linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
@ 2022-06-14 12:02 Kirill A. Shutemov
  2022-06-14 12:02 ` [PATCHv7 01/14] x86/boot: Centralize __pa()/__va() definitions Kirill A. Shutemov
                   ` (19 more replies)
  0 siblings, 20 replies; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-06-14 12:02 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov

UEFI Specification version 2.9 introduces the concept of memory
acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, requiring memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific for the Virtual
Machine platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptance until memory is needed. It lowers boot time and reduces
memory overhead.

The kernel needs to know what memory has been accepted. Firmware
communicates this information via memory map: a new memory type --
EFI_UNACCEPTED_MEMORY -- indicates such memory.

Range-based tracking works fine for firmware, but it gets bulky for
the kernel: e820 has to be modified on every page acceptance. It leads
to table fragmentation, but there's a limited number of entries in the
e820 table

Another option is to mark such memory as usable in e820 and track if the
range has been accepted in a bitmap. One bit in the bitmap represents
2MiB in the address space: one 4k page is enough to track 64GiB or
physical address space.

In the worst-case scenario -- a huge hole in the middle of the
address space -- It needs 256MiB to handle 4PiB of the address
space.

Any unaccepted memory that is not aligned to 2M gets accepted upfront.

The approach lowers boot time substantially. Boot to shell is ~2.5x
faster for 4G TDX VM and ~4x faster for 64G.

TDX-specific code isolated from the core of unaccepted memory support. It
supposed to help to plug-in different implementation of unaccepted memory
such as SEV-SNP.

The tree can be found here:

https://github.com/intel/tdx.git guest-unaccepted-memory

v7:
 - Rework meminfo counter to use PageUnaccepted() and move to generic code;
 - Fix range_contains_unaccepted_memory() on machines without unaccepted memory;
 - Add Reviewed-by from David;
v6:
 - Fix load_unaligned_zeropad() on machine with unaccepted memory;
 - Clear PageUnaccepted() on merged pages, leaving it only on head;
 - Clarify error handling in allocate_e820();
 - Fix build with CONFIG_UNACCEPTED_MEMORY=y, but without TDX;
 - Disable kexec at boottime instead of build conflict;
 - Rebased to tip/master;
 - Spelling fixes;
 - Add Reviewed-by from Mike and David;
v5:
 - Updates comments and commit messages;
   + Explain options for unaccepted memory handling;
 - Expose amount of unaccepted memory in /proc/meminfo
 - Adjust check in page_expected_state();
 - Fix error code handling in allocate_e820();
 - Centralize __pa()/__va() definitions in the boot stub;
 - Avoid includes from the main kernel in the boot stub;
 - Use an existing hole in boot_param for unaccepted_memory, instead of adding
   to the end of the structure;
 - Extract allocate_unaccepted_memory() form allocate_e820();
 - Complain if there's unaccepted memory, but kernel does not support it;
 - Fix vmstat counter;
 - Split up few preparatory patches;
 - Random readability adjustments;
v4:
 - PageBuddyUnaccepted() -> PageUnaccepted;
 - Use separate page_type, not shared with offline;
 - Rework interface between core-mm and arch code;
 - Adjust commit messages;
 - Ack from Mike;

Kirill A. Shutemov (14):
  x86/boot: Centralize __pa()/__va() definitions
  mm: Add support for unaccepted memory
  mm: Report unaccepted memory in meminfo
  efi/x86: Get full memory map in allocate_e820()
  x86/boot: Add infrastructure required for unaccepted memory support
  efi/x86: Implement support for unaccepted memory
  x86/boot/compressed: Handle unaccepted memory
  x86/mm: Reserve unaccepted memory bitmap
  x86/mm: Provide helpers for unaccepted memory
  x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  x86: Disable kexec if system has unaccepted memory
  x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in
    boot stub
  x86/tdx: Refactor try_accept_one()
  x86/tdx: Add unaccepted memory support

 Documentation/x86/zero-page.rst          |   1 +
 arch/x86/Kconfig                         |   1 +
 arch/x86/boot/bitops.h                   |  40 ++++++++
 arch/x86/boot/compressed/Makefile        |   1 +
 arch/x86/boot/compressed/align.h         |  14 +++
 arch/x86/boot/compressed/bitmap.c        |  43 ++++++++
 arch/x86/boot/compressed/bitmap.h        |  49 +++++++++
 arch/x86/boot/compressed/bits.h          |  36 +++++++
 arch/x86/boot/compressed/compiler.h      |   9 ++
 arch/x86/boot/compressed/efi.h           |   1 +
 arch/x86/boot/compressed/find.c          |  54 ++++++++++
 arch/x86/boot/compressed/find.h          |  80 +++++++++++++++
 arch/x86/boot/compressed/ident_map_64.c  |   8 --
 arch/x86/boot/compressed/kaslr.c         |  35 ++++---
 arch/x86/boot/compressed/math.h          |  37 +++++++
 arch/x86/boot/compressed/mem.c           | 111 ++++++++++++++++++++
 arch/x86/boot/compressed/minmax.h        |  61 +++++++++++
 arch/x86/boot/compressed/misc.c          |   6 ++
 arch/x86/boot/compressed/misc.h          |  15 +++
 arch/x86/boot/compressed/pgtable_types.h |  25 +++++
 arch/x86/boot/compressed/sev.c           |   2 -
 arch/x86/boot/compressed/tdx.c           |  78 ++++++++++++++
 arch/x86/coco/tdx/tdx.c                  |  94 ++++++++---------
 arch/x86/include/asm/page.h              |   3 +
 arch/x86/include/asm/shared/tdx.h        |  47 +++++++++
 arch/x86/include/asm/tdx.h               |  19 ----
 arch/x86/include/asm/unaccepted_memory.h |  16 +++
 arch/x86/include/uapi/asm/bootparam.h    |   2 +-
 arch/x86/kernel/e820.c                   |  10 ++
 arch/x86/mm/Makefile                     |   2 +
 arch/x86/mm/unaccepted_memory.c          | 123 +++++++++++++++++++++++
 drivers/base/node.c                      |   7 ++
 drivers/firmware/efi/Kconfig             |  14 +++
 drivers/firmware/efi/efi.c               |   1 +
 drivers/firmware/efi/libstub/x86-stub.c  | 103 ++++++++++++++++---
 fs/proc/meminfo.c                        |   5 +
 include/linux/efi.h                      |   3 +-
 include/linux/mmzone.h                   |   1 +
 include/linux/page-flags.h               |  31 ++++++
 mm/internal.h                            |  12 +++
 mm/memblock.c                            |   9 ++
 mm/page_alloc.c                          |  96 +++++++++++++++++-
 mm/vmstat.c                              |   1 +
 43 files changed, 1191 insertions(+), 115 deletions(-)
 create mode 100644 arch/x86/boot/compressed/align.h
 create mode 100644 arch/x86/boot/compressed/bitmap.c
 create mode 100644 arch/x86/boot/compressed/bitmap.h
 create mode 100644 arch/x86/boot/compressed/bits.h
 create mode 100644 arch/x86/boot/compressed/compiler.h
 create mode 100644 arch/x86/boot/compressed/find.c
 create mode 100644 arch/x86/boot/compressed/find.h
 create mode 100644 arch/x86/boot/compressed/math.h
 create mode 100644 arch/x86/boot/compressed/mem.c
 create mode 100644 arch/x86/boot/compressed/minmax.h
 create mode 100644 arch/x86/boot/compressed/pgtable_types.h
 create mode 100644 arch/x86/include/asm/unaccepted_memory.h
 create mode 100644 arch/x86/mm/unaccepted_memory.c

-- 
2.35.1


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCHv7 01/14] x86/boot: Centralize __pa()/__va() definitions
  2022-06-14 12:02 [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
@ 2022-06-14 12:02 ` Kirill A. Shutemov
  2022-06-23 17:37   ` Dave Hansen
  2022-06-14 12:02 ` [PATCHv7 02/14] mm: Add support for unaccepted memory Kirill A. Shutemov
                   ` (18 subsequent siblings)
  19 siblings, 1 reply; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-06-14 12:02 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov,
	Mike Rapoport

Replace multiple __pa()/__va() definitions with a single one in misc.h.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
---
 arch/x86/boot/compressed/ident_map_64.c | 8 --------
 arch/x86/boot/compressed/misc.h         | 9 +++++++++
 arch/x86/boot/compressed/sev.c          | 2 --
 3 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index 44c350d627c7..66a2e992c5f5 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -8,14 +8,6 @@
  * Copyright (C)      2016  Kees Cook
  */
 
-/*
- * Since we're dealing with identity mappings, physical and virtual
- * addresses are the same, so override these defines which are ultimately
- * used by the headers in misc.h.
- */
-#define __pa(x)  ((unsigned long)(x))
-#define __va(x)  ((void *)((unsigned long)(x)))
-
 /* No PAGE_TABLE_ISOLATION support needed either: */
 #undef CONFIG_PAGE_TABLE_ISOLATION
 
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 4910bf230d7b..245cf8f2a0bd 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -19,6 +19,15 @@
 /* cpu_feature_enabled() cannot be used this early */
 #define USE_EARLY_PGTABLE_L5
 
+/*
+ * Boot stub deals with identity mappings, physical and virtual addresses are
+ * the same, so override these defines.
+ *
+ * <asm/page.h> will not define them if they are already defined.
+ */
+#define __pa(x)  ((unsigned long)(x))
+#define __va(x)  ((void *)((unsigned long)(x)))
+
 #include <linux/linkage.h>
 #include <linux/screen_info.h>
 #include <linux/elf.h>
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 52f989f6acc2..730c4677e9db 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -104,9 +104,7 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
 }
 
 #undef __init
-#undef __pa
 #define __init
-#define __pa(x)	((unsigned long)(x))
 
 #define __BOOT_COMPRESSED
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-06-14 12:02 [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
  2022-06-14 12:02 ` [PATCHv7 01/14] x86/boot: Centralize __pa()/__va() definitions Kirill A. Shutemov
@ 2022-06-14 12:02 ` Kirill A. Shutemov
  2022-06-14 12:57   ` Gupta, Pankaj
                     ` (3 more replies)
  2022-06-14 12:02 ` [PATCHv7 03/14] mm: Report unaccepted memory in meminfo Kirill A. Shutemov
                   ` (17 subsequent siblings)
  19 siblings, 4 replies; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-06-14 12:02 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov,
	Mike Rapoport

UEFI Specification version 2.9 introduces the concept of memory
acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, require memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific to the Virtual Machine
platform.

There are several ways kernel can deal with unaccepted memory:

 1. Accept all the memory during the boot. It is easy to implement and
    it doesn't have runtime cost once the system is booted. The downside
    is very long boot time.

    Accept can be parallelized to multiple CPUs to keep it manageable
    (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
    memory bandwidth and does not scale beyond the point.

 2. Accept a block of memory on the first use. It requires more
    infrastructure and changes in page allocator to make it work, but
    it provides good boot time.

    On-demand memory accept means latency spikes every time kernel steps
    onto a new memory block. The spikes will go away once workload data
    set size gets stabilized or all memory gets accepted.

 3. Accept all memory in background. Introduce a thread (or multiple)
    that gets memory accepted proactively. It will minimize time the
    system experience latency spikes on memory allocation while keeping
    low boot time.

    This approach cannot function on its own. It is an extension of #2:
    background memory acceptance requires functional scheduler, but the
    page allocator may need to tap into unaccepted memory before that.

    The downside of the approach is that these threads also steal CPU
    cycles and memory bandwidth from the user's workload and may hurt
    user experience.

Implement #2 for now. It is a reasonable default. Some workloads may
want to use #1 or #3 and they can be implemented later based on user's
demands.

Support of unaccepted memory requires a few changes in core-mm code:

  - memblock has to accept memory on allocation;

  - page allocator has to accept memory on the first allocation of the
    page;

Memblock change is trivial.

The page allocator is modified to accept pages on the first allocation.
The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is
used to indicate that the page requires acceptance.

Architecture has to provide two helpers if it wants to support
unaccepted memory:

 - accept_memory() makes a range of physical addresses accepted.

 - range_contains_unaccepted_memory() checks anything within the range
   of physical addresses requires acceptance.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 include/linux/page-flags.h | 31 +++++++++++++
 mm/internal.h              | 12 +++++
 mm/memblock.c              |  9 ++++
 mm/page_alloc.c            | 89 +++++++++++++++++++++++++++++++++++++-
 4 files changed, 139 insertions(+), 2 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e66f7aa3191d..444ba8f4bfa0 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -942,6 +942,14 @@ static inline bool is_page_hwpoison(struct page *page)
 #define PG_offline	0x00000100
 #define PG_table	0x00000200
 #define PG_guard	0x00000400
+#define PG_unaccepted	0x00000800
+
+/*
+ * Page types allowed at page_expected_state()
+ *
+ * PageUnaccepted() will get cleared in post_alloc_hook().
+ */
+#define PAGE_TYPES_EXPECTED	PG_unaccepted
 
 #define PageType(page, flag)						\
 	((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
@@ -967,6 +975,18 @@ static __always_inline void __ClearPage##uname(struct page *page)	\
 	page->page_type |= PG_##lname;					\
 }
 
+#define PAGE_TYPE_OPS_FALSE(uname)					\
+static __always_inline int Page##uname(struct page *page)		\
+{									\
+	return false;							\
+}									\
+static __always_inline void __SetPage##uname(struct page *page)		\
+{									\
+}									\
+static __always_inline void __ClearPage##uname(struct page *page)	\
+{									\
+}
+
 /*
  * PageBuddy() indicates that the page is free and in the buddy system
  * (see mm/page_alloc.c).
@@ -997,6 +1017,17 @@ PAGE_TYPE_OPS(Buddy, buddy)
  */
 PAGE_TYPE_OPS(Offline, offline)
 
+/*
+ * PageUnaccepted() indicates that the page has to be "accepted" before it can
+ * be read or written. The page allocator must call accept_page() before
+ * touching the page or returning it to the caller.
+ */
+#ifdef CONFIG_UNACCEPTED_MEMORY
+PAGE_TYPE_OPS(Unaccepted, unaccepted)
+#else
+PAGE_TYPE_OPS_FALSE(Unaccepted)
+#endif
+
 extern void page_offline_freeze(void);
 extern void page_offline_thaw(void);
 extern void page_offline_begin(void);
diff --git a/mm/internal.h b/mm/internal.h
index c0f8fbe0445b..7c1a653edfc8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -861,4 +861,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags);
 
 DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
 
+#ifndef CONFIG_UNACCEPTED_MEMORY
+static inline bool range_contains_unaccepted_memory(phys_addr_t start,
+						    phys_addr_t end)
+{
+	return false;
+}
+
+static inline void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+}
+#endif
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memblock.c b/mm/memblock.c
index e4f03a6e8e56..a1f7f8b304d5 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1405,6 +1405,15 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
 		 */
 		kmemleak_alloc_phys(found, size, 0, 0);
 
+	/*
+	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
+	 * require memory to be accepted before it can be used by the
+	 * guest.
+	 *
+	 * Accept the memory of the allocated buffer.
+	 */
+	accept_memory(found, found + size);
+
 	return found;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e008a3df0485..279c2746aaa8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -122,6 +122,12 @@ typedef int __bitwise fpi_t;
  */
 #define FPI_SKIP_KASAN_POISON	((__force fpi_t)BIT(2))
 
+/*
+ * Check if the page needs to be marked as PageUnaccepted().
+ * Used for the new pages added to the buddy allocator for the first time.
+ */
+#define FPI_UNACCEPTED_SLOWPATH	((__force fpi_t)BIT(3))
+
 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
 static DEFINE_MUTEX(pcp_batch_high_lock);
 #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
@@ -993,6 +999,35 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
 			NULL) != NULL;
 }
 
+/*
+ * Page acceptance can be very slow. Do not call under critical locks.
+ */
+static void accept_page(struct page *page, unsigned int order)
+{
+	phys_addr_t start = page_to_phys(page);
+	int i;
+
+	if (!PageUnaccepted(page))
+		return;
+
+	accept_memory(start, start + (PAGE_SIZE << order));
+	__ClearPageUnaccepted(page);
+
+	/* Assert that there is no PageUnaccepted() on tail pages */
+	if (IS_ENABLED(CONFIG_DEBUG_VM)) {
+		for (i = 1; i < (1 << order); i++)
+			VM_BUG_ON_PAGE(PageUnaccepted(page + i), page + i);
+	}
+}
+
+static bool page_contains_unaccepted(struct page *page, unsigned int order)
+{
+	phys_addr_t start = page_to_phys(page);
+	phys_addr_t end = start + (PAGE_SIZE << order);
+
+	return range_contains_unaccepted_memory(start, end);
+}
+
 /*
  * Freeing function for a buddy system allocator.
  *
@@ -1027,6 +1062,7 @@ static inline void __free_one_page(struct page *page,
 	unsigned long combined_pfn;
 	struct page *buddy;
 	bool to_tail;
+	bool page_needs_acceptance = false;
 
 	VM_BUG_ON(!zone_is_initialized(zone));
 	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
@@ -1038,6 +1074,11 @@ static inline void __free_one_page(struct page *page,
 	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
 	VM_BUG_ON_PAGE(bad_range(zone, page), page);
 
+	if (PageUnaccepted(page)) {
+		page_needs_acceptance = true;
+		__ClearPageUnaccepted(page);
+	}
+
 	while (order < MAX_ORDER - 1) {
 		if (compaction_capture(capc, page, order, migratetype)) {
 			__mod_zone_freepage_state(zone, -(1 << order),
@@ -1072,6 +1113,13 @@ static inline void __free_one_page(struct page *page,
 			clear_page_guard(zone, buddy, order, migratetype);
 		else
 			del_page_from_free_list(buddy, zone, order);
+
+		/* Mark page unaccepted if any of merged pages were unaccepted */
+		if (PageUnaccepted(buddy)) {
+			page_needs_acceptance = true;
+			__ClearPageUnaccepted(buddy);
+		}
+
 		combined_pfn = buddy_pfn & pfn;
 		page = page + (combined_pfn - pfn);
 		pfn = combined_pfn;
@@ -1081,6 +1129,23 @@ static inline void __free_one_page(struct page *page,
 done_merging:
 	set_buddy_order(page, order);
 
+	/*
+	 * The page gets marked as PageUnaccepted() if any of merged-in pages
+	 * is PageUnaccepted().
+	 *
+	 * New pages, just being added to buddy allocator, do not have
+	 * PageUnaccepted() set. FPI_UNACCEPTED_SLOWPATH indicates that the
+	 * page is new and page_contains_unaccepted() check is required to
+	 * determinate if acceptance is required.
+	 *
+	 * Avoid calling page_contains_unaccepted() if it is known that the page
+	 * needs acceptance. It can be costly.
+	 */
+	if (!page_needs_acceptance && (fpi_flags & FPI_UNACCEPTED_SLOWPATH))
+		page_needs_acceptance = page_contains_unaccepted(page, order);
+	if (page_needs_acceptance)
+		__SetPageUnaccepted(page);
+
 	if (fpi_flags & FPI_TO_TAIL)
 		to_tail = true;
 	else if (is_shuffle_order(order))
@@ -1164,7 +1229,13 @@ int split_free_page(struct page *free_page,
 static inline bool page_expected_state(struct page *page,
 					unsigned long check_flags)
 {
-	if (unlikely(atomic_read(&page->_mapcount) != -1))
+	/*
+	 * The page must not be mapped to userspace and must not have
+	 * a PageType other than listed in PAGE_TYPES_EXPECTED.
+	 *
+	 * Note, bit cleared means the page type is set.
+	 */
+	if (unlikely((atomic_read(&page->_mapcount) | PAGE_TYPES_EXPECTED) != -1))
 		return false;
 
 	if (unlikely((unsigned long)page->mapping |
@@ -1669,7 +1740,9 @@ void __free_pages_core(struct page *page, unsigned int order)
 	 * Bypass PCP and place fresh pages right to the tail, primarily
 	 * relevant for memory onlining.
 	 */
-	__free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON);
+	__free_pages_ok(page, order,
+			FPI_TO_TAIL | FPI_SKIP_KASAN_POISON |
+			FPI_UNACCEPTED_SLOWPATH);
 }
 
 #ifdef CONFIG_NUMA
@@ -1822,6 +1895,9 @@ static void __init deferred_free_range(unsigned long pfn,
 		return;
 	}
 
+	/* Accept chunks smaller than page-block upfront */
+	accept_memory(PFN_PHYS(pfn), PFN_PHYS(pfn + nr_pages));
+
 	for (i = 0; i < nr_pages; i++, page++, pfn++) {
 		if ((pfn & (pageblock_nr_pages - 1)) == 0)
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
@@ -2281,6 +2357,13 @@ static inline void expand(struct zone *zone, struct page *page,
 		if (set_page_guard(zone, &page[size], high, migratetype))
 			continue;
 
+		/*
+		 * Transfer PageUnaccepted() to the newly split pages so
+		 * they can be accepted after dropping the zone lock.
+		 */
+		if (PageUnaccepted(page))
+			__SetPageUnaccepted(&page[size]);
+
 		add_to_free_list(&page[size], zone, high, migratetype);
 		set_buddy_order(&page[size], high);
 	}
@@ -2411,6 +2494,8 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 	 */
 	kernel_unpoison_pages(page, 1 << order);
 
+	accept_page(page, order);
+
 	/*
 	 * As memory initialization might be integrated into KASAN,
 	 * KASAN unpoisoning and memory initializion code must be
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCHv7 03/14] mm: Report unaccepted memory in meminfo
  2022-06-14 12:02 [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
  2022-06-14 12:02 ` [PATCHv7 01/14] x86/boot: Centralize __pa()/__va() definitions Kirill A. Shutemov
  2022-06-14 12:02 ` [PATCHv7 02/14] mm: Add support for unaccepted memory Kirill A. Shutemov
@ 2022-06-14 12:02 ` Kirill A. Shutemov
  2022-07-26 14:33   ` David Hildenbrand
  2022-06-14 12:02 ` [PATCHv7 04/14] efi/x86: Get full memory map in allocate_e820() Kirill A. Shutemov
                   ` (16 subsequent siblings)
  19 siblings, 1 reply; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-06-14 12:02 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov

Track amount of unaccepted memory and report it in /proc/meminfo and in
node meminfo.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 drivers/base/node.c    | 7 +++++++
 fs/proc/meminfo.c      | 5 +++++
 include/linux/mmzone.h | 1 +
 mm/page_alloc.c        | 9 ++++++++-
 mm/vmstat.c            | 1 +
 5 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 0ac6376ef7a1..fc7cf4d91eb6 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -446,6 +446,9 @@ static ssize_t node_read_meminfo(struct device *dev,
 			     "Node %d ShmemPmdMapped: %8lu kB\n"
 			     "Node %d FileHugePages: %8lu kB\n"
 			     "Node %d FilePmdMapped: %8lu kB\n"
+#endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+			     "Node %d UnacceptedPages: %8lu kB\n"
 #endif
 			     ,
 			     nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
@@ -474,6 +477,10 @@ static ssize_t node_read_meminfo(struct device *dev,
 			     nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)),
 			     nid, K(node_page_state(pgdat, NR_FILE_THPS)),
 			     nid, K(node_page_state(pgdat, NR_FILE_PMDMAPPED))
+#endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+			     ,
+			     nid, K(node_page_state(pgdat, NR_UNACCEPTED))
 #endif
 			    );
 	len += hugetlb_report_node_meminfo(buf, len, nid);
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 6e89f0e2fd20..796544e50365 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -153,6 +153,11 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		    global_zone_page_state(NR_FREE_CMA_PAGES));
 #endif
 
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	show_val_kb(m, "UnacceptedPages:",
+		    global_node_page_state(NR_UNACCEPTED));
+#endif
+
 	hugetlb_report_meminfo(m);
 
 	arch_report_meminfo(m);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index aab70355d64f..aa08cd7eaaf5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -212,6 +212,7 @@ enum node_stat_item {
 	NR_FOLL_PIN_ACQUIRED,	/* via: pin_user_page(), gup flag: FOLL_PIN */
 	NR_FOLL_PIN_RELEASED,	/* pages returned via unpin_user_page() */
 	NR_KERNEL_STACK_KB,	/* measured in KiB */
+	NR_UNACCEPTED,
 #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
 	NR_KERNEL_SCS_KB,	/* measured in KiB */
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 279c2746aaa8..6316d695a567 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1012,6 +1012,7 @@ static void accept_page(struct page *page, unsigned int order)
 
 	accept_memory(start, start + (PAGE_SIZE << order));
 	__ClearPageUnaccepted(page);
+	mod_node_page_state(page_pgdat(page), NR_UNACCEPTED, -(1 << order));
 
 	/* Assert that there is no PageUnaccepted() on tail pages */
 	if (IS_ENABLED(CONFIG_DEBUG_VM)) {
@@ -1063,6 +1064,7 @@ static inline void __free_one_page(struct page *page,
 	struct page *buddy;
 	bool to_tail;
 	bool page_needs_acceptance = false;
+	int nr_unaccepted = 0;
 
 	VM_BUG_ON(!zone_is_initialized(zone));
 	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
@@ -1076,6 +1078,7 @@ static inline void __free_one_page(struct page *page,
 
 	if (PageUnaccepted(page)) {
 		page_needs_acceptance = true;
+		nr_unaccepted += 1 << order;
 		__ClearPageUnaccepted(page);
 	}
 
@@ -1117,6 +1120,7 @@ static inline void __free_one_page(struct page *page,
 		/* Mark page unaccepted if any of merged pages were unaccepted */
 		if (PageUnaccepted(buddy)) {
 			page_needs_acceptance = true;
+			nr_unaccepted += 1 << order;
 			__ClearPageUnaccepted(buddy);
 		}
 
@@ -1143,8 +1147,11 @@ static inline void __free_one_page(struct page *page,
 	 */
 	if (!page_needs_acceptance && (fpi_flags & FPI_UNACCEPTED_SLOWPATH))
 		page_needs_acceptance = page_contains_unaccepted(page, order);
-	if (page_needs_acceptance)
+	if (page_needs_acceptance) {
 		__SetPageUnaccepted(page);
+		__mod_node_page_state(page_pgdat(page), NR_UNACCEPTED,
+				    (1 << order) - nr_unaccepted);
+	}
 
 	if (fpi_flags & FPI_TO_TAIL)
 		to_tail = true;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 373d2730fcf2..4e12d22f1e04 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1236,6 +1236,7 @@ const char * const vmstat_text[] = {
 	"nr_foll_pin_acquired",
 	"nr_foll_pin_released",
 	"nr_kernel_stack",
+	"nr_unaccepted",
 #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
 	"nr_shadow_call_stack",
 #endif
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCHv7 04/14] efi/x86: Get full memory map in allocate_e820()
  2022-06-14 12:02 [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2022-06-14 12:02 ` [PATCHv7 03/14] mm: Report unaccepted memory in meminfo Kirill A. Shutemov
@ 2022-06-14 12:02 ` Kirill A. Shutemov
  2022-07-25 13:02   ` Borislav Petkov
  2022-06-14 12:02 ` [PATCHv7 05/14] x86/boot: Add infrastructure required for unaccepted memory support Kirill A. Shutemov
                   ` (15 subsequent siblings)
  19 siblings, 1 reply; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-06-14 12:02 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov

Currently allocate_e820() only interested in the size of map and size of
memory descriptor to determine how many e820 entries the kernel needs.

UEFI Specification version 2.9 introduces a new memory type --
unaccepted memory. To track unaccepted memory kernel needs to allocate
a bitmap. The size of the bitmap is dependent on the maximum physical
address present in the system. A full memory map is required to find
the maximum address.

Modify allocate_e820() to get a full memory map.

This is preparation for the next patch that implements handling of
unaccepted memory in EFI stub.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 drivers/firmware/efi/libstub/x86-stub.c | 28 +++++++++++--------------
 1 file changed, 12 insertions(+), 16 deletions(-)

diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index 05ae8bcc9d67..504955368934 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -672,31 +672,27 @@ static efi_status_t alloc_e820ext(u32 nr_desc, struct setup_data **e820ext,
 }
 
 static efi_status_t allocate_e820(struct boot_params *params,
+				  struct efi_boot_memmap *map,
 				  struct setup_data **e820ext,
 				  u32 *e820ext_size)
 {
-	unsigned long map_size, desc_size, map_key;
 	efi_status_t status;
-	__u32 nr_desc, desc_version;
+	__u32 nr_desc;
 
-	/* Only need the size of the mem map and size of each mem descriptor */
-	map_size = 0;
-	status = efi_bs_call(get_memory_map, &map_size, NULL, &map_key,
-			     &desc_size, &desc_version);
-	if (status != EFI_BUFFER_TOO_SMALL)
-		return (status != EFI_SUCCESS) ? status : EFI_UNSUPPORTED;
-
-	nr_desc = map_size / desc_size + EFI_MMAP_NR_SLACK_SLOTS;
+	status = efi_get_memory_map(map);
+	if (status != EFI_SUCCESS)
+		return status;
 
-	if (nr_desc > ARRAY_SIZE(params->e820_table)) {
-		u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table);
+	nr_desc = *map->map_size / *map->desc_size;
+	if (nr_desc > ARRAY_SIZE(params->e820_table) - EFI_MMAP_NR_SLACK_SLOTS) {
+		u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table) +
+			EFI_MMAP_NR_SLACK_SLOTS;
 
 		status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
-		if (status != EFI_SUCCESS)
-			return status;
 	}
 
-	return EFI_SUCCESS;
+	efi_bs_call(free_pool, *map->map);
+	return status;
 }
 
 struct exit_boot_struct {
@@ -745,7 +741,7 @@ static efi_status_t exit_boot(struct boot_params *boot_params, void *handle)
 	priv.boot_params	= boot_params;
 	priv.efi		= &boot_params->efi_info;
 
-	status = allocate_e820(boot_params, &e820ext, &e820ext_size);
+	status = allocate_e820(boot_params, &map, &e820ext, &e820ext_size);
 	if (status != EFI_SUCCESS)
 		return status;
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCHv7 05/14] x86/boot: Add infrastructure required for unaccepted memory support
  2022-06-14 12:02 [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2022-06-14 12:02 ` [PATCHv7 04/14] efi/x86: Get full memory map in allocate_e820() Kirill A. Shutemov
@ 2022-06-14 12:02 ` Kirill A. Shutemov
  2022-06-15 10:19   ` Peter Zijlstra
  2022-07-25 21:33   ` Borislav Petkov
  2022-06-14 12:02 ` [PATCHv7 06/14] efi/x86: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (14 subsequent siblings)
  19 siblings, 2 replies; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-06-14 12:02 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov

Pull functionality from the main kernel headers and lib/ that is
required for unaccepted memory support.

This is preparatory patch. The users for the functionality will come in
following patches.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/bitops.h                   | 40 ++++++++++++
 arch/x86/boot/compressed/align.h         | 14 +++++
 arch/x86/boot/compressed/bitmap.c        | 43 +++++++++++++
 arch/x86/boot/compressed/bitmap.h        | 49 +++++++++++++++
 arch/x86/boot/compressed/bits.h          | 36 +++++++++++
 arch/x86/boot/compressed/compiler.h      |  9 +++
 arch/x86/boot/compressed/find.c          | 54 ++++++++++++++++
 arch/x86/boot/compressed/find.h          | 80 ++++++++++++++++++++++++
 arch/x86/boot/compressed/math.h          | 37 +++++++++++
 arch/x86/boot/compressed/minmax.h        | 61 ++++++++++++++++++
 arch/x86/boot/compressed/pgtable_types.h | 25 ++++++++
 11 files changed, 448 insertions(+)
 create mode 100644 arch/x86/boot/compressed/align.h
 create mode 100644 arch/x86/boot/compressed/bitmap.c
 create mode 100644 arch/x86/boot/compressed/bitmap.h
 create mode 100644 arch/x86/boot/compressed/bits.h
 create mode 100644 arch/x86/boot/compressed/compiler.h
 create mode 100644 arch/x86/boot/compressed/find.c
 create mode 100644 arch/x86/boot/compressed/find.h
 create mode 100644 arch/x86/boot/compressed/math.h
 create mode 100644 arch/x86/boot/compressed/minmax.h
 create mode 100644 arch/x86/boot/compressed/pgtable_types.h

diff --git a/arch/x86/boot/bitops.h b/arch/x86/boot/bitops.h
index 02e1dea11d94..61eb820ee402 100644
--- a/arch/x86/boot/bitops.h
+++ b/arch/x86/boot/bitops.h
@@ -41,4 +41,44 @@ static inline void set_bit(int nr, void *addr)
 	asm("btsl %1,%0" : "+m" (*(u32 *)addr) : "Ir" (nr));
 }
 
+static __always_inline void __set_bit(long nr, volatile unsigned long *addr)
+{
+	asm volatile(__ASM_SIZE(bts) " %1,%0" : : "m" (*(volatile long *) addr),
+		     "Ir" (nr) : "memory");
+}
+
+static __always_inline void __clear_bit(long nr, volatile unsigned long *addr)
+{
+	asm volatile(__ASM_SIZE(btr) " %1,%0" : : "m" (*(volatile long *) addr),
+		     "Ir" (nr) : "memory");
+}
+
+/**
+ * __ffs - find first set bit in word
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+static __always_inline unsigned long __ffs(unsigned long word)
+{
+	asm("rep; bsf %1,%0"
+		: "=r" (word)
+		: "rm" (word));
+	return word;
+}
+
+/**
+ * ffz - find first zero bit in word
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+static __always_inline unsigned long ffz(unsigned long word)
+{
+	asm("rep; bsf %1,%0"
+		: "=r" (word)
+		: "r" (~word));
+	return word;
+}
+
 #endif /* BOOT_BITOPS_H */
diff --git a/arch/x86/boot/compressed/align.h b/arch/x86/boot/compressed/align.h
new file mode 100644
index 000000000000..7ccabbc5d1b8
--- /dev/null
+++ b/arch/x86/boot/compressed/align.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_ALIGN_H
+#define BOOT_ALIGN_H
+#define _LINUX_ALIGN_H /* Inhibit inclusion of <linux/align.h> */
+
+/* @a is a power of 2 value */
+#define ALIGN(x, a)		__ALIGN_KERNEL((x), (a))
+#define ALIGN_DOWN(x, a)	__ALIGN_KERNEL((x) - ((a) - 1), (a))
+#define __ALIGN_MASK(x, mask)	__ALIGN_KERNEL_MASK((x), (mask))
+#define PTR_ALIGN(p, a)		((typeof(p))ALIGN((unsigned long)(p), (a)))
+#define PTR_ALIGN_DOWN(p, a)	((typeof(p))ALIGN_DOWN((unsigned long)(p), (a)))
+#define IS_ALIGNED(x, a)		(((x) & ((typeof(x))(a) - 1)) == 0)
+
+#endif
diff --git a/arch/x86/boot/compressed/bitmap.c b/arch/x86/boot/compressed/bitmap.c
new file mode 100644
index 000000000000..789ecadeb521
--- /dev/null
+++ b/arch/x86/boot/compressed/bitmap.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "bitmap.h"
+
+void __bitmap_set(unsigned long *map, unsigned int start, int len)
+{
+	unsigned long *p = map + BIT_WORD(start);
+	const unsigned int size = start + len;
+	int bits_to_set = BITS_PER_LONG - (start % BITS_PER_LONG);
+	unsigned long mask_to_set = BITMAP_FIRST_WORD_MASK(start);
+
+	while (len - bits_to_set >= 0) {
+		*p |= mask_to_set;
+		len -= bits_to_set;
+		bits_to_set = BITS_PER_LONG;
+		mask_to_set = ~0UL;
+		p++;
+	}
+	if (len) {
+		mask_to_set &= BITMAP_LAST_WORD_MASK(size);
+		*p |= mask_to_set;
+	}
+}
+
+void __bitmap_clear(unsigned long *map, unsigned int start, int len)
+{
+	unsigned long *p = map + BIT_WORD(start);
+	const unsigned int size = start + len;
+	int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+	unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+
+	while (len - bits_to_clear >= 0) {
+		*p &= ~mask_to_clear;
+		len -= bits_to_clear;
+		bits_to_clear = BITS_PER_LONG;
+		mask_to_clear = ~0UL;
+		p++;
+	}
+	if (len) {
+		mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+		*p &= ~mask_to_clear;
+	}
+}
diff --git a/arch/x86/boot/compressed/bitmap.h b/arch/x86/boot/compressed/bitmap.h
new file mode 100644
index 000000000000..35357f5feda2
--- /dev/null
+++ b/arch/x86/boot/compressed/bitmap.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_BITMAP_H
+#define BOOT_BITMAP_H
+#define __LINUX_BITMAP_H /* Inhibit inclusion of <linux/bitmap.h> */
+
+#include "../bitops.h"
+#include "../string.h"
+#include "align.h"
+
+#define BITMAP_MEM_ALIGNMENT 8
+#define BITMAP_MEM_MASK (BITMAP_MEM_ALIGNMENT - 1)
+
+#define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 1)))
+#define BITMAP_LAST_WORD_MASK(nbits) (~0UL >> (-(nbits) & (BITS_PER_LONG - 1)))
+
+#define BIT_WORD(nr)		((nr) / BITS_PER_LONG)
+
+void __bitmap_set(unsigned long *map, unsigned int start, int len);
+void __bitmap_clear(unsigned long *map, unsigned int start, int len);
+
+static __always_inline void bitmap_set(unsigned long *map, unsigned int start,
+		unsigned int nbits)
+{
+	if (__builtin_constant_p(nbits) && nbits == 1)
+		__set_bit(start, map);
+	else if (__builtin_constant_p(start & BITMAP_MEM_MASK) &&
+		 IS_ALIGNED(start, BITMAP_MEM_ALIGNMENT) &&
+		 __builtin_constant_p(nbits & BITMAP_MEM_MASK) &&
+		 IS_ALIGNED(nbits, BITMAP_MEM_ALIGNMENT))
+		memset((char *)map + start / 8, 0xff, nbits / 8);
+	else
+		__bitmap_set(map, start, nbits);
+}
+
+static __always_inline void bitmap_clear(unsigned long *map, unsigned int start,
+		unsigned int nbits)
+{
+	if (__builtin_constant_p(nbits) && nbits == 1)
+		__clear_bit(start, map);
+	else if (__builtin_constant_p(start & BITMAP_MEM_MASK) &&
+		 IS_ALIGNED(start, BITMAP_MEM_ALIGNMENT) &&
+		 __builtin_constant_p(nbits & BITMAP_MEM_MASK) &&
+		 IS_ALIGNED(nbits, BITMAP_MEM_ALIGNMENT))
+		memset((char *)map + start / 8, 0, nbits / 8);
+	else
+		__bitmap_clear(map, start, nbits);
+}
+
+#endif
diff --git a/arch/x86/boot/compressed/bits.h b/arch/x86/boot/compressed/bits.h
new file mode 100644
index 000000000000..b0ffa007ee19
--- /dev/null
+++ b/arch/x86/boot/compressed/bits.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_BITS_H
+#define BOOT_BITS_H
+#define __LINUX_BITS_H /* Inhibit inclusion of <linux/bits.h> */
+
+#ifdef __ASSEMBLY__
+#define _AC(X,Y)	X
+#define _AT(T,X)	X
+#else
+#define __AC(X,Y)	(X##Y)
+#define _AC(X,Y)	__AC(X,Y)
+#define _AT(T,X)	((T)(X))
+#endif
+
+#define _UL(x)		(_AC(x, UL))
+#define _ULL(x)		(_AC(x, ULL))
+#define UL(x)		(_UL(x))
+#define ULL(x)		(_ULL(x))
+
+#define BIT(nr)			(UL(1) << (nr))
+#define BIT_ULL(nr)		(ULL(1) << (nr))
+#define BIT_MASK(nr)		(UL(1) << ((nr) % BITS_PER_LONG))
+#define BIT_WORD(nr)		((nr) / BITS_PER_LONG)
+#define BIT_ULL_MASK(nr)	(ULL(1) << ((nr) % BITS_PER_LONG_LONG))
+#define BIT_ULL_WORD(nr)	((nr) / BITS_PER_LONG_LONG)
+#define BITS_PER_BYTE		8
+
+#define GENMASK(h, l) \
+	(((~UL(0)) - (UL(1) << (l)) + 1) & \
+	 (~UL(0) >> (BITS_PER_LONG - 1 - (h))))
+
+#define GENMASK_ULL(h, l) \
+	(((~ULL(0)) - (ULL(1) << (l)) + 1) & \
+	 (~ULL(0) >> (BITS_PER_LONG_LONG - 1 - (h))))
+
+#endif
diff --git a/arch/x86/boot/compressed/compiler.h b/arch/x86/boot/compressed/compiler.h
new file mode 100644
index 000000000000..452c4c0844b9
--- /dev/null
+++ b/arch/x86/boot/compressed/compiler.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_COMPILER_H
+#define BOOT_COMPILER_H
+#define __LINUX_COMPILER_H /* Inhibit inclusion of <linux/compiler.h> */
+
+# define likely(x)	__builtin_expect(!!(x), 1)
+# define unlikely(x)	__builtin_expect(!!(x), 0)
+
+#endif
diff --git a/arch/x86/boot/compressed/find.c b/arch/x86/boot/compressed/find.c
new file mode 100644
index 000000000000..839be91aae52
--- /dev/null
+++ b/arch/x86/boot/compressed/find.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include "bitmap.h"
+#include "find.h"
+#include "math.h"
+#include "minmax.h"
+
+static __always_inline unsigned long swab(const unsigned long y)
+{
+#if __BITS_PER_LONG == 64
+	return __builtin_bswap32(y);
+#else /* __BITS_PER_LONG == 32 */
+	return __builtin_bswap64(y);
+#endif
+}
+
+unsigned long _find_next_bit(const unsigned long *addr1,
+		const unsigned long *addr2, unsigned long nbits,
+		unsigned long start, unsigned long invert, unsigned long le)
+{
+	unsigned long tmp, mask;
+
+	if (unlikely(start >= nbits))
+		return nbits;
+
+	tmp = addr1[start / BITS_PER_LONG];
+	if (addr2)
+		tmp &= addr2[start / BITS_PER_LONG];
+	tmp ^= invert;
+
+	/* Handle 1st word. */
+	mask = BITMAP_FIRST_WORD_MASK(start);
+	if (le)
+		mask = swab(mask);
+
+	tmp &= mask;
+
+	start = round_down(start, BITS_PER_LONG);
+
+	while (!tmp) {
+		start += BITS_PER_LONG;
+		if (start >= nbits)
+			return nbits;
+
+		tmp = addr1[start / BITS_PER_LONG];
+		if (addr2)
+			tmp &= addr2[start / BITS_PER_LONG];
+		tmp ^= invert;
+	}
+
+	if (le)
+		tmp = swab(tmp);
+
+	return min(start + __ffs(tmp), nbits);
+}
diff --git a/arch/x86/boot/compressed/find.h b/arch/x86/boot/compressed/find.h
new file mode 100644
index 000000000000..799176972ebc
--- /dev/null
+++ b/arch/x86/boot/compressed/find.h
@@ -0,0 +1,80 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_FIND_H
+#define BOOT_FIND_H
+#define __LINUX_FIND_H /* Inhibit inclusion of <linux/find.h> */
+
+#include "../bitops.h"
+#include "align.h"
+#include "bits.h"
+#include "compiler.h"
+
+unsigned long _find_next_bit(const unsigned long *addr1,
+		const unsigned long *addr2, unsigned long nbits,
+		unsigned long start, unsigned long invert, unsigned long le);
+
+/**
+ * find_next_bit - find the next set bit in a memory region
+ * @addr: The address to base the search on
+ * @offset: The bitnumber to start searching at
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number for the next set bit
+ * If no bits are set, returns @size.
+ */
+static inline
+unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
+			    unsigned long offset)
+{
+	if (small_const_nbits(size)) {
+		unsigned long val;
+
+		if (unlikely(offset >= size))
+			return size;
+
+		val = *addr & GENMASK(size - 1, offset);
+		return val ? __ffs(val) : size;
+	}
+
+	return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
+}
+
+/**
+ * find_next_zero_bit - find the next cleared bit in a memory region
+ * @addr: The address to base the search on
+ * @offset: The bitnumber to start searching at
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number of the next zero bit
+ * If no bits are zero, returns @size.
+ */
+static inline
+unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
+				 unsigned long offset)
+{
+	if (small_const_nbits(size)) {
+		unsigned long val;
+
+		if (unlikely(offset >= size))
+			return size;
+
+		val = *addr | ~GENMASK(size - 1, offset);
+		return val == ~0UL ? size : ffz(val);
+	}
+
+	return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
+}
+
+/**
+ * for_each_set_bitrange_from - iterate over all set bit ranges [b; e)
+ * @b: bit offset of start of current bitrange (first set bit); must be initialized
+ * @e: bit offset of end of current bitrange (first unset bit)
+ * @addr: bitmap address to base the search on
+ * @size: bitmap size in number of bits
+ */
+#define for_each_set_bitrange_from(b, e, addr, size)		\
+	for ((b) = find_next_bit((addr), (size), (b)),		\
+	     (e) = find_next_zero_bit((addr), (size), (b) + 1);	\
+	     (b) < (size);					\
+	     (b) = find_next_bit((addr), (size), (e) + 1),	\
+	     (e) = find_next_zero_bit((addr), (size), (b) + 1))
+#endif
diff --git a/arch/x86/boot/compressed/math.h b/arch/x86/boot/compressed/math.h
new file mode 100644
index 000000000000..f7eede84bbc2
--- /dev/null
+++ b/arch/x86/boot/compressed/math.h
@@ -0,0 +1,37 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_MATH_H
+#define BOOT_MATH_H
+#define __LINUX_MATH_H /* Inhibit inclusion of <linux/math.h> */
+
+/*
+ *
+ * This looks more complex than it should be. But we need to
+ * get the type for the ~ right in round_down (it needs to be
+ * as wide as the result!), and we want to evaluate the macro
+ * arguments just once each.
+ */
+#define __round_mask(x, y) ((__typeof__(x))((y)-1))
+
+/**
+ * round_up - round up to next specified power of 2
+ * @x: the value to round
+ * @y: multiple to round up to (must be a power of 2)
+ *
+ * Rounds @x up to next multiple of @y (which must be a power of 2).
+ * To perform arbitrary rounding up, use roundup() below.
+ */
+#define round_up(x, y) ((((x)-1) | __round_mask(x, y))+1)
+
+/**
+ * round_down - round down to next specified power of 2
+ * @x: the value to round
+ * @y: multiple to round down to (must be a power of 2)
+ *
+ * Rounds @x down to next multiple of @y (which must be a power of 2).
+ * To perform arbitrary rounding down, use rounddown() below.
+ */
+#define round_down(x, y) ((x) & ~__round_mask(x, y))
+
+#define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
+
+#endif
diff --git a/arch/x86/boot/compressed/minmax.h b/arch/x86/boot/compressed/minmax.h
new file mode 100644
index 000000000000..4efd05673260
--- /dev/null
+++ b/arch/x86/boot/compressed/minmax.h
@@ -0,0 +1,61 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_MINMAX_H
+#define BOOT_MINMAX_H
+#define __LINUX_MINMAX_H /* Inhibit inclusion of <linux/minmax.h> */
+
+/*
+ * This returns a constant expression while determining if an argument is
+ * a constant expression, most importantly without evaluating the argument.
+ * Glory to Martin Uecker <Martin.Uecker@med.uni-goettingen.de>
+ */
+#define __is_constexpr(x) \
+	(sizeof(int) == sizeof(*(8 ? ((void *)((long)(x) * 0l)) : (int *)8)))
+
+/*
+ * min()/max()/clamp() macros must accomplish three things:
+ *
+ * - avoid multiple evaluations of the arguments (so side-effects like
+ *   "x++" happen only once) when non-constant.
+ * - perform strict type-checking (to generate warnings instead of
+ *   nasty runtime surprises). See the "unnecessary" pointer comparison
+ *   in __typecheck().
+ * - retain result as a constant expressions when called with only
+ *   constant expressions (to avoid tripping VLA warnings in stack
+ *   allocation usage).
+ */
+#define __typecheck(x, y) \
+	(!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
+
+#define __no_side_effects(x, y) \
+		(__is_constexpr(x) && __is_constexpr(y))
+
+#define __safe_cmp(x, y) \
+		(__typecheck(x, y) && __no_side_effects(x, y))
+
+#define __cmp(x, y, op)	((x) op (y) ? (x) : (y))
+
+#define __cmp_once(x, y, unique_x, unique_y, op) ({	\
+		typeof(x) unique_x = (x);		\
+		typeof(y) unique_y = (y);		\
+		__cmp(unique_x, unique_y, op); })
+
+#define __careful_cmp(x, y, op) \
+	__builtin_choose_expr(__safe_cmp(x, y), \
+		__cmp(x, y, op), \
+		__cmp_once(x, y, __UNIQUE_ID(__x), __UNIQUE_ID(__y), op))
+
+/**
+ * min - return minimum of two values of the same or compatible types
+ * @x: first value
+ * @y: second value
+ */
+#define min(x, y)	__careful_cmp(x, y, <)
+
+/**
+ * max - return maximum of two values of the same or compatible types
+ * @x: first value
+ * @y: second value
+ */
+#define max(x, y)	__careful_cmp(x, y, >)
+
+#endif
diff --git a/arch/x86/boot/compressed/pgtable_types.h b/arch/x86/boot/compressed/pgtable_types.h
new file mode 100644
index 000000000000..8f1d87a69efc
--- /dev/null
+++ b/arch/x86/boot/compressed/pgtable_types.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef BOOT_COMPRESSED_PGTABLE_TYPES_H
+#define BOOT_COMPRESSED_PGTABLE_TYPES_H
+#define _ASM_X86_PGTABLE_DEFS_H /* Inhibit inclusion of <asm/pgtable_types.h> */
+
+#define PAGE_SHIFT	12
+
+#ifdef CONFIG_X86_64
+#define PTE_SHIFT	9
+#elif defined CONFIG_X86_PAE
+#define PTE_SHIFT	9
+#else /* 2-level */
+#define PTE_SHIFT	10
+#endif
+
+enum pg_level {
+	PG_LEVEL_NONE,
+	PG_LEVEL_4K,
+	PG_LEVEL_2M,
+	PG_LEVEL_1G,
+	PG_LEVEL_512G,
+	PG_LEVEL_NUM
+};
+
+#endif
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCHv7 06/14] efi/x86: Implement support for unaccepted memory
  2022-06-14 12:02 [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2022-06-14 12:02 ` [PATCHv7 05/14] x86/boot: Add infrastructure required for unaccepted memory support Kirill A. Shutemov
@ 2022-06-14 12:02 ` Kirill A. Shutemov
  2022-06-22 19:58   ` Dave Hansen
  2022-07-26  8:35   ` Borislav Petkov
  2022-06-14 12:02 ` [PATCHv7 07/14] x86/boot/compressed: Handle " Kirill A. Shutemov
                   ` (13 subsequent siblings)
  19 siblings, 2 replies; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-06-14 12:02 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov

UEFI Specification version 2.9 introduces the concept of memory
acceptance: Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, requiring memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific for the Virtual
Machine platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptance until memory is needed. It lowers boot time and reduces
memory overhead.

The kernel needs to know what memory has been accepted. Firmware
communicates this information via memory map: a new memory type --
EFI_UNACCEPTED_MEMORY -- indicates such memory.

Range-based tracking works fine for firmware, but it gets bulky for
the kernel: e820 has to be modified on every page acceptance. It leads
to table fragmentation, but there's a limited number of entries in the
e820 table

Another option is to mark such memory as usable in e820 and track if the
range has been accepted in a bitmap. One bit in the bitmap represents
2MiB in the address space: one 4k page is enough to track 64GiB or
physical address space.

In the worst-case scenario -- a huge hole in the middle of the
address space -- It needs 256MiB to handle 4PiB of the address
space.

Any unaccepted memory that is not aligned to 2M gets accepted upfront.

The bitmap is allocated and constructed in the EFI stub and passed down
to the kernel via boot_params. allocate_e820() allocates the bitmap if
unaccepted memory is present, according to the maximum address in the
memory map.

The same boot_params.unaccepted_memory can be used to pass the bitmap
between two kernels on kexec, but the use-case is not yet implemented.

The implementation requires some basic helpers in boot stub. They
provided by linux/ includes in the main kernel image, but is not present
in boot stub. Create copy of required functionality in the boot stub.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/x86/zero-page.rst          |  1 +
 arch/x86/boot/compressed/Makefile        |  1 +
 arch/x86/boot/compressed/mem.c           | 68 ++++++++++++++++++++++++
 arch/x86/include/asm/unaccepted_memory.h | 10 ++++
 arch/x86/include/uapi/asm/bootparam.h    |  2 +-
 drivers/firmware/efi/Kconfig             | 14 +++++
 drivers/firmware/efi/efi.c               |  1 +
 drivers/firmware/efi/libstub/x86-stub.c  | 68 ++++++++++++++++++++++++
 include/linux/efi.h                      |  3 +-
 9 files changed, 166 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/boot/compressed/mem.c
 create mode 100644 arch/x86/include/asm/unaccepted_memory.h

diff --git a/Documentation/x86/zero-page.rst b/Documentation/x86/zero-page.rst
index 45aa9cceb4f1..f21905e61ade 100644
--- a/Documentation/x86/zero-page.rst
+++ b/Documentation/x86/zero-page.rst
@@ -20,6 +20,7 @@ Offset/Size	Proto	Name			Meaning
 060/010		ALL	ist_info		Intel SpeedStep (IST) BIOS support information
 						(struct ist_info)
 070/008		ALL	acpi_rsdp_addr		Physical address of ACPI RSDP table
+078/008		ALL	unaccepted_memory	Bitmap of unaccepted memory (1bit == 2M)
 080/010		ALL	hd0_info		hd0 disk parameter, OBSOLETE!!
 090/010		ALL	hd1_info		hd1 disk parameter, OBSOLETE!!
 0A0/010		ALL	sys_desc_table		System description table (struct sys_desc_table),
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 19e1905dcbf6..67732478323f 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -102,6 +102,7 @@ endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
 vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
+vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/bitmap.o $(obj)/mem.o
 
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
 vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
new file mode 100644
index 000000000000..415df0d3bc81
--- /dev/null
+++ b/arch/x86/boot/compressed/mem.c
@@ -0,0 +1,68 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "../cpuflags.h"
+#include "bitmap.h"
+#include "error.h"
+#include "math.h"
+
+#define PMD_SHIFT	21
+#define PMD_SIZE	(_AC(1, UL) << PMD_SHIFT)
+#define PMD_MASK	(~(PMD_SIZE - 1))
+
+static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	/* Platform-specific memory-acceptance call goes here */
+	error("Cannot accept memory");
+}
+
+/*
+ * The accepted memory bitmap only works at PMD_SIZE granularity. If a request
+ * comes in to mark memory as unaccepted which is not PMD_SIZE-aligned, simply
+ * accept the memory now since it can not be *marked* as unaccepted.
+ */
+void process_unaccepted_memory(struct boot_params *params, u64 start, u64 end)
+{
+	/*
+	 * Accept small regions that might not be able to be represented
+	 * in the bitmap.  This is a bit imprecise and may accept some
+	 * areas that could have been represented in the bitmap instead.
+	 *
+	 * Consider case like this:
+	 *
+	 * | 4k | 2044k |    2048k   |
+	 * ^ 0x0        ^ 2MB        ^ 4MB
+	 *
+	 * all memory in the range is unaccepted, except for the first 4k.
+	 * The second 2M can be represented in the bitmap, but kernel accept it
+	 * right away. The imprecision makes the code simpler by ensuring that
+	 * at least one bit will be set int the bitmap below.
+	 */
+	if (end - start < 2 * PMD_SIZE) {
+		__accept_memory(start, end);
+		return;
+	}
+
+	/*
+	 * No matter how the start and end are aligned, at least one unaccepted
+	 * PMD_SIZE area will remain.
+	 */
+
+	/* Immediately accept a <PMD_SIZE piece at the start: */
+	if (start & ~PMD_MASK) {
+		__accept_memory(start, round_up(start, PMD_SIZE));
+		start = round_up(start, PMD_SIZE);
+	}
+
+	/* Immediately accept a <PMD_SIZE piece at the end: */
+	if (end & ~PMD_MASK) {
+		__accept_memory(round_down(end, PMD_SIZE), end);
+		end = round_down(end, PMD_SIZE);
+	}
+
+	/*
+	 * 'start' and 'end' are now both PMD-aligned.
+	 * Record the range as being unaccepted:
+	 */
+	bitmap_set((unsigned long *)params->unaccepted_memory,
+		   start / PMD_SIZE, (end - start) / PMD_SIZE);
+}
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
new file mode 100644
index 000000000000..df0736d32858
--- /dev/null
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2020 Intel Corporation */
+#ifndef _ASM_X86_UNACCEPTED_MEMORY_H
+#define _ASM_X86_UNACCEPTED_MEMORY_H
+
+struct boot_params;
+
+void process_unaccepted_memory(struct boot_params *params, u64 start, u64 num);
+
+#endif
diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h
index bea5cdcdf532..c62f6ee4b6f0 100644
--- a/arch/x86/include/uapi/asm/bootparam.h
+++ b/arch/x86/include/uapi/asm/bootparam.h
@@ -180,7 +180,7 @@ struct boot_params {
 	__u64  tboot_addr;				/* 0x058 */
 	struct ist_info ist_info;			/* 0x060 */
 	__u64 acpi_rsdp_addr;				/* 0x070 */
-	__u8  _pad3[8];					/* 0x078 */
+	__u64 unaccepted_memory;			/* 0x078 */
 	__u8  hd0_info[16];	/* obsolete! */		/* 0x080 */
 	__u8  hd1_info[16];	/* obsolete! */		/* 0x090 */
 	struct sys_desc_table sys_desc_table; /* obsolete! */	/* 0x0a0 */
diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
index 7aa4717cdcac..e1270beff4dc 100644
--- a/drivers/firmware/efi/Kconfig
+++ b/drivers/firmware/efi/Kconfig
@@ -305,6 +305,20 @@ config EFI_COCO_SECRET
 	  virt/coco/efi_secret module to access the secrets, which in turn
 	  allows userspace programs to access the injected secrets.
 
+config UNACCEPTED_MEMORY
+	bool
+	depends on EFI_STUB
+	help
+	   Some Virtual Machine platforms, such as Intel TDX, require
+	   some memory to be "accepted" by the guest before it can be used.
+	   This mechanism helps prevent malicious hosts from making changes
+	   to guest memory.
+
+	   UEFI specification v2.9 introduced EFI_UNACCEPTED_MEMORY memory type.
+
+	   This option adds support for unaccepted memory and makes such memory
+	   usable by the kernel.
+
 config EFI_EMBEDDED_FIRMWARE
 	bool
 	select CRYPTO_LIB_SHA256
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 860534bcfdac..34d1ca870420 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -758,6 +758,7 @@ static __initdata char memory_type_name[][13] = {
 	"MMIO Port",
 	"PAL Code",
 	"Persistent",
+	"Unaccepted",
 };
 
 char * __init efi_md_typeattr_format(char *buf, size_t size,
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index 504955368934..b91c89100b2d 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -15,6 +15,7 @@
 #include <asm/setup.h>
 #include <asm/desc.h>
 #include <asm/boot.h>
+#include <asm/unaccepted_memory.h>
 
 #include "efistub.h"
 
@@ -607,6 +608,17 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
 			e820_type = E820_TYPE_PMEM;
 			break;
 
+		case EFI_UNACCEPTED_MEMORY:
+			if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY)) {
+				efi_warn_once("The system has unaccepted memory,"
+					     " but kernel does not support it\n");
+				efi_warn_once("Consider enabling CONFIG_UNACCEPTED_MEMORY\n");
+				continue;
+			}
+			e820_type = E820_TYPE_RAM;
+			process_unaccepted_memory(params, d->phys_addr,
+						  d->phys_addr + PAGE_SIZE * d->num_pages);
+			break;
 		default:
 			continue;
 		}
@@ -671,6 +683,59 @@ static efi_status_t alloc_e820ext(u32 nr_desc, struct setup_data **e820ext,
 	return status;
 }
 
+static efi_status_t allocate_unaccepted_memory(struct boot_params *params,
+					       __u32 nr_desc,
+					       struct efi_boot_memmap *map)
+{
+	unsigned long *mem = NULL;
+	u64 size, max_addr = 0;
+	efi_status_t status;
+	bool found = false;
+	int i;
+
+	/* Check if there's any unaccepted memory and find the max address */
+	for (i = 0; i < nr_desc; i++) {
+		efi_memory_desc_t *d;
+
+		d = efi_early_memdesc_ptr(*map->map, *map->desc_size, i);
+		if (d->type == EFI_UNACCEPTED_MEMORY)
+			found = true;
+		if (d->phys_addr + d->num_pages * PAGE_SIZE > max_addr)
+			max_addr = d->phys_addr + d->num_pages * PAGE_SIZE;
+	}
+
+	if (!found) {
+		params->unaccepted_memory = 0;
+		return EFI_SUCCESS;
+	}
+
+	/*
+	 * If unaccepted memory is present allocate a bitmap to track what
+	 * memory has to be accepted before access.
+	 *
+	 * One bit in the bitmap represents 2MiB in the address space:
+	 * A 4k bitmap can track 64GiB of physical address space.
+	 *
+	 * In the worst case scenario -- a huge hole in the middle of the
+	 * address space -- It needs 256MiB to handle 4PiB of the address
+	 * space.
+	 *
+	 * TODO: handle situation if params->unaccepted_memory is already set.
+	 * It's required to deal with kexec.
+	 *
+	 * The bitmap will be populated in setup_e820() according to the memory
+	 * map after efi_exit_boot_services().
+	 */
+	size = DIV_ROUND_UP(max_addr, PMD_SIZE * BITS_PER_BYTE);
+	status = efi_allocate_pages(size, (unsigned long *)&mem, ULONG_MAX);
+	if (status == EFI_SUCCESS) {
+		memset(mem, 0, size);
+		params->unaccepted_memory = (unsigned long)mem;
+	}
+
+	return status;
+}
+
 static efi_status_t allocate_e820(struct boot_params *params,
 				  struct efi_boot_memmap *map,
 				  struct setup_data **e820ext,
@@ -691,6 +756,9 @@ static efi_status_t allocate_e820(struct boot_params *params,
 		status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
 	}
 
+	if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) && status == EFI_SUCCESS)
+		status = allocate_unaccepted_memory(params, nr_desc, map);
+
 	efi_bs_call(free_pool, *map->map);
 	return status;
 }
diff --git a/include/linux/efi.h b/include/linux/efi.h
index 7d9b0bb47eb3..9c2fa94f2f93 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -108,7 +108,8 @@ typedef	struct {
 #define EFI_MEMORY_MAPPED_IO_PORT_SPACE	12
 #define EFI_PAL_CODE			13
 #define EFI_PERSISTENT_MEMORY		14
-#define EFI_MAX_MEMORY_TYPE		15
+#define EFI_UNACCEPTED_MEMORY		15
+#define EFI_MAX_MEMORY_TYPE		16
 
 /* Attribute values: */
 #define EFI_MEMORY_UC		((u64)0x0000000000000001ULL)	/* uncached */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCHv7 07/14] x86/boot/compressed: Handle unaccepted memory
  2022-06-14 12:02 [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  2022-06-14 12:02 ` [PATCHv7 06/14] efi/x86: Implement support for unaccepted memory Kirill A. Shutemov
@ 2022-06-14 12:02 ` Kirill A. Shutemov
  2022-06-14 12:02 ` [PATCHv7 08/14] x86/mm: Reserve unaccepted memory bitmap Kirill A. Shutemov
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-06-14 12:02 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov

The firmware will pre-accept the memory used to run the stub. But, the
stub is responsible for accepting the memory into which it decompresses
the main kernel. Accept memory just before decompression starts.

The stub is also responsible for choosing a physical address in which to
place the decompressed kernel image. The KASLR mechanism will randomize
this physical address. Since the unaccepted memory region is relatively
small, KASLR would be quite ineffective if it only used the pre-accepted
area (EFI_CONVENTIONAL_MEMORY). Ensure that KASLR randomizes among the
entire physical address space by also including EFI_UNACCEPTED_MEMORY.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/Makefile        |  2 +-
 arch/x86/boot/compressed/efi.h           |  1 +
 arch/x86/boot/compressed/kaslr.c         | 35 ++++++++++++++++--------
 arch/x86/boot/compressed/mem.c           | 18 ++++++++++++
 arch/x86/boot/compressed/misc.c          |  6 ++++
 arch/x86/boot/compressed/misc.h          |  6 ++++
 arch/x86/include/asm/unaccepted_memory.h |  2 ++
 7 files changed, 57 insertions(+), 13 deletions(-)

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 67732478323f..85a631f5cdff 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -102,7 +102,7 @@ endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
 vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
-vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/bitmap.o $(obj)/mem.o
+vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/bitmap.o $(obj)/find.o $(obj)/mem.o
 
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
 vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
diff --git a/arch/x86/boot/compressed/efi.h b/arch/x86/boot/compressed/efi.h
index 7db2f41b54cd..cf475243b6d5 100644
--- a/arch/x86/boot/compressed/efi.h
+++ b/arch/x86/boot/compressed/efi.h
@@ -32,6 +32,7 @@ typedef	struct {
 } efi_table_hdr_t;
 
 #define EFI_CONVENTIONAL_MEMORY		 7
+#define EFI_UNACCEPTED_MEMORY		15
 
 #define EFI_MEMORY_MORE_RELIABLE \
 				((u64)0x0000000000010000ULL)	/* higher reliability */
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 4a3f223973f4..a42264b0b2f5 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -671,6 +671,28 @@ static bool process_mem_region(struct mem_vector *region,
 }
 
 #ifdef CONFIG_EFI
+
+/*
+ * Only EFI_CONVENTIONAL_MEMORY and EFI_UNACCEPTED_MEMORY (if supported) are
+ * guaranteed to be free.
+ *
+ * It is more conservative in picking free memory than the EFI spec allows:
+ *
+ * According to the spec, EFI_BOOT_SERVICES_{CODE|DATA} are also free memory
+ * and thus available to place the kernel image into, but in practice there's
+ * firmware where using that memory leads to crashes.
+ */
+static inline bool memory_type_is_free(efi_memory_desc_t *md)
+{
+	if (md->type == EFI_CONVENTIONAL_MEMORY)
+		return true;
+
+	if (md->type == EFI_UNACCEPTED_MEMORY)
+		return IS_ENABLED(CONFIG_UNACCEPTED_MEMORY);
+
+	return false;
+}
+
 /*
  * Returns true if we processed the EFI memmap, which we prefer over the E820
  * table if it is available.
@@ -715,18 +737,7 @@ process_efi_entries(unsigned long minimum, unsigned long image_size)
 	for (i = 0; i < nr_desc; i++) {
 		md = efi_early_memdesc_ptr(pmap, e->efi_memdesc_size, i);
 
-		/*
-		 * Here we are more conservative in picking free memory than
-		 * the EFI spec allows:
-		 *
-		 * According to the spec, EFI_BOOT_SERVICES_{CODE|DATA} are also
-		 * free memory and thus available to place the kernel image into,
-		 * but in practice there's firmware where using that memory leads
-		 * to crashes.
-		 *
-		 * Only EFI_CONVENTIONAL_MEMORY is guaranteed to be free.
-		 */
-		if (md->type != EFI_CONVENTIONAL_MEMORY)
+		if (!memory_type_is_free(md))
 			continue;
 
 		if (efi_soft_reserve_enabled() &&
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index 415df0d3bc81..b45458af00ca 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -3,12 +3,15 @@
 #include "../cpuflags.h"
 #include "bitmap.h"
 #include "error.h"
+#include "find.h"
 #include "math.h"
 
 #define PMD_SHIFT	21
 #define PMD_SIZE	(_AC(1, UL) << PMD_SHIFT)
 #define PMD_MASK	(~(PMD_SIZE - 1))
 
+extern struct boot_params *boot_params;
+
 static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
 {
 	/* Platform-specific memory-acceptance call goes here */
@@ -66,3 +69,18 @@ void process_unaccepted_memory(struct boot_params *params, u64 start, u64 end)
 	bitmap_set((unsigned long *)params->unaccepted_memory,
 		   start / PMD_SIZE, (end - start) / PMD_SIZE);
 }
+
+void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	unsigned long range_start, range_end;
+	unsigned long *bitmap, bitmap_size;
+
+	bitmap = (unsigned long *)boot_params->unaccepted_memory;
+	range_start = start / PMD_SIZE;
+	bitmap_size = DIV_ROUND_UP(end, PMD_SIZE);
+
+	for_each_set_bitrange_from(range_start, range_end, bitmap, bitmap_size) {
+		__accept_memory(range_start * PMD_SIZE, range_end * PMD_SIZE);
+		bitmap_clear(bitmap, range_start, range_end - range_start);
+	}
+}
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index cf690d8712f4..c41a87b0ec06 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -454,6 +454,12 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
 #endif
 
 	debug_putstr("\nDecompressing Linux... ");
+
+	if (boot_params->unaccepted_memory) {
+		debug_putstr("Accepting memory... ");
+		accept_memory(__pa(output), __pa(output) + needed_size);
+	}
+
 	__decompress(input_data, input_len, NULL, NULL, output, output_len,
 			NULL, error);
 	parse_elf(output);
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 245cf8f2a0bd..da7ef45133b9 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -235,4 +235,10 @@ static inline unsigned long efi_find_vendor_table(struct boot_params *bp,
 }
 #endif /* CONFIG_EFI */
 
+#ifdef CONFIG_UNACCEPTED_MEMORY
+void accept_memory(phys_addr_t start, phys_addr_t end);
+#else
+static inline void accept_memory(phys_addr_t start, phys_addr_t end) {}
+#endif
+
 #endif /* BOOT_COMPRESSED_MISC_H */
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
index df0736d32858..41fbfc798100 100644
--- a/arch/x86/include/asm/unaccepted_memory.h
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -7,4 +7,6 @@ struct boot_params;
 
 void process_unaccepted_memory(struct boot_params *params, u64 start, u64 num);
 
+void accept_memory(phys_addr_t start, phys_addr_t end);
+
 #endif
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCHv7 08/14] x86/mm: Reserve unaccepted memory bitmap
  2022-06-14 12:02 [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (6 preceding siblings ...)
  2022-06-14 12:02 ` [PATCHv7 07/14] x86/boot/compressed: Handle " Kirill A. Shutemov
@ 2022-06-14 12:02 ` Kirill A. Shutemov
  2022-07-26  9:07   ` Borislav Petkov
  2022-06-14 12:02 ` [PATCHv7 09/14] x86/mm: Provide helpers for unaccepted memory Kirill A. Shutemov
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-06-14 12:02 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov,
	Mike Rapoport

A given page of memory can only be accepted once. The kernel has to
accept memory both in the early decompression stage and during normal
runtime.

A bitmap is used to communicate the acceptance state of each page
between the decompression stage and normal runtime.

boot_params is used to communicate location of the bitmap throughout
the boot. The bitmap is allocated and initially populated in EFI stub.
Decompression stage accepts pages required for kernel/initrd and marks
these pages accordingly in the bitmap. The main kernel picks up the
bitmap from the same boot_params and uses it to determine what has to
be accepted on allocation.

In the runtime kernel, reserve the bitmap's memory to ensure nothing
overwrites it.

The size of bitmap is determined with e820__end_of_ram_pfn() which
relies on setup_e820() marking unaccepted memory as E820_TYPE_RAM.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
---
 arch/x86/kernel/e820.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index f267205f2d5a..22d1fe48dcba 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1316,6 +1316,16 @@ void __init e820__memblock_setup(void)
 	int i;
 	u64 end;
 
+	/* Mark unaccepted memory bitmap reserved */
+	if (boot_params.unaccepted_memory) {
+		unsigned long size;
+
+		/* One bit per 2MB */
+		size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
+				    PMD_SIZE * BITS_PER_BYTE);
+		memblock_reserve(boot_params.unaccepted_memory, size);
+	}
+
 	/*
 	 * The bootstrap memblock region count maximum is 128 entries
 	 * (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCHv7 09/14] x86/mm: Provide helpers for unaccepted memory
  2022-06-14 12:02 [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (7 preceding siblings ...)
  2022-06-14 12:02 ` [PATCHv7 08/14] x86/mm: Reserve unaccepted memory bitmap Kirill A. Shutemov
@ 2022-06-14 12:02 ` Kirill A. Shutemov
  2022-06-14 12:02 ` [PATCHv7 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into " Kirill A. Shutemov
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-06-14 12:02 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov

Core-mm requires few helpers to support unaccepted memory:

 - accept_memory() checks the range of addresses against the bitmap and
   accept memory if needed.

 - range_contains_unaccepted_memory() checks if anything within the
   range requires acceptance.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/page.h              |  3 ++
 arch/x86/include/asm/unaccepted_memory.h |  4 ++
 arch/x86/mm/Makefile                     |  2 +
 arch/x86/mm/unaccepted_memory.c          | 61 ++++++++++++++++++++++++
 4 files changed, 70 insertions(+)
 create mode 100644 arch/x86/mm/unaccepted_memory.c

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 9cc82f305f4b..df4ec3a988dc 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -19,6 +19,9 @@
 struct page;
 
 #include <linux/range.h>
+
+#include <asm/unaccepted_memory.h>
+
 extern struct range pfn_mapped[];
 extern int nr_pfn_mapped;
 
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
index 41fbfc798100..89fc91c61560 100644
--- a/arch/x86/include/asm/unaccepted_memory.h
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -7,6 +7,10 @@ struct boot_params;
 
 void process_unaccepted_memory(struct boot_params *params, u64 start, u64 num);
 
+#ifdef CONFIG_UNACCEPTED_MEMORY
+
 void accept_memory(phys_addr_t start, phys_addr_t end);
+bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end);
 
 #endif
+#endif
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index f8220fd2c169..96fc2766b6d7 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -59,3 +59,5 @@ obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_amd.o
 
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
+
+obj-$(CONFIG_UNACCEPTED_MEMORY)	+= unaccepted_memory.o
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
new file mode 100644
index 000000000000..1df918b21469
--- /dev/null
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -0,0 +1,61 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/memblock.h>
+#include <linux/mm.h>
+#include <linux/pfn.h>
+#include <linux/spinlock.h>
+
+#include <asm/io.h>
+#include <asm/setup.h>
+#include <asm/unaccepted_memory.h>
+
+/* Protects unaccepted memory bitmap */
+static DEFINE_SPINLOCK(unaccepted_memory_lock);
+
+void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	unsigned long range_start, range_end;
+	unsigned long *bitmap;
+	unsigned long flags;
+
+	if (!boot_params.unaccepted_memory)
+		return;
+
+	bitmap = __va(boot_params.unaccepted_memory);
+	range_start = start / PMD_SIZE;
+
+	spin_lock_irqsave(&unaccepted_memory_lock, flags);
+	for_each_set_bitrange_from(range_start, range_end, bitmap,
+				   DIV_ROUND_UP(end, PMD_SIZE)) {
+		unsigned long len = range_end - range_start;
+
+		/* Platform-specific memory-acceptance call goes here */
+		panic("Cannot accept memory: unknown platform\n");
+		bitmap_clear(bitmap, range_start, len);
+	}
+	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+}
+
+bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
+{
+	unsigned long *bitmap;
+	unsigned long flags;
+	bool ret = false;
+
+	if (!boot_params.unaccepted_memory)
+		return 0;
+
+	bitmap = __va(boot_params.unaccepted_memory);
+
+	spin_lock_irqsave(&unaccepted_memory_lock, flags);
+	while (start < end) {
+		if (test_bit(start / PMD_SIZE, bitmap)) {
+			ret = true;
+			break;
+		}
+
+		start += PMD_SIZE;
+	}
+	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+
+	return ret;
+}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCHv7 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2022-06-14 12:02 [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (8 preceding siblings ...)
  2022-06-14 12:02 ` [PATCHv7 09/14] x86/mm: Provide helpers for unaccepted memory Kirill A. Shutemov
@ 2022-06-14 12:02 ` Kirill A. Shutemov
  2022-06-23 17:19   ` Dave Hansen
                     ` (3 more replies)
  2022-06-14 12:02 ` [PATCHv7 11/14] x86: Disable kexec if system has " Kirill A. Shutemov
                   ` (9 subsequent siblings)
  19 siblings, 4 replies; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-06-14 12:02 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov

load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
The unwanted loads are typically harmless. But, they might be made to
totally unrelated or even unmapped memory. load_unaligned_zeropad()
relies on exception fixup (#PF, #GP and now #VE) to recover from these
unwanted loads.

But, this approach does not work for unaccepted memory. For TDX, a load
from unaccepted memory will not lead to a recoverable exception within
the guest. The guest will exit to the VMM where the only recourse is to
terminate the guest.

There are three parts to fix this issue and comprehensively avoid access
to unaccepted memory. Together these ensure that an extra “guard” page
is accepted in addition to the memory that needs to be used.

1. Implicitly extend the range_contains_unaccepted_memory(start, end)
   checks up to end+2M if ‘end’ is aligned on a 2M boundary.
2. Implicitly extend accept_memory(start, end) to end+2M if ‘end’ is
   aligned on a 2M boundary.
3. Set PageUnaccepted() on both memory that itself needs to be accepted
   *and* memory where the next page needs to be accepted. Essentially,
   make PageUnaccepted(page) a marker for whether work needs to be done
   to make ‘page’ usable. That work might include accepting pages in
   addition to ‘page’ itself.

Side note: This leads to something strange. Pages which were accepted
	   at boot, marked by the firmware as accepted and will never
	   _need_ to be accepted might have PageUnaccepted() set on
	   them. PageUnaccepted(page) is a cue to ensure that the next
	   page is accepted before ‘page’ can be used.

This is an actual, real-world problem which was discovered during TDX
testing.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/unaccepted_memory.c         | 36 +++++++++++++++++++++++++
 drivers/firmware/efi/libstub/x86-stub.c |  7 +++++
 2 files changed, 43 insertions(+)

diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index 1df918b21469..bcd56fe82b9e 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -23,6 +23,38 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
 	bitmap = __va(boot_params.unaccepted_memory);
 	range_start = start / PMD_SIZE;
 
+	/*
+	 * load_unaligned_zeropad() can lead to unwanted loads across page
+	 * boundaries. The unwanted loads are typically harmless. But, they
+	 * might be made to totally unrelated or even unmapped memory.
+	 * load_unaligned_zeropad() relies on exception fixup (#PF, #GP and now
+	 * #VE) to recover from these unwanted loads.
+	 *
+	 * But, this approach does not work for unaccepted memory. For TDX, a
+	 * load from unaccepted memory will not lead to a recoverable exception
+	 * within the guest. The guest will exit to the VMM where the only
+	 * recourse is to terminate the guest.
+	 *
+	 * There are three parts to fix this issue and comprehensively avoid
+	 * access to unaccepted memory. Together these ensure that an extra
+	 * “guard” page is accepted in addition to the memory that needs to be
+	 * used:
+	 *
+	 * 1. Implicitly extend the range_contains_unaccepted_memory(start, end)
+	 *    checks up to end+2M if ‘end’ is aligned on a 2M boundary.
+	 *
+	 * 2. Implicitly extend accept_memory(start, end) to end+2M if ‘end’ is
+	 *    aligned on a 2M boundary.
+	 *
+	 * 3. Set PageUnaccepted() on both memory that itself needs to be
+	 *    accepted *and* memory where the next page needs to be accepted.
+	 *    Essentially, make PageUnaccepted(page) a marker for whether work
+	 *    needs to be done to make ‘page’ usable. That work might include
+	 *    accepting pages in addition to ‘page’ itself.
+	 */
+	if (!(end % PMD_SIZE))
+		end += PMD_SIZE;
+
 	spin_lock_irqsave(&unaccepted_memory_lock, flags);
 	for_each_set_bitrange_from(range_start, range_end, bitmap,
 				   DIV_ROUND_UP(end, PMD_SIZE)) {
@@ -46,6 +78,10 @@ bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
 
 	bitmap = __va(boot_params.unaccepted_memory);
 
+	/* See comment on load_unaligned_zeropad() in accept_memory() */
+	if (!(end % PMD_SIZE))
+		end += PMD_SIZE;
+
 	spin_lock_irqsave(&unaccepted_memory_lock, flags);
 	while (start < end) {
 		if (test_bit(start / PMD_SIZE, bitmap)) {
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index b91c89100b2d..bc1110509de4 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -709,6 +709,13 @@ static efi_status_t allocate_unaccepted_memory(struct boot_params *params,
 		return EFI_SUCCESS;
 	}
 
+	/*
+	 * range_contains_unaccepted_memory() may need to check one 2M chunk
+	 * beyond the end of RAM to deal with load_unaligned_zeropad(). Make
+	 * sure that the bitmap is large enough handle it.
+	 */
+	max_addr += PMD_SIZE;
+
 	/*
 	 * If unaccepted memory is present allocate a bitmap to track what
 	 * memory has to be accepted before access.
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCHv7 11/14] x86: Disable kexec if system has unaccepted memory
  2022-06-14 12:02 [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (9 preceding siblings ...)
  2022-06-14 12:02 ` [PATCHv7 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into " Kirill A. Shutemov
@ 2022-06-14 12:02 ` Kirill A. Shutemov
  2022-06-23 17:23   ` Dave Hansen
  2022-06-14 12:02 ` [PATCHv7 12/14] x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub Kirill A. Shutemov
                   ` (8 subsequent siblings)
  19 siblings, 1 reply; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-06-14 12:02 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov

On kexec, the target kernel has to know what memory has been accepted.
Information in EFI map is out of date and cannot be used.

boot_params.unaccepted_memory can be used to pass the bitmap between two
kernels on kexec, but the use-case is not yet implemented.

Disable kexec on machines with unaccepted memory for now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/unaccepted_memory.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index bcd56fe82b9e..05e216716690 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -1,4 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0-only
+#include <linux/kexec.h>
 #include <linux/memblock.h>
 #include <linux/mm.h>
 #include <linux/pfn.h>
@@ -95,3 +96,21 @@ bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
 
 	return ret;
 }
+
+static int __init unaccepted_init(void)
+{
+	if (!boot_params.unaccepted_memory)
+		return 0;
+
+#ifdef CONFIG_KEXEC_CORE
+	/*
+	 * TODO: Information on memory acceptance status has to be communicated
+	 * between kernel.
+	 */
+	pr_warn("Disable kexec: not yet supported on systems with unaccepted memory\n");
+	kexec_load_disabled = 1;
+#endif
+
+	return 0;
+}
+fs_initcall(unaccepted_init);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCHv7 12/14] x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub
  2022-06-14 12:02 [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (10 preceding siblings ...)
  2022-06-14 12:02 ` [PATCHv7 11/14] x86: Disable kexec if system has " Kirill A. Shutemov
@ 2022-06-14 12:02 ` Kirill A. Shutemov
  2022-06-23 17:25   ` Dave Hansen
  2022-06-14 12:02 ` [PATCHv7 13/14] x86/tdx: Refactor try_accept_one() Kirill A. Shutemov
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-06-14 12:02 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov

Memory acceptance requires a hypercall and one or multiple module calls.

Make helpers for the calls available in boot stub. It has to accept
memory where kernel image and initrd are placed.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/coco/tdx/tdx.c           | 26 ------------------
 arch/x86/include/asm/shared/tdx.h | 45 +++++++++++++++++++++++++++++++
 arch/x86/include/asm/tdx.h        | 19 -------------
 3 files changed, 45 insertions(+), 45 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 3bcaf2170ede..b75fe670032b 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -12,14 +12,6 @@
 #include <asm/insn-eval.h>
 #include <asm/pgtable.h>
 
-/* TDX module Call Leaf IDs */
-#define TDX_GET_INFO			1
-#define TDX_GET_VEINFO			3
-#define TDX_ACCEPT_PAGE			6
-
-/* TDX hypercall Leaf IDs */
-#define TDVMCALL_MAP_GPA		0x10001
-
 /* MMIO direction */
 #define EPT_READ	0
 #define EPT_WRITE	1
@@ -34,24 +26,6 @@
 #define VE_GET_PORT_NUM(e)	((e) >> 16)
 #define VE_IS_IO_STRING(e)	((e) & BIT(4))
 
-/*
- * Wrapper for standard use of __tdx_hypercall with no output aside from
- * return code.
- */
-static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
-{
-	struct tdx_hypercall_args args = {
-		.r10 = TDX_HYPERCALL_STANDARD,
-		.r11 = fn,
-		.r12 = r12,
-		.r13 = r13,
-		.r14 = r14,
-		.r15 = r15,
-	};
-
-	return __tdx_hypercall(&args, 0);
-}
-
 /* Called from __tdx_hypercall() for unrecoverable failure */
 void __tdx_hypercall_failed(void)
 {
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index e53f26228fbb..956ced04c3be 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -13,6 +13,14 @@
 #define TDX_CPUID_LEAF_ID	0x21
 #define TDX_IDENT		"IntelTDX    "
 
+/* TDX module Call Leaf IDs */
+#define TDX_GET_INFO			1
+#define TDX_GET_VEINFO			3
+#define TDX_ACCEPT_PAGE			6
+
+/* TDX hypercall Leaf IDs */
+#define TDVMCALL_MAP_GPA		0x10001
+
 #ifndef __ASSEMBLY__
 
 /*
@@ -33,8 +41,45 @@ struct tdx_hypercall_args {
 /* Used to request services from the VMM */
 u64 __tdx_hypercall(struct tdx_hypercall_args *args, unsigned long flags);
 
+/*
+ * Wrapper for standard use of __tdx_hypercall with no output aside from
+ * return code.
+ */
+static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+	struct tdx_hypercall_args args = {
+		.r10 = TDX_HYPERCALL_STANDARD,
+		.r11 = fn,
+		.r12 = r12,
+		.r13 = r13,
+		.r14 = r14,
+		.r15 = r15,
+	};
+
+	return __tdx_hypercall(&args, 0);
+}
+
+
 /* Called from __tdx_hypercall() for unrecoverable failure */
 void __tdx_hypercall_failed(void);
 
+/*
+ * Used in __tdx_module_call() to gather the output registers' values of the
+ * TDCALL instruction when requesting services from the TDX module. This is a
+ * software only structure and not part of the TDX module/VMM ABI
+ */
+struct tdx_module_output {
+	u64 rcx;
+	u64 rdx;
+	u64 r8;
+	u64 r9;
+	u64 r10;
+	u64 r11;
+};
+
+/* Used to communicate with the TDX module */
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+		      struct tdx_module_output *out);
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_SHARED_TDX_H */
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 020c81a7c729..d9106d3e89f8 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -20,21 +20,6 @@
 
 #ifndef __ASSEMBLY__
 
-/*
- * Used to gather the output registers values of the TDCALL and SEAMCALL
- * instructions when requesting services from the TDX module.
- *
- * This is a software only structure and not part of the TDX module/VMM ABI.
- */
-struct tdx_module_output {
-	u64 rcx;
-	u64 rdx;
-	u64 r8;
-	u64 r9;
-	u64 r10;
-	u64 r11;
-};
-
 /*
  * Used by the #VE exception handler to gather the #VE exception
  * info from the TDX module. This is a software only structure
@@ -55,10 +40,6 @@ struct ve_info {
 
 void __init tdx_early_init(void);
 
-/* Used to communicate with the TDX module */
-u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
-		      struct tdx_module_output *out);
-
 void tdx_get_ve_info(struct ve_info *ve);
 
 bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCHv7 13/14] x86/tdx: Refactor try_accept_one()
  2022-06-14 12:02 [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (11 preceding siblings ...)
  2022-06-14 12:02 ` [PATCHv7 12/14] x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub Kirill A. Shutemov
@ 2022-06-14 12:02 ` Kirill A. Shutemov
  2022-06-23 17:31   ` Dave Hansen
  2022-07-26 10:58   ` Borislav Petkov
  2022-06-14 12:02 ` [PATCHv7 14/14] x86/tdx: Add unaccepted memory support Kirill A. Shutemov
                   ` (6 subsequent siblings)
  19 siblings, 2 replies; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-06-14 12:02 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov

Rework try_accept_one() to return accepted size instead of modifying
'start' inside the helper. It makes 'start' in-only argumaent and
streamlines code on the caller side.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
---
 arch/x86/coco/tdx/tdx.c | 38 +++++++++++++++++++-------------------
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index b75fe670032b..b10c95307f6e 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -624,18 +624,18 @@ static bool tdx_cache_flush_required(void)
 	return true;
 }
 
-static bool try_accept_one(phys_addr_t *start, unsigned long len,
-			  enum pg_level pg_level)
+static unsigned long try_accept_one(phys_addr_t start, unsigned long len,
+				    enum pg_level pg_level)
 {
 	unsigned long accept_size = page_level_size(pg_level);
 	u64 tdcall_rcx;
 	u8 page_size;
 
-	if (!IS_ALIGNED(*start, accept_size))
-		return false;
+	if (!IS_ALIGNED(start, accept_size))
+		return 0;
 
 	if (len < accept_size)
-		return false;
+		return 0;
 
 	/*
 	 * Pass the page physical address to the TDX module to accept the
@@ -654,15 +654,14 @@ static bool try_accept_one(phys_addr_t *start, unsigned long len,
 		page_size = 2;
 		break;
 	default:
-		return false;
+		return 0;
 	}
 
-	tdcall_rcx = *start | page_size;
+	tdcall_rcx = start | page_size;
 	if (__tdx_module_call(TDX_ACCEPT_PAGE, tdcall_rcx, 0, 0, 0, NULL))
-		return false;
+		return 0;
 
-	*start += accept_size;
-	return true;
+	return accept_size;
 }
 
 /*
@@ -699,21 +698,22 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
 	 */
 	while (start < end) {
 		unsigned long len = end - start;
+		unsigned long accept_size;
 
 		/*
 		 * Try larger accepts first. It gives chance to VMM to keep
-		 * 1G/2M SEPT entries where possible and speeds up process by
-		 * cutting number of hypercalls (if successful).
+		 * 1G/2M Secure EPT entries where possible and speeds up
+		 * process by cutting number of hypercalls (if successful).
 		 */
 
-		if (try_accept_one(&start, len, PG_LEVEL_1G))
-			continue;
-
-		if (try_accept_one(&start, len, PG_LEVEL_2M))
-			continue;
-
-		if (!try_accept_one(&start, len, PG_LEVEL_4K))
+		accept_size = try_accept_one(start, len, PG_LEVEL_1G);
+		if (!accept_size)
+			accept_size = try_accept_one(start, len, PG_LEVEL_2M);
+		if (!accept_size)
+			accept_size = try_accept_one(start, len, PG_LEVEL_4K);
+		if (!accept_size)
 			return false;
+		start += accept_size;
 	}
 
 	return true;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCHv7 14/14] x86/tdx: Add unaccepted memory support
  2022-06-14 12:02 [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (12 preceding siblings ...)
  2022-06-14 12:02 ` [PATCHv7 13/14] x86/tdx: Refactor try_accept_one() Kirill A. Shutemov
@ 2022-06-14 12:02 ` Kirill A. Shutemov
  2022-06-24 16:22   ` Dave Hansen
  2022-07-26 14:51   ` Borislav Petkov
  2022-06-24 16:37 ` [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Peter Gonda
                   ` (5 subsequent siblings)
  19 siblings, 2 replies; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-06-14 12:02 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov

Hookup TDX-specific code to accept memory.

Accepting the memory is the same process as converting memory from
shared to private: kernel notifies VMM with MAP_GPA hypercall and then
accept pages with ACCEPT_PAGE module call.

The implementation in core kernel uses tdx_enc_status_changed(). It
already used for converting memory to shared and back for I/O
transactions.

Boot stub provides own implementation of tdx_accept_memory(). It is
similar in structure to tdx_enc_status_changed(), but only cares about
converting memory to private.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                  |  1 +
 arch/x86/boot/compressed/mem.c    | 27 ++++++++++-
 arch/x86/boot/compressed/tdx.c    | 78 +++++++++++++++++++++++++++++++
 arch/x86/coco/tdx/tdx.c           | 30 ++++++++----
 arch/x86/include/asm/shared/tdx.h |  2 +
 arch/x86/mm/unaccepted_memory.c   |  9 +++-
 6 files changed, 136 insertions(+), 11 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9783ebc4e021..80683afa5749 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -892,6 +892,7 @@ config INTEL_TDX_GUEST
 	select ARCH_HAS_CC_PLATFORM
 	select X86_MEM_ENCRYPT
 	select X86_MCE
+	select UNACCEPTED_MEMORY
 	help
 	  Support running as a guest under Intel TDX.  Without this support,
 	  the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index b45458af00ca..48e36e640da1 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -5,6 +5,8 @@
 #include "error.h"
 #include "find.h"
 #include "math.h"
+#include "tdx.h"
+#include <asm/shared/tdx.h>
 
 #define PMD_SHIFT	21
 #define PMD_SIZE	(_AC(1, UL) << PMD_SHIFT)
@@ -12,10 +14,33 @@
 
 extern struct boot_params *boot_params;
 
+static bool is_tdx_guest(void)
+{
+	static bool once;
+	static bool is_tdx;
+
+	if (!IS_ENABLED(CONFIG_INTEL_TDX_GUEST))
+		return false;
+
+	if (!once) {
+		u32 eax, sig[3];
+
+		cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax,
+			    &sig[0], &sig[2],  &sig[1]);
+		is_tdx = !memcmp(TDX_IDENT, sig, sizeof(sig));
+		once = true;
+	}
+
+	return is_tdx;
+}
+
 static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
 {
 	/* Platform-specific memory-acceptance call goes here */
-	error("Cannot accept memory");
+	if (is_tdx_guest())
+		tdx_accept_memory(start, end);
+	else
+		error("Cannot accept memory: unknown platform\n");
 }
 
 /*
diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
index 918a7606f53c..8518a75e5dd5 100644
--- a/arch/x86/boot/compressed/tdx.c
+++ b/arch/x86/boot/compressed/tdx.c
@@ -3,12 +3,15 @@
 #include "../cpuflags.h"
 #include "../string.h"
 #include "../io.h"
+#include "align.h"
 #include "error.h"
+#include "pgtable_types.h"
 
 #include <vdso/limits.h>
 #include <uapi/asm/vmx.h>
 
 #include <asm/shared/tdx.h>
+#include <asm/page_types.h>
 
 /* Called from __tdx_hypercall() for unrecoverable failure */
 void __tdx_hypercall_failed(void)
@@ -75,3 +78,78 @@ void early_tdx_detect(void)
 	pio_ops.f_outb = tdx_outb;
 	pio_ops.f_outw = tdx_outw;
 }
+
+static unsigned long try_accept_one(phys_addr_t start, unsigned long len,
+				    enum pg_level level)
+{
+	unsigned long accept_size = PAGE_SIZE << ((level - 1) * PTE_SHIFT);
+	u64 tdcall_rcx;
+	u8 page_size;
+
+	if (!IS_ALIGNED(start, accept_size))
+		return 0;
+
+	if (len < accept_size)
+		return 0;
+
+	/*
+	 * Pass the page physical address to the TDX module to accept the
+	 * pending, private page.
+	 *
+	 * Bits 2:0 of RCX encode page size: 0 - 4K, 1 - 2M, 2 - 1G.
+	 */
+	switch (level) {
+	case PG_LEVEL_4K:
+		page_size = 0;
+		break;
+	case PG_LEVEL_2M:
+		page_size = 1;
+		break;
+	case PG_LEVEL_1G:
+		page_size = 2;
+		break;
+	default:
+		return 0;
+	}
+
+	tdcall_rcx = start | page_size;
+	if (__tdx_module_call(TDX_ACCEPT_PAGE, tdcall_rcx, 0, 0, 0, NULL))
+		return 0;
+
+	return accept_size;
+}
+
+void tdx_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	/*
+	 * Notify the VMM about page mapping conversion. More info about ABI
+	 * can be found in TDX Guest-Host-Communication Interface (GHCI),
+	 * section "TDG.VP.VMCALL<MapGPA>"
+	 */
+	if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
+		error("Accepting memory failed\n");
+
+	/*
+	 * For shared->private conversion, accept the page using
+	 * TDX_ACCEPT_PAGE TDX module call.
+	 */
+	while (start < end) {
+		unsigned long len = end - start;
+		unsigned long accept_size;
+
+		/*
+		 * Try larger accepts first. It gives chance to VMM to keep
+		 * 1G/2M Secure EPT entries where possible and speeds up
+		 * process by cutting number of hypercalls (if successful).
+		 */
+
+		accept_size = try_accept_one(start, len, PG_LEVEL_1G);
+		if (!accept_size)
+			accept_size = try_accept_one(start, len, PG_LEVEL_2M);
+		if (!accept_size)
+			accept_size = try_accept_one(start, len, PG_LEVEL_4K);
+		if (!accept_size)
+			error("Accepting memory failed\n");
+		start += accept_size;
+	}
+}
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index b10c95307f6e..8240f04d3646 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -664,16 +664,9 @@ static unsigned long try_accept_one(phys_addr_t start, unsigned long len,
 	return accept_size;
 }
 
-/*
- * Inform the VMM of the guest's intent for this physical page: shared with
- * the VMM or private to the guest.  The VMM is expected to change its mapping
- * of the page in response.
- */
-static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
+static bool tdx_enc_status_changed_phys(phys_addr_t start, phys_addr_t end,
+					bool enc)
 {
-	phys_addr_t start = __pa(vaddr);
-	phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
-
 	if (!enc) {
 		/* Set the shared (decrypted) bits: */
 		start |= cc_mkdec(0);
@@ -719,6 +712,25 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
 	return true;
 }
 
+void tdx_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	if (!tdx_enc_status_changed_phys(start, end, true))
+		panic("Accepting memory failed: %#llx-%#llx\n",  start, end);
+}
+
+/*
+ * Inform the VMM of the guest's intent for this physical page: shared with
+ * the VMM or private to the guest.  The VMM is expected to change its mapping
+ * of the page in response.
+ */
+static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
+{
+	phys_addr_t start = __pa(vaddr);
+	phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
+
+	return tdx_enc_status_changed_phys(start, end, enc);
+}
+
 void __init tdx_early_init(void)
 {
 	u64 cc_mask;
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index 956ced04c3be..97534c334473 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -81,5 +81,7 @@ struct tdx_module_output {
 u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 		      struct tdx_module_output *out);
 
+void tdx_accept_memory(phys_addr_t start, phys_addr_t end);
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_SHARED_TDX_H */
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index 05e216716690..9ec2304272dc 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -7,6 +7,7 @@
 
 #include <asm/io.h>
 #include <asm/setup.h>
+#include <asm/shared/tdx.h>
 #include <asm/unaccepted_memory.h>
 
 /* Protects unaccepted memory bitmap */
@@ -62,7 +63,13 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
 		unsigned long len = range_end - range_start;
 
 		/* Platform-specific memory-acceptance call goes here */
-		panic("Cannot accept memory: unknown platform\n");
+		if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
+			tdx_accept_memory(range_start * PMD_SIZE,
+					  range_end * PMD_SIZE);
+		} else {
+			panic("Cannot accept memory: unknown platform\n");
+		}
+
 		bitmap_clear(bitmap, range_start, len);
 	}
 	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-06-14 12:02 ` [PATCHv7 02/14] mm: Add support for unaccepted memory Kirill A. Shutemov
@ 2022-06-14 12:57   ` Gupta, Pankaj
  2022-06-17 19:28   ` Tom Lendacky
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 200+ messages in thread
From: Gupta, Pankaj @ 2022-06-14 12:57 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Mike Rapoport

On 6/14/2022 2:02 PM, Kirill A. Shutemov wrote:
> UEFI Specification version 2.9 introduces the concept of memory
> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
> SEV-SNP, require memory to be accepted before it can be used by the
> guest. Accepting happens via a protocol specific to the Virtual Machine
> platform.
> 
> There are several ways kernel can deal with unaccepted memory:
> 
>   1. Accept all the memory during the boot. It is easy to implement and
>      it doesn't have runtime cost once the system is booted. The downside
>      is very long boot time.
> 
>      Accept can be parallelized to multiple CPUs to keep it manageable
>      (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
>      memory bandwidth and does not scale beyond the point.
> 
>   2. Accept a block of memory on the first use. It requires more
>      infrastructure and changes in page allocator to make it work, but
>      it provides good boot time.
> 
>      On-demand memory accept means latency spikes every time kernel steps
>      onto a new memory block. The spikes will go away once workload data
>      set size gets stabilized or all memory gets accepted.
> 
>   3. Accept all memory in background. Introduce a thread (or multiple)
>      that gets memory accepted proactively. It will minimize time the
>      system experience latency spikes on memory allocation while keeping
>      low boot time.
> 
>      This approach cannot function on its own. It is an extension of #2:
>      background memory acceptance requires functional scheduler, but the
>      page allocator may need to tap into unaccepted memory before that.
> 
>      The downside of the approach is that these threads also steal CPU
>      cycles and memory bandwidth from the user's workload and may hurt
>      user experience.
> 
> Implement #2 for now. It is a reasonable default. Some workloads may
> want to use #1 or #3 and they can be implemented later based on user's
> demands.
> 
> Support of unaccepted memory requires a few changes in core-mm code:
> 
>    - memblock has to accept memory on allocation;
> 
>    - page allocator has to accept memory on the first allocation of the
>      page;
> 
> Memblock change is trivial.
> 
> The page allocator is modified to accept pages on the first allocation.
> The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is
> used to indicate that the page requires acceptance.
> 
> Architecture has to provide two helpers if it wants to support
> unaccepted memory:
> 
>   - accept_memory() makes a range of physical addresses accepted.
> 
>   - range_contains_unaccepted_memory() checks anything within the range
>     of physical addresses requires acceptance.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
> Reviewed-by: David Hildenbrand <david@redhat.com>
> ---
>   include/linux/page-flags.h | 31 +++++++++++++
>   mm/internal.h              | 12 +++++
>   mm/memblock.c              |  9 ++++
>   mm/page_alloc.c            | 89 +++++++++++++++++++++++++++++++++++++-
>   4 files changed, 139 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index e66f7aa3191d..444ba8f4bfa0 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -942,6 +942,14 @@ static inline bool is_page_hwpoison(struct page *page)
>   #define PG_offline	0x00000100
>   #define PG_table	0x00000200
>   #define PG_guard	0x00000400
> +#define PG_unaccepted	0x00000800
> +
> +/*
> + * Page types allowed at page_expected_state()
> + *
> + * PageUnaccepted() will get cleared in post_alloc_hook().
> + */
> +#define PAGE_TYPES_EXPECTED	PG_unaccepted
>   
>   #define PageType(page, flag)						\
>   	((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
> @@ -967,6 +975,18 @@ static __always_inline void __ClearPage##uname(struct page *page)	\
>   	page->page_type |= PG_##lname;					\
>   }
>   
> +#define PAGE_TYPE_OPS_FALSE(uname)					\
> +static __always_inline int Page##uname(struct page *page)		\
> +{									\
> +	return false;							\
> +}									\
> +static __always_inline void __SetPage##uname(struct page *page)		\
> +{									\
> +}									\
> +static __always_inline void __ClearPage##uname(struct page *page)	\
> +{									\
> +}
> +
>   /*
>    * PageBuddy() indicates that the page is free and in the buddy system
>    * (see mm/page_alloc.c).
> @@ -997,6 +1017,17 @@ PAGE_TYPE_OPS(Buddy, buddy)
>    */
>   PAGE_TYPE_OPS(Offline, offline)
>   
> +/*
> + * PageUnaccepted() indicates that the page has to be "accepted" before it can
> + * be read or written. The page allocator must call accept_page() before
> + * touching the page or returning it to the caller.
> + */
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +PAGE_TYPE_OPS(Unaccepted, unaccepted)
> +#else
> +PAGE_TYPE_OPS_FALSE(Unaccepted)
> +#endif
> +
>   extern void page_offline_freeze(void);
>   extern void page_offline_thaw(void);
>   extern void page_offline_begin(void);
> diff --git a/mm/internal.h b/mm/internal.h
> index c0f8fbe0445b..7c1a653edfc8 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -861,4 +861,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags);
>   
>   DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>   
> +#ifndef CONFIG_UNACCEPTED_MEMORY
> +static inline bool range_contains_unaccepted_memory(phys_addr_t start,
> +						    phys_addr_t end)
> +{
> +	return false;
> +}
> +
> +static inline void accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +}
> +#endif
> +
>   #endif	/* __MM_INTERNAL_H */
> diff --git a/mm/memblock.c b/mm/memblock.c
> index e4f03a6e8e56..a1f7f8b304d5 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1405,6 +1405,15 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
>   		 */
>   		kmemleak_alloc_phys(found, size, 0, 0);
>   
> +	/*
> +	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> +	 * require memory to be accepted before it can be used by the
> +	 * guest.
> +	 *
> +	 * Accept the memory of the allocated buffer.
> +	 */
> +	accept_memory(found, found + size);
> +
>   	return found;
>   }
>   
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e008a3df0485..279c2746aaa8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -122,6 +122,12 @@ typedef int __bitwise fpi_t;
>    */
>   #define FPI_SKIP_KASAN_POISON	((__force fpi_t)BIT(2))
>   
> +/*
> + * Check if the page needs to be marked as PageUnaccepted().
> + * Used for the new pages added to the buddy allocator for the first time.
> + */
> +#define FPI_UNACCEPTED_SLOWPATH	((__force fpi_t)BIT(3))
> +
>   /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>   static DEFINE_MUTEX(pcp_batch_high_lock);
>   #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
> @@ -993,6 +999,35 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
>   			NULL) != NULL;
>   }
>   
> +/*
> + * Page acceptance can be very slow. Do not call under critical locks.
> + */
> +static void accept_page(struct page *page, unsigned int order)
> +{
> +	phys_addr_t start = page_to_phys(page);
> +	int i;
> +
> +	if (!PageUnaccepted(page))
> +		return;
> +
> +	accept_memory(start, start + (PAGE_SIZE << order));
> +	__ClearPageUnaccepted(page);
> +
> +	/* Assert that there is no PageUnaccepted() on tail pages */
> +	if (IS_ENABLED(CONFIG_DEBUG_VM)) {
> +		for (i = 1; i < (1 << order); i++)
> +			VM_BUG_ON_PAGE(PageUnaccepted(page + i), page + i);
> +	}
> +}
> +
> +static bool page_contains_unaccepted(struct page *page, unsigned int order)
> +{
> +	phys_addr_t start = page_to_phys(page);
> +	phys_addr_t end = start + (PAGE_SIZE << order);
> +
> +	return range_contains_unaccepted_memory(start, end);
> +}
> +
>   /*
>    * Freeing function for a buddy system allocator.
>    *
> @@ -1027,6 +1062,7 @@ static inline void __free_one_page(struct page *page,
>   	unsigned long combined_pfn;
>   	struct page *buddy;
>   	bool to_tail;
> +	bool page_needs_acceptance = false;
>   
>   	VM_BUG_ON(!zone_is_initialized(zone));
>   	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> @@ -1038,6 +1074,11 @@ static inline void __free_one_page(struct page *page,
>   	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
>   	VM_BUG_ON_PAGE(bad_range(zone, page), page);
>   
> +	if (PageUnaccepted(page)) {
> +		page_needs_acceptance = true;
> +		__ClearPageUnaccepted(page);
> +	}
> +
>   	while (order < MAX_ORDER - 1) {
>   		if (compaction_capture(capc, page, order, migratetype)) {
>   			__mod_zone_freepage_state(zone, -(1 << order),
> @@ -1072,6 +1113,13 @@ static inline void __free_one_page(struct page *page,
>   			clear_page_guard(zone, buddy, order, migratetype);
>   		else
>   			del_page_from_free_list(buddy, zone, order);
> +
> +		/* Mark page unaccepted if any of merged pages were unaccepted */
> +		if (PageUnaccepted(buddy)) {
> +			page_needs_acceptance = true;
> +			__ClearPageUnaccepted(buddy);
> +		}
> +
>   		combined_pfn = buddy_pfn & pfn;
>   		page = page + (combined_pfn - pfn);
>   		pfn = combined_pfn;
> @@ -1081,6 +1129,23 @@ static inline void __free_one_page(struct page *page,
>   done_merging:
>   	set_buddy_order(page, order);
>   
> +	/*
> +	 * The page gets marked as PageUnaccepted() if any of merged-in pages
> +	 * is PageUnaccepted().
> +	 *
> +	 * New pages, just being added to buddy allocator, do not have
> +	 * PageUnaccepted() set. FPI_UNACCEPTED_SLOWPATH indicates that the
> +	 * page is new and page_contains_unaccepted() check is required to
> +	 * determinate if acceptance is required.
> +	 *
> +	 * Avoid calling page_contains_unaccepted() if it is known that the page
> +	 * needs acceptance. It can be costly.
> +	 */
> +	if (!page_needs_acceptance && (fpi_flags & FPI_UNACCEPTED_SLOWPATH))
> +		page_needs_acceptance = page_contains_unaccepted(page, order);
> +	if (page_needs_acceptance)
> +		__SetPageUnaccepted(page);
> +
>   	if (fpi_flags & FPI_TO_TAIL)
>   		to_tail = true;
>   	else if (is_shuffle_order(order))
> @@ -1164,7 +1229,13 @@ int split_free_page(struct page *free_page,
>   static inline bool page_expected_state(struct page *page,
>   					unsigned long check_flags)
>   {
> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
> +	/*
> +	 * The page must not be mapped to userspace and must not have
> +	 * a PageType other than listed in PAGE_TYPES_EXPECTED.
> +	 *
> +	 * Note, bit cleared means the page type is set.
> +	 */
> +	if (unlikely((atomic_read(&page->_mapcount) | PAGE_TYPES_EXPECTED) != -1))
>   		return false;
>   
>   	if (unlikely((unsigned long)page->mapping |
> @@ -1669,7 +1740,9 @@ void __free_pages_core(struct page *page, unsigned int order)
>   	 * Bypass PCP and place fresh pages right to the tail, primarily
>   	 * relevant for memory onlining.
>   	 */
> -	__free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON);
> +	__free_pages_ok(page, order,
> +			FPI_TO_TAIL | FPI_SKIP_KASAN_POISON |
> +			FPI_UNACCEPTED_SLOWPATH);
>   }
>   
>   #ifdef CONFIG_NUMA
> @@ -1822,6 +1895,9 @@ static void __init deferred_free_range(unsigned long pfn,
>   		return;
>   	}
>   
> +	/* Accept chunks smaller than page-block upfront */
> +	accept_memory(PFN_PHYS(pfn), PFN_PHYS(pfn + nr_pages));
> +
>   	for (i = 0; i < nr_pages; i++, page++, pfn++) {
>   		if ((pfn & (pageblock_nr_pages - 1)) == 0)
>   			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> @@ -2281,6 +2357,13 @@ static inline void expand(struct zone *zone, struct page *page,
>   		if (set_page_guard(zone, &page[size], high, migratetype))
>   			continue;
>   
> +		/*
> +		 * Transfer PageUnaccepted() to the newly split pages so
> +		 * they can be accepted after dropping the zone lock.
> +		 */
> +		if (PageUnaccepted(page))
> +			__SetPageUnaccepted(&page[size]);
> +
>   		add_to_free_list(&page[size], zone, high, migratetype);
>   		set_buddy_order(&page[size], high);
>   	}
> @@ -2411,6 +2494,8 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>   	 */
>   	kernel_unpoison_pages(page, 1 << order);
>   
> +	accept_page(page, order);
> +
>   	/*
>   	 * As memory initialization might be integrated into KASAN,
>   	 * KASAN unpoisoning and memory initializion code must be

Reviewed previous version(v6) of this patch. Seems no change in this 
version except '!PageUnaccepted' check addition in function 
'accept_page' and rename of function 'page_contains_unaccepted'.

Acked-by: Pankaj Gupta <pankaj.gupta@amd.com>

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 05/14] x86/boot: Add infrastructure required for unaccepted memory support
  2022-06-14 12:02 ` [PATCHv7 05/14] x86/boot: Add infrastructure required for unaccepted memory support Kirill A. Shutemov
@ 2022-06-15 10:19   ` Peter Zijlstra
  2022-06-15 15:05     ` Kirill A. Shutemov
  2022-07-25 21:33   ` Borislav Petkov
  1 sibling, 1 reply; 200+ messages in thread
From: Peter Zijlstra @ 2022-06-15 10:19 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Paolo Bonzini, Ingo Molnar,
	Varad Gautam, Dario Faggioli, Dave Hansen, Mike Rapoport,
	David Hildenbrand, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, x86, linux-mm, linux-coco, linux-efi, linux-kernel

On Tue, Jun 14, 2022 at 03:02:22PM +0300, Kirill A. Shutemov wrote:
> Pull functionality from the main kernel headers and lib/ that is
> required for unaccepted memory support.
> 
> This is preparatory patch. The users for the functionality will come in
> following patches.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/boot/bitops.h                   | 40 ++++++++++++
>  arch/x86/boot/compressed/align.h         | 14 +++++
>  arch/x86/boot/compressed/bitmap.c        | 43 +++++++++++++
>  arch/x86/boot/compressed/bitmap.h        | 49 +++++++++++++++
>  arch/x86/boot/compressed/bits.h          | 36 +++++++++++
>  arch/x86/boot/compressed/compiler.h      |  9 +++
>  arch/x86/boot/compressed/find.c          | 54 ++++++++++++++++
>  arch/x86/boot/compressed/find.h          | 80 ++++++++++++++++++++++++
>  arch/x86/boot/compressed/math.h          | 37 +++++++++++
>  arch/x86/boot/compressed/minmax.h        | 61 ++++++++++++++++++
>  arch/x86/boot/compressed/pgtable_types.h | 25 ++++++++

That's quite a lot of duplicated code; is there really no way so share
this?

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 05/14] x86/boot: Add infrastructure required for unaccepted memory support
  2022-06-15 10:19   ` Peter Zijlstra
@ 2022-06-15 15:05     ` Kirill A. Shutemov
  2022-07-17 17:16       ` Borislav Petkov
  0 siblings, 1 reply; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-06-15 15:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel

On Wed, Jun 15, 2022 at 12:19:45PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 14, 2022 at 03:02:22PM +0300, Kirill A. Shutemov wrote:
> > Pull functionality from the main kernel headers and lib/ that is
> > required for unaccepted memory support.
> > 
> > This is preparatory patch. The users for the functionality will come in
> > following patches.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  arch/x86/boot/bitops.h                   | 40 ++++++++++++
> >  arch/x86/boot/compressed/align.h         | 14 +++++
> >  arch/x86/boot/compressed/bitmap.c        | 43 +++++++++++++
> >  arch/x86/boot/compressed/bitmap.h        | 49 +++++++++++++++
> >  arch/x86/boot/compressed/bits.h          | 36 +++++++++++
> >  arch/x86/boot/compressed/compiler.h      |  9 +++
> >  arch/x86/boot/compressed/find.c          | 54 ++++++++++++++++
> >  arch/x86/boot/compressed/find.h          | 80 ++++++++++++++++++++++++
> >  arch/x86/boot/compressed/math.h          | 37 +++++++++++
> >  arch/x86/boot/compressed/minmax.h        | 61 ++++++++++++++++++
> >  arch/x86/boot/compressed/pgtable_types.h | 25 ++++++++
> 
> That's quite a lot of duplicated code; is there really no way so share
> this?

Code duplication also make me uncomfortable. But that what Borislav wanted
to see. efi.h in the boot stub which copies bulk of <linux/efi.h> also
sets the trend in the direction.

Alternative is creating a subset of headers that can be used in both in
main kernel and boot stub. It is more complex and doesn't allow for short
cuts that can be made on copy if you know the context it is used in.

It also sounds painfully similar to uapi/ project. I'm not sure we want to
go this path.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-06-14 12:02 ` [PATCHv7 02/14] mm: Add support for unaccepted memory Kirill A. Shutemov
  2022-06-14 12:57   ` Gupta, Pankaj
@ 2022-06-17 19:28   ` Tom Lendacky
  2022-06-17 20:53     ` Tom Lendacky
  2022-07-21 15:14   ` Borislav Petkov
  2022-08-05 11:49   ` Vlastimil Babka
  3 siblings, 1 reply; 200+ messages in thread
From: Tom Lendacky @ 2022-06-17 19:28 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Mike Rapoport, Michael Roth,
	Ashish Kalra

On 6/14/22 07:02, Kirill A. Shutemov wrote:
> UEFI Specification version 2.9 introduces the concept of memory
> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
> SEV-SNP, require memory to be accepted before it can be used by the
> guest. Accepting happens via a protocol specific to the Virtual Machine
> platform.
> 
> There are several ways kernel can deal with unaccepted memory:
> 
>   1. Accept all the memory during the boot. It is easy to implement and
>      it doesn't have runtime cost once the system is booted. The downside
>      is very long boot time.
> 
>      Accept can be parallelized to multiple CPUs to keep it manageable
>      (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
>      memory bandwidth and does not scale beyond the point.
> 
>   2. Accept a block of memory on the first use. It requires more
>      infrastructure and changes in page allocator to make it work, but
>      it provides good boot time.
> 
>      On-demand memory accept means latency spikes every time kernel steps
>      onto a new memory block. The spikes will go away once workload data
>      set size gets stabilized or all memory gets accepted.
> 
>   3. Accept all memory in background. Introduce a thread (or multiple)
>      that gets memory accepted proactively. It will minimize time the
>      system experience latency spikes on memory allocation while keeping
>      low boot time.
> 
>      This approach cannot function on its own. It is an extension of #2:
>      background memory acceptance requires functional scheduler, but the
>      page allocator may need to tap into unaccepted memory before that.
> 
>      The downside of the approach is that these threads also steal CPU
>      cycles and memory bandwidth from the user's workload and may hurt
>      user experience.
> 
> Implement #2 for now. It is a reasonable default. Some workloads may
> want to use #1 or #3 and they can be implemented later based on user's
> demands.
> 
> Support of unaccepted memory requires a few changes in core-mm code:
> 
>    - memblock has to accept memory on allocation;
> 
>    - page allocator has to accept memory on the first allocation of the
>      page;
> 
> Memblock change is trivial.
> 
> The page allocator is modified to accept pages on the first allocation.
> The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is
> used to indicate that the page requires acceptance.
> 
> Architecture has to provide two helpers if it wants to support
> unaccepted memory:
> 
>   - accept_memory() makes a range of physical addresses accepted.
> 
>   - range_contains_unaccepted_memory() checks anything within the range
>     of physical addresses requires acceptance.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
> Reviewed-by: David Hildenbrand <david@redhat.com>
> ---
>   include/linux/page-flags.h | 31 +++++++++++++
>   mm/internal.h              | 12 +++++
>   mm/memblock.c              |  9 ++++
>   mm/page_alloc.c            | 89 +++++++++++++++++++++++++++++++++++++-
>   4 files changed, 139 insertions(+), 2 deletions(-)
> 

> diff --git a/mm/memblock.c b/mm/memblock.c
> index e4f03a6e8e56..a1f7f8b304d5 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1405,6 +1405,15 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
>   		 */
>   		kmemleak_alloc_phys(found, size, 0, 0);
>   
> +	/*
> +	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> +	 * require memory to be accepted before it can be used by the
> +	 * guest.
> +	 *
> +	 * Accept the memory of the allocated buffer.
> +	 */
> +	accept_memory(found, found + size);

The SNP support will kmalloc a descriptor that can be used to supply the 
range to the hypervisor using the GHCB/VMGEXIT. But kmalloc won't work 
when called this early, so we are likely to need an early_accept_memory or 
some kind of flag to know whether this is an early call or not in order to 
use a static descriptor in the file.

Thanks,
Tom

> +
>   	return found;
>   }
>   

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-06-17 19:28   ` Tom Lendacky
@ 2022-06-17 20:53     ` Tom Lendacky
  0 siblings, 0 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-06-17 20:53 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Mike Rapoport, Michael Roth,
	Ashish Kalra

On 6/17/22 14:28, Tom Lendacky wrote:
> On 6/14/22 07:02, Kirill A. Shutemov wrote:
>> UEFI Specification version 2.9 introduces the concept of memory
>> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
>> SEV-SNP, require memory to be accepted before it can be used by the
>> guest. Accepting happens via a protocol specific to the Virtual Machine
>> platform.
>>
>> There are several ways kernel can deal with unaccepted memory:
>>
>>   1. Accept all the memory during the boot. It is easy to implement and
>>      it doesn't have runtime cost once the system is booted. The downside
>>      is very long boot time.
>>
>>      Accept can be parallelized to multiple CPUs to keep it manageable
>>      (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
>>      memory bandwidth and does not scale beyond the point.
>>
>>   2. Accept a block of memory on the first use. It requires more
>>      infrastructure and changes in page allocator to make it work, but
>>      it provides good boot time.
>>
>>      On-demand memory accept means latency spikes every time kernel steps
>>      onto a new memory block. The spikes will go away once workload data
>>      set size gets stabilized or all memory gets accepted.
>>
>>   3. Accept all memory in background. Introduce a thread (or multiple)
>>      that gets memory accepted proactively. It will minimize time the
>>      system experience latency spikes on memory allocation while keeping
>>      low boot time.
>>
>>      This approach cannot function on its own. It is an extension of #2:
>>      background memory acceptance requires functional scheduler, but the
>>      page allocator may need to tap into unaccepted memory before that.
>>
>>      The downside of the approach is that these threads also steal CPU
>>      cycles and memory bandwidth from the user's workload and may hurt
>>      user experience.
>>
>> Implement #2 for now. It is a reasonable default. Some workloads may
>> want to use #1 or #3 and they can be implemented later based on user's
>> demands.
>>
>> Support of unaccepted memory requires a few changes in core-mm code:
>>
>>    - memblock has to accept memory on allocation;
>>
>>    - page allocator has to accept memory on the first allocation of the
>>      page;
>>
>> Memblock change is trivial.
>>
>> The page allocator is modified to accept pages on the first allocation.
>> The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is
>> used to indicate that the page requires acceptance.
>>
>> Architecture has to provide two helpers if it wants to support
>> unaccepted memory:
>>
>>   - accept_memory() makes a range of physical addresses accepted.
>>
>>   - range_contains_unaccepted_memory() checks anything within the range
>>     of physical addresses requires acceptance.
>>
>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Acked-by: Mike Rapoport <rppt@linux.ibm.com>    # memblock
>> Reviewed-by: David Hildenbrand <david@redhat.com>
>> ---
>>   include/linux/page-flags.h | 31 +++++++++++++
>>   mm/internal.h              | 12 +++++
>>   mm/memblock.c              |  9 ++++
>>   mm/page_alloc.c            | 89 +++++++++++++++++++++++++++++++++++++-
>>   4 files changed, 139 insertions(+), 2 deletions(-)
>>
> 
>> diff --git a/mm/memblock.c b/mm/memblock.c
>> index e4f03a6e8e56..a1f7f8b304d5 100644
>> --- a/mm/memblock.c
>> +++ b/mm/memblock.c
>> @@ -1405,6 +1405,15 @@ phys_addr_t __init 
>> memblock_alloc_range_nid(phys_addr_t size,
>>            */
>>           kmemleak_alloc_phys(found, size, 0, 0);
>> +    /*
>> +     * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
>> +     * require memory to be accepted before it can be used by the
>> +     * guest.
>> +     *
>> +     * Accept the memory of the allocated buffer.
>> +     */
>> +    accept_memory(found, found + size);
> 
> The SNP support will kmalloc a descriptor that can be used to supply the 
> range to the hypervisor using the GHCB/VMGEXIT. But kmalloc won't work 
> when called this early, so we are likely to need an early_accept_memory or 
> some kind of flag to know whether this is an early call or not in order to 
> use a static descriptor in the file.

Then again, the accept_memory() call from mm/page_alloc.c can also be 
early... maybe I can check system_state and make a determination as to 
which SNP page state change mechanism to use (GHCB vs MSR protocol). It 
might not be performance optimal, though, as the MSR protocol is one 4K 
page at a time.

Thanks,
Tom

> 
> Thanks,
> Tom
> 
>> +
>>       return found;
>>   }

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 06/14] efi/x86: Implement support for unaccepted memory
  2022-06-14 12:02 ` [PATCHv7 06/14] efi/x86: Implement support for unaccepted memory Kirill A. Shutemov
@ 2022-06-22 19:58   ` Dave Hansen
  2022-07-26  8:35   ` Borislav Petkov
  1 sibling, 0 replies; 200+ messages in thread
From: Dave Hansen @ 2022-06-22 19:58 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel

On 6/14/22 05:02, Kirill A. Shutemov wrote:
...
> +/*
> + * The accepted memory bitmap only works at PMD_SIZE granularity. If a request
> + * comes in to mark memory as unaccepted which is not PMD_SIZE-aligned, simply
> + * accept the memory now since it can not be *marked* as unaccepted.
> + */

/*
 * The accepted memory bitmap only works at PMD_SIZE granularity.  This
 * function takes unaligned start/end addresses and either:
 *  1. Accepts the memory immediately and in its entirety
 *  2. Accepts unaligned parts, and marks *some* aligned part unaccepted
 *
 * The function will never reach the bitmap_set() with zero bits to set.
 */


> +void process_unaccepted_memory(struct boot_params *params, u64 start, u64 end)
> +{
> +	/*
> +	 * Accept small regions that might not be able to be represented
> +	 * in the bitmap.  This is a bit imprecise and may accept some
> +	 * areas that could have been represented in the bitmap instead.

	/*
	 * Ensure that at least one bit will be set in the bitmap by
	 * immediately accepting all regions under 2*PMD_SIZE.  This is
	 * imprecise and may immediately accept some areas that could
	 * have been represented in the bitmap.  But, results in simpler
	 * code below.

> +	 * Consider case like this:
> +	 *
> +	 * | 4k | 2044k |    2048k   |
> +	 * ^ 0x0        ^ 2MB        ^ 4MB
> +	 *
> +	 * all memory in the range is unaccepted, except for the first 4k.
> +	 * The second 2M can be represented in the bitmap, but kernel accept it
> +	 * right away. The imprecision makes the code simpler by ensuring that
> +	 * at least one bit will be set int the bitmap below.
> +	 */

	...
	* Only the first 4k has been accepted.  The 0MB->2MB region can
	* not be represented in the bitmap.  The 2MB->4MB region can be
	* represented in the bitmap.  But, the 0MB->4MB region is
	* <2*PMD_SIZE and will be immediately accepted in its entirety.
	*/

> +	if (end - start < 2 * PMD_SIZE) {
> +		__accept_memory(start, end);
> +		return;
> +	}
> +
> +	/*
> +	 * No matter how the start and end are aligned, at least one unaccepted
> +	 * PMD_SIZE area will remain.
> +	 */

I'd probably add:

	... to be marked in the bitmap


<snip>
> @@ -607,6 +608,17 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
>  			e820_type = E820_TYPE_PMEM;
>  			break;
>  
> +		case EFI_UNACCEPTED_MEMORY:
> +			if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY)) {
> +				efi_warn_once("The system has unaccepted memory,"
> +					     " but kernel does not support it\n");
> +				efi_warn_once("Consider enabling CONFIG_UNACCEPTED_MEMORY\n");
> +				continue;
> +			}
> +			e820_type = E820_TYPE_RAM;
> +			process_unaccepted_memory(params, d->phys_addr,
> +						  d->phys_addr + PAGE_SIZE * d->num_pages);
> +			break;
>  		default:
>  			continue;
>  		}
> @@ -671,6 +683,59 @@ static efi_status_t alloc_e820ext(u32 nr_desc, struct setup_data **e820ext,
>  	return status;
>  }
>  
> +static efi_status_t allocate_unaccepted_memory(struct boot_params *params,
> +					       __u32 nr_desc,
> +					       struct efi_boot_memmap *map)

I think this is misnamed.  This function is allocating a bitmap, not
"unaccepted_memory" itself.  Right?

> +{
> +	unsigned long *mem = NULL;
> +	u64 size, max_addr = 0;
> +	efi_status_t status;
> +	bool found = false;
> +	int i;
> +
> +	/* Check if there's any unaccepted memory and find the max address */
> +	for (i = 0; i < nr_desc; i++) {
> +		efi_memory_desc_t *d;
> +
> +		d = efi_early_memdesc_ptr(*map->map, *map->desc_size, i);
> +		if (d->type == EFI_UNACCEPTED_MEMORY)
> +			found = true;
> +		if (d->phys_addr + d->num_pages * PAGE_SIZE > max_addr)
> +			max_addr = d->phys_addr + d->num_pages * PAGE_SIZE;
> +	}
> +
> +	if (!found) {
> +		params->unaccepted_memory = 0;
> +		return EFI_SUCCESS;
> +	}
> +
> +	/*
> +	 * If unaccepted memory is present allocate a bitmap to track what
			
					  ^ comma

> +	 * memory has to be accepted before access.
> +	 *
> +	 * One bit in the bitmap represents 2MiB in the address space:
> +	 * A 4k bitmap can track 64GiB of physical address space.
> +	 *
> +	 * In the worst case scenario -- a huge hole in the middle of the
> +	 * address space -- It needs 256MiB to handle 4PiB of the address
> +	 * space.
> +	 *
> +	 * TODO: handle situation if params->unaccepted_memory is already set.
> +	 * It's required to deal with kexec.
> +	 *
> +	 * The bitmap will be populated in setup_e820() according to the memory
> +	 * map after efi_exit_boot_services().
> +	 */
> +	size = DIV_ROUND_UP(max_addr, PMD_SIZE * BITS_PER_BYTE);
> +	status = efi_allocate_pages(size, (unsigned long *)&mem, ULONG_MAX);
> +	if (status == EFI_SUCCESS) {
> +		memset(mem, 0, size);
> +		params->unaccepted_memory = (unsigned long)mem;
> +	}
> +
> +	return status;
> +}
> +
>  static efi_status_t allocate_e820(struct boot_params *params,
>  				  struct efi_boot_memmap *map,
>  				  struct setup_data **e820ext,
> @@ -691,6 +756,9 @@ static efi_status_t allocate_e820(struct boot_params *params,
>  		status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
>  	}
>  
> +	if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) && status == EFI_SUCCESS)
> +		status = allocate_unaccepted_memory(params, nr_desc, map);
> +
>  	efi_bs_call(free_pool, *map->map);
>  	return status;
>  }
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index 7d9b0bb47eb3..9c2fa94f2f93 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -108,7 +108,8 @@ typedef	struct {
>  #define EFI_MEMORY_MAPPED_IO_PORT_SPACE	12
>  #define EFI_PAL_CODE			13
>  #define EFI_PERSISTENT_MEMORY		14
> -#define EFI_MAX_MEMORY_TYPE		15
> +#define EFI_UNACCEPTED_MEMORY		15
> +#define EFI_MAX_MEMORY_TYPE		16
>  
>  /* Attribute values: */
>  #define EFI_MEMORY_UC		((u64)0x0000000000000001ULL)	/* uncached */


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2022-06-14 12:02 ` [PATCHv7 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into " Kirill A. Shutemov
@ 2022-06-23 17:19   ` Dave Hansen
  2022-07-26 10:21   ` Borislav Petkov
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 200+ messages in thread
From: Dave Hansen @ 2022-06-23 17:19 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel

On 6/14/22 05:02, Kirill A. Shutemov wrote:
> load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
> The unwanted loads are typically harmless. But, they might be made to
> totally unrelated or even unmapped memory. load_unaligned_zeropad()
> relies on exception fixup (#PF, #GP and now #VE) to recover from these
> unwanted loads.
> 
> But, this approach does not work for unaccepted memory. For TDX, a load
> from unaccepted memory will not lead to a recoverable exception within
> the guest. The guest will exit to the VMM where the only recourse is to
> terminate the guest.
> 
> There are three parts to fix this issue and comprehensively avoid access
> to unaccepted memory. Together these ensure that an extra “guard” page
> is accepted in addition to the memory that needs to be used.
> 
> 1. Implicitly extend the range_contains_unaccepted_memory(start, end)
>    checks up to end+2M if ‘end’ is aligned on a 2M boundary.
> 2. Implicitly extend accept_memory(start, end) to end+2M if ‘end’ is
>    aligned on a 2M boundary.
> 3. Set PageUnaccepted() on both memory that itself needs to be accepted
>    *and* memory where the next page needs to be accepted. Essentially,
>    make PageUnaccepted(page) a marker for whether work needs to be done
>    to make ‘page’ usable. That work might include accepting pages in
>    addition to ‘page’ itself.
...

That all looks pretty good.

> diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
> index 1df918b21469..bcd56fe82b9e 100644
> --- a/arch/x86/mm/unaccepted_memory.c
> +++ b/arch/x86/mm/unaccepted_memory.c
> @@ -23,6 +23,38 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
>  	bitmap = __va(boot_params.unaccepted_memory);
>  	range_start = start / PMD_SIZE;
>  
> +	/*
> +	 * load_unaligned_zeropad() can lead to unwanted loads across page
> +	 * boundaries. The unwanted loads are typically harmless. But, they
> +	 * might be made to totally unrelated or even unmapped memory.
> +	 * load_unaligned_zeropad() relies on exception fixup (#PF, #GP and now
> +	 * #VE) to recover from these unwanted loads.
> +	 *
> +	 * But, this approach does not work for unaccepted memory. For TDX, a
> +	 * load from unaccepted memory will not lead to a recoverable exception
> +	 * within the guest. The guest will exit to the VMM where the only
> +	 * recourse is to terminate the guest.
> +	 *
> +	 * There are three parts to fix this issue and comprehensively avoid
> +	 * access to unaccepted memory. Together these ensure that an extra
> +	 * “guard” page is accepted in addition to the memory that needs to be
> +	 * used:
> +	 *
> +	 * 1. Implicitly extend the range_contains_unaccepted_memory(start, end)
> +	 *    checks up to end+2M if ‘end’ is aligned on a 2M boundary.
> +	 *
> +	 * 2. Implicitly extend accept_memory(start, end) to end+2M if ‘end’ is
> +	 *    aligned on a 2M boundary.
> +	 *
> +	 * 3. Set PageUnaccepted() on both memory that itself needs to be
> +	 *    accepted *and* memory where the next page needs to be accepted.
> +	 *    Essentially, make PageUnaccepted(page) a marker for whether work
> +	 *    needs to be done to make ‘page’ usable. That work might include
> +	 *    accepting pages in addition to ‘page’ itself.
> +	 */

One nit with this: I'd much rather add one sentence to these to help tie
the code implementing it with this comment.  Maybe something like:

 * 2. Implicitly extend accept_memory(start, end) to end+2M if ‘end’ is
 *    aligned on a 2M boundary. (immediately following this comment)


> +	if (!(end % PMD_SIZE))
> +		end += PMD_SIZE;
> +
>  	spin_lock_irqsave(&unaccepted_memory_lock, flags);
>  	for_each_set_bitrange_from(range_start, range_end, bitmap,
>  				   DIV_ROUND_UP(end, PMD_SIZE)) {
> @@ -46,6 +78,10 @@ bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
>  
>  	bitmap = __va(boot_params.unaccepted_memory);
>  
> +	/* See comment on load_unaligned_zeropad() in accept_memory() */
> +	if (!(end % PMD_SIZE))
> +		end += PMD_SIZE;

It's a wee bit hard to follow this back to the comment that it
references, even with them sitting next to each other in this diff.  How
about adding:

	/*
	 * Also consider the unaccepted state of the *next* page.  See
	 * fix #1 in the comment on load_unaligned_zeropad() in
	 * accept_memory().
	 */

>  	spin_lock_irqsave(&unaccepted_memory_lock, flags);
>  	while (start < end) {
>  		if (test_bit(start / PMD_SIZE, bitmap)) {
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> index b91c89100b2d..bc1110509de4 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -709,6 +709,13 @@ static efi_status_t allocate_unaccepted_memory(struct boot_params *params,
>  		return EFI_SUCCESS;
>  	}
>  
> +	/*
> +	 * range_contains_unaccepted_memory() may need to check one 2M chunk
> +	 * beyond the end of RAM to deal with load_unaligned_zeropad(). Make
> +	 * sure that the bitmap is large enough handle it.
> +	 */
> +	max_addr += PMD_SIZE;

I guess the alternative to this would have been to record 'max_addr',
then special case 'max_addr'+2M in the bitmap checks.  I agree this is
probably nicer.

Also, the changelog needs to at least *mention* this little tidbit.  It
was a bit of a surprise when I got here.

With those fixed:

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 11/14] x86: Disable kexec if system has unaccepted memory
  2022-06-14 12:02 ` [PATCHv7 11/14] x86: Disable kexec if system has " Kirill A. Shutemov
@ 2022-06-23 17:23   ` Dave Hansen
  2022-06-23 21:48     ` Eric W. Biederman
  0 siblings, 1 reply; 200+ messages in thread
From: Dave Hansen @ 2022-06-23 17:23 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Eric Biederman, kexec

... adding kexec folks

On 6/14/22 05:02, Kirill A. Shutemov wrote:
> On kexec, the target kernel has to know what memory has been accepted.
> Information in EFI map is out of date and cannot be used.
> 
> boot_params.unaccepted_memory can be used to pass the bitmap between two
> kernels on kexec, but the use-case is not yet implemented.
> 
> Disable kexec on machines with unaccepted memory for now.
...
> +static int __init unaccepted_init(void)
> +{
> +	if (!boot_params.unaccepted_memory)
> +		return 0;
> +
> +#ifdef CONFIG_KEXEC_CORE
> +	/*
> +	 * TODO: Information on memory acceptance status has to be communicated
> +	 * between kernel.
> +	 */
> +	pr_warn("Disable kexec: not yet supported on systems with unaccepted memory\n");
> +	kexec_load_disabled = 1;
> +#endif

This looks to be the *only* in-kernel user tweaking kexec_load_disabled.
 It doesn't feel great to just be disabling kexec like this.  Why not
just fix it properly?

What do the kexec folks think?

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 12/14] x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub
  2022-06-14 12:02 ` [PATCHv7 12/14] x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub Kirill A. Shutemov
@ 2022-06-23 17:25   ` Dave Hansen
  0 siblings, 0 replies; 200+ messages in thread
From: Dave Hansen @ 2022-06-23 17:25 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel

On 6/14/22 05:02, Kirill A. Shutemov wrote:
> Memory acceptance requires a hypercall and one or multiple module calls.
> 
> Make helpers for the calls available in boot stub. It has to accept
> memory where kernel image and initrd are placed.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/coco/tdx/tdx.c           | 26 ------------------
>  arch/x86/include/asm/shared/tdx.h | 45 +++++++++++++++++++++++++++++++
>  arch/x86/include/asm/tdx.h        | 19 -------------
>  3 files changed, 45 insertions(+), 45 deletions(-)

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 13/14] x86/tdx: Refactor try_accept_one()
  2022-06-14 12:02 ` [PATCHv7 13/14] x86/tdx: Refactor try_accept_one() Kirill A. Shutemov
@ 2022-06-23 17:31   ` Dave Hansen
  2022-07-26 10:58   ` Borislav Petkov
  1 sibling, 0 replies; 200+ messages in thread
From: Dave Hansen @ 2022-06-23 17:31 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel

On 6/14/22 05:02, Kirill A. Shutemov wrote:
> Rework try_accept_one() to return accepted size instead of modifying
> 'start' inside the helper. It makes 'start' in-only argumaent and

							^ argument

> streamlines code on the caller side.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Suggested-by: Borislav Petkov <bp@alien8.de>

I'm not sure how much it actually streamlines things.  The line count
looks pretty similar.  But, I do generally dislike implicit return
values, so with that typo fixed:

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 01/14] x86/boot: Centralize __pa()/__va() definitions
  2022-06-14 12:02 ` [PATCHv7 01/14] x86/boot: Centralize __pa()/__va() definitions Kirill A. Shutemov
@ 2022-06-23 17:37   ` Dave Hansen
  0 siblings, 0 replies; 200+ messages in thread
From: Dave Hansen @ 2022-06-23 17:37 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Mike Rapoport

On 6/14/22 05:02, Kirill A. Shutemov wrote:
> Replace multiple __pa()/__va() definitions with a single one in misc.h.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>  arch/x86/boot/compressed/ident_map_64.c | 8 --------
>  arch/x86/boot/compressed/misc.h         | 9 +++++++++
>  arch/x86/boot/compressed/sev.c          | 2 --
>  3 files changed, 9 insertions(+), 10 deletions(-)

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 11/14] x86: Disable kexec if system has unaccepted memory
  2022-06-23 17:23   ` Dave Hansen
@ 2022-06-23 21:48     ` Eric W. Biederman
  2022-06-24  2:00       ` Kirill A. Shutemov
  0 siblings, 1 reply; 200+ messages in thread
From: Eric W. Biederman @ 2022-06-23 21:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, kexec

Dave Hansen <dave.hansen@intel.com> writes:

> ... adding kexec folks
>
> On 6/14/22 05:02, Kirill A. Shutemov wrote:
>> On kexec, the target kernel has to know what memory has been accepted.
>> Information in EFI map is out of date and cannot be used.
>> 
>> boot_params.unaccepted_memory can be used to pass the bitmap between two
>> kernels on kexec, but the use-case is not yet implemented.
>> 
>> Disable kexec on machines with unaccepted memory for now.
> ...
>> +static int __init unaccepted_init(void)
>> +{
>> +	if (!boot_params.unaccepted_memory)
>> +		return 0;
>> +
>> +#ifdef CONFIG_KEXEC_CORE
>> +	/*
>> +	 * TODO: Information on memory acceptance status has to be communicated
>> +	 * between kernel.
>> +	 */
>> +	pr_warn("Disable kexec: not yet supported on systems with unaccepted memory\n");
>> +	kexec_load_disabled = 1;
>> +#endif
>
> This looks to be the *only* in-kernel user tweaking kexec_load_disabled.
>  It doesn't feel great to just be disabling kexec like this.  Why not
> just fix it properly?
>
> What do the kexec folks think?

I didn't realized someone had implemented kexec_load_disabled.  I am not
particularly happy about that.  It looks like an over-broad stick that
we will have to support forever.

This change looks like it just builds on that bad decision.

If people don't want to deal with this situation right now, then I
recommend they make this new code and KEXEC conflict at the Kconfig
level.  That would give serious incentive to adding the missing
implementation.

If there is some deep and fundamental why this can not be supported
then it probably makes sense to put some code in the arch_kexec_load
hook that verifies that deep and fundamental reason is present.

With the kexec code all we have to verify it works is a little testing
and careful code review.  Something like this makes code review much
harder because the entire kernel has to be checked to see if some random
driver without locking changed a variable.  Rather than having it
apparent that this special case exists when reading through the kexec
code.

Eric


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 11/14] x86: Disable kexec if system has unaccepted memory
  2022-06-23 21:48     ` Eric W. Biederman
@ 2022-06-24  2:00       ` Kirill A. Shutemov
  2022-06-28 23:51         ` Kirill A. Shutemov
  0 siblings, 1 reply; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-06-24  2:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Dave Hansen, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, kexec

On Thu, Jun 23, 2022 at 04:48:59PM -0500, Eric W. Biederman wrote:
> Dave Hansen <dave.hansen@intel.com> writes:
> 
> > ... adding kexec folks
> >
> > On 6/14/22 05:02, Kirill A. Shutemov wrote:
> >> On kexec, the target kernel has to know what memory has been accepted.
> >> Information in EFI map is out of date and cannot be used.
> >> 
> >> boot_params.unaccepted_memory can be used to pass the bitmap between two
> >> kernels on kexec, but the use-case is not yet implemented.
> >> 
> >> Disable kexec on machines with unaccepted memory for now.
> > ...
> >> +static int __init unaccepted_init(void)
> >> +{
> >> +	if (!boot_params.unaccepted_memory)
> >> +		return 0;
> >> +
> >> +#ifdef CONFIG_KEXEC_CORE
> >> +	/*
> >> +	 * TODO: Information on memory acceptance status has to be communicated
> >> +	 * between kernel.
> >> +	 */
> >> +	pr_warn("Disable kexec: not yet supported on systems with unaccepted memory\n");
> >> +	kexec_load_disabled = 1;
> >> +#endif
> >
> > This looks to be the *only* in-kernel user tweaking kexec_load_disabled.
> >  It doesn't feel great to just be disabling kexec like this.  Why not
> > just fix it properly?

Unfortunately, problems with kexec are not limited to the unaccepted
memory. Isaku pointed out that MADT CPU wake is also problematic for
kexec. It doesn't allow CPU offline so secondary kernel will not be able
to wake it up. So additional limitation (as of now) for kexec is !SMP on
TDX guest.

I guess we can implement CPU offlining by going to a loop that checks
mailbox and responds to the command. That loops has to be somehow
protected from being overwritten on kexec.

Other issues may come up as we actually try to implement it.

That's all doable, but feels like a scope creep for unaccepted memory
enabling patchset :/

Is it a must for merge consideration?

> > What do the kexec folks think?
> 
> I didn't realized someone had implemented kexec_load_disabled.  I am not
> particularly happy about that.  It looks like an over-broad stick that
> we will have to support forever.
> 
> This change looks like it just builds on that bad decision.
> 
> If people don't want to deal with this situation right now, then I
> recommend they make this new code and KEXEC conflict at the Kconfig
> level.  That would give serious incentive to adding the missing
> implementation.

I tried to limit KEXEC on Kconfig level before[1]. Naive approach does not work[2]:

WARNING: unmet direct dependencies detected for UNACCEPTED_MEMORY
  Depends on [n]: EFI [=y] && EFI_STUB [=y] && !KEXEC_CORE [=y]
  Selected by [y]:
  - INTEL_TDX_GUEST [=y] && HYPERVISOR_GUEST [=y] && X86_64 [=y] && CPU_SUP_INTEL [=y] && X86_X2APIC [=y]

Maybe my Kconfig-fu is not strong enough, I donno.

[1] https://lore.kernel.org/all/20220425033934.68551-6-kirill.shutemov@linux.intel.com
[2] https://lore.kernel.org/all/YnOjJB8h3ZUR9sLX@zn.tnic

> If there is some deep and fundamental why this can not be supported
> then it probably makes sense to put some code in the arch_kexec_load
> hook that verifies that deep and fundamental reason is present.

Sounds straight-forward. I can do this.

> With the kexec code all we have to verify it works is a little testing
> and careful code review.  Something like this makes code review much
> harder because the entire kernel has to be checked to see if some random
> driver without locking changed a variable.  Rather than having it
> apparent that this special case exists when reading through the kexec
> code.
> 
> Eric
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 14/14] x86/tdx: Add unaccepted memory support
  2022-06-14 12:02 ` [PATCHv7 14/14] x86/tdx: Add unaccepted memory support Kirill A. Shutemov
@ 2022-06-24 16:22   ` Dave Hansen
  2022-06-27 10:42     ` Kirill A. Shutemov
  2022-07-26 14:51   ` Borislav Petkov
  1 sibling, 1 reply; 200+ messages in thread
From: Dave Hansen @ 2022-06-24 16:22 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel

On 6/14/22 05:02, Kirill A. Shutemov wrote:
>  static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
>  {
>  	/* Platform-specific memory-acceptance call goes here */
> -	error("Cannot accept memory");
> +	if (is_tdx_guest())
> +		tdx_accept_memory(start, end);
> +	else
> +		error("Cannot accept memory: unknown platform\n");
>  }

There are quite a few of these

	if (tdx())
		...

conditions in common code here.  Shouldn't this be something like a
CC_ATTR_MEM_ACCEPT?

	if (cc_platform_has(CC_ATTR_MEM_ACCEPT))
		cc_accept_memory(...);
	else
		error("Cannot accept memory: unknown platform\n");

I understand that TDX is the first one to the party.  Is this the time
to add the cc_ infrastructure?

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-06-14 12:02 [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (13 preceding siblings ...)
  2022-06-14 12:02 ` [PATCHv7 14/14] x86/tdx: Add unaccepted memory support Kirill A. Shutemov
@ 2022-06-24 16:37 ` Peter Gonda
  2022-06-24 16:57   ` Dave Hansen
                     ` (2 more replies)
  2022-07-29 14:01 ` [PATCH v1 0/2] Provide SEV-SNP " Tom Lendacky
                   ` (4 subsequent siblings)
  19 siblings, 3 replies; 200+ messages in thread
From: Peter Gonda @ 2022-06-24 16:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, the arch/x86 maintainers, linux-mm,
	linux-coco, linux-efi, LKML

On Tue, Jun 14, 2022 at 6:03 AM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> UEFI Specification version 2.9 introduces the concept of memory
> acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
> SEV-SNP, requiring memory to be accepted before it can be used by the
> guest. Accepting happens via a protocol specific for the Virtual
> Machine platform.
>
> Accepting memory is costly and it makes VMM allocate memory for the
> accepted guest physical address range. It's better to postpone memory
> acceptance until memory is needed. It lowers boot time and reduces
> memory overhead.
>
> The kernel needs to know what memory has been accepted. Firmware
> communicates this information via memory map: a new memory type --
> EFI_UNACCEPTED_MEMORY -- indicates such memory.
>
> Range-based tracking works fine for firmware, but it gets bulky for
> the kernel: e820 has to be modified on every page acceptance. It leads
> to table fragmentation, but there's a limited number of entries in the
> e820 table
>
> Another option is to mark such memory as usable in e820 and track if the
> range has been accepted in a bitmap. One bit in the bitmap represents
> 2MiB in the address space: one 4k page is enough to track 64GiB or
> physical address space.
>
> In the worst-case scenario -- a huge hole in the middle of the
> address space -- It needs 256MiB to handle 4PiB of the address
> space.
>
> Any unaccepted memory that is not aligned to 2M gets accepted upfront.
>
> The approach lowers boot time substantially. Boot to shell is ~2.5x
> faster for 4G TDX VM and ~4x faster for 64G.
>
> TDX-specific code isolated from the core of unaccepted memory support. It
> supposed to help to plug-in different implementation of unaccepted memory
> such as SEV-SNP.
>
> The tree can be found here:
>
> https://github.com/intel/tdx.git guest-unaccepted-memory

Hi Kirill,

I have a couple questions about this feature mainly about how cloud
customers can use this, I assume since this is a confidential compute
feature a large number of the users of these patches will be cloud
customers using TDX and SNP. One issue I see with these patches is how
do we as a cloud provider know whether a customer's linux image
supports this feature, if the image doesn't have these patches UEFI
needs to fully validate the memory, if the image does we can use this
new protocol. In GCE we supply our VMs with a version of the EDK2 FW
and the customer doesn't input into which UEFI we run, as far as I can
tell from the Azure SNP VM documentation it seems very similar. We
need to somehow tell our UEFI in the VM what to do based on the image.
The current way I can see to solve this issue would be to have our
customers give us metadata about their VM's image but this seems kinda
burdensome on our customers (I assume we'll have more features which
both UEFI and kernel need to both support inorder to be turned on like
this one) and error-prone, if a customer incorrectly labels their
image it may fail to boot.. Has there been any discussion about how to
solve this? My naive thoughts were what if UEFI and Kernel had some
sort of feature negotiation. Maybe that could happen via an extension
to exit boot services or a UEFI runtime driver, I'm not sure what's
best here just some ideas.



>
> v7:
>  - Rework meminfo counter to use PageUnaccepted() and move to generic code;
>  - Fix range_contains_unaccepted_memory() on machines without unaccepted memory;
>  - Add Reviewed-by from David;
> v6:
>  - Fix load_unaligned_zeropad() on machine with unaccepted memory;
>  - Clear PageUnaccepted() on merged pages, leaving it only on head;
>  - Clarify error handling in allocate_e820();
>  - Fix build with CONFIG_UNACCEPTED_MEMORY=y, but without TDX;
>  - Disable kexec at boottime instead of build conflict;
>  - Rebased to tip/master;
>  - Spelling fixes;
>  - Add Reviewed-by from Mike and David;
> v5:
>  - Updates comments and commit messages;
>    + Explain options for unaccepted memory handling;
>  - Expose amount of unaccepted memory in /proc/meminfo
>  - Adjust check in page_expected_state();
>  - Fix error code handling in allocate_e820();
>  - Centralize __pa()/__va() definitions in the boot stub;
>  - Avoid includes from the main kernel in the boot stub;
>  - Use an existing hole in boot_param for unaccepted_memory, instead of adding
>    to the end of the structure;
>  - Extract allocate_unaccepted_memory() form allocate_e820();
>  - Complain if there's unaccepted memory, but kernel does not support it;
>  - Fix vmstat counter;
>  - Split up few preparatory patches;
>  - Random readability adjustments;
> v4:
>  - PageBuddyUnaccepted() -> PageUnaccepted;
>  - Use separate page_type, not shared with offline;
>  - Rework interface between core-mm and arch code;
>  - Adjust commit messages;
>  - Ack from Mike;
>
> Kirill A. Shutemov (14):
>   x86/boot: Centralize __pa()/__va() definitions
>   mm: Add support for unaccepted memory
>   mm: Report unaccepted memory in meminfo
>   efi/x86: Get full memory map in allocate_e820()
>   x86/boot: Add infrastructure required for unaccepted memory support
>   efi/x86: Implement support for unaccepted memory
>   x86/boot/compressed: Handle unaccepted memory
>   x86/mm: Reserve unaccepted memory bitmap
>   x86/mm: Provide helpers for unaccepted memory
>   x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
>   x86: Disable kexec if system has unaccepted memory
>   x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in
>     boot stub
>   x86/tdx: Refactor try_accept_one()
>   x86/tdx: Add unaccepted memory support
>
>  Documentation/x86/zero-page.rst          |   1 +
>  arch/x86/Kconfig                         |   1 +
>  arch/x86/boot/bitops.h                   |  40 ++++++++
>  arch/x86/boot/compressed/Makefile        |   1 +
>  arch/x86/boot/compressed/align.h         |  14 +++
>  arch/x86/boot/compressed/bitmap.c        |  43 ++++++++
>  arch/x86/boot/compressed/bitmap.h        |  49 +++++++++
>  arch/x86/boot/compressed/bits.h          |  36 +++++++
>  arch/x86/boot/compressed/compiler.h      |   9 ++
>  arch/x86/boot/compressed/efi.h           |   1 +
>  arch/x86/boot/compressed/find.c          |  54 ++++++++++
>  arch/x86/boot/compressed/find.h          |  80 +++++++++++++++
>  arch/x86/boot/compressed/ident_map_64.c  |   8 --
>  arch/x86/boot/compressed/kaslr.c         |  35 ++++---
>  arch/x86/boot/compressed/math.h          |  37 +++++++
>  arch/x86/boot/compressed/mem.c           | 111 ++++++++++++++++++++
>  arch/x86/boot/compressed/minmax.h        |  61 +++++++++++
>  arch/x86/boot/compressed/misc.c          |   6 ++
>  arch/x86/boot/compressed/misc.h          |  15 +++
>  arch/x86/boot/compressed/pgtable_types.h |  25 +++++
>  arch/x86/boot/compressed/sev.c           |   2 -
>  arch/x86/boot/compressed/tdx.c           |  78 ++++++++++++++
>  arch/x86/coco/tdx/tdx.c                  |  94 ++++++++---------
>  arch/x86/include/asm/page.h              |   3 +
>  arch/x86/include/asm/shared/tdx.h        |  47 +++++++++
>  arch/x86/include/asm/tdx.h               |  19 ----
>  arch/x86/include/asm/unaccepted_memory.h |  16 +++
>  arch/x86/include/uapi/asm/bootparam.h    |   2 +-
>  arch/x86/kernel/e820.c                   |  10 ++
>  arch/x86/mm/Makefile                     |   2 +
>  arch/x86/mm/unaccepted_memory.c          | 123 +++++++++++++++++++++++
>  drivers/base/node.c                      |   7 ++
>  drivers/firmware/efi/Kconfig             |  14 +++
>  drivers/firmware/efi/efi.c               |   1 +
>  drivers/firmware/efi/libstub/x86-stub.c  | 103 ++++++++++++++++---
>  fs/proc/meminfo.c                        |   5 +
>  include/linux/efi.h                      |   3 +-
>  include/linux/mmzone.h                   |   1 +
>  include/linux/page-flags.h               |  31 ++++++
>  mm/internal.h                            |  12 +++
>  mm/memblock.c                            |   9 ++
>  mm/page_alloc.c                          |  96 +++++++++++++++++-
>  mm/vmstat.c                              |   1 +
>  43 files changed, 1191 insertions(+), 115 deletions(-)
>  create mode 100644 arch/x86/boot/compressed/align.h
>  create mode 100644 arch/x86/boot/compressed/bitmap.c
>  create mode 100644 arch/x86/boot/compressed/bitmap.h
>  create mode 100644 arch/x86/boot/compressed/bits.h
>  create mode 100644 arch/x86/boot/compressed/compiler.h
>  create mode 100644 arch/x86/boot/compressed/find.c
>  create mode 100644 arch/x86/boot/compressed/find.h
>  create mode 100644 arch/x86/boot/compressed/math.h
>  create mode 100644 arch/x86/boot/compressed/mem.c
>  create mode 100644 arch/x86/boot/compressed/minmax.h
>  create mode 100644 arch/x86/boot/compressed/pgtable_types.h
>  create mode 100644 arch/x86/include/asm/unaccepted_memory.h
>  create mode 100644 arch/x86/mm/unaccepted_memory.c
>
> --
> 2.35.1
>
>

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-06-24 16:37 ` [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Peter Gonda
@ 2022-06-24 16:57   ` Dave Hansen
  2022-06-24 17:06     ` Marc Orr
  2022-06-24 17:40   ` Michael Roth
  2022-06-27 11:30   ` Kirill A. Shutemov
  2 siblings, 1 reply; 200+ messages in thread
From: Dave Hansen @ 2022-06-24 16:57 UTC (permalink / raw)
  To: Peter Gonda, Kirill A. Shutemov
  Cc: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, the arch/x86 maintainers, linux-mm, linux-coco,
	linux-efi, LKML

Peter, is your enter key broken?  You seem to be typing all your text in
a single unreadable paragraph.

On 6/24/22 09:37, Peter Gonda wrote:
> if a customer incorrectly labels their image it may fail to boot..

You're saying that firmware basically has two choices:
1. Accept all the memory up front and boot slowly, but reliably
2. Use thus "unaccepted memory" mechanism, boot fast, but risk that the
   VM loses a bunch of memory.

If the guest can't even boot because of a lack of memory, then the
pre-accepted chunk is probably too small in the first place.

If the customer screws up, they lose a bunch of the RAM they paid for.
That seems like a rather self-correcting problem to me.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-06-24 16:57   ` Dave Hansen
@ 2022-06-24 17:06     ` Marc Orr
  2022-06-24 17:09       ` Dave Hansen
  0 siblings, 1 reply; 200+ messages in thread
From: Marc Orr @ 2022-06-24 17:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Gonda, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Marcelo, tim.gardner, Khalid ElMously,
	philip.cox, the arch/x86 maintainers, linux-mm, linux-coco,
	linux-efi, LKML

On Fri, Jun 24, 2022 at 9:57 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> Peter, is your enter key broken?  You seem to be typing all your text in
> a single unreadable paragraph.
>
> On 6/24/22 09:37, Peter Gonda wrote:
> > if a customer incorrectly labels their image it may fail to boot..
>
> You're saying that firmware basically has two choices:
> 1. Accept all the memory up front and boot slowly, but reliably
> 2. Use thus "unaccepted memory" mechanism, boot fast, but risk that the
>    VM loses a bunch of memory.
>
> If the guest can't even boot because of a lack of memory, then the
> pre-accepted chunk is probably too small in the first place.
>
> If the customer screws up, they lose a bunch of the RAM they paid for.
> That seems like a rather self-correcting problem to me.

I think Peter's point is a little more nuanced than that. Once lazy
accept goes into the guest firmware -- without the feature negotiation
that Peter is suggesting -- cloud providers now have a bookkeeping
problem. Which images have kernels that can boot from a guest firmware
that doesn't pre-validate all the guest memory?

The way we've been solving similar bookkeeping problems up to now
(e.g., Which guest can run with CVM features like TDX/SEV enabled?
Which SEV guests can live migrate?) is as follows. We tag images with
feature tags. But this is sort of a hack. And not a great one. It's
confusing to customers, hard for the cloud service provider to
support, and easy to mess up.

It would be better if the guest FW knew whether or not the kernel it
was going to launch supported lazy accept.

That being said, this does seem like a difficult problem to solve,
since it's sort of backward from how things work, in that when the
guest firmware wants to switch between pre-validating all memory vs.
minimizing what it pre-validates, the guest kernel is not running yet!
But if there is some way to do this, it would be a huge improvement
over the current status quo of pushing the feature negotiation up to
the cloud service provider and ultimately the cloud customer.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-06-24 17:06     ` Marc Orr
@ 2022-06-24 17:09       ` Dave Hansen
  2022-06-24 17:15         ` Peter Gonda
  2022-06-24 17:19         ` Marc Orr
  0 siblings, 2 replies; 200+ messages in thread
From: Dave Hansen @ 2022-06-24 17:09 UTC (permalink / raw)
  To: Marc Orr
  Cc: Peter Gonda, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Marcelo, tim.gardner, Khalid ElMously,
	philip.cox, the arch/x86 maintainers, linux-mm, linux-coco,
	linux-efi, LKML

On 6/24/22 10:06, Marc Orr wrote:
> I think Peter's point is a little more nuanced than that. Once lazy
> accept goes into the guest firmware -- without the feature negotiation
> that Peter is suggesting -- cloud providers now have a bookkeeping
> problem. Which images have kernels that can boot from a guest firmware
> that doesn't pre-validate all the guest memory?

Hold on a sec though...

Is this a matter of

	can boot from a guest firmware that doesn't pre-validate all the
	guest memory?

or

	can boot from a guest firmware that doesn't pre-validate all the
	guest memory ... with access to all of that guest's RAM?

In other words, are we talking about "fails to boot" or "can't see all
the RAM"?

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-06-24 17:09       ` Dave Hansen
@ 2022-06-24 17:15         ` Peter Gonda
  2022-06-24 17:19         ` Marc Orr
  1 sibling, 0 replies; 200+ messages in thread
From: Peter Gonda @ 2022-06-24 17:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Marc Orr, Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, Marcelo, tim.gardner,
	Khalid ElMously, philip.cox, the arch/x86 maintainers, linux-mm,
	linux-coco, linux-efi, LKML

>> > Peter, is your enter key broken?  You seem to be typing all your text in
>> > a single unreadable paragraph.

Sorry I will try to format better in the future.

>> > You're saying that firmware basically has two choices:
>> > 1. Accept all the memory up front and boot slowly, but reliably
>> > 2. Use thus "unaccepted memory" mechanism, boot fast, but risk that the
>> >    VM loses a bunch of memory.

That's right. Given that the first round of SNP guest patches are in
but this work to support unaccepted memory for SNP is not we assume we
will have distros that support SNP without this "unaccepted memory"
feature.

On Fri, Jun 24, 2022 at 11:10 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 6/24/22 10:06, Marc Orr wrote:
> > I think Peter's point is a little more nuanced than that. Once lazy
> > accept goes into the guest firmware -- without the feature negotiation
> > that Peter is suggesting -- cloud providers now have a bookkeeping
> > problem. Which images have kernels that can boot from a guest firmware
> > that doesn't pre-validate all the guest memory?
>
> Hold on a sec though...
>
> Is this a matter of
>
>         can boot from a guest firmware that doesn't pre-validate all the
>         guest memory?
>
> or
>
>         can boot from a guest firmware that doesn't pre-validate all the
>         guest memory ... with access to all of that guest's RAM?
>
> In other words, are we talking about "fails to boot" or "can't see all
> the RAM"?
>

Yes, I'm sorry I was mistaken. If FW uses unaccepted memory but the
kernel doesn't support it the VM should still boot but will fail to
utilize all of its given RAM.

>> > If the customer screws up, they lose a bunch of the RAM they paid for.
>> > That seems like a rather self-correcting problem to me.

Providing customers with an easy to use product is a problem for us
the cloud provider, encoding foot-guns doesn't sound like what's best
for the user here. I wanted to bring this up here since it seems like
a problem most vendors/users of SNP and TDX would run into. We can of
course figure this out internally if no one else sees this as an
issue.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-06-24 17:09       ` Dave Hansen
  2022-06-24 17:15         ` Peter Gonda
@ 2022-06-24 17:19         ` Marc Orr
  2022-06-24 17:21           ` Peter Gonda
  2022-06-24 17:47           ` Dave Hansen
  1 sibling, 2 replies; 200+ messages in thread
From: Marc Orr @ 2022-06-24 17:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Gonda, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Marcelo, tim.gardner, Khalid ElMously,
	philip.cox, the arch/x86 maintainers, linux-mm, linux-coco,
	linux-efi, LKML

On Fri, Jun 24, 2022 at 10:10 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 6/24/22 10:06, Marc Orr wrote:
> > I think Peter's point is a little more nuanced than that. Once lazy
> > accept goes into the guest firmware -- without the feature negotiation
> > that Peter is suggesting -- cloud providers now have a bookkeeping
> > problem. Which images have kernels that can boot from a guest firmware
> > that doesn't pre-validate all the guest memory?
>
> Hold on a sec though...
>
> Is this a matter of
>
>         can boot from a guest firmware that doesn't pre-validate all the
>         guest memory?
>
> or
>
>         can boot from a guest firmware that doesn't pre-validate all the
>         guest memory ... with access to all of that guest's RAM?
>
> In other words, are we talking about "fails to boot" or "can't see all
> the RAM"?

Ah... yeah, you're right, Dave -- I guess it's the latter. The guest
won't have access to all of the memory that the customer is paying
for. But that's still bad. If the customer buys a 96 GB VM and can
only see 4GB because they're kernel doesn't have these patches they're
going to be confused and frustrated.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-06-24 17:19         ` Marc Orr
@ 2022-06-24 17:21           ` Peter Gonda
  2022-06-24 17:47           ` Dave Hansen
  1 sibling, 0 replies; 200+ messages in thread
From: Peter Gonda @ 2022-06-24 17:21 UTC (permalink / raw)
  To: Marc Orr
  Cc: Dave Hansen, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Marcelo, tim.gardner, Khalid ElMously,
	philip.cox, the arch/x86 maintainers, linux-mm, linux-coco,
	linux-efi, LKML

On Fri, Jun 24, 2022 at 11:19 AM Marc Orr <marcorr@google.com> wrote:
>
> On Fri, Jun 24, 2022 at 10:10 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 6/24/22 10:06, Marc Orr wrote:
> > > I think Peter's point is a little more nuanced than that. Once lazy
> > > accept goes into the guest firmware -- without the feature negotiation
> > > that Peter is suggesting -- cloud providers now have a bookkeeping
> > > problem. Which images have kernels that can boot from a guest firmware
> > > that doesn't pre-validate all the guest memory?
> >
> > Hold on a sec though...
> >
> > Is this a matter of
> >
> >         can boot from a guest firmware that doesn't pre-validate all the
> >         guest memory?
> >
> > or
> >
> >         can boot from a guest firmware that doesn't pre-validate all the
> >         guest memory ... with access to all of that guest's RAM?
> >
> > In other words, are we talking about "fails to boot" or "can't see all
> > the RAM"?
>
> Ah... yeah, you're right, Dave -- I guess it's the latter. The guest
> won't have access to all of the memory that the customer is paying
> for. But that's still bad. If the customer buys a 96 GB VM and can
> only see 4GB because they're kernel doesn't have these patches they're
> going to be confused and frustrated.

The other error case which might be more confusing to the customer is
their kernel does have these patches, there is some misconfiguration
and their VM boots slowly because the FW uses the accept all memory
approach.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-06-24 16:37 ` [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Peter Gonda
  2022-06-24 16:57   ` Dave Hansen
@ 2022-06-24 17:40   ` Michael Roth
  2022-06-24 17:58     ` Michael Roth
  2022-06-24 18:05     ` Peter Gonda
  2022-06-27 11:30   ` Kirill A. Shutemov
  2 siblings, 2 replies; 200+ messages in thread
From: Michael Roth @ 2022-06-24 17:40 UTC (permalink / raw)
  To: Peter Gonda
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox,
	the arch/x86 maintainers, linux-mm, linux-coco, linux-efi, LKML

On Fri, Jun 24, 2022 at 10:37:10AM -0600, Peter Gonda wrote:
> On Tue, Jun 14, 2022 at 6:03 AM Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > UEFI Specification version 2.9 introduces the concept of memory
> > acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
> > SEV-SNP, requiring memory to be accepted before it can be used by the
> > guest. Accepting happens via a protocol specific for the Virtual
> > Machine platform.
> >
> > Accepting memory is costly and it makes VMM allocate memory for the
> > accepted guest physical address range. It's better to postpone memory
> > acceptance until memory is needed. It lowers boot time and reduces
> > memory overhead.
> >
> > The kernel needs to know what memory has been accepted. Firmware
> > communicates this information via memory map: a new memory type --
> > EFI_UNACCEPTED_MEMORY -- indicates such memory.
> >
> > Range-based tracking works fine for firmware, but it gets bulky for
> > the kernel: e820 has to be modified on every page acceptance. It leads
> > to table fragmentation, but there's a limited number of entries in the
> > e820 table
> >
> > Another option is to mark such memory as usable in e820 and track if the
> > range has been accepted in a bitmap. One bit in the bitmap represents
> > 2MiB in the address space: one 4k page is enough to track 64GiB or
> > physical address space.
> >
> > In the worst-case scenario -- a huge hole in the middle of the
> > address space -- It needs 256MiB to handle 4PiB of the address
> > space.
> >
> > Any unaccepted memory that is not aligned to 2M gets accepted upfront.
> >
> > The approach lowers boot time substantially. Boot to shell is ~2.5x
> > faster for 4G TDX VM and ~4x faster for 64G.
> >
> > TDX-specific code isolated from the core of unaccepted memory support. It
> > supposed to help to plug-in different implementation of unaccepted memory
> > such as SEV-SNP.
> >
> > The tree can be found here:
> >
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fintel%2Ftdx.git&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C73bacba017c84291482a08da55ffd481%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637916854542432349%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=P%2FUJOL305xo85NLXGxGouQVGHgzLJpmBdNyZ7Re5%2FB0%3D&amp;reserved=0 guest-unaccepted-memory
> 
> Hi Kirill,
> 
> I have a couple questions about this feature mainly about how cloud
> customers can use this, I assume since this is a confidential compute
> feature a large number of the users of these patches will be cloud
> customers using TDX and SNP. One issue I see with these patches is how
> do we as a cloud provider know whether a customer's linux image
> supports this feature, if the image doesn't have these patches UEFI
> needs to fully validate the memory, if the image does we can use this
> new protocol. In GCE we supply our VMs with a version of the EDK2 FW
> and the customer doesn't input into which UEFI we run, as far as I can
> tell from the Azure SNP VM documentation it seems very similar. We
> need to somehow tell our UEFI in the VM what to do based on the image.
> The current way I can see to solve this issue would be to have our
> customers give us metadata about their VM's image but this seems kinda
> burdensome on our customers (I assume we'll have more features which
> both UEFI and kernel need to both support inorder to be turned on like
> this one) and error-prone, if a customer incorrectly labels their

> image it may fail to boot.. Has there been any discussion about how to
> solve this? My naive thoughts were what if UEFI and Kernel had some
> sort of feature negotiation. Maybe that could happen via an extension
> to exit boot services or a UEFI runtime driver, I'm not sure what's
> best here just some ideas.

Not sure if you've seen this thread or not, but there's also been some
discussion around this in the context of the UEFI support:

  https://patchew.org/EDK2/cover.1654420875.git.min.m.xu@intel.com/cce5ea2aaaeddd9ce9df6fa7ac1ef52976c5c7e6.1654420876.git.min.m.xu@intel.com/#20220608061805.vvsjiqt55rqnl3fw@sirius.home.kraxel.org

2 things being discussed there really, which I think roughly boil down
to:

 1) how to configure OVMF to enable/disable lazy acceptance
    - compile time option most likely: accept-all/accept-minimum/accept-1GB

 2) how to introduce an automatic mode in the future where OVMF does the
    right thing based on what the guest supports. Gerd floated the idea of
    tying it to ExitBootServices as well, but not sure there's a solid
    plan on what to do here yet.

If that's accurate, it seems like the only 'safe' option is to disable it via
#1 (accept-all), and then when #2 comes along, compile OVMF to just Do The
Right Thing.

Users who know their VMs implement lazy acceptance can force it on via
accept-all OVMF compile option.

-Mike

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-06-24 17:19         ` Marc Orr
  2022-06-24 17:21           ` Peter Gonda
@ 2022-06-24 17:47           ` Dave Hansen
  2022-06-24 18:10             ` Peter Gonda
  1 sibling, 1 reply; 200+ messages in thread
From: Dave Hansen @ 2022-06-24 17:47 UTC (permalink / raw)
  To: Marc Orr
  Cc: Peter Gonda, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Marcelo, tim.gardner, Khalid ElMously,
	philip.cox, the arch/x86 maintainers, linux-mm, linux-coco,
	linux-efi, LKML

On 6/24/22 10:19, Marc Orr wrote:
>> Is this a matter of
>>
>>         can boot from a guest firmware that doesn't pre-validate all the
>>         guest memory?
>>
>> or
>>
>>         can boot from a guest firmware that doesn't pre-validate all the
>>         guest memory ... with access to all of that guest's RAM?
>>
>> In other words, are we talking about "fails to boot" or "can't see all
>> the RAM"?
> Ah... yeah, you're right, Dave -- I guess it's the latter. The guest
> won't have access to all of the memory that the customer is paying
> for. But that's still bad. If the customer buys a 96 GB VM and can
> only see 4GB because they're kernel doesn't have these patches they're
> going to be confused and frustrated.

They'll at least be a _bit_ less angry and frustrated than if they were
staring at a blank screen. ;)  But, yeah, I totally get the point.

How big is the window going to be where we have guests that can have
unaccepted memory, but don't have acceptance support?  For TDX, it's
looking like it'll probably _just_ be 5.19.  Is TDX on 5.19 in shape
that cloud providers can deploy it?  Or, is stuff like lack of
attestation a deal breaker?


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-06-24 17:40   ` Michael Roth
@ 2022-06-24 17:58     ` Michael Roth
  2022-06-24 18:05     ` Peter Gonda
  1 sibling, 0 replies; 200+ messages in thread
From: Michael Roth @ 2022-06-24 17:58 UTC (permalink / raw)
  To: Peter Gonda
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox,
	the arch/x86 maintainers, linux-mm, linux-coco, linux-efi, LKML

On Fri, Jun 24, 2022 at 12:40:57PM -0500, Michael Roth wrote:
> 
>  1) how to configure OVMF to enable/disable lazy acceptance
>     - compile time option most likely: accept-all/accept-minimum/accept-1GB
> 
>  2) how to introduce an automatic mode in the future where OVMF does the
>     right thing based on what the guest supports. Gerd floated the idea of
>     tying it to ExitBootServices as well, but not sure there's a solid
>     plan on what to do here yet.
> 
> If that's accurate, it seems like the only 'safe' option is to disable it via
> #1 (accept-all), and then when #2 comes along, compile OVMF to just Do The
> Right Thing.
> 
> Users who know their VMs implement lazy acceptance can force it on via
> accept-all OVMF compile option.

accept-min / accept-X I mean.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-06-24 17:40   ` Michael Roth
  2022-06-24 17:58     ` Michael Roth
@ 2022-06-24 18:05     ` Peter Gonda
  1 sibling, 0 replies; 200+ messages in thread
From: Peter Gonda @ 2022-06-24 18:05 UTC (permalink / raw)
  To: Michael Roth
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand, Marcelo,
	tim.gardner, Khalid ElMously, philip.cox,
	the arch/x86 maintainers, linux-mm, linux-coco, linux-efi, LKML

On Fri, Jun 24, 2022 at 11:41 AM Michael Roth <michael.roth@amd.com> wrote:
>
> On Fri, Jun 24, 2022 at 10:37:10AM -0600, Peter Gonda wrote:
> > On Tue, Jun 14, 2022 at 6:03 AM Kirill A. Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> > >
> > > UEFI Specification version 2.9 introduces the concept of memory
> > > acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
> > > SEV-SNP, requiring memory to be accepted before it can be used by the
> > > guest. Accepting happens via a protocol specific for the Virtual
> > > Machine platform.
> > >
> > > Accepting memory is costly and it makes VMM allocate memory for the
> > > accepted guest physical address range. It's better to postpone memory
> > > acceptance until memory is needed. It lowers boot time and reduces
> > > memory overhead.
> > >
> > > The kernel needs to know what memory has been accepted. Firmware
> > > communicates this information via memory map: a new memory type --
> > > EFI_UNACCEPTED_MEMORY -- indicates such memory.
> > >
> > > Range-based tracking works fine for firmware, but it gets bulky for
> > > the kernel: e820 has to be modified on every page acceptance. It leads
> > > to table fragmentation, but there's a limited number of entries in the
> > > e820 table
> > >
> > > Another option is to mark such memory as usable in e820 and track if the
> > > range has been accepted in a bitmap. One bit in the bitmap represents
> > > 2MiB in the address space: one 4k page is enough to track 64GiB or
> > > physical address space.
> > >
> > > In the worst-case scenario -- a huge hole in the middle of the
> > > address space -- It needs 256MiB to handle 4PiB of the address
> > > space.
> > >
> > > Any unaccepted memory that is not aligned to 2M gets accepted upfront.
> > >
> > > The approach lowers boot time substantially. Boot to shell is ~2.5x
> > > faster for 4G TDX VM and ~4x faster for 64G.
> > >
> > > TDX-specific code isolated from the core of unaccepted memory support. It
> > > supposed to help to plug-in different implementation of unaccepted memory
> > > such as SEV-SNP.
> > >
> > > The tree can be found here:
> > >
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fintel%2Ftdx.git&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C73bacba017c84291482a08da55ffd481%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637916854542432349%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=P%2FUJOL305xo85NLXGxGouQVGHgzLJpmBdNyZ7Re5%2FB0%3D&amp;reserved=0 guest-unaccepted-memory
> >
> > Hi Kirill,
> >
> > I have a couple questions about this feature mainly about how cloud
> > customers can use this, I assume since this is a confidential compute
> > feature a large number of the users of these patches will be cloud
> > customers using TDX and SNP. One issue I see with these patches is how
> > do we as a cloud provider know whether a customer's linux image
> > supports this feature, if the image doesn't have these patches UEFI
> > needs to fully validate the memory, if the image does we can use this
> > new protocol. In GCE we supply our VMs with a version of the EDK2 FW
> > and the customer doesn't input into which UEFI we run, as far as I can
> > tell from the Azure SNP VM documentation it seems very similar. We
> > need to somehow tell our UEFI in the VM what to do based on the image.
> > The current way I can see to solve this issue would be to have our
> > customers give us metadata about their VM's image but this seems kinda
> > burdensome on our customers (I assume we'll have more features which
> > both UEFI and kernel need to both support inorder to be turned on like
> > this one) and error-prone, if a customer incorrectly labels their
>
> > image it may fail to boot.. Has there been any discussion about how to
> > solve this? My naive thoughts were what if UEFI and Kernel had some
> > sort of feature negotiation. Maybe that could happen via an extension
> > to exit boot services or a UEFI runtime driver, I'm not sure what's
> > best here just some ideas.
>
> Not sure if you've seen this thread or not, but there's also been some
> discussion around this in the context of the UEFI support:
>
>   https://patchew.org/EDK2/cover.1654420875.git.min.m.xu@intel.com/cce5ea2aaaeddd9ce9df6fa7ac1ef52976c5c7e6.1654420876.git.min.m.xu@intel.com/#20220608061805.vvsjiqt55rqnl3fw@sirius.home.kraxel.org
>
> 2 things being discussed there really, which I think roughly boil down
> to:
>
>  1) how to configure OVMF to enable/disable lazy acceptance
>     - compile time option most likely: accept-all/accept-minimum/accept-1GB
>
>  2) how to introduce an automatic mode in the future where OVMF does the
>     right thing based on what the guest supports. Gerd floated the idea of
>     tying it to ExitBootServices as well, but not sure there's a solid
>     plan on what to do here yet.
>
> If that's accurate, it seems like the only 'safe' option is to disable it via
> #1 (accept-all), and then when #2 comes along, compile OVMF to just Do The
> Right Thing.
>
> Users who know their VMs implement lazy acceptance can force it on via
> accept-all OVMF compile option.

Thanks for this Mike! I will bring this to the EDK2 community.

The issue for us is our users use a GCE built EDK2 not their own
compiled version so they don't have the choice. Reading the Azure docs
it seems the same for them, and for AWS so I don't know how often
customers actually get to bring their own firmware.

>
> -Mike

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-06-24 17:47           ` Dave Hansen
@ 2022-06-24 18:10             ` Peter Gonda
  2022-06-24 18:13               ` Dave Hansen
  0 siblings, 1 reply; 200+ messages in thread
From: Peter Gonda @ 2022-06-24 18:10 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Marc Orr, Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, Marcelo, tim.gardner,
	Khalid ElMously, philip.cox, the arch/x86 maintainers, linux-mm,
	linux-coco, linux-efi, LKML

On Fri, Jun 24, 2022 at 11:47 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 6/24/22 10:19, Marc Orr wrote:
> >> Is this a matter of
> >>
> >>         can boot from a guest firmware that doesn't pre-validate all the
> >>         guest memory?
> >>
> >> or
> >>
> >>         can boot from a guest firmware that doesn't pre-validate all the
> >>         guest memory ... with access to all of that guest's RAM?
> >>
> >> In other words, are we talking about "fails to boot" or "can't see all
> >> the RAM"?
> > Ah... yeah, you're right, Dave -- I guess it's the latter. The guest
> > won't have access to all of the memory that the customer is paying
> > for. But that's still bad. If the customer buys a 96 GB VM and can
> > only see 4GB because they're kernel doesn't have these patches they're
> > going to be confused and frustrated.
>
> They'll at least be a _bit_ less angry and frustrated than if they were
> staring at a blank screen. ;)  But, yeah, I totally get the point.

Ha! Well we do have that issue in some cases. If you try to run an SEV
VM with an image that doesn't support SEV you will just get a blank
serial screen. If we had something like this back then the FW could
have surfaced a nice error to the user but that's history now.

>
> How big is the window going to be where we have guests that can have
> unaccepted memory, but don't have acceptance support?  For TDX, it's
> looking like it'll probably _just_ be 5.19.  Is TDX on 5.19 in shape
> that cloud providers can deploy it?  Or, is stuff like lack of
> attestation a deal breaker?

This is complicated because distros don't run upstream linux versions.
If I understand correctly (I see some distro emails on here so please
correct me) distros normally maintain forks which they backport things
into. So I cannot answer this question. It is possible that a
hypothetical distro backports only the SNP/TDX initial patches and
doesn't take these for many releases.

I am more familiar with SNP and it does have some attestation support
in the first patch sets.

Also I should have been more clear. I don't want to try and hold up
this feature but instead discuss a future usability add-on feature.

>
>

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-06-24 18:10             ` Peter Gonda
@ 2022-06-24 18:13               ` Dave Hansen
  0 siblings, 0 replies; 200+ messages in thread
From: Dave Hansen @ 2022-06-24 18:13 UTC (permalink / raw)
  To: Peter Gonda
  Cc: Marc Orr, Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, Marcelo, tim.gardner,
	Khalid ElMously, philip.cox, the arch/x86 maintainers, linux-mm,
	linux-coco, linux-efi, LKML

On 6/24/22 11:10, Peter Gonda wrote:
>> How big is the window going to be where we have guests that can have
>> unaccepted memory, but don't have acceptance support?  For TDX, it's
>> looking like it'll probably _just_ be 5.19.  Is TDX on 5.19 in shape
>> that cloud providers can deploy it?  Or, is stuff like lack of
>> attestation a deal breaker?
> This is complicated because distros don't run upstream linux versions.
> If I understand correctly (I see some distro emails on here so please
> correct me) distros normally maintain forks which they backport things
> into. So I cannot answer this question. It is possible that a
> hypothetical distro backports only the SNP/TDX initial patches and
> doesn't take these for many releases.

Distros could also backport a bare-bones version of this set that
doesn't do anything fancy and just synchronously accepts the memory at
boot.  No bitmap, no page allocator changes.  It'll slow boot down, but
is better than having no RAM.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 14/14] x86/tdx: Add unaccepted memory support
  2022-06-24 16:22   ` Dave Hansen
@ 2022-06-27 10:42     ` Kirill A. Shutemov
  0 siblings, 0 replies; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-06-27 10:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, x86, linux-mm, linux-coco, linux-efi, linux-kernel

On Fri, Jun 24, 2022 at 09:22:03AM -0700, Dave Hansen wrote:
> On 6/14/22 05:02, Kirill A. Shutemov wrote:
> >  static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
> >  {
> >  	/* Platform-specific memory-acceptance call goes here */
> > -	error("Cannot accept memory");
> > +	if (is_tdx_guest())
> > +		tdx_accept_memory(start, end);
> > +	else
> > +		error("Cannot accept memory: unknown platform\n");
> >  }
> 
> There are quite a few of these
> 
> 	if (tdx())
> 		...
> 
> conditions in common code here.  Shouldn't this be something like a
> CC_ATTR_MEM_ACCEPT?
> 
> 	if (cc_platform_has(CC_ATTR_MEM_ACCEPT))
> 		cc_accept_memory(...);
> 	else
> 		error("Cannot accept memory: unknown platform\n");
> 
> I understand that TDX is the first one to the party.  Is this the time
> to add the cc_ infrastructure?

We need if tdx() check *somewhere* as how exactly memory gets accepted is
specific to a particular platform.

There are two callsites where memory acceptance happens. One of them is in
boot stub where we don't have cc_ infrastructure. So it will boil down to
a single cc_accept_memory() that will have 'if tdx()' inside.

I don't see much sense in the exercise. We can as well keep the 'if' in
accept_memory().

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-06-24 16:37 ` [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Peter Gonda
  2022-06-24 16:57   ` Dave Hansen
  2022-06-24 17:40   ` Michael Roth
@ 2022-06-27 11:30   ` Kirill A. Shutemov
  2022-06-27 11:54     ` Ard Biesheuvel
  2 siblings, 1 reply; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-06-27 11:30 UTC (permalink / raw)
  To: Peter Gonda
  Cc: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, the arch/x86 maintainers, linux-mm,
	linux-coco, linux-efi, LKML

On Fri, Jun 24, 2022 at 10:37:10AM -0600, Peter Gonda wrote:
> On Tue, Jun 14, 2022 at 6:03 AM Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > UEFI Specification version 2.9 introduces the concept of memory
> > acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
> > SEV-SNP, requiring memory to be accepted before it can be used by the
> > guest. Accepting happens via a protocol specific for the Virtual
> > Machine platform.
> >
> > Accepting memory is costly and it makes VMM allocate memory for the
> > accepted guest physical address range. It's better to postpone memory
> > acceptance until memory is needed. It lowers boot time and reduces
> > memory overhead.
> >
> > The kernel needs to know what memory has been accepted. Firmware
> > communicates this information via memory map: a new memory type --
> > EFI_UNACCEPTED_MEMORY -- indicates such memory.
> >
> > Range-based tracking works fine for firmware, but it gets bulky for
> > the kernel: e820 has to be modified on every page acceptance. It leads
> > to table fragmentation, but there's a limited number of entries in the
> > e820 table
> >
> > Another option is to mark such memory as usable in e820 and track if the
> > range has been accepted in a bitmap. One bit in the bitmap represents
> > 2MiB in the address space: one 4k page is enough to track 64GiB or
> > physical address space.
> >
> > In the worst-case scenario -- a huge hole in the middle of the
> > address space -- It needs 256MiB to handle 4PiB of the address
> > space.
> >
> > Any unaccepted memory that is not aligned to 2M gets accepted upfront.
> >
> > The approach lowers boot time substantially. Boot to shell is ~2.5x
> > faster for 4G TDX VM and ~4x faster for 64G.
> >
> > TDX-specific code isolated from the core of unaccepted memory support. It
> > supposed to help to plug-in different implementation of unaccepted memory
> > such as SEV-SNP.
> >
> > The tree can be found here:
> >
> > https://github.com/intel/tdx.git guest-unaccepted-memory
> 
> Hi Kirill,
> 
> I have a couple questions about this feature mainly about how cloud
> customers can use this, I assume since this is a confidential compute
> feature a large number of the users of these patches will be cloud
> customers using TDX and SNP. One issue I see with these patches is how
> do we as a cloud provider know whether a customer's linux image
> supports this feature, if the image doesn't have these patches UEFI
> needs to fully validate the memory, if the image does we can use this
> new protocol. In GCE we supply our VMs with a version of the EDK2 FW
> and the customer doesn't input into which UEFI we run, as far as I can
> tell from the Azure SNP VM documentation it seems very similar. We
> need to somehow tell our UEFI in the VM what to do based on the image.
> The current way I can see to solve this issue would be to have our
> customers give us metadata about their VM's image but this seems kinda
> burdensome on our customers (I assume we'll have more features which
> both UEFI and kernel need to both support inorder to be turned on like
> this one) and error-prone, if a customer incorrectly labels their
> image it may fail to boot.. Has there been any discussion about how to
> solve this? My naive thoughts were what if UEFI and Kernel had some
> sort of feature negotiation. Maybe that could happen via an extension
> to exit boot services or a UEFI runtime driver, I'm not sure what's
> best here just some ideas.

Just as an idea, we can put info into UTS_VERSION which can be read from
the built bzImage. We have info on SMP and preeption there already.

Patch below does this:

$ file arch/x86/boot/bzImage
arch/x86/boot/bzImage: Linux kernel x86 boot executable bzImage, version 5.19.0-rc3-00016-g2f6aa48e28d9-dirty (kas@box) #2300 SMP PREEMPT_DYNAMIC UNACCEPTED_MEMORY Mon Jun 27 14:23:04 , RO-rootFS, swap_dev 0XC, Normal VGA

Note UNACCEPTED_MEMORY in the output.

Probably we want to have there info on which flavour of unaccepted memory
is supported (TDX/SNP/whatever). It is a bit more tricky.

Any opinion?

diff --git a/init/Makefile b/init/Makefile
index d82623d7fc8e..6688ea43e6bf 100644
--- a/init/Makefile
+++ b/init/Makefile
@@ -32,7 +32,7 @@ quiet_cmd_compile.h = CHK     $@
 	$(CONFIG_SHELL) $(srctree)/scripts/mkcompile_h $@	\
 	"$(UTS_MACHINE)" "$(CONFIG_SMP)" "$(CONFIG_PREEMPT_BUILD)"	\
 	"$(CONFIG_PREEMPT_DYNAMIC)" "$(CONFIG_PREEMPT_RT)" \
-	"$(CONFIG_CC_VERSION_TEXT)" "$(LD)"
+	"$(CONFIG_UNACCEPTED_MEMORY)" "$(CONFIG_CC_VERSION_TEXT)" "$(LD)"

 include/generated/compile.h: FORCE
 	$(call cmd,compile.h)
diff --git a/scripts/mkcompile_h b/scripts/mkcompile_h
index ca40a5258c87..efacfecad699 100755
--- a/scripts/mkcompile_h
+++ b/scripts/mkcompile_h
@@ -7,8 +7,9 @@ SMP=$3
 PREEMPT=$4
 PREEMPT_DYNAMIC=$5
 PREEMPT_RT=$6
-CC_VERSION="$7"
-LD=$8
+UNACCEPTED_MEMORY=$7
+CC_VERSION="$8"
+LD=$9

 # Do not expand names
 set -f
@@ -51,6 +52,10 @@ elif [ -n "$PREEMPT" ] ; then
 	CONFIG_FLAGS="$CONFIG_FLAGS PREEMPT"
 fi

+if [ -n "$UNACCEPTED_MEMORY" ] ; then
+	CONFIG_FLAGS="$CONFIG_FLAGS UNACCEPTED_MEMORY"
+fi
+
 # Truncate to maximum length
 UTS_LEN=64
 UTS_VERSION="$(echo $UTS_VERSION $CONFIG_FLAGS $TIMESTAMP | cut -b -$UTS_LEN)"
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-06-27 11:30   ` Kirill A. Shutemov
@ 2022-06-27 11:54     ` Ard Biesheuvel
  2022-06-27 12:22       ` Kirill A. Shutemov
  0 siblings, 1 reply; 200+ messages in thread
From: Ard Biesheuvel @ 2022-06-27 11:54 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Peter Gonda, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Marcelo Cerri, tim.gardner,
	khalid.elmously, philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML

On Mon, 27 Jun 2022 at 13:30, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Fri, Jun 24, 2022 at 10:37:10AM -0600, Peter Gonda wrote:
> > On Tue, Jun 14, 2022 at 6:03 AM Kirill A. Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> > >
> > > UEFI Specification version 2.9 introduces the concept of memory
> > > acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
> > > SEV-SNP, requiring memory to be accepted before it can be used by the
> > > guest. Accepting happens via a protocol specific for the Virtual
> > > Machine platform.
> > >
> > > Accepting memory is costly and it makes VMM allocate memory for the
> > > accepted guest physical address range. It's better to postpone memory
> > > acceptance until memory is needed. It lowers boot time and reduces
> > > memory overhead.
> > >
> > > The kernel needs to know what memory has been accepted. Firmware
> > > communicates this information via memory map: a new memory type --
> > > EFI_UNACCEPTED_MEMORY -- indicates such memory.
> > >
> > > Range-based tracking works fine for firmware, but it gets bulky for
> > > the kernel: e820 has to be modified on every page acceptance. It leads
> > > to table fragmentation, but there's a limited number of entries in the
> > > e820 table
> > >
> > > Another option is to mark such memory as usable in e820 and track if the
> > > range has been accepted in a bitmap. One bit in the bitmap represents
> > > 2MiB in the address space: one 4k page is enough to track 64GiB or
> > > physical address space.
> > >
> > > In the worst-case scenario -- a huge hole in the middle of the
> > > address space -- It needs 256MiB to handle 4PiB of the address
> > > space.
> > >
> > > Any unaccepted memory that is not aligned to 2M gets accepted upfront.
> > >
> > > The approach lowers boot time substantially. Boot to shell is ~2.5x
> > > faster for 4G TDX VM and ~4x faster for 64G.
> > >
> > > TDX-specific code isolated from the core of unaccepted memory support. It
> > > supposed to help to plug-in different implementation of unaccepted memory
> > > such as SEV-SNP.
> > >
> > > The tree can be found here:
> > >
> > > https://github.com/intel/tdx.git guest-unaccepted-memory
> >
> > Hi Kirill,
> >
> > I have a couple questions about this feature mainly about how cloud
> > customers can use this, I assume since this is a confidential compute
> > feature a large number of the users of these patches will be cloud
> > customers using TDX and SNP. One issue I see with these patches is how
> > do we as a cloud provider know whether a customer's linux image
> > supports this feature, if the image doesn't have these patches UEFI
> > needs to fully validate the memory, if the image does we can use this
> > new protocol. In GCE we supply our VMs with a version of the EDK2 FW
> > and the customer doesn't input into which UEFI we run, as far as I can
> > tell from the Azure SNP VM documentation it seems very similar. We
> > need to somehow tell our UEFI in the VM what to do based on the image.
> > The current way I can see to solve this issue would be to have our
> > customers give us metadata about their VM's image but this seems kinda
> > burdensome on our customers (I assume we'll have more features which
> > both UEFI and kernel need to both support inorder to be turned on like
> > this one) and error-prone, if a customer incorrectly labels their
> > image it may fail to boot.. Has there been any discussion about how to
> > solve this? My naive thoughts were what if UEFI and Kernel had some
> > sort of feature negotiation. Maybe that could happen via an extension
> > to exit boot services or a UEFI runtime driver, I'm not sure what's
> > best here just some ideas.
>
> Just as an idea, we can put info into UTS_VERSION which can be read from
> the built bzImage. We have info on SMP and preeption there already.
>

Instead of hacking this into the binary, couldn't we define a protocol
that the kernel will call from the EFI stub (before EBS()) to identify
itself as an image that understands unaccepted memory, and knows how
to deal with it?

That way, the firmware can accept all the memory on behalf of the OS
at ExitBootServices() time, unless the OS has indicated there is no
need to do so.




> Patch below does this:
>
> $ file arch/x86/boot/bzImage
> arch/x86/boot/bzImage: Linux kernel x86 boot executable bzImage, version 5.19.0-rc3-00016-g2f6aa48e28d9-dirty (kas@box) #2300 SMP PREEMPT_DYNAMIC UNACCEPTED_MEMORY Mon Jun 27 14:23:04 , RO-rootFS, swap_dev 0XC, Normal VGA
>
> Note UNACCEPTED_MEMORY in the output.
>
> Probably we want to have there info on which flavour of unaccepted memory
> is supported (TDX/SNP/whatever). It is a bit more tricky.
>
> Any opinion?
>
> diff --git a/init/Makefile b/init/Makefile
> index d82623d7fc8e..6688ea43e6bf 100644
> --- a/init/Makefile
> +++ b/init/Makefile
> @@ -32,7 +32,7 @@ quiet_cmd_compile.h = CHK     $@
>         $(CONFIG_SHELL) $(srctree)/scripts/mkcompile_h $@       \
>         "$(UTS_MACHINE)" "$(CONFIG_SMP)" "$(CONFIG_PREEMPT_BUILD)"      \
>         "$(CONFIG_PREEMPT_DYNAMIC)" "$(CONFIG_PREEMPT_RT)" \
> -       "$(CONFIG_CC_VERSION_TEXT)" "$(LD)"
> +       "$(CONFIG_UNACCEPTED_MEMORY)" "$(CONFIG_CC_VERSION_TEXT)" "$(LD)"
>
>  include/generated/compile.h: FORCE
>         $(call cmd,compile.h)
> diff --git a/scripts/mkcompile_h b/scripts/mkcompile_h
> index ca40a5258c87..efacfecad699 100755
> --- a/scripts/mkcompile_h
> +++ b/scripts/mkcompile_h
> @@ -7,8 +7,9 @@ SMP=$3
>  PREEMPT=$4
>  PREEMPT_DYNAMIC=$5
>  PREEMPT_RT=$6
> -CC_VERSION="$7"
> -LD=$8
> +UNACCEPTED_MEMORY=$7
> +CC_VERSION="$8"
> +LD=$9
>
>  # Do not expand names
>  set -f
> @@ -51,6 +52,10 @@ elif [ -n "$PREEMPT" ] ; then
>         CONFIG_FLAGS="$CONFIG_FLAGS PREEMPT"
>  fi
>
> +if [ -n "$UNACCEPTED_MEMORY" ] ; then
> +       CONFIG_FLAGS="$CONFIG_FLAGS UNACCEPTED_MEMORY"
> +fi
> +
>  # Truncate to maximum length
>  UTS_LEN=64
>  UTS_VERSION="$(echo $UTS_VERSION $CONFIG_FLAGS $TIMESTAMP | cut -b -$UTS_LEN)"
> --
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-06-27 11:54     ` Ard Biesheuvel
@ 2022-06-27 12:22       ` Kirill A. Shutemov
  2022-06-27 16:17         ` Peter Gonda
  0 siblings, 1 reply; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-06-27 12:22 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Peter Gonda, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Marcelo Cerri, tim.gardner,
	khalid.elmously, philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML

On Mon, Jun 27, 2022 at 01:54:45PM +0200, Ard Biesheuvel wrote:
> On Mon, 27 Jun 2022 at 13:30, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > On Fri, Jun 24, 2022 at 10:37:10AM -0600, Peter Gonda wrote:
> > > On Tue, Jun 14, 2022 at 6:03 AM Kirill A. Shutemov
> > > <kirill.shutemov@linux.intel.com> wrote:
> > > >
> > > > UEFI Specification version 2.9 introduces the concept of memory
> > > > acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
> > > > SEV-SNP, requiring memory to be accepted before it can be used by the
> > > > guest. Accepting happens via a protocol specific for the Virtual
> > > > Machine platform.
> > > >
> > > > Accepting memory is costly and it makes VMM allocate memory for the
> > > > accepted guest physical address range. It's better to postpone memory
> > > > acceptance until memory is needed. It lowers boot time and reduces
> > > > memory overhead.
> > > >
> > > > The kernel needs to know what memory has been accepted. Firmware
> > > > communicates this information via memory map: a new memory type --
> > > > EFI_UNACCEPTED_MEMORY -- indicates such memory.
> > > >
> > > > Range-based tracking works fine for firmware, but it gets bulky for
> > > > the kernel: e820 has to be modified on every page acceptance. It leads
> > > > to table fragmentation, but there's a limited number of entries in the
> > > > e820 table
> > > >
> > > > Another option is to mark such memory as usable in e820 and track if the
> > > > range has been accepted in a bitmap. One bit in the bitmap represents
> > > > 2MiB in the address space: one 4k page is enough to track 64GiB or
> > > > physical address space.
> > > >
> > > > In the worst-case scenario -- a huge hole in the middle of the
> > > > address space -- It needs 256MiB to handle 4PiB of the address
> > > > space.
> > > >
> > > > Any unaccepted memory that is not aligned to 2M gets accepted upfront.
> > > >
> > > > The approach lowers boot time substantially. Boot to shell is ~2.5x
> > > > faster for 4G TDX VM and ~4x faster for 64G.
> > > >
> > > > TDX-specific code isolated from the core of unaccepted memory support. It
> > > > supposed to help to plug-in different implementation of unaccepted memory
> > > > such as SEV-SNP.
> > > >
> > > > The tree can be found here:
> > > >
> > > > https://github.com/intel/tdx.git guest-unaccepted-memory
> > >
> > > Hi Kirill,
> > >
> > > I have a couple questions about this feature mainly about how cloud
> > > customers can use this, I assume since this is a confidential compute
> > > feature a large number of the users of these patches will be cloud
> > > customers using TDX and SNP. One issue I see with these patches is how
> > > do we as a cloud provider know whether a customer's linux image
> > > supports this feature, if the image doesn't have these patches UEFI
> > > needs to fully validate the memory, if the image does we can use this
> > > new protocol. In GCE we supply our VMs with a version of the EDK2 FW
> > > and the customer doesn't input into which UEFI we run, as far as I can
> > > tell from the Azure SNP VM documentation it seems very similar. We
> > > need to somehow tell our UEFI in the VM what to do based on the image.
> > > The current way I can see to solve this issue would be to have our
> > > customers give us metadata about their VM's image but this seems kinda
> > > burdensome on our customers (I assume we'll have more features which
> > > both UEFI and kernel need to both support inorder to be turned on like
> > > this one) and error-prone, if a customer incorrectly labels their
> > > image it may fail to boot.. Has there been any discussion about how to
> > > solve this? My naive thoughts were what if UEFI and Kernel had some
> > > sort of feature negotiation. Maybe that could happen via an extension
> > > to exit boot services or a UEFI runtime driver, I'm not sure what's
> > > best here just some ideas.
> >
> > Just as an idea, we can put info into UTS_VERSION which can be read from
> > the built bzImage. We have info on SMP and preeption there already.
> >
> 
> Instead of hacking this into the binary, couldn't we define a protocol
> that the kernel will call from the EFI stub (before EBS()) to identify
> itself as an image that understands unaccepted memory, and knows how
> to deal with it?
> 
> That way, the firmware can accept all the memory on behalf of the OS
> at ExitBootServices() time, unless the OS has indicated there is no
> need to do so.

I agree it would be better. But I think it would require change to EFI
spec, no?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-06-27 12:22       ` Kirill A. Shutemov
@ 2022-06-27 16:17         ` Peter Gonda
  2022-06-27 16:33           ` Ard Biesheuvel
  0 siblings, 1 reply; 200+ messages in thread
From: Peter Gonda @ 2022-06-27 16:17 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Ard Biesheuvel, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Marcelo Cerri, tim.gardner,
	Khalid ElMously, philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML

On Mon, Jun 27, 2022 at 6:22 AM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Mon, Jun 27, 2022 at 01:54:45PM +0200, Ard Biesheuvel wrote:
> > On Mon, 27 Jun 2022 at 13:30, Kirill A. Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> > >
> > > On Fri, Jun 24, 2022 at 10:37:10AM -0600, Peter Gonda wrote:
> > > > On Tue, Jun 14, 2022 at 6:03 AM Kirill A. Shutemov
> > > > <kirill.shutemov@linux.intel.com> wrote:
> > > > >
> > > > > UEFI Specification version 2.9 introduces the concept of memory
> > > > > acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
> > > > > SEV-SNP, requiring memory to be accepted before it can be used by the
> > > > > guest. Accepting happens via a protocol specific for the Virtual
> > > > > Machine platform.
> > > > >
> > > > > Accepting memory is costly and it makes VMM allocate memory for the
> > > > > accepted guest physical address range. It's better to postpone memory
> > > > > acceptance until memory is needed. It lowers boot time and reduces
> > > > > memory overhead.
> > > > >
> > > > > The kernel needs to know what memory has been accepted. Firmware
> > > > > communicates this information via memory map: a new memory type --
> > > > > EFI_UNACCEPTED_MEMORY -- indicates such memory.
> > > > >
> > > > > Range-based tracking works fine for firmware, but it gets bulky for
> > > > > the kernel: e820 has to be modified on every page acceptance. It leads
> > > > > to table fragmentation, but there's a limited number of entries in the
> > > > > e820 table
> > > > >
> > > > > Another option is to mark such memory as usable in e820 and track if the
> > > > > range has been accepted in a bitmap. One bit in the bitmap represents
> > > > > 2MiB in the address space: one 4k page is enough to track 64GiB or
> > > > > physical address space.
> > > > >
> > > > > In the worst-case scenario -- a huge hole in the middle of the
> > > > > address space -- It needs 256MiB to handle 4PiB of the address
> > > > > space.
> > > > >
> > > > > Any unaccepted memory that is not aligned to 2M gets accepted upfront.
> > > > >
> > > > > The approach lowers boot time substantially. Boot to shell is ~2.5x
> > > > > faster for 4G TDX VM and ~4x faster for 64G.
> > > > >
> > > > > TDX-specific code isolated from the core of unaccepted memory support. It
> > > > > supposed to help to plug-in different implementation of unaccepted memory
> > > > > such as SEV-SNP.
> > > > >
> > > > > The tree can be found here:
> > > > >
> > > > > https://github.com/intel/tdx.git guest-unaccepted-memory
> > > >
> > > > Hi Kirill,
> > > >
> > > > I have a couple questions about this feature mainly about how cloud
> > > > customers can use this, I assume since this is a confidential compute
> > > > feature a large number of the users of these patches will be cloud
> > > > customers using TDX and SNP. One issue I see with these patches is how
> > > > do we as a cloud provider know whether a customer's linux image
> > > > supports this feature, if the image doesn't have these patches UEFI
> > > > needs to fully validate the memory, if the image does we can use this
> > > > new protocol. In GCE we supply our VMs with a version of the EDK2 FW
> > > > and the customer doesn't input into which UEFI we run, as far as I can
> > > > tell from the Azure SNP VM documentation it seems very similar. We
> > > > need to somehow tell our UEFI in the VM what to do based on the image.
> > > > The current way I can see to solve this issue would be to have our
> > > > customers give us metadata about their VM's image but this seems kinda
> > > > burdensome on our customers (I assume we'll have more features which
> > > > both UEFI and kernel need to both support inorder to be turned on like
> > > > this one) and error-prone, if a customer incorrectly labels their
> > > > image it may fail to boot.. Has there been any discussion about how to
> > > > solve this? My naive thoughts were what if UEFI and Kernel had some
> > > > sort of feature negotiation. Maybe that could happen via an extension
> > > > to exit boot services or a UEFI runtime driver, I'm not sure what's
> > > > best here just some ideas.
> > >
> > > Just as an idea, we can put info into UTS_VERSION which can be read from
> > > the built bzImage. We have info on SMP and preeption there already.
> > >
> >
> > Instead of hacking this into the binary, couldn't we define a protocol
> > that the kernel will call from the EFI stub (before EBS()) to identify
> > itself as an image that understands unaccepted memory, and knows how
> > to deal with it?
> >
> > That way, the firmware can accept all the memory on behalf of the OS
> > at ExitBootServices() time, unless the OS has indicated there is no
> > need to do so.
>
> I agree it would be better. But I think it would require change to EFI
> spec, no?

Could this somehow be amended on to the UEFI Specification version 2.9
change which added all of the unaccepted memory features?

>
> --
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-06-27 16:17         ` Peter Gonda
@ 2022-06-27 16:33           ` Ard Biesheuvel
  2022-06-27 22:38             ` Kirill A. Shutemov
  0 siblings, 1 reply; 200+ messages in thread
From: Ard Biesheuvel @ 2022-06-27 16:33 UTC (permalink / raw)
  To: Peter Gonda
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Marcelo Cerri, tim.gardner,
	Khalid ElMously, philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML

On Mon, 27 Jun 2022 at 18:17, Peter Gonda <pgonda@google.com> wrote:
>
> On Mon, Jun 27, 2022 at 6:22 AM Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > On Mon, Jun 27, 2022 at 01:54:45PM +0200, Ard Biesheuvel wrote:
> > > On Mon, 27 Jun 2022 at 13:30, Kirill A. Shutemov
> > > <kirill.shutemov@linux.intel.com> wrote:
> > > >
> > > > On Fri, Jun 24, 2022 at 10:37:10AM -0600, Peter Gonda wrote:
> > > > > On Tue, Jun 14, 2022 at 6:03 AM Kirill A. Shutemov
> > > > > <kirill.shutemov@linux.intel.com> wrote:
> > > > > >
> > > > > > UEFI Specification version 2.9 introduces the concept of memory
> > > > > > acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
> > > > > > SEV-SNP, requiring memory to be accepted before it can be used by the
> > > > > > guest. Accepting happens via a protocol specific for the Virtual
> > > > > > Machine platform.
> > > > > >
> > > > > > Accepting memory is costly and it makes VMM allocate memory for the
> > > > > > accepted guest physical address range. It's better to postpone memory
> > > > > > acceptance until memory is needed. It lowers boot time and reduces
> > > > > > memory overhead.
> > > > > >
> > > > > > The kernel needs to know what memory has been accepted. Firmware
> > > > > > communicates this information via memory map: a new memory type --
> > > > > > EFI_UNACCEPTED_MEMORY -- indicates such memory.
> > > > > >
> > > > > > Range-based tracking works fine for firmware, but it gets bulky for
> > > > > > the kernel: e820 has to be modified on every page acceptance. It leads
> > > > > > to table fragmentation, but there's a limited number of entries in the
> > > > > > e820 table
> > > > > >
> > > > > > Another option is to mark such memory as usable in e820 and track if the
> > > > > > range has been accepted in a bitmap. One bit in the bitmap represents
> > > > > > 2MiB in the address space: one 4k page is enough to track 64GiB or
> > > > > > physical address space.
> > > > > >
> > > > > > In the worst-case scenario -- a huge hole in the middle of the
> > > > > > address space -- It needs 256MiB to handle 4PiB of the address
> > > > > > space.
> > > > > >
> > > > > > Any unaccepted memory that is not aligned to 2M gets accepted upfront.
> > > > > >
> > > > > > The approach lowers boot time substantially. Boot to shell is ~2.5x
> > > > > > faster for 4G TDX VM and ~4x faster for 64G.
> > > > > >
> > > > > > TDX-specific code isolated from the core of unaccepted memory support. It
> > > > > > supposed to help to plug-in different implementation of unaccepted memory
> > > > > > such as SEV-SNP.
> > > > > >
> > > > > > The tree can be found here:
> > > > > >
> > > > > > https://github.com/intel/tdx.git guest-unaccepted-memory
> > > > >
> > > > > Hi Kirill,
> > > > >
> > > > > I have a couple questions about this feature mainly about how cloud
> > > > > customers can use this, I assume since this is a confidential compute
> > > > > feature a large number of the users of these patches will be cloud
> > > > > customers using TDX and SNP. One issue I see with these patches is how
> > > > > do we as a cloud provider know whether a customer's linux image
> > > > > supports this feature, if the image doesn't have these patches UEFI
> > > > > needs to fully validate the memory, if the image does we can use this
> > > > > new protocol. In GCE we supply our VMs with a version of the EDK2 FW
> > > > > and the customer doesn't input into which UEFI we run, as far as I can
> > > > > tell from the Azure SNP VM documentation it seems very similar. We
> > > > > need to somehow tell our UEFI in the VM what to do based on the image.
> > > > > The current way I can see to solve this issue would be to have our
> > > > > customers give us metadata about their VM's image but this seems kinda
> > > > > burdensome on our customers (I assume we'll have more features which
> > > > > both UEFI and kernel need to both support inorder to be turned on like
> > > > > this one) and error-prone, if a customer incorrectly labels their
> > > > > image it may fail to boot.. Has there been any discussion about how to
> > > > > solve this? My naive thoughts were what if UEFI and Kernel had some
> > > > > sort of feature negotiation. Maybe that could happen via an extension
> > > > > to exit boot services or a UEFI runtime driver, I'm not sure what's
> > > > > best here just some ideas.
> > > >
> > > > Just as an idea, we can put info into UTS_VERSION which can be read from
> > > > the built bzImage. We have info on SMP and preeption there already.
> > > >
> > >
> > > Instead of hacking this into the binary, couldn't we define a protocol
> > > that the kernel will call from the EFI stub (before EBS()) to identify
> > > itself as an image that understands unaccepted memory, and knows how
> > > to deal with it?
> > >
> > > That way, the firmware can accept all the memory on behalf of the OS
> > > at ExitBootServices() time, unless the OS has indicated there is no
> > > need to do so.
> >
> > I agree it would be better. But I think it would require change to EFI
> > spec, no?
>
> Could this somehow be amended on to the UEFI Specification version 2.9
> change which added all of the unaccepted memory features?
>

Why would this need a change in the EFI spec? Not every EFI protocol
needs to be in the spec.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-06-27 16:33           ` Ard Biesheuvel
@ 2022-06-27 22:38             ` Kirill A. Shutemov
  2022-06-28 17:17               ` Ard Biesheuvel
  0 siblings, 1 reply; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-06-27 22:38 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Peter Gonda, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Marcelo Cerri, tim.gardner,
	Khalid ElMously, philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML

On Mon, Jun 27, 2022 at 06:33:51PM +0200, Ard Biesheuvel wrote:
> > > > >
> > > > > Just as an idea, we can put info into UTS_VERSION which can be read from
> > > > > the built bzImage. We have info on SMP and preeption there already.
> > > > >
> > > >
> > > > Instead of hacking this into the binary, couldn't we define a protocol
> > > > that the kernel will call from the EFI stub (before EBS()) to identify
> > > > itself as an image that understands unaccepted memory, and knows how
> > > > to deal with it?
> > > >
> > > > That way, the firmware can accept all the memory on behalf of the OS
> > > > at ExitBootServices() time, unless the OS has indicated there is no
> > > > need to do so.
> > >
> > > I agree it would be better. But I think it would require change to EFI
> > > spec, no?
> >
> > Could this somehow be amended on to the UEFI Specification version 2.9
> > change which added all of the unaccepted memory features?
> >
> 
> Why would this need a change in the EFI spec? Not every EFI protocol
> needs to be in the spec.

My EFI knowledge is shallow. Do we do this in other cases?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-06-27 22:38             ` Kirill A. Shutemov
@ 2022-06-28 17:17               ` Ard Biesheuvel
  2022-07-18 17:21                 ` Kirill A. Shutemov
  0 siblings, 1 reply; 200+ messages in thread
From: Ard Biesheuvel @ 2022-06-28 17:17 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Peter Gonda, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Marcelo Cerri, tim.gardner,
	Khalid ElMously, philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML

On Tue, 28 Jun 2022 at 00:38, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Mon, Jun 27, 2022 at 06:33:51PM +0200, Ard Biesheuvel wrote:
> > > > > >
> > > > > > Just as an idea, we can put info into UTS_VERSION which can be read from
> > > > > > the built bzImage. We have info on SMP and preeption there already.
> > > > > >
> > > > >
> > > > > Instead of hacking this into the binary, couldn't we define a protocol
> > > > > that the kernel will call from the EFI stub (before EBS()) to identify
> > > > > itself as an image that understands unaccepted memory, and knows how
> > > > > to deal with it?
> > > > >
> > > > > That way, the firmware can accept all the memory on behalf of the OS
> > > > > at ExitBootServices() time, unless the OS has indicated there is no
> > > > > need to do so.
> > > >
> > > > I agree it would be better. But I think it would require change to EFI
> > > > spec, no?
> > >
> > > Could this somehow be amended on to the UEFI Specification version 2.9
> > > change which added all of the unaccepted memory features?
> > >
> >
> > Why would this need a change in the EFI spec? Not every EFI protocol
> > needs to be in the spec.
>
> My EFI knowledge is shallow. Do we do this in other cases?
>

The E in EFI means 'extensible' and the whole design of a protocol
database using GUIDs as identifiers (which will not collide and
therefore need no a priori coordination when defining them) is
intended to allow extensions to be defined and implemented in a
distributed manner.

Of course, it would be fantastic if we can converge on a protocol that
all flavors of confidential compute can use, across different OSes, so
it is generally good if a protocol is defined in *some* shared
specification. But this doesn't have to be the EFI spec.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 11/14] x86: Disable kexec if system has unaccepted memory
  2022-06-24  2:00       ` Kirill A. Shutemov
@ 2022-06-28 23:51         ` Kirill A. Shutemov
  2022-06-29  0:10           ` Dave Hansen
  0 siblings, 1 reply; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-06-28 23:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Dave Hansen, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, kexec

On Fri, Jun 24, 2022 at 05:00:05AM +0300, Kirill A. Shutemov wrote:
> > If there is some deep and fundamental why this can not be supported
> > then it probably makes sense to put some code in the arch_kexec_load
> > hook that verifies that deep and fundamental reason is present.
> 
> Sounds straight-forward. I can do this.

What about the patch below?

From 0b758600e1eef5525f2a46630ab3559f118a272a Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Tue, 10 May 2022 19:02:18 +0300
Subject: [PATCH] x86: Disable kexec if system has unaccepted memory

On kexec, the target kernel has to know what memory has been accepted.
Information in EFI map is out of date and cannot be used.

boot_params.unaccepted_memory can be used to pass the bitmap between two
kernels on kexec, but the use-case is not yet implemented.

Disable kexec on machines with unaccepted memory for now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/unaccepted_memory.c | 16 ++++++++++++++++
 include/linux/kexec.h           |  2 ++
 kernel/kexec.c                  |  4 ++++
 kernel/kexec_core.c             |  5 +++++
 kernel/kexec_file.c             |  4 ++++
 5 files changed, 31 insertions(+)

diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index 566c3a72aee8..529c3fd1dab3 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -1,4 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0-only
+#include <linux/kexec.h>
 #include <linux/memblock.h>
 #include <linux/mm.h>
 #include <linux/pfn.h>
@@ -98,3 +99,18 @@ bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
 
 	return ret;
 }
+
+#ifdef CONFIG_KEXEC_CORE
+int arch_kexec_load(void)
+{
+	if (!boot_params.unaccepted_memory)
+		return 0;
+
+	/*
+	 * TODO: Information on memory acceptance status has to be communicated
+	 * between kernel.
+	 */
+	pr_warn_once("Disable kexec: not yet supported on systems with unaccepted memory\n");
+	return -EOPNOTSUPP;
+}
+#endif
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index ce6536f1d269..dfd9493d0b4b 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -396,6 +396,8 @@ void crash_free_reserved_phys_range(unsigned long begin, unsigned long end);
 void arch_kexec_protect_crashkres(void);
 void arch_kexec_unprotect_crashkres(void);
 
+int arch_kexec_load(void);
+
 #ifndef page_to_boot_pfn
 static inline unsigned long page_to_boot_pfn(struct page *page)
 {
diff --git a/kernel/kexec.c b/kernel/kexec.c
index b5e40f069768..352b3742f07a 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -195,6 +195,10 @@ static inline int kexec_load_check(unsigned long nr_segments,
 {
 	int result;
 
+	result = arch_kexec_load();
+	if (result)
+		return result;
+
 	/* We only trust the superuser with rebooting the system. */
 	if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
 		return -EPERM;
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 4d34c78334ce..4d51b9271f6b 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -1238,3 +1238,8 @@ void __weak arch_kexec_protect_crashkres(void)
 
 void __weak arch_kexec_unprotect_crashkres(void)
 {}
+
+int __weak arch_kexec_load(void)
+{
+	return 0;
+}
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 145321a5e798..d531df94ffbb 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -324,6 +324,10 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
 	int ret = 0, i;
 	struct kimage **dest_image, *image;
 
+	ret = arch_kexec_load();
+	if (ret)
+		return ret;
+
 	/* We only trust the superuser with rebooting the system. */
 	if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
 		return -EPERM;
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 11/14] x86: Disable kexec if system has unaccepted memory
  2022-06-28 23:51         ` Kirill A. Shutemov
@ 2022-06-29  0:10           ` Dave Hansen
  2022-06-29  0:59             ` Kirill A. Shutemov
  0 siblings, 1 reply; 200+ messages in thread
From: Dave Hansen @ 2022-06-29  0:10 UTC (permalink / raw)
  To: Kirill A. Shutemov, Eric W. Biederman
  Cc: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, x86, linux-mm, linux-coco, linux-efi, linux-kernel,
	kexec

On 6/28/22 16:51, Kirill A. Shutemov wrote:
> On Fri, Jun 24, 2022 at 05:00:05AM +0300, Kirill A. Shutemov wrote:
>>> If there is some deep and fundamental why this can not be supported
>>> then it probably makes sense to put some code in the arch_kexec_load
>>> hook that verifies that deep and fundamental reason is present.
...
> +	/*
> +	 * TODO: Information on memory acceptance status has to be communicated
> +	 * between kernel.
> +	 */

So, the deep and fundamental reason is... drum roll... you haven't
gotten around to implementing bitmap passing yet?!?!?   I have the
feeling that wasn't what Eric was looking for.


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 11/14] x86: Disable kexec if system has unaccepted memory
  2022-06-29  0:10           ` Dave Hansen
@ 2022-06-29  0:59             ` Kirill A. Shutemov
  2022-07-04  7:18               ` Dave Young
  0 siblings, 1 reply; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-06-29  0:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Eric W. Biederman, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, kexec

On Tue, Jun 28, 2022 at 05:10:56PM -0700, Dave Hansen wrote:
> On 6/28/22 16:51, Kirill A. Shutemov wrote:
> > On Fri, Jun 24, 2022 at 05:00:05AM +0300, Kirill A. Shutemov wrote:
> >>> If there is some deep and fundamental why this can not be supported
> >>> then it probably makes sense to put some code in the arch_kexec_load
> >>> hook that verifies that deep and fundamental reason is present.
> ...
> > +	/*
> > +	 * TODO: Information on memory acceptance status has to be communicated
> > +	 * between kernel.
> > +	 */
> 
> So, the deep and fundamental reason is... drum roll... you haven't
> gotten around to implementing bitmap passing yet?!?!?   I have the
> feeling that wasn't what Eric was looking for.

The deep fundamental reason is that everything cannot be implemented and
upstreamed at once.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 11/14] x86: Disable kexec if system has unaccepted memory
  2022-06-29  0:59             ` Kirill A. Shutemov
@ 2022-07-04  7:18               ` Dave Young
  0 siblings, 0 replies; 200+ messages in thread
From: Dave Young @ 2022-07-04  7:18 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Eric W. Biederman, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	linux-mm, linux-coco, linux-efi,
	open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	kexec

On Wed, 29 Jun 2022 at 08:59, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Tue, Jun 28, 2022 at 05:10:56PM -0700, Dave Hansen wrote:
> > On 6/28/22 16:51, Kirill A. Shutemov wrote:
> > > On Fri, Jun 24, 2022 at 05:00:05AM +0300, Kirill A. Shutemov wrote:
> > >>> If there is some deep and fundamental why this can not be supported
> > >>> then it probably makes sense to put some code in the arch_kexec_load
> > >>> hook that verifies that deep and fundamental reason is present.
> > ...
> > > +   /*
> > > +    * TODO: Information on memory acceptance status has to be communicated
> > > +    * between kernel.
> > > +    */
> >
> > So, the deep and fundamental reason is... drum roll... you haven't
> > gotten around to implementing bitmap passing yet?!?!?   I have the
> > feeling that wasn't what Eric was looking for.
>
> The deep fundamental reason is that everything cannot be implemented and
> upstreamed at once.

If the only thing is to pass bitmap to kexec kernel, since you have
reserved the bitmap memory I guess it is straightforward to set the
kexec bootparams.unaccepted_memory as the old value.  Not sure if
there are problems when the decompress code accepts memory again
though.
for kernel kexec_file_load, refer to function setup_boot_parameters()
in arch/x86/kernel/kexec-bzimage64.c for kexec_file_load,
for kexec-tools kexec_load code refer to
setup_linux_system_parameters() kexec/arch/i386/x86-linux-setup.c

Thanks
Dave


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 05/14] x86/boot: Add infrastructure required for unaccepted memory support
  2022-06-15 15:05     ` Kirill A. Shutemov
@ 2022-07-17 17:16       ` Borislav Petkov
  0 siblings, 0 replies; 200+ messages in thread
From: Borislav Petkov @ 2022-07-17 17:16 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Peter Zijlstra, Kirill A. Shutemov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel

On Wed, Jun 15, 2022 at 06:05:34PM +0300, Kirill A. Shutemov wrote:
> It also sounds painfully similar to uapi/ project. I'm not sure we want to
> go this path.

It is the same path perf tool is taking - see first paragraph:

https://lore.kernel.org/r/YtQM40VmiLTkPND2@kernel.org

I don't want to deal with the regular breakages or hacks to
boot/compressed/ just because little duplication. We copy the header
once and that's it - it doesn't even have to get updated like perf tool
does from time to time.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-06-28 17:17               ` Ard Biesheuvel
@ 2022-07-18 17:21                 ` Kirill A. Shutemov
  2022-07-18 23:32                   ` Dionna Amalie Glaze
  0 siblings, 1 reply; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-07-18 17:21 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Peter Gonda, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Marcelo Cerri, tim.gardner,
	Khalid ElMously, philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML

On Tue, Jun 28, 2022 at 07:17:00PM +0200, Ard Biesheuvel wrote:
> On Tue, 28 Jun 2022 at 00:38, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > On Mon, Jun 27, 2022 at 06:33:51PM +0200, Ard Biesheuvel wrote:
> > > > > > >
> > > > > > > Just as an idea, we can put info into UTS_VERSION which can be read from
> > > > > > > the built bzImage. We have info on SMP and preeption there already.
> > > > > > >
> > > > > >
> > > > > > Instead of hacking this into the binary, couldn't we define a protocol
> > > > > > that the kernel will call from the EFI stub (before EBS()) to identify
> > > > > > itself as an image that understands unaccepted memory, and knows how
> > > > > > to deal with it?
> > > > > >
> > > > > > That way, the firmware can accept all the memory on behalf of the OS
> > > > > > at ExitBootServices() time, unless the OS has indicated there is no
> > > > > > need to do so.
> > > > >
> > > > > I agree it would be better. But I think it would require change to EFI
> > > > > spec, no?
> > > >
> > > > Could this somehow be amended on to the UEFI Specification version 2.9
> > > > change which added all of the unaccepted memory features?
> > > >
> > >
> > > Why would this need a change in the EFI spec? Not every EFI protocol
> > > needs to be in the spec.
> >
> > My EFI knowledge is shallow. Do we do this in other cases?
> >
> 
> The E in EFI means 'extensible' and the whole design of a protocol
> database using GUIDs as identifiers (which will not collide and
> therefore need no a priori coordination when defining them) is
> intended to allow extensions to be defined and implemented in a
> distributed manner.
> 
> Of course, it would be fantastic if we can converge on a protocol that
> all flavors of confidential compute can use, across different OSes, so
> it is generally good if a protocol is defined in *some* shared
> specification. But this doesn't have to be the EFI spec.

I've talked with our firmware expert today and I think we have a problem
with the approach when kernel declaries support of unaccepted memory.

This apporach doesn't work if we include bootloader into the picture: if
EBS() called by bootloader we still cannot know if target kernel supports
unaccepted memory and we return to the square 1.

I think we should make it obvious from a kernel image if it supports
unaccepted memory (with UTS_VERSION or other way).

Any comments?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-07-18 17:21                 ` Kirill A. Shutemov
@ 2022-07-18 23:32                   ` Dionna Amalie Glaze
  2022-07-19  0:31                     ` Dionna Amalie Glaze
  2022-07-19  2:48                     ` Yao, Jiewen
  0 siblings, 2 replies; 200+ messages in thread
From: Dionna Amalie Glaze @ 2022-07-18 23:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Ard Biesheuvel, Peter Gonda, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Marcelo Cerri, tim.gardner,
	Khalid ElMously, philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML, Yao,
	Jiewen

> I've talked with our firmware expert today and I think we have a problem
> with the approach when kernel declaries support of unaccepted memory.
>

Is this Jiewen Yao? I've been trying to design the UEFI spec change
with him. The bootloader problem he commented with this morning was
something I wasn't fully considering.

> This apporach doesn't work if we include bootloader into the picture: if
> EBS() called by bootloader we still cannot know if target kernel supports
> unaccepted memory and we return to the square 1.
>
> I think we should make it obvious from a kernel image if it supports
> unaccepted memory (with UTS_VERSION or other way).
>
> Any comments?

Is this binary parsing trick already used in EDK2? If not, I wouldn't
want to introduce an ABI-solidifying requirement like that.

A bit more cumbersome, but more flexible way to enable the feature is
an idea I had in a meeting today:
Make unaccepted memory support a feature-enabling EFI driver installed
to the EFI system partition.

* The first time you boot (setup mode), you install an EFI driver that
just sets a feature Pcd to true (using a custom protocol as Ard had
suggested above).
* The second time you boot, if the feature Pcd is true, then the UEFI
is free to not accept memory and use the unaccepted memory type. The
bootloader will run after unaccepted memory has been allowed already,
so there is no accept-all event.

The default behavior will be to accept all memory when GetMemoryMap is
called unless the feature pcd is set to true.

We can then say this driver isn't needed once some new generation of
this technology comes along and we can require unaccepted memory
support as part of that technology's baseline, or we manage to update
the UEFI spec to have GetMemoryMapEx which has unaccepted memory
support baked in and the bootloaders all know to use it.

The cloud experience will be, "is boot slow? Install this EFI driver
from the cloud service provider" to tell the UEFI to enable unaccepted
memory.

-- 
-Dionna Glaze, PhD (she/her)

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-07-18 23:32                   ` Dionna Amalie Glaze
@ 2022-07-19  0:31                     ` Dionna Amalie Glaze
  2022-07-19 18:29                       ` Dionna Amalie Glaze
  2022-07-19  2:48                     ` Yao, Jiewen
  1 sibling, 1 reply; 200+ messages in thread
From: Dionna Amalie Glaze @ 2022-07-19  0:31 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Ard Biesheuvel, Peter Gonda, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Marcelo Cerri, tim.gardner,
	Khalid ElMously, philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML, Yao,
	Jiewen

> > I think we should make it obvious from a kernel image if it supports
> > unaccepted memory (with UTS_VERSION or other way).
> >

Something I didn't address in my previous email: how would the UEFI
know where the kernel is to parse this UTS_VERSION out when it's
booting a bootloader before Linux gets booted?

-- 
-Dionna Glaze, PhD (she/her)

^ permalink raw reply	[flat|nested] 200+ messages in thread

* RE: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-07-18 23:32                   ` Dionna Amalie Glaze
  2022-07-19  0:31                     ` Dionna Amalie Glaze
@ 2022-07-19  2:48                     ` Yao, Jiewen
  1 sibling, 0 replies; 200+ messages in thread
From: Yao, Jiewen @ 2022-07-19  2:48 UTC (permalink / raw)
  To: Dionna Amalie Glaze, Kirill A. Shutemov
  Cc: Ard Biesheuvel, Peter Gonda, Borislav Petkov, Lutomirski, Andy,
	Christopherson,,
	Sean, Andrew Morton, Rodel, Jorg, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Hansen, Dave,
	Mike Rapoport, David Hildenbrand, Marcelo Cerri, tim.gardner,
	Khalid ElMously, Cox, Philip, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML, Yao,
	Jiewen

Hey
I posted my comment on Bugzilla https://bugzilla.tianocore.org/show_bug.cgi?id=3987

Let's achieve EDKII/UEFI related discussion there.

Thank you
Yao, Jiewen

> -----Original Message-----
> From: Dionna Amalie Glaze <dionnaglaze@google.com>
> Sent: Tuesday, July 19, 2022 7:32 AM
> To: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>; Peter Gonda <pgonda@google.com>;
> Borislav Petkov <bp@alien8.de>; Lutomirski, Andy <luto@kernel.org>;
> Christopherson,, Sean <seanjc@google.com>; Andrew Morton <akpm@linux-
> foundation.org>; Rodel, Jorg <jroedel@suse.de>; Andi Kleen
> <ak@linux.intel.com>; Kuppuswamy Sathyanarayanan
> <sathyanarayanan.kuppuswamy@linux.intel.com>; David Rientjes
> <rientjes@google.com>; Vlastimil Babka <vbabka@suse.cz>; Tom Lendacky
> <thomas.lendacky@amd.com>; Thomas Gleixner <tglx@linutronix.de>; Peter
> Zijlstra <peterz@infradead.org>; Paolo Bonzini <pbonzini@redhat.com>; Ingo
> Molnar <mingo@redhat.com>; Varad Gautam <varad.gautam@suse.com>;
> Dario Faggioli <dfaggioli@suse.com>; Hansen, Dave <dave.hansen@intel.com>;
> Mike Rapoport <rppt@kernel.org>; David Hildenbrand <david@redhat.com>;
> Marcelo Cerri <marcelo.cerri@canonical.com>; tim.gardner@canonical.com;
> Khalid ElMously <khalid.elmously@canonical.com>; Cox, Philip
> <philip.cox@canonical.com>; the arch/x86 maintainers <x86@kernel.org>;
> Linux Memory Management List <linux-mm@kvack.org>; linux-
> coco@lists.linux.dev; linux-efi <linux-efi@vger.kernel.org>; LKML <linux-
> kernel@vger.kernel.org>; Yao, Jiewen <jiewen.yao@intel.com>
> Subject: Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted
> memory
> 
> > I've talked with our firmware expert today and I think we have a problem
> > with the approach when kernel declaries support of unaccepted memory.
> >
> 
> Is this Jiewen Yao? I've been trying to design the UEFI spec change
> with him. The bootloader problem he commented with this morning was
> something I wasn't fully considering.
> 
> > This apporach doesn't work if we include bootloader into the picture: if
> > EBS() called by bootloader we still cannot know if target kernel supports
> > unaccepted memory and we return to the square 1.
> >
> > I think we should make it obvious from a kernel image if it supports
> > unaccepted memory (with UTS_VERSION or other way).
> >
> > Any comments?
> 
> Is this binary parsing trick already used in EDK2? If not, I wouldn't
> want to introduce an ABI-solidifying requirement like that.
> 
> A bit more cumbersome, but more flexible way to enable the feature is
> an idea I had in a meeting today:
> Make unaccepted memory support a feature-enabling EFI driver installed
> to the EFI system partition.
> 
> * The first time you boot (setup mode), you install an EFI driver that
> just sets a feature Pcd to true (using a custom protocol as Ard had
> suggested above).
> * The second time you boot, if the feature Pcd is true, then the UEFI
> is free to not accept memory and use the unaccepted memory type. The
> bootloader will run after unaccepted memory has been allowed already,
> so there is no accept-all event.
> 
> The default behavior will be to accept all memory when GetMemoryMap is
> called unless the feature pcd is set to true.
> 
> We can then say this driver isn't needed once some new generation of
> this technology comes along and we can require unaccepted memory
> support as part of that technology's baseline, or we manage to update
> the UEFI spec to have GetMemoryMapEx which has unaccepted memory
> support baked in and the bootloaders all know to use it.
> 
> The cloud experience will be, "is boot slow? Install this EFI driver
> from the cloud service provider" to tell the UEFI to enable unaccepted
> memory.
> 
> --
> -Dionna Glaze, PhD (she/her)

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-07-19  0:31                     ` Dionna Amalie Glaze
@ 2022-07-19 18:29                       ` Dionna Amalie Glaze
  2022-07-19 19:13                         ` Borislav Petkov
  0 siblings, 1 reply; 200+ messages in thread
From: Dionna Amalie Glaze @ 2022-07-19 18:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Ard Biesheuvel, Peter Gonda, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Marcelo Cerri, tim.gardner,
	Khalid ElMously, philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML, Yao,
	Jiewen

> > > I think we should make it obvious from a kernel image if it supports
> > > unaccepted memory (with UTS_VERSION or other way).
> > >
>
> Something I didn't address in my previous email: how would the UEFI
> know where the kernel is to parse this UTS_VERSION out when it's
> booting a bootloader before Linux gets booted?
>

How about instead of the limited resource of UTS_VERSION, we add a
SETUP_BOOT_FEATURES enum for setup_data in the boot header? That would
be easier to parse out and more extensible in the future.
https://www.kernel.org/doc/html/latest/x86/boot.html?highlight=boot

This can contain a bitmap of a number of features that we currently
need manual tagging for, such as SEV guest support, SEV-SNP guest
support, TDX guest support, and (CONFIG_UNACCEPTED_MEMORY, TDX) or
(CONFIG_UNACCEPTED_MEMORY, SEV-SNP).
The VMM, UEFI, or boot loader can read these from the images/kernels
and have the appropriate behavior.

-- 
-Dionna Glaze, PhD (she/her)

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-07-19 18:29                       ` Dionna Amalie Glaze
@ 2022-07-19 19:13                         ` Borislav Petkov
  2022-07-19 20:45                           ` Ard Biesheuvel
  0 siblings, 1 reply; 200+ messages in thread
From: Borislav Petkov @ 2022-07-19 19:13 UTC (permalink / raw)
  To: Dionna Amalie Glaze
  Cc: Kirill A. Shutemov, Ard Biesheuvel, Peter Gonda, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Marcelo Cerri, tim.gardner,
	Khalid ElMously, philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML, Yao,
	Jiewen

On Tue, Jul 19, 2022 at 11:29:32AM -0700, Dionna Amalie Glaze wrote:
> How about instead of the limited resource of UTS_VERSION, we add a
> SETUP_BOOT_FEATURES enum for setup_data in the boot header? That would
> be easier to parse out and more extensible in the future.
> https://www.kernel.org/doc/html/latest/x86/boot.html?highlight=boot
> 
> This can contain a bitmap of a number of features that we currently
> need manual tagging for, such as SEV guest support, SEV-SNP guest
> support, TDX guest support, and (CONFIG_UNACCEPTED_MEMORY, TDX) or
> (CONFIG_UNACCEPTED_MEMORY, SEV-SNP).
> The VMM, UEFI, or boot loader can read these from the images/kernels
> and have the appropriate behavior.

I think for stuff like that you want loadflags or xloadflags in the
setup header.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-07-19 19:13                         ` Borislav Petkov
@ 2022-07-19 20:45                           ` Ard Biesheuvel
  2022-07-19 21:23                             ` Borislav Petkov
  0 siblings, 1 reply; 200+ messages in thread
From: Ard Biesheuvel @ 2022-07-19 20:45 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dionna Amalie Glaze, Kirill A. Shutemov, Peter Gonda,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Andi Kleen, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Varad Gautam,
	Dario Faggioli, Dave Hansen, Mike Rapoport, David Hildenbrand,
	Marcelo Cerri, tim.gardner, Khalid ElMously, philip.cox,
	the arch/x86 maintainers, Linux Memory Management List,
	linux-coco, linux-efi, LKML, Yao, Jiewen

On Tue, 19 Jul 2022 at 21:14, Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, Jul 19, 2022 at 11:29:32AM -0700, Dionna Amalie Glaze wrote:
> > How about instead of the limited resource of UTS_VERSION, we add a
> > SETUP_BOOT_FEATURES enum for setup_data in the boot header? That would
> > be easier to parse out and more extensible in the future.
> > https://www.kernel.org/doc/html/latest/x86/boot.html?highlight=boot
> >
> > This can contain a bitmap of a number of features that we currently
> > need manual tagging for, such as SEV guest support, SEV-SNP guest
> > support, TDX guest support, and (CONFIG_UNACCEPTED_MEMORY, TDX) or
> > (CONFIG_UNACCEPTED_MEMORY, SEV-SNP).
> > The VMM, UEFI, or boot loader can read these from the images/kernels
> > and have the appropriate behavior.
>
> I think for stuff like that you want loadflags or xloadflags in the
> setup header.
>

Please, no. Let's not invent Linux/x86 specific hacks to infer whether
or not the kernel is capable of accepting memory when it is perfectly
capable of telling us directly. We will surely need something
analogous on other architectures in the future as well, so the setup
header is definitely not the right place for this.

The 'bootloader that calls EBS()' case does not apply to Linux, and
given that we are talking specifically about confidential computing
VMs here, we can afford to be normative and define something generic
that works well for us.

So let's define a way for the EFI stub to signal to the firmware
(before EBS()) that it will take control of accepting memory. The
'bootloader that calls EBS()' case can invent something along the
lines of what has been proposed in this thread to infer the
capabilities of the kernel (and decide what to signal to the
firmware). But we have no need for this additional complexity on
Linux.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-07-19 20:45                           ` Ard Biesheuvel
@ 2022-07-19 21:23                             ` Borislav Petkov
  2022-07-19 21:35                               ` Dave Hansen
  0 siblings, 1 reply; 200+ messages in thread
From: Borislav Petkov @ 2022-07-19 21:23 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Dionna Amalie Glaze, Kirill A. Shutemov, Peter Gonda,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Andi Kleen, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Varad Gautam,
	Dario Faggioli, Dave Hansen, Mike Rapoport, David Hildenbrand,
	Marcelo Cerri, tim.gardner, Khalid ElMously, philip.cox,
	the arch/x86 maintainers, Linux Memory Management List,
	linux-coco, linux-efi, LKML, Yao, Jiewen

On Tue, Jul 19, 2022 at 10:45:06PM +0200, Ard Biesheuvel wrote:
> So let's define a way for the EFI stub to signal to the firmware
> (before EBS()) that it will take control of accepting memory. The
> 'bootloader that calls EBS()' case can invent something along the
> lines of what has been proposed in this thread to infer the
> capabilities of the kernel (and decide what to signal to the
> firmware). But we have no need for this additional complexity on
> Linux.

To tell you the truth, I've been perusing this thread from the sidelines
and am wondering why does this need this special dance at all?

If EFI takes control of accepting memory, then when the guest kernel
boots, it'll find all memory accepted and not do anything.

If EFI doesn't accept memory, then the guest kernel will boot and do the
accepting itself.

So either I'm missing something or we're overengineering this for no
good reason...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-07-19 21:23                             ` Borislav Petkov
@ 2022-07-19 21:35                               ` Dave Hansen
  2022-07-19 21:50                                 ` Borislav Petkov
  0 siblings, 1 reply; 200+ messages in thread
From: Dave Hansen @ 2022-07-19 21:35 UTC (permalink / raw)
  To: Borislav Petkov, Ard Biesheuvel
  Cc: Dionna Amalie Glaze, Kirill A. Shutemov, Peter Gonda,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Andi Kleen, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Varad Gautam,
	Dario Faggioli, Mike Rapoport, David Hildenbrand, Marcelo Cerri,
	tim.gardner, Khalid ElMously, philip.cox,
	the arch/x86 maintainers, Linux Memory Management List,
	linux-coco, linux-efi, LKML, Yao, Jiewen

On 7/19/22 14:23, Borislav Petkov wrote:
> On Tue, Jul 19, 2022 at 10:45:06PM +0200, Ard Biesheuvel wrote:
>> So let's define a way for the EFI stub to signal to the firmware
>> (before EBS()) that it will take control of accepting memory. The
>> 'bootloader that calls EBS()' case can invent something along the
>> lines of what has been proposed in this thread to infer the
>> capabilities of the kernel (and decide what to signal to the
>> firmware). But we have no need for this additional complexity on
>> Linux.
> To tell you the truth, I've been perusing this thread from the sidelines
> and am wondering why does this need this special dance at all?
> 
> If EFI takes control of accepting memory, then when the guest kernel
> boots, it'll find all memory accepted and not do anything.
> 
> If EFI doesn't accept memory, then the guest kernel will boot and do the
> accepting itself.
> 
> So either I'm missing something or we're overengineering this for no
> good reason...

They're trying to design something that can (forever) handle guests that
might not be able to accept memory.  It's based on the idea that
*something* needs to assume control and EFI doesn't have enough
information to assume control.

I wish we didn't need all this complexity, though.

There are three entities that can influence how much memory is accepted:

1. The host
2. The guest firmware
3. The guest kernel (or bootloader or something after the firmware)

This whole thread is about how #2 and #3 talk to each other and make
sure *someone* does it.

I kinda think we should just take the guest firmware out of the picture.
 There are only going to be a few versions of the kernel that can boot
under TDX (or SEV-SNP) and *can't* handle unaccepted memory.  It seems a
bit silly to design this whole interface for a few versions of the OS
that TDX folks tell me can't be used anyway.

I think we should just say if you want to run an OS that doesn't have
unaccepted memory support, you can either:

1. Deal with that at the host level configuration
2. Boot some intermediate thing like a bootloader that does acceptance
   before running the stupid^Wunenlightended OS
3. Live with the 4GB of pre-accepted memory you get with no OS work.

Yeah, this isn't convenient for some hosts.  But, really, this is
preferable to doing an EFI/OS dance until the end of time.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-07-19 21:35                               ` Dave Hansen
@ 2022-07-19 21:50                                 ` Borislav Petkov
  2022-07-19 22:01                                   ` Kirill A. Shutemov
  2022-07-19 22:02                                   ` Dave Hansen
  0 siblings, 2 replies; 200+ messages in thread
From: Borislav Petkov @ 2022-07-19 21:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ard Biesheuvel, Dionna Amalie Glaze, Kirill A. Shutemov,
	Peter Gonda, Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Andi Kleen, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Varad Gautam,
	Dario Faggioli, Mike Rapoport, David Hildenbrand, Marcelo Cerri,
	tim.gardner, Khalid ElMously, philip.cox,
	the arch/x86 maintainers, Linux Memory Management List,
	linux-coco, linux-efi, LKML, Yao, Jiewen

On Tue, Jul 19, 2022 at 02:35:45PM -0700, Dave Hansen wrote:
> They're trying to design something that can (forever) handle guests that
> might not be able to accept memory. 

Wait, what?

If you can't modify those guests to teach them to accept memory, how do
you add TDX or SNP guest support to them?

I.e., you need to modify the guests and then you can add memory
acceptance. Basically, your point below...

> It's based on the idea that *something* needs to assume control and
> EFI doesn't have enough information to assume control.
>
> I wish we didn't need all this complexity, though.
> 
> There are three entities that can influence how much memory is accepted:
> 
> 1. The host
> 2. The guest firmware
> 3. The guest kernel (or bootloader or something after the firmware)
> 
> This whole thread is about how #2 and #3 talk to each other and make
> sure *someone* does it.
> 
> I kinda think we should just take the guest firmware out of the picture.
>  There are only going to be a few versions of the kernel that can boot
> under TDX (or SEV-SNP) and *can't* handle unaccepted memory.  It seems a
> bit silly to design this whole interface for a few versions of the OS
> that TDX folks tell me can't be used anyway.
> 
> I think we should just say if you want to run an OS that doesn't have
> unaccepted memory support, you can either:
> 
> 1. Deal with that at the host level configuration
> 2. Boot some intermediate thing like a bootloader that does acceptance
>    before running the stupid^Wunenlightended OS
> 3. Live with the 4GB of pre-accepted memory you get with no OS work.
> 
> Yeah, this isn't convenient for some hosts.  But, really, this is
> preferable to doing an EFI/OS dance until the end of time.

Ack. Definitely.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-07-19 21:50                                 ` Borislav Petkov
@ 2022-07-19 22:01                                   ` Kirill A. Shutemov
  2022-07-19 22:02                                   ` Dave Hansen
  1 sibling, 0 replies; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-07-19 22:01 UTC (permalink / raw)
  To: Borislav Petkov, Peter Gonda
  Cc: Dave Hansen, Ard Biesheuvel, Dionna Amalie Glaze,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Andi Kleen, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Varad Gautam,
	Dario Faggioli, Mike Rapoport, David Hildenbrand, Marcelo Cerri,
	tim.gardner, Khalid ElMously, philip.cox,
	the arch/x86 maintainers, Linux Memory Management List,
	linux-coco, linux-efi, LKML, Yao, Jiewen

On Tue, Jul 19, 2022 at 11:50:57PM +0200, Borislav Petkov wrote:
> On Tue, Jul 19, 2022 at 02:35:45PM -0700, Dave Hansen wrote:
> > They're trying to design something that can (forever) handle guests that
> > might not be able to accept memory. 
> 
> Wait, what?
> 
> If you can't modify those guests to teach them to accept memory, how do
> you add TDX or SNP guest support to them?
> 
> I.e., you need to modify the guests and then you can add memory
> acceptance. Basically, your point below...
> 
> > It's based on the idea that *something* needs to assume control and
> > EFI doesn't have enough information to assume control.
> >
> > I wish we didn't need all this complexity, though.
> > 
> > There are three entities that can influence how much memory is accepted:
> > 
> > 1. The host
> > 2. The guest firmware
> > 3. The guest kernel (or bootloader or something after the firmware)
> > 
> > This whole thread is about how #2 and #3 talk to each other and make
> > sure *someone* does it.
> > 
> > I kinda think we should just take the guest firmware out of the picture.
> >  There are only going to be a few versions of the kernel that can boot
> > under TDX (or SEV-SNP) and *can't* handle unaccepted memory.  It seems a
> > bit silly to design this whole interface for a few versions of the OS
> > that TDX folks tell me can't be used anyway.
> > 
> > I think we should just say if you want to run an OS that doesn't have
> > unaccepted memory support, you can either:
> > 
> > 1. Deal with that at the host level configuration
> > 2. Boot some intermediate thing like a bootloader that does acceptance
> >    before running the stupid^Wunenlightended OS
> > 3. Live with the 4GB of pre-accepted memory you get with no OS work.
> > 
> > Yeah, this isn't convenient for some hosts.  But, really, this is
> > preferable to doing an EFI/OS dance until the end of time.
> 
> Ack. Definitely.

I like it too as it is no-code solution :P

Peter, I'm pretty sure unaccepted memory support hits upstream well before
TDX get adopted widely in production. I think it is pretty reasonable to
deal with it on host side in meanwhile.

Any objections?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-07-19 21:50                                 ` Borislav Petkov
  2022-07-19 22:01                                   ` Kirill A. Shutemov
@ 2022-07-19 22:02                                   ` Dave Hansen
  2022-07-19 22:08                                     ` Tom Lendacky
  2022-07-20  0:26                                     ` Marc Orr
  1 sibling, 2 replies; 200+ messages in thread
From: Dave Hansen @ 2022-07-19 22:02 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ard Biesheuvel, Dionna Amalie Glaze, Kirill A. Shutemov,
	Peter Gonda, Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Andi Kleen, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Varad Gautam,
	Dario Faggioli, Mike Rapoport, David Hildenbrand, Marcelo Cerri,
	tim.gardner, Khalid ElMously, philip.cox,
	the arch/x86 maintainers, Linux Memory Management List,
	linux-coco, linux-efi, LKML, Yao, Jiewen

On 7/19/22 14:50, Borislav Petkov wrote:
> On Tue, Jul 19, 2022 at 02:35:45PM -0700, Dave Hansen wrote:
>> They're trying to design something that can (forever) handle guests that
>> might not be able to accept memory. 
> Wait, what?
> 
> If you can't modify those guests to teach them to accept memory, how do
> you add TDX or SNP guest support to them?

Mainline today, for instance, doesn't have unaccepted memory support for
TDX or SEV-SNP guests.  But, they both still boot fine because folks
either configure it on the host side not to *have* any unaccepted
memory.  Or, they just live with the small (4GB??) amount of
pre-accepted memory, which is fine for testing things.


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-07-19 22:02                                   ` Dave Hansen
@ 2022-07-19 22:08                                     ` Tom Lendacky
  2022-07-20  0:26                                     ` Marc Orr
  1 sibling, 0 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-07-19 22:08 UTC (permalink / raw)
  To: Dave Hansen, Borislav Petkov
  Cc: Ard Biesheuvel, Dionna Amalie Glaze, Kirill A. Shutemov,
	Peter Gonda, Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Andi Kleen, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, Marcelo Cerri, tim.gardner,
	Khalid ElMously, philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML, Yao,
	Jiewen

On 7/19/22 17:02, Dave Hansen wrote:
> On 7/19/22 14:50, Borislav Petkov wrote:
>> On Tue, Jul 19, 2022 at 02:35:45PM -0700, Dave Hansen wrote:
>>> They're trying to design something that can (forever) handle guests that
>>> might not be able to accept memory.
>> Wait, what?
>>
>> If you can't modify those guests to teach them to accept memory, how do
>> you add TDX or SNP guest support to them?
> 
> Mainline today, for instance, doesn't have unaccepted memory support for
> TDX or SEV-SNP guests.  But, they both still boot fine because folks
> either configure it on the host side not to *have* any unaccepted
> memory.  Or, they just live with the small (4GB??) amount of
> pre-accepted memory, which is fine for testing things.

Today, for SEV-SNP, OVMF accepts all of the memory in advance of booting 
the kernel.

Thanks,
Tom

> 

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-07-19 22:02                                   ` Dave Hansen
  2022-07-19 22:08                                     ` Tom Lendacky
@ 2022-07-20  0:26                                     ` Marc Orr
  2022-07-20  5:44                                       ` Borislav Petkov
  2022-07-21 17:12                                       ` Dave Hansen
  1 sibling, 2 replies; 200+ messages in thread
From: Marc Orr @ 2022-07-20  0:26 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Borislav Petkov, Ard Biesheuvel, Dionna Amalie Glaze,
	Kirill A. Shutemov, Peter Gonda, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Marcelo Cerri, tim.gardner, Khalid ElMously,
	philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML, Yao,
	Jiewen

On Tue, Jul 19, 2022 at 3:02 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 7/19/22 14:50, Borislav Petkov wrote:
> > On Tue, Jul 19, 2022 at 02:35:45PM -0700, Dave Hansen wrote:
> >> They're trying to design something that can (forever) handle guests that
> >> might not be able to accept memory.
> > Wait, what?
> >
> > If you can't modify those guests to teach them to accept memory, how do
> > you add TDX or SNP guest support to them?
>
> Mainline today, for instance, doesn't have unaccepted memory support for
> TDX or SEV-SNP guests.  But, they both still boot fine because folks
> either configure it on the host side not to *have* any unaccepted
> memory.  Or, they just live with the small (4GB??) amount of
> pre-accepted memory, which is fine for testing things.

For us (Google cloud), "1. Deal with that at the host level
configuration" looks like:
https://cloud.google.com/compute/docs/images/create-delete-deprecate-private-images#guest-os-features

In other words, we have to tag images with "feature tags" to
distinguish which images have kernels that support which features.

Part of the reason we need to do it this way is that we use a single
guest firmware (i.e., guest UEFI) that lives outside of the image.

These feature tags are a mess to keep track of.

All that being said, I can totally see the upstream perspective being
"not our problem". It's hard to argue with that :-).

A few more thoughts:

- If the guest-side patches weren't upstream before this patch set to
handle unaccepted memory, you're all definitely right, that this isn't
a real issue. (Maybe it still isn't...)
- Do we anticipate (many) more features for confidential compute in
the future that require code in both the guest FW and guest kernel? If
yes, then designing a FW-kernel feature negotiation could be useful
beyond this situation.
- Dave's suggestion to "2. Boot some intermediate thing like a
bootloader that does acceptance ..." is pretty clever! So if upstream
thinks this FW-kernel negotiation is not a good direction, maybe we
(Google) can pursue this idea to avoid introducing yet another tag on
our images.

Thank you all for this discussion.

Thanks,
Marc

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-07-20  0:26                                     ` Marc Orr
@ 2022-07-20  5:44                                       ` Borislav Petkov
  2022-07-20 17:03                                         ` Marc Orr
  2022-07-21 17:12                                       ` Dave Hansen
  1 sibling, 1 reply; 200+ messages in thread
From: Borislav Petkov @ 2022-07-20  5:44 UTC (permalink / raw)
  To: Marc Orr
  Cc: Dave Hansen, Ard Biesheuvel, Dionna Amalie Glaze,
	Kirill A. Shutemov, Peter Gonda, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Marcelo Cerri, tim.gardner, Khalid ElMously,
	philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML, Yao,
	Jiewen

On Tue, Jul 19, 2022 at 05:26:21PM -0700, Marc Orr wrote:
> These feature tags are a mess to keep track of.

Well, looking at those tags, it doesn't look like you'll stop using them
anytime soon.

And once all the required SNP/TDX features are part of the guest image,
- including unaccepted memory - if anything, you'll have less tags.

:-)

> - Do we anticipate (many) more features for confidential compute in
> the future that require code in both the guest FW and guest kernel? If
> yes, then designing a FW-kernel feature negotiation could be useful
> beyond this situation.

Good question.

> - Dave's suggestion to "2. Boot some intermediate thing like a
> bootloader that does acceptance ..." is pretty clever! So if upstream
> thinks this FW-kernel negotiation is not a good direction, maybe we
> (Google) can pursue this idea to avoid introducing yet another tag on
> our images.

Are those tags really that nasty so that you guys are looking at
upstream changes just to avoid them?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-07-20  5:44                                       ` Borislav Petkov
@ 2022-07-20 17:03                                         ` Marc Orr
  2022-07-22 15:07                                           ` Borislav Petkov
  0 siblings, 1 reply; 200+ messages in thread
From: Marc Orr @ 2022-07-20 17:03 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Hansen, Ard Biesheuvel, Dionna Amalie Glaze,
	Kirill A. Shutemov, Peter Gonda, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Marcelo Cerri, tim.gardner, Khalid ElMously,
	philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML, Yao,
	Jiewen

On Tue, Jul 19, 2022 at 10:44 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, Jul 19, 2022 at 05:26:21PM -0700, Marc Orr wrote:
> > These feature tags are a mess to keep track of.
>
> Well, looking at those tags, it doesn't look like you'll stop using them
> anytime soon.
>
> And once all the required SNP/TDX features are part of the guest image,
> - including unaccepted memory - if anything, you'll have less tags.
>
> :-)

Yeah, once all of the features are a part of the guest image AND any
older images with SNP/TDX minus the features are deprecated. I agree.

> > - Do we anticipate (many) more features for confidential compute in
> > the future that require code in both the guest FW and guest kernel? If
> > yes, then designing a FW-kernel feature negotiation could be useful
> > beyond this situation.
>
> Good question.
>
> > - Dave's suggestion to "2. Boot some intermediate thing like a
> > bootloader that does acceptance ..." is pretty clever! So if upstream
> > thinks this FW-kernel negotiation is not a good direction, maybe we
> > (Google) can pursue this idea to avoid introducing yet another tag on
> > our images.
>
> Are those tags really that nasty so that you guys are looking at
> upstream changes just to avoid them?

Generally, no. But the problem with tags is that distros tag their
images wrong sometimes. And that leads to problems. For example, I
just got a bug assigned to me yesterday about some ARM image tagged as
SEV_CAPABLE. Oops. Lol :-). (Though, I'm pretty sure we won't try to
boot an ARM image on a non-ARM host anyway; but it's still wrong...)

That being said, this lazy accept problem is sort of a special case,
since it requires deploying code to the guest FW and the guest kernel.
I'm still relatively new at all of this, but other than the
SNP/TDX-enlightenment patches themselves,  I haven't really seen any
other examples of this. So that goes back to my previous question. Is
this going to happen a lot more? If not, I can definitely see value in
the argument to skip the complexity of the FW/kernel feature
negotiation.

Another thing I thought of since my last reply, that's mostly an
internal solution to this problem on our side: Going back to Dave's
10k-foot view of the different angles of how to solve this. For "1.
Deal with that at the host level configuration", I'm thinking we could
tag the images with their internal guest kernel version. For example,
if an image has a 5.15 kernel, then we could have a `KERNEL_5_15` tag.
This would then allow us to have logic in the guest FW like:

if (guest_kernel_is_at_least(/*major=*/5, /*minor=*/15)
     enable_lazy_accept = true;

One detail I actually missed in all of this, is how the guest image
tag gets propagated into the guest FW in this approach. (Apologies for
this, as that's a pretty big oversight on my part.) Dionna: Have you
thought about this? Presumably this requires some sort of paravirt for
the guest to ask the host. And for any paravirt interface, now we need
to think about if it degrades the security of the confidential VMs.
Though, using it to get the kernel version to decide whether or not to
accept the memory within the guest UEFI or mark it as unaccepted seems
fine from a security angle to me.

Also, tagging images with their underlying kernel versions still seems
susceptible to mis-labeling. But this seems like it can be mostly
"fixed" via automation (e.g., write a tool to boot the guest and ask
it what it's kernel version is and use the result to attach the tag).
Also, tagging the images with their kernel version seems like a much
more general solution to these sorts of issues.

Thoughts?

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-06-14 12:02 ` [PATCHv7 02/14] mm: Add support for unaccepted memory Kirill A. Shutemov
  2022-06-14 12:57   ` Gupta, Pankaj
  2022-06-17 19:28   ` Tom Lendacky
@ 2022-07-21 15:14   ` Borislav Petkov
  2022-07-21 15:49     ` Dave Hansen
  2022-08-05 11:49   ` Vlastimil Babka
  3 siblings, 1 reply; 200+ messages in thread
From: Borislav Petkov @ 2022-07-21 15:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Mike Rapoport

On Tue, Jun 14, 2022 at 03:02:19PM +0300, Kirill A. Shutemov wrote:
>     On-demand memory accept means latency spikes every time kernel steps
>     onto a new memory block. The spikes will go away once workload data
>     set size gets stabilized or all memory gets accepted.

What does that mean?

If we're accepting 2M pages and considering referential locality, how
are those "spikes" even noticeable?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-07-21 15:14   ` Borislav Petkov
@ 2022-07-21 15:49     ` Dave Hansen
  2022-07-22 19:18       ` Borislav Petkov
  0 siblings, 1 reply; 200+ messages in thread
From: Dave Hansen @ 2022-07-21 15:49 UTC (permalink / raw)
  To: Borislav Petkov, Kirill A. Shutemov
  Cc: Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, x86, linux-mm, linux-coco, linux-efi, linux-kernel,
	Mike Rapoport

On 7/21/22 08:14, Borislav Petkov wrote:
> On Tue, Jun 14, 2022 at 03:02:19PM +0300, Kirill A. Shutemov wrote:
>>     On-demand memory accept means latency spikes every time kernel steps
>>     onto a new memory block. The spikes will go away once workload data
>>     set size gets stabilized or all memory gets accepted.
> What does that mean?
> 
> If we're accepting 2M pages and considering referential locality, how
> are those "spikes" even noticeable?

Acceptance is slow and the heavy lifting is done inside the TDX module.
 It involves flushing old aliases out of the caches and initializing the
memory integrity metadata for every cacheline.  This implementation does
acceptance in 2MB chunks while holding a global lock.

So, those (effective) 2MB clflush+memset's (plus a few thousand cycles
for the hypercall/tdcall transitions) can't happen in parallel.  They
are serialized and must wait on each other.  If you have a few hundred
CPUs all trying to allocate memory (say, doing the first kernel compile
after a reboot), this is going to be very, very painful for a while.

That said, I think this is the right place to _start_.  There is going
to need to be some kind of follow-on solution (likely background
acceptance of some kind).  But, even with that solution, *this* code is
still needed to handle the degenerate case where the background accepter
can't keep up with foreground memory needs.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-07-20  0:26                                     ` Marc Orr
  2022-07-20  5:44                                       ` Borislav Petkov
@ 2022-07-21 17:12                                       ` Dave Hansen
  2022-07-23 11:14                                         ` Ard Biesheuvel
  1 sibling, 1 reply; 200+ messages in thread
From: Dave Hansen @ 2022-07-21 17:12 UTC (permalink / raw)
  To: Marc Orr
  Cc: Borislav Petkov, Ard Biesheuvel, Dionna Amalie Glaze,
	Kirill A. Shutemov, Peter Gonda, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Marcelo Cerri, tim.gardner, Khalid ElMously,
	philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML, Yao,
	Jiewen

On 7/19/22 17:26, Marc Orr wrote:
> - Dave's suggestion to "2. Boot some intermediate thing like a
> bootloader that does acceptance ..." is pretty clever! So if upstream
> thinks this FW-kernel negotiation is not a good direction, maybe we
> (Google) can pursue this idea to avoid introducing yet another tag on
> our images.

I'm obviously speaking only for myself here and not for "upstream" as a
whole, but I clearly don't like the FW/kernel negotiation thing.  It's a
permanent pain in our necks to solve a very temporary problem.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-07-20 17:03                                         ` Marc Orr
@ 2022-07-22 15:07                                           ` Borislav Petkov
  0 siblings, 0 replies; 200+ messages in thread
From: Borislav Petkov @ 2022-07-22 15:07 UTC (permalink / raw)
  To: Marc Orr
  Cc: Dave Hansen, Ard Biesheuvel, Dionna Amalie Glaze,
	Kirill A. Shutemov, Peter Gonda, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Marcelo Cerri, tim.gardner, Khalid ElMously,
	philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML, Yao,
	Jiewen

On Wed, Jul 20, 2022 at 10:03:40AM -0700, Marc Orr wrote:
> Generally, no. But the problem with tags is that distros tag their
> images wrong sometimes. And that leads to problems. For example, I
> just got a bug assigned to me yesterday about some ARM image tagged as
> SEV_CAPABLE. Oops. Lol :-). (Though, I'm pretty sure we won't try to
> boot an ARM image on a non-ARM host anyway; but it's still wrong...)

Yeah, even if, let it crash'n'burn - people will notice pretty quickly.

> That being said, this lazy accept problem is sort of a special case,
> since it requires deploying code to the guest FW and the guest kernel.
> I'm still relatively new at all of this, but other than the
> SNP/TDX-enlightenment patches themselves,  I haven't really seen any
> other examples of this. So that goes back to my previous question. Is
> this going to happen a lot more?

Good question.

Unfortunately, not even the architects of coco could give you an answer
because, as you see yourself, those additional features like memory
acceptance, live migration, etc keep changing - the whole coco thing is
pretty much a moving target.

For example, if someone comes along and says, err, see, I have this live
migration helper and that thing runs as an EFI executable and it is so
much better...

Not saying it'll happen but it could. I hope you're catching my drift.

> If not, I can definitely see value in the argument to skip the
> complexity of the FW/kernel feature negotiation.
>
> Another thing I thought of since my last reply, that's mostly an
> internal solution to this problem on our side: Going back to Dave's
> 10k-foot view of the different angles of how to solve this. For "1.
> Deal with that at the host level configuration", I'm thinking we could
> tag the images with their internal guest kernel version. For example,
> if an image has a 5.15 kernel, then we could have a `KERNEL_5_15` tag.
> This would then allow us to have logic in the guest FW like:
> 
> if (guest_kernel_is_at_least(/*major=*/5, /*minor=*/15)
>      enable_lazy_accept = true;

Well, I don't want to spoil your idea but imagine distros like SLE or
others backport features into old kernels. All of a sudden 5.14 or older
can do memory acceptance too. And then that version-based scheme falls
apart.

So I'm guessing it would probably be better to explicitly tag distro
images. Thing is, once all needed support gets in, you can drop the tags
and simply say, you don't support those old images anymore and assume
all required support is there and implicit...

> Also, tagging images with their underlying kernel versions still seems
> susceptible to mis-labeling. But this seems like it can be mostly
> "fixed" via automation (e.g., write a tool to boot the guest and ask
> it what it's kernel version is and use the result to attach the tag).

I'll do you one better: boot the image and check for all required
features and produce tags. Or do not accept the image as a possible coco
image. And so on.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-07-21 15:49     ` Dave Hansen
@ 2022-07-22 19:18       ` Borislav Petkov
  2022-07-22 19:30         ` Dave Hansen
  0 siblings, 1 reply; 200+ messages in thread
From: Borislav Petkov @ 2022-07-22 19:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, x86, linux-mm, linux-coco, linux-efi, linux-kernel,
	Mike Rapoport

On Thu, Jul 21, 2022 at 08:49:31AM -0700, Dave Hansen wrote:
> Acceptance is slow and the heavy lifting is done inside the TDX module.
>  It involves flushing old aliases out of the caches and initializing the
> memory integrity metadata for every cacheline.  This implementation does
> acceptance in 2MB chunks while holding a global lock.

Oh, fun.

> So, those (effective) 2MB clflush+memset's (plus a few thousand cycles
> for the hypercall/tdcall transitions)

So this sounds strange - page validation on AMD - judging by the
pseudocode of the PVALIDATE insn - does a bunch of sanity checks on the
gVA of the page and then installs it into the RMP and also "PVALIDATE
performs the same segmentation and paging checks as a 1-byte read.
PVALIDATE does not invalidate TLB caches."

But that still sounds a lot less work than what the TDX module needs to
do...

> can't happen in parallel. They are serialized and must wait on each
> other.

Ofc, the Intel version of the RMP table needs to be protected. :-)

> If you have a few hundred CPUs all trying to allocate memory (say,
> doing the first kernel compile after a reboot), this is going to be
> very, very painful for a while.
>
> That said, I think this is the right place to _start_. There is going
> to need to be some kind of follow-on solution (likely background
> acceptance of some kind). But, even with that solution, *this* code
> is still needed to handle the degenerate case where the background
> accepter can't keep up with foreground memory needs.

I'm still catering to the view that it should be a two-tier thing: you
validate during boot a certain amount - say 4G - a size for which the
boot delay is acceptable and you do the rest on-demand along with a
background accepter.

That should give you the best of both worlds...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-07-22 19:18       ` Borislav Petkov
@ 2022-07-22 19:30         ` Dave Hansen
  2022-07-25 12:23           ` Borislav Petkov
  0 siblings, 1 reply; 200+ messages in thread
From: Dave Hansen @ 2022-07-22 19:30 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, x86, linux-mm, linux-coco, linux-efi, linux-kernel,
	Mike Rapoport

On 7/22/22 12:18, Borislav Petkov wrote:
> On Thu, Jul 21, 2022 at 08:49:31AM -0700, Dave Hansen wrote:
>> So, those (effective) 2MB clflush+memset's (plus a few thousand cycles
>> for the hypercall/tdcall transitions)
> 
> So this sounds strange - page validation on AMD - judging by the
> pseudocode of the PVALIDATE insn - does a bunch of sanity checks on the
> gVA of the page and then installs it into the RMP and also "PVALIDATE
> performs the same segmentation and paging checks as a 1-byte read.
> PVALIDATE does not invalidate TLB caches."
> 
> But that still sounds a lot less work than what the TDX module needs to
> do...

Sure does...  *Something* has to manage the cache coherency so that old
physical aliases of the converted memory don't write back and clobber
new data.  But, maybe the hardware is doing that now.


>> If you have a few hundred CPUs all trying to allocate memory (say,
>> doing the first kernel compile after a reboot), this is going to be
>> very, very painful for a while.
>>
>> That said, I think this is the right place to _start_. There is going
>> to need to be some kind of follow-on solution (likely background
>> acceptance of some kind). But, even with that solution, *this* code
>> is still needed to handle the degenerate case where the background
>> accepter can't keep up with foreground memory needs.
> 
> I'm still catering to the view that it should be a two-tier thing: you
> validate during boot a certain amount - say 4G - a size for which the
> boot delay is acceptable and you do the rest on-demand along with a
> background accepter.
> 
> That should give you the best of both worlds...

Yeah, that two-tier system is the way it's happening today from what I
understand.  This whole conversation is about how to handle the >4GB memory.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-07-21 17:12                                       ` Dave Hansen
@ 2022-07-23 11:14                                         ` Ard Biesheuvel
  2022-07-28 22:01                                           ` Dionna Amalie Glaze
  2022-08-09 11:14                                           ` Kirill A. Shutemov
  0 siblings, 2 replies; 200+ messages in thread
From: Ard Biesheuvel @ 2022-07-23 11:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Marc Orr, Borislav Petkov, Dionna Amalie Glaze,
	Kirill A. Shutemov, Peter Gonda, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Marcelo Cerri, tim.gardner, Khalid ElMously,
	philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML, Yao,
	Jiewen

On Thu, 21 Jul 2022 at 19:13, Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 7/19/22 17:26, Marc Orr wrote:
> > - Dave's suggestion to "2. Boot some intermediate thing like a
> > bootloader that does acceptance ..." is pretty clever! So if upstream
> > thinks this FW-kernel negotiation is not a good direction, maybe we
> > (Google) can pursue this idea to avoid introducing yet another tag on
> > our images.
>
> I'm obviously speaking only for myself here and not for "upstream" as a
> whole, but I clearly don't like the FW/kernel negotiation thing.  It's a
> permanent pain in our necks to solve a very temporary problem.

EFI is basically our existing embodiment of this fw/kernel negotiation
thing, and iff we need it, I have no objection to using it for this
purpose, i.e., to allow the firmware to infer whether or not it should
accept all available memory on behalf of the OS before exiting boot
services. But if we don't need this, even better.

What I strongly object to is inventing a new bespoke way for the
firmware to make inferences about the capabilities of the image by
inspecting fields in the file representation of the image (which is
not guaranteed by EFI to be identical to its in-memory representation,
as, e.g., the PE/COFF header could be omitted by a loader without
violating the spec)

As for the intermediate thing: yes, that would be a valuable thing to
have in OVMF (and I will gladly take EDK2 patches that implement
this). However, I'm not sure how you decide whether or not this thing
should be active or not, doesn't that just move the problem around?

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-07-22 19:30         ` Dave Hansen
@ 2022-07-25 12:23           ` Borislav Petkov
  2022-07-25 12:38             ` David Hildenbrand
  2022-07-25 13:00             ` Mike Rapoport
  0 siblings, 2 replies; 200+ messages in thread
From: Borislav Petkov @ 2022-07-25 12:23 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, x86, linux-mm, linux-coco, linux-efi, linux-kernel,
	Mike Rapoport

On Fri, Jul 22, 2022 at 12:30:36PM -0700, Dave Hansen wrote:
> Sure does...  *Something* has to manage the cache coherency so that old
> physical aliases of the converted memory don't write back and clobber
> new data.  But, maybe the hardware is doing that now.

Let's hope.

> Yeah, that two-tier system is the way it's happening today from what
> I understand. This whole conversation is about how to handle the >4GB
> memory.

Would it be possible to pre-accept a bunch of mem - think "pre-fault" -
from userspace?

I.e., I'm thinking some huge process is going to start in the VM, VM
userspace goes and causes a chunk of memory to be pre-accepted and then
the process starts and runs more-or-less smoothly as the majority of its
memory has already been "prepared".

Or does that not make any sense from mm perspective?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-07-25 12:23           ` Borislav Petkov
@ 2022-07-25 12:38             ` David Hildenbrand
  2022-07-25 12:53               ` Borislav Petkov
  2022-07-25 13:00             ` Mike Rapoport
  1 sibling, 1 reply; 200+ messages in thread
From: David Hildenbrand @ 2022-07-25 12:38 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Mike Rapoport,
	marcelo.cerri, tim.gardner, khalid.elmously, philip.cox, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel, Mike Rapoport

On 25.07.22 14:23, Borislav Petkov wrote:
> On Fri, Jul 22, 2022 at 12:30:36PM -0700, Dave Hansen wrote:
>> Sure does...  *Something* has to manage the cache coherency so that old
>> physical aliases of the converted memory don't write back and clobber
>> new data.  But, maybe the hardware is doing that now.
> 
> Let's hope.
> 
>> Yeah, that two-tier system is the way it's happening today from what
>> I understand. This whole conversation is about how to handle the >4GB
>> memory.
> 
> Would it be possible to pre-accept a bunch of mem - think "pre-fault" -
> from userspace?
> 
> I.e., I'm thinking some huge process is going to start in the VM, VM
> userspace goes and causes a chunk of memory to be pre-accepted and then
> the process starts and runs more-or-less smoothly as the majority of its
> memory has already been "prepared".
> 
> Or does that not make any sense from mm perspective?
> 

The less core-MM code to handle unaccepted memory the better. Meaning,
that any kind of additional pre-acceptance (in addition to what we have
here) needs good justification.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-07-25 12:38             ` David Hildenbrand
@ 2022-07-25 12:53               ` Borislav Petkov
  2022-07-26 14:30                 ` David Hildenbrand
  0 siblings, 1 reply; 200+ messages in thread
From: Borislav Petkov @ 2022-07-25 12:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dave Hansen, Kirill A. Shutemov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Mike Rapoport, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, x86, linux-mm, linux-coco, linux-efi, linux-kernel,
	Mike Rapoport

On Mon, Jul 25, 2022 at 02:38:57PM +0200, David Hildenbrand wrote:
> The less core-MM code to handle unaccepted memory the better. Meaning,
> that any kind of additional pre-acceptance (in addition to what we have
> here) needs good justification.

I actually mean to have a user interface to have the core code
pre-accept. I.e., interface to what will be already there anyway.

Or is that frowned upon too?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-07-25 12:23           ` Borislav Petkov
  2022-07-25 12:38             ` David Hildenbrand
@ 2022-07-25 13:00             ` Mike Rapoport
  2022-07-25 13:05               ` Borislav Petkov
  1 sibling, 1 reply; 200+ messages in thread
From: Mike Rapoport @ 2022-07-25 13:00 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Hansen, Kirill A. Shutemov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	David Hildenbrand, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, x86, linux-mm, linux-coco, linux-efi, linux-kernel,
	Mike Rapoport

On Mon, Jul 25, 2022 at 02:23:20PM +0200, Borislav Petkov wrote:
> On Fri, Jul 22, 2022 at 12:30:36PM -0700, Dave Hansen wrote:
> > Sure does...  *Something* has to manage the cache coherency so that old
> > physical aliases of the converted memory don't write back and clobber
> > new data.  But, maybe the hardware is doing that now.
> 
> Let's hope.
> 
> > Yeah, that two-tier system is the way it's happening today from what
> > I understand. This whole conversation is about how to handle the >4GB
> > memory.
> 
> Would it be possible to pre-accept a bunch of mem - think "pre-fault" -
> from userspace?
> 
> I.e., I'm thinking some huge process is going to start in the VM, VM
> userspace goes and causes a chunk of memory to be pre-accepted and then
> the process starts and runs more-or-less smoothly as the majority of its
> memory has already been "prepared".

An application in the VM can do mlock() or mmap(..., MAP_POPULATE, ...) and
this will essentially force acceptance of that memory.

But there's no sysctl or something for that.
 
> Or does that not make any sense from mm perspective?
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 04/14] efi/x86: Get full memory map in allocate_e820()
  2022-06-14 12:02 ` [PATCHv7 04/14] efi/x86: Get full memory map in allocate_e820() Kirill A. Shutemov
@ 2022-07-25 13:02   ` Borislav Petkov
  0 siblings, 0 replies; 200+ messages in thread
From: Borislav Petkov @ 2022-07-25 13:02 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel

On Tue, Jun 14, 2022 at 03:02:21PM +0300, Kirill A. Shutemov wrote:
> Currently allocate_e820() only interested in the size of map and size of
			   ^
			   is


> memory descriptor to determine how many e820 entries the kernel needs.
> 
> UEFI Specification version 2.9 introduces a new memory type --
> unaccepted memory. To track unaccepted memory kernel needs to allocate
> a bitmap. The size of the bitmap is dependent on the maximum physical
> address present in the system. A full memory map is required to find
> the maximum address.
> 
> Modify allocate_e820() to get a full memory map.
> 
> This is preparation for the next patch that implements handling of
> unaccepted memory in EFI stub.

As already pointed out, the concept of "next patch" is ambiguous in git.
Just drop the whole sentence.

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  drivers/firmware/efi/libstub/x86-stub.c | 28 +++++++++++--------------
>  1 file changed, 12 insertions(+), 16 deletions(-)

With the above addressed:

Reviewed-by: Borislav Petkov <bp@suse.de>

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-07-25 13:00             ` Mike Rapoport
@ 2022-07-25 13:05               ` Borislav Petkov
  0 siblings, 0 replies; 200+ messages in thread
From: Borislav Petkov @ 2022-07-25 13:05 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Dave Hansen, Kirill A. Shutemov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	David Hildenbrand, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, x86, linux-mm, linux-coco, linux-efi, linux-kernel,
	Mike Rapoport

On Mon, Jul 25, 2022 at 04:00:14PM +0300, Mike Rapoport wrote:
> An application in the VM can do mlock() or mmap(..., MAP_POPULATE, ...) and
> this will essentially force acceptance of that memory.

Ah, cool, that's what I meant.

> But there's no sysctl or something for that.

Yeah, no need.

I was simply wondering whether one can relocate the acceptance work to
the moment prior to starting the process so that it can run smoothly
once started and doesn't cause spikes due to on-demand acceptance.

At least not too many.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 05/14] x86/boot: Add infrastructure required for unaccepted memory support
  2022-06-14 12:02 ` [PATCHv7 05/14] x86/boot: Add infrastructure required for unaccepted memory support Kirill A. Shutemov
  2022-06-15 10:19   ` Peter Zijlstra
@ 2022-07-25 21:33   ` Borislav Petkov
  1 sibling, 0 replies; 200+ messages in thread
From: Borislav Petkov @ 2022-07-25 21:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel

On Tue, Jun 14, 2022 at 03:02:22PM +0300, Kirill A. Shutemov wrote:
> diff --git a/arch/x86/boot/compressed/compiler.h b/arch/x86/boot/compressed/compiler.h
> new file mode 100644
> index 000000000000..452c4c0844b9
> --- /dev/null
> +++ b/arch/x86/boot/compressed/compiler.h
> @@ -0,0 +1,9 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef BOOT_COMPILER_H
> +#define BOOT_COMPILER_H
> +#define __LINUX_COMPILER_H /* Inhibit inclusion of <linux/compiler.h> */
> +
> +# define likely(x)	__builtin_expect(!!(x), 1)
> +# define unlikely(x)	__builtin_expect(!!(x), 0)
> +
> +#endif

I guess that header is not really needed - simply drop the annotations
from the code too.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 06/14] efi/x86: Implement support for unaccepted memory
  2022-06-14 12:02 ` [PATCHv7 06/14] efi/x86: Implement support for unaccepted memory Kirill A. Shutemov
  2022-06-22 19:58   ` Dave Hansen
@ 2022-07-26  8:35   ` Borislav Petkov
  1 sibling, 0 replies; 200+ messages in thread
From: Borislav Petkov @ 2022-07-26  8:35 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel

On Tue, Jun 14, 2022 at 03:02:23PM +0300, Kirill A. Shutemov wrote:
> diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
> index 7aa4717cdcac..e1270beff4dc 100644
> --- a/drivers/firmware/efi/Kconfig
> +++ b/drivers/firmware/efi/Kconfig
> @@ -305,6 +305,20 @@ config EFI_COCO_SECRET
>  	  virt/coco/efi_secret module to access the secrets, which in turn
>  	  allows userspace programs to access the injected secrets.
>  
> +config UNACCEPTED_MEMORY
> +	bool
> +	depends on EFI_STUB
> +	help
> +	   Some Virtual Machine platforms, such as Intel TDX, require
> +	   some memory to be "accepted" by the guest before it can be used.
> +	   This mechanism helps prevent malicious hosts from making changes
> +	   to guest memory.
> +
> +	   UEFI specification v2.9 introduced EFI_UNACCEPTED_MEMORY memory type.
> +
> +	   This option adds support for unaccepted memory and makes such memory
> +	   usable by the kernel.
> +

This looks weird:

$ grep EFI_STUB .config
CONFIG_EFI_STUB=y
$ grep UNACCEPTED_MEMORY .config
$

So the bool needs to have a text string after it so that it is
selectable or how is UNACCEPTED_MEMORY supposed to be enabled otherwise?

If I add the string and select UNACCEPTED_MEMORY, it won't build:

mm/page_alloc.c: In function ‘accept_page’:
mm/page_alloc.c:1013:9: error: implicit declaration of function ‘accept_memory’ [-Werror=implicit-function-declaration]
 1013 |         accept_memory(start, start + (PAGE_SIZE << order));
      |         ^~~~~~~~~~~~~
mm/page_alloc.c: In function ‘page_contains_unaccepted’:
mm/page_alloc.c:1029:16: error: implicit declaration of function ‘range_contains_unaccepted_memory’; did you mean ‘page_contains_unaccepted’? [-Werror=implicit-function-declaration]
 1029 |         return range_contains_unaccepted_memory(start, end);
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                page_contains_unaccepted
mm/memblock.c: In function ‘memblock_alloc_range_nid’:
mm/memblock.c:1415:9: error: implicit declaration of function ‘accept_memory’ [-Werror=implicit-function-declaration]
 1415 |         accept_memory(found, found + size);
      |         ^~~~~~~~~~~~~
cc1: some warnings being treated as errors
make[1]: *** [scripts/Makefile.build:249: mm/memblock.o] Error 1
make[1]: *** Waiting for unfinished jobs....
cc1: some warnings being treated as errors
make[1]: *** [scripts/Makefile.build:249: mm/page_alloc.o] Error 1
make: *** [Makefile:1843: mm] Error 2
make: *** Waiting for unfinished jobs....

so this is weird.

> diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> index 504955368934..b91c89100b2d 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -15,6 +15,7 @@
>  #include <asm/setup.h>
>  #include <asm/desc.h>
>  #include <asm/boot.h>
> +#include <asm/unaccepted_memory.h>
>  
>  #include "efistub.h"
>  
> @@ -607,6 +608,17 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
>  			e820_type = E820_TYPE_PMEM;
>  			break;
>  
> +		case EFI_UNACCEPTED_MEMORY:
> +			if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY)) {
> +				efi_warn_once("The system has unaccepted memory,"
> +					     " but kernel does not support it\n");
> +				efi_warn_once("Consider enabling CONFIG_UNACCEPTED_MEMORY\n");
> +				continue;
> +			}

So that it can be grepped for:

diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index b91c89100b2d..8be6b675e08e 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -610,9 +610,8 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
 
 		case EFI_UNACCEPTED_MEMORY:
 			if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY)) {
-				efi_warn_once("The system has unaccepted memory,"
-					     " but kernel does not support it\n");
-				efi_warn_once("Consider enabling CONFIG_UNACCEPTED_MEMORY\n");
+				efi_warn_once(
+"The system has unaccepted memory, but kernel does not support it.\nConsider enabling CONFIG_UNACCEPTED_MEMORY\n");
 				continue;
 			}
 			e820_type = E820_TYPE_RAM;


Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 08/14] x86/mm: Reserve unaccepted memory bitmap
  2022-06-14 12:02 ` [PATCHv7 08/14] x86/mm: Reserve unaccepted memory bitmap Kirill A. Shutemov
@ 2022-07-26  9:07   ` Borislav Petkov
  2022-11-30  1:28     ` Kirill A. Shutemov
  0 siblings, 1 reply; 200+ messages in thread
From: Borislav Petkov @ 2022-07-26  9:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Mike Rapoport

On Tue, Jun 14, 2022 at 03:02:25PM +0300, Kirill A. Shutemov wrote:
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index f267205f2d5a..22d1fe48dcba 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1316,6 +1316,16 @@ void __init e820__memblock_setup(void)
>  	int i;
>  	u64 end;
>  
> +	/* Mark unaccepted memory bitmap reserved */
> +	if (boot_params.unaccepted_memory) {
> +		unsigned long size;
> +
> +		/* One bit per 2MB */
> +		size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
> +				    PMD_SIZE * BITS_PER_BYTE);
> +		memblock_reserve(boot_params.unaccepted_memory, size);
> +	}
> +

Hmm, I don't like how this is dropped right in the middle of a unrelated
function.

You're adding arch/x86/mm/unaccepted_memory.c later. Why don't you put
that chunk in a function there which is called by early_reserve_memory()
which does exactly what you want - reserve memory early, before memblock
allocations?

Hmm.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2022-06-14 12:02 ` [PATCHv7 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into " Kirill A. Shutemov
  2022-06-23 17:19   ` Dave Hansen
@ 2022-07-26 10:21   ` Borislav Petkov
  2022-08-02 23:46     ` Dave Hansen
  2022-07-26 17:25   ` Borislav Petkov
  2022-07-26 20:17   ` Andy Lutomirski
  3 siblings, 1 reply; 200+ messages in thread
From: Borislav Petkov @ 2022-07-26 10:21 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel

On Tue, Jun 14, 2022 at 03:02:27PM +0300, Kirill A. Shutemov wrote:
> But, this approach does not work for unaccepted memory. For TDX, a load
> from unaccepted memory will not lead to a recoverable exception within
> the guest. The guest will exit to the VMM where the only recourse is to
> terminate the guest.

FTR, this random-memory-access-to-unaccepted-memory-is-deadly thing is
really silly. We should be able to handle such cases - because they do
happen often - in a more resilient way. Just look at the complex dance
this patch needs to do just to avoid this.

IOW, this part of the coco technology needs improvement.

Just sayin...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 13/14] x86/tdx: Refactor try_accept_one()
  2022-06-14 12:02 ` [PATCHv7 13/14] x86/tdx: Refactor try_accept_one() Kirill A. Shutemov
  2022-06-23 17:31   ` Dave Hansen
@ 2022-07-26 10:58   ` Borislav Petkov
  1 sibling, 0 replies; 200+ messages in thread
From: Borislav Petkov @ 2022-07-26 10:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel

On Tue, Jun 14, 2022 at 03:02:30PM +0300, Kirill A. Shutemov wrote:
> Rework try_accept_one() to return accepted size instead of modifying
> 'start' inside the helper. It makes 'start' in-only argumaent and

Unknown word [argumaent] in commit message.
Suggestions: ['argument', 'augment', 'arguments', 'armament', 'argent', "argument's"]

You need a spellchecker enabled in your editor with which you write your
commit messages.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-07-25 12:53               ` Borislav Petkov
@ 2022-07-26 14:30                 ` David Hildenbrand
  0 siblings, 0 replies; 200+ messages in thread
From: David Hildenbrand @ 2022-07-26 14:30 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Hansen, Kirill A. Shutemov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Mike Rapoport, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, x86, linux-mm, linux-coco, linux-efi, linux-kernel,
	Mike Rapoport

On 25.07.22 14:53, Borislav Petkov wrote:
> On Mon, Jul 25, 2022 at 02:38:57PM +0200, David Hildenbrand wrote:
>> The less core-MM code to handle unaccepted memory the better. Meaning,
>> that any kind of additional pre-acceptance (in addition to what we have
>> here) needs good justification.
> 
> I actually mean to have a user interface to have the core code
> pre-accept. I.e., interface to what will be already there anyway.
>

Ah, got it. The "issue" with ordinary prefaulting in user space is that
you'll also end up zeroing the memory. But yeah, it should just accept a
bunch of memory.

> Or is that frowned upon too?

I was assuming you'd want some additional interface that only accept
memory without actually temporarily allocating it.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 03/14] mm: Report unaccepted memory in meminfo
  2022-06-14 12:02 ` [PATCHv7 03/14] mm: Report unaccepted memory in meminfo Kirill A. Shutemov
@ 2022-07-26 14:33   ` David Hildenbrand
  0 siblings, 0 replies; 200+ messages in thread
From: David Hildenbrand @ 2022-07-26 14:33 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Mike Rapoport, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel

On 14.06.22 14:02, Kirill A. Shutemov wrote:
> Track amount of unaccepted memory and report it in /proc/meminfo and in
> node meminfo.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  drivers/base/node.c    | 7 +++++++
>  fs/proc/meminfo.c      | 5 +++++
>  include/linux/mmzone.h | 1 +
>  mm/page_alloc.c        | 9 ++++++++-
>  mm/vmstat.c            | 1 +
>  5 files changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 0ac6376ef7a1..fc7cf4d91eb6 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -446,6 +446,9 @@ static ssize_t node_read_meminfo(struct device *dev,
>  			     "Node %d ShmemPmdMapped: %8lu kB\n"
>  			     "Node %d FileHugePages: %8lu kB\n"
>  			     "Node %d FilePmdMapped: %8lu kB\n"
> +#endif
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +			     "Node %d UnacceptedPages: %8lu kB\n"

Nit: I'd just call that "Unaccepted", as the unit is kB.

>  #endif
>  			     ,
>  			     nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
> @@ -474,6 +477,10 @@ static ssize_t node_read_meminfo(struct device *dev,
>  			     nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)),
>  			     nid, K(node_page_state(pgdat, NR_FILE_THPS)),
>  			     nid, K(node_page_state(pgdat, NR_FILE_PMDMAPPED))
> +#endif
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +			     ,
> +			     nid, K(node_page_state(pgdat, NR_UNACCEPTED))
>  #endif
>  			    );
>  	len += hugetlb_report_node_meminfo(buf, len, nid);
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 6e89f0e2fd20..796544e50365 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -153,6 +153,11 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>  		    global_zone_page_state(NR_FREE_CMA_PAGES));
>  #endif
>  
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +	show_val_kb(m, "UnacceptedPages:",
> +		    global_node_page_state(NR_UNACCEPTED));

Dito.

> +#endif
> +
>  	hugetlb_report_meminfo(m);
>  
>  	arch_report_meminfo(m);
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index aab70355d64f..aa08cd7eaaf5 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -212,6 +212,7 @@ enum node_stat_item {
>  	NR_FOLL_PIN_ACQUIRED,	/* via: pin_user_page(), gup flag: FOLL_PIN */
>  	NR_FOLL_PIN_RELEASED,	/* pages returned via unpin_user_page() */
>  	NR_KERNEL_STACK_KB,	/* measured in KiB */
> +	NR_UNACCEPTED,

Do we want to ifdef that as well? (and in all other cases below)

>  #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
>  	NR_KERNEL_SCS_KB,	/* measured in KiB */
>  #endif
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 279c2746aaa8..6316d695a567 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1012,6 +1012,7 @@ static void accept_page(struct page *page, unsigned int order)
>  
>  	accept_memory(start, start + (PAGE_SIZE << order));
>  	__ClearPageUnaccepted(page);
> +	mod_node_page_state(page_pgdat(page), NR_UNACCEPTED, -(1 << order));
>  
>  	/* Assert that there is no PageUnaccepted() on tail pages */
>  	if (IS_ENABLED(CONFIG_DEBUG_VM)) {
> @@ -1063,6 +1064,7 @@ static inline void __free_one_page(struct page *page,
>  	struct page *buddy;
>  	bool to_tail;
>  	bool page_needs_acceptance = false;
> +	int nr_unaccepted = 0;
>  
>  	VM_BUG_ON(!zone_is_initialized(zone));
>  	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> @@ -1076,6 +1078,7 @@ static inline void __free_one_page(struct page *page,
>  
>  	if (PageUnaccepted(page)) {
>  		page_needs_acceptance = true;
> +		nr_unaccepted += 1 << order;
>  		__ClearPageUnaccepted(page);
>  	}
>  
> @@ -1117,6 +1120,7 @@ static inline void __free_one_page(struct page *page,
>  		/* Mark page unaccepted if any of merged pages were unaccepted */
>  		if (PageUnaccepted(buddy)) {
>  			page_needs_acceptance = true;
> +			nr_unaccepted += 1 << order;
>  			__ClearPageUnaccepted(buddy);
>  		}
>  
> @@ -1143,8 +1147,11 @@ static inline void __free_one_page(struct page *page,
>  	 */
>  	if (!page_needs_acceptance && (fpi_flags & FPI_UNACCEPTED_SLOWPATH))
>  		page_needs_acceptance = page_contains_unaccepted(page, order);
> -	if (page_needs_acceptance)
> +	if (page_needs_acceptance) {
>  		__SetPageUnaccepted(page);
> +		__mod_node_page_state(page_pgdat(page), NR_UNACCEPTED,
> +				    (1 << order) - nr_unaccepted);
> +	}
>  
>  	if (fpi_flags & FPI_TO_TAIL)
>  		to_tail = true;
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 373d2730fcf2..4e12d22f1e04 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1236,6 +1236,7 @@ const char * const vmstat_text[] = {
>  	"nr_foll_pin_acquired",
>  	"nr_foll_pin_released",
>  	"nr_kernel_stack",
> +	"nr_unaccepted",
>  #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
>  	"nr_shadow_call_stack",
>  #endif


LGTM

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 14/14] x86/tdx: Add unaccepted memory support
  2022-06-14 12:02 ` [PATCHv7 14/14] x86/tdx: Add unaccepted memory support Kirill A. Shutemov
  2022-06-24 16:22   ` Dave Hansen
@ 2022-07-26 14:51   ` Borislav Petkov
  2022-08-09 11:45     ` Kirill A. Shutemov
  1 sibling, 1 reply; 200+ messages in thread
From: Borislav Petkov @ 2022-07-26 14:51 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel

On Tue, Jun 14, 2022 at 03:02:31PM +0300, Kirill A. Shutemov wrote:
> +static bool is_tdx_guest(void)
> +{
> +	static bool once;
> +	static bool is_tdx;
> +
> +	if (!IS_ENABLED(CONFIG_INTEL_TDX_GUEST))
> +		return false;
> +
> +	if (!once) {
> +		u32 eax, sig[3];
> +
> +		cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax,
> +			    &sig[0], &sig[2],  &sig[1]);
> +		is_tdx = !memcmp(TDX_IDENT, sig, sizeof(sig));
> +		once = true;
> +	}
> +
> +	return is_tdx;
> +}

early_tdx_detect() already calls this CPUID function. It assigns
function pointers too.

So why can't you assign an accept_memory() function pointer there and
get rid of this sprinkled if (tdx) everywhere?

> diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
> index 918a7606f53c..8518a75e5dd5 100644
> --- a/arch/x86/boot/compressed/tdx.c
> +++ b/arch/x86/boot/compressed/tdx.c
> @@ -3,12 +3,15 @@
>  #include "../cpuflags.h"
>  #include "../string.h"
>  #include "../io.h"
> +#include "align.h"
>  #include "error.h"
> +#include "pgtable_types.h"
>  
>  #include <vdso/limits.h>
>  #include <uapi/asm/vmx.h>
>  
>  #include <asm/shared/tdx.h>
> +#include <asm/page_types.h>
>  
>  /* Called from __tdx_hypercall() for unrecoverable failure */
>  void __tdx_hypercall_failed(void)
> @@ -75,3 +78,78 @@ void early_tdx_detect(void)
>  	pio_ops.f_outb = tdx_outb;
>  	pio_ops.f_outw = tdx_outw;
>  }
> +
> +static unsigned long try_accept_one(phys_addr_t start, unsigned long len,
> +				    enum pg_level level)

That's pretty much a copy of the same function in arch/x86/coco/tdx/tdx.c.

Yeah, you need a tdx-shared.c which you include in both places just like
it is done with sev-shared.c

...

> +void tdx_accept_memory(phys_addr_t start, phys_addr_t end)

That one too.

> +{
> +	/*
> +	 * Notify the VMM about page mapping conversion. More info about ABI
> +	 * can be found in TDX Guest-Host-Communication Interface (GHCI),
> +	 * section "TDG.VP.VMCALL<MapGPA>"
> +	 */
> +	if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
> +		error("Accepting memory failed\n");
> +
> +	/*
> +	 * For shared->private conversion, accept the page using
> +	 * TDX_ACCEPT_PAGE TDX module call.
> +	 */
> +	while (start < end) {
> +		unsigned long len = end - start;
> +		unsigned long accept_size;
> +
> +		/*
> +		 * Try larger accepts first. It gives chance to VMM to keep
> +		 * 1G/2M Secure EPT entries where possible and speeds up
> +		 * process by cutting number of hypercalls (if successful).
> +		 */
> +
> +		accept_size = try_accept_one(start, len, PG_LEVEL_1G);
> +		if (!accept_size)
> +			accept_size = try_accept_one(start, len, PG_LEVEL_2M);
> +		if (!accept_size)
> +			accept_size = try_accept_one(start, len, PG_LEVEL_4K);
> +		if (!accept_size)
> +			error("Accepting memory failed\n");
> +		start += accept_size;

This series of calls to try_accept_one() appear in at least three
places. Please carve them out into a separate function can put it in
tdx-shared.c.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2022-06-14 12:02 ` [PATCHv7 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into " Kirill A. Shutemov
  2022-06-23 17:19   ` Dave Hansen
  2022-07-26 10:21   ` Borislav Petkov
@ 2022-07-26 17:25   ` Borislav Petkov
  2022-07-26 17:46     ` Dave Hansen
  2022-07-26 20:17   ` Andy Lutomirski
  3 siblings, 1 reply; 200+ messages in thread
From: Borislav Petkov @ 2022-07-26 17:25 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel

On Tue, Jun 14, 2022 at 03:02:27PM +0300, Kirill A. Shutemov wrote:
> diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
> index 1df918b21469..bcd56fe82b9e 100644
> --- a/arch/x86/mm/unaccepted_memory.c
> +++ b/arch/x86/mm/unaccepted_memory.c
> @@ -23,6 +23,38 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
>  	bitmap = __va(boot_params.unaccepted_memory);
>  	range_start = start / PMD_SIZE;
>  
> +	/*
> +	 * load_unaligned_zeropad() can lead to unwanted loads across page
> +	 * boundaries. The unwanted loads are typically harmless. But, they
> +	 * might be made to totally unrelated or even unmapped memory.
> +	 * load_unaligned_zeropad() relies on exception fixup (#PF, #GP and now
> +	 * #VE) to recover from these unwanted loads.
> +	 *
> +	 * But, this approach does not work for unaccepted memory. For TDX, a
> +	 * load from unaccepted memory will not lead to a recoverable exception
> +	 * within the guest. The guest will exit to the VMM where the only
> +	 * recourse is to terminate the guest.
> +	 *
> +	 * There are three parts to fix this issue and comprehensively avoid
> +	 * access to unaccepted memory. Together these ensure that an extra
> +	 * “guard” page is accepted in addition to the memory that needs to be
> +	 * used:
> +	 *
> +	 * 1. Implicitly extend the range_contains_unaccepted_memory(start, end)
> +	 *    checks up to end+2M if ‘end’ is aligned on a 2M boundary.
> +	 *
> +	 * 2. Implicitly extend accept_memory(start, end) to end+2M if ‘end’ is
> +	 *    aligned on a 2M boundary.

Why do we need those unicode quotes and backticks in there?

verify_diff: Warning: Unicode char [“] (0x8220 in line: +	 * “guard” page is accepted in addition to the memory that needs to be
verify_diff: Warning: Unicode char [‘] (0x8216 in line: +	 *    checks up to end+2M if ‘end’ is aligned on a 2M boundary.
verify_diff: Warning: Unicode char [‘] (0x8216 in line: +	 * 2. Implicitly extend accept_memory(start, end) to end+2M if ‘end’ is
verify_diff: Warning: Unicode char [‘] (0x8216 in line: +	 *    needs to be done to make ‘page’ usable. That work might include
verify_diff: Warning: Unicode char [‘] (0x8216 in line: +	 *    accepting pages in addition to ‘page’ itself.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2022-07-26 17:25   ` Borislav Petkov
@ 2022-07-26 17:46     ` Dave Hansen
  0 siblings, 0 replies; 200+ messages in thread
From: Dave Hansen @ 2022-07-26 17:46 UTC (permalink / raw)
  To: Borislav Petkov, Kirill A. Shutemov
  Cc: Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, x86, linux-mm, linux-coco, linux-efi, linux-kernel

On 7/26/22 10:25, Borislav Petkov wrote:
> Why do we need those unicode quotes and backticks in there?
> 
> verify_diff: Warning: Unicode char [“] (0x8220 in line: +	 * “guard” page is accepted in addition to the memory that needs to be
> verify_diff: Warning: Unicode char [‘] (0x8216 in line: +	 *    checks up to end+2M if ‘end’ is aligned on a 2M boundary.
> verify_diff: Warning: Unicode char [‘] (0x8216 in line: +	 * 2. Implicitly extend accept_memory(start, end) to end+2M if ‘end’ is
> verify_diff: Warning: Unicode char [‘] (0x8216 in line: +	 *    needs to be done to make ‘page’ usable. That work might include
> verify_diff: Warning: Unicode char [‘] (0x8216 in line: +	 *    accepting pages in addition to ‘page’ itself.

I've been encouraging folks to stick their changelogs in a Google Doc
(or even Word) when they're writing them.  This gives them better
spelling and grammar checking than is available in most editors and also
makes it easier for folks to improve it collaboratively.  I find it a
lot more efficient than sending 10 copies back and forth in email.

The downside is that those fancy programs insert unicode willy nilly for
stuff like this.  You usually need to catch it with scripts because it's
hard to spot visually.

It might make a good checkpatch addition, if it's not already there.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2022-06-14 12:02 ` [PATCHv7 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into " Kirill A. Shutemov
                     ` (2 preceding siblings ...)
  2022-07-26 17:25   ` Borislav Petkov
@ 2022-07-26 20:17   ` Andy Lutomirski
  2022-08-09 11:38     ` Kirill A. Shutemov
  3 siblings, 1 reply; 200+ messages in thread
From: Andy Lutomirski @ 2022-07-26 20:17 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Sathyanarayanan Kuppuswamy, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand,
	Marcelo Henrique Cerri, tim.gardner, khalid.elmously, philip.cox,
	the arch/x86 maintainers, linux-mm, linux-coco, linux-efi,
	Linux Kernel Mailing List



On Tue, Jun 14, 2022, at 5:02 AM, Kirill A. Shutemov wrote:
> load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
> The unwanted loads are typically harmless. But, they might be made to
> totally unrelated or even unmapped memory. load_unaligned_zeropad()
> relies on exception fixup (#PF, #GP and now #VE) to recover from these
> unwanted loads.
>
> But, this approach does not work for unaccepted memory. For TDX, a load
> from unaccepted memory will not lead to a recoverable exception within
> the guest. The guest will exit to the VMM where the only recourse is to
> terminate the guest.

Why is unaccepted memory marked present in the direct map in the first place?

Having kernel code assume that every valid address is followed by several bytes of memory that may be read without side effects other than #PF also seems like a mistake, but I probably won’t win that fight. But sticking guard pages in front of definitely-not-logically present pages seems silly to me.  Let’s just not map it.

(What if MMIO memory is mapped next to regular memory?  Doing random unaligned reads that cross into MMIO seems unwise.)

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-07-23 11:14                                         ` Ard Biesheuvel
@ 2022-07-28 22:01                                           ` Dionna Amalie Glaze
  2022-08-09 11:14                                           ` Kirill A. Shutemov
  1 sibling, 0 replies; 200+ messages in thread
From: Dionna Amalie Glaze @ 2022-07-28 22:01 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Dave Hansen, Marc Orr, Borislav Petkov, Kirill A. Shutemov,
	Peter Gonda, Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Andi Kleen, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Varad Gautam,
	Dario Faggioli, Mike Rapoport, David Hildenbrand, Marcelo Cerri,
	tim.gardner, Khalid ElMously, philip.cox,
	the arch/x86 maintainers, Linux Memory Management List,
	linux-coco, linux-efi, LKML, Yao, Jiewen

>
> What I strongly object to is inventing a new bespoke way for the
> firmware to make inferences about the capabilities of the image by
> inspecting fields in the file representation of the image (which is
> not guaranteed by EFI to be identical to its in-memory representation,
> as, e.g., the PE/COFF header could be omitted by a loader without
> violating the spec)
>
> As for the intermediate thing: yes, that would be a valuable thing to
> have in OVMF (and I will gladly take EDK2 patches that implement
> this). However, I'm not sure how you decide whether or not this thing
> should be active or not, doesn't that just move the problem around?

This does just move the problem around, but it makes correct behavior
the default instead of silently ignoring most of the VM's memory and
booting regularly. I have the driver mostly written to change the
behavior to accept all by default unless a driver has been installed
to set a particular boolean to make it not. Still that's yet another
thing as you say.

I agree with everyone that this situation just stinks. "Can't you just
boot it?" was asked before, and yes we can, but at the scale of a CSP
managing anybody's image uploads, that not-insignificant cost has to
be paid by someone. It's a hard problem to route the image to the
right kind of machine that's expected to be able to run it... it's a
big ol' mess.

One thing is for sure: these patches shouldn't be blocked by the "how
do we detect it" question. I'm glad to see so much engagement with
this problem, but I fear I might have delayed its progress towards a
merge. I know AMD has a follow-up to add SEV-SNP accept_memory support
to finish this all up.

I'll try to get the ear of all the distributions that are tracking
towards providing SEV-SNP-supported images for CSPs to get them on the
release that includes these patches. I'll also see about upstreaming
that EFI driver and EDK2 changes in case there's a slip in the kernel
release and we need this workaround.
--
-Dionna Glaze, PhD (she/her)

^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v1 0/2] Provide SEV-SNP support for unaccepted memory
  2022-06-14 12:02 [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (14 preceding siblings ...)
  2022-06-24 16:37 ` [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Peter Gonda
@ 2022-07-29 14:01 ` Tom Lendacky
  2022-07-29 14:01   ` [PATCH v1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support Tom Lendacky
  2022-07-29 14:01   ` [PATCH v1 " Tom Lendacky
  2022-08-08 17:16 ` [PATCH v2 0/2] Provide SEV-SNP support for unaccepted memory Tom Lendacky
                   ` (3 subsequent siblings)
  19 siblings, 2 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-07-29 14:01 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

This series adds SEV-SNP support for unaccepted memory to the patch series
titled:

  [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory

Currently, when changing the state of a page under SNP, the page state
change structure is kmalloc()'d. This lead to hangs during boot when
accepting memory because the allocation can trigger the need to accept
more memory. So this series consists of two patches:

  - A pre-patch to switch from a kmalloc()'d page state change structure
    to a per-CPU page state change structure.

  - SNP support for unaccepted memory.

The series is based off of and tested against Kirill Shutemov's tree:
  https://github.com/intel/tdx.git guest-unaccepted-memory

---

Tom Lendacky (2):
  x86/sev: Use per-CPU PSC structure in prep for unaccepted memory
    support
  x86/sev: Add SNP-specific unaccepted memory support

 arch/x86/Kconfig                |  1 +
 arch/x86/boot/compressed/mem.c  |  3 ++
 arch/x86/boot/compressed/sev.c  | 10 ++++-
 arch/x86/boot/compressed/sev.h  | 23 ++++++++++
 arch/x86/include/asm/sev.h      |  3 ++
 arch/x86/kernel/sev.c           | 76 ++++++++++++++++++++++++---------
 arch/x86/mm/unaccepted_memory.c |  4 ++
 7 files changed, 98 insertions(+), 22 deletions(-)
 create mode 100644 arch/x86/boot/compressed/sev.h

-- 
2.36.1


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support
  2022-07-29 14:01 ` [PATCH v1 0/2] Provide SEV-SNP " Tom Lendacky
@ 2022-07-29 14:01   ` Tom Lendacky
  2022-07-29 14:18     ` Dave Hansen
  2022-07-29 14:01   ` [PATCH v1 " Tom Lendacky
  1 sibling, 1 reply; 200+ messages in thread
From: Tom Lendacky @ 2022-07-29 14:01 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

In advance of providing support for unaccepted memory, switch from using
kmalloc() for allocating the Page State Change structure to using a
per-CPU structure. This is needed to avoid a possible recursive call into
set_pages_state() if the allocation requires (more) memory to be accepted,
which would result in a hang.

Protect the use of the per-CPU structure by disabling interrupts during
memory acceptance. Since the set_pages_state() path is the only path into
vmgexit_psc(), rename vmgexit_psc() to __vmgexit_psc() and remove the
calls to disable interrupts which are now performed by set_pages_state().

Even with interrupts disabled, an NMI can be raised while performing
memory acceptance. The NMI could then cause further memory acceptance to
performed. To prevent corruption to the per-CPU structure, use the PSC
MSR protocol in this situation.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/kernel/sev.c | 60 ++++++++++++++++++++++++++++---------------
 1 file changed, 39 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index c05f0124c410..1f7f6205c4f6 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -104,6 +104,15 @@ struct sev_es_runtime_data {
 	 * is currently unsupported in SEV-ES guests.
 	 */
 	unsigned long dr7;
+
+	/*
+	 * Page State Change structure for use when accepting memory or when
+	 * changing page state. Interrupts are disabled when using the structure
+	 * but an NMI could still be raised, so use a flag to indicate when the
+	 * structure is in use and use the MSR protocol in these cases.
+	 */
+	struct snp_psc_desc psc_desc;
+	bool psc_active;
 };
 
 struct ghcb_state {
@@ -660,7 +669,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool valid
 	}
 }
 
-static void __init early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
+static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
 {
 	unsigned long paddr_end;
 	u64 val;
@@ -742,26 +751,17 @@ void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op
 		WARN(1, "invalid memory op %d\n", op);
 }
 
-static int vmgexit_psc(struct snp_psc_desc *desc)
+static int __vmgexit_psc(struct snp_psc_desc *desc)
 {
 	int cur_entry, end_entry, ret = 0;
 	struct snp_psc_desc *data;
 	struct ghcb_state state;
 	struct es_em_ctxt ctxt;
-	unsigned long flags;
 	struct ghcb *ghcb;
 
-	/*
-	 * __sev_get_ghcb() needs to run with IRQs disabled because it is using
-	 * a per-CPU GHCB.
-	 */
-	local_irq_save(flags);
-
 	ghcb = __sev_get_ghcb(&state);
-	if (!ghcb) {
-		ret = 1;
-		goto out_unlock;
-	}
+	if (!ghcb)
+		return 1;
 
 	/* Copy the input desc into GHCB shared buffer */
 	data = (struct snp_psc_desc *)ghcb->shared_buffer;
@@ -820,9 +820,6 @@ static int vmgexit_psc(struct snp_psc_desc *desc)
 out:
 	__sev_put_ghcb(&state);
 
-out_unlock:
-	local_irq_restore(flags);
-
 	return ret;
 }
 
@@ -861,18 +858,32 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
 		i++;
 	}
 
-	if (vmgexit_psc(data))
+	if (__vmgexit_psc(data))
 		sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
 }
 
 static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
 {
 	unsigned long vaddr_end, next_vaddr;
+	struct sev_es_runtime_data *data;
 	struct snp_psc_desc *desc;
+	unsigned long flags;
 
-	desc = kmalloc(sizeof(*desc), GFP_KERNEL_ACCOUNT);
-	if (!desc)
-		panic("SNP: failed to allocate memory for PSC descriptor\n");
+	/* Disable interrupts since a per-CPU PSC and per-CPU GHCB are used. */
+	local_irq_save(flags);
+
+	data = this_cpu_read(runtime_data);
+	if (!data || data->psc_active) {
+		/* No per-CPU PSC or it is active, use the MSR protocol. */
+		early_set_pages_state(__pa(vaddr), npages, op);
+		goto out;
+	}
+
+	/* Mark the PSC in use. */
+	data->psc_active = true;
+	barrier();
+
+	desc = &data->psc_desc;
 
 	vaddr = vaddr & PAGE_MASK;
 	vaddr_end = vaddr + (npages << PAGE_SHIFT);
@@ -887,7 +898,12 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
 		vaddr = next_vaddr;
 	}
 
-	kfree(desc);
+	/* Mark the PSC no longer in use. */
+	barrier();
+	data->psc_active = false;
+
+out:
+	local_irq_restore(flags);
 }
 
 void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
@@ -1339,6 +1355,8 @@ static void __init alloc_runtime_data(int cpu)
 		panic("Can't allocate SEV-ES runtime data");
 
 	per_cpu(runtime_data, cpu) = data;
+
+	data->psc_active = false;
 }
 
 static void __init init_ghcb(int cpu)
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCH v1 2/2] x86/sev: Add SNP-specific unaccepted memory support
  2022-07-29 14:01 ` [PATCH v1 0/2] Provide SEV-SNP " Tom Lendacky
  2022-07-29 14:01   ` [PATCH v1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support Tom Lendacky
@ 2022-07-29 14:01   ` Tom Lendacky
  2022-08-23  0:24     ` Dionna Amalie Glaze
  2022-08-23 23:28     ` Dionna Amalie Glaze
  1 sibling, 2 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-07-29 14:01 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

Add SNP-specific hooks to the unaccepted memory support in the boot
path (__accept_memory()) and the core kernel (accept_memory()) in order
to support booting SNP guests when unaccepted memory is present. Without
this support, SNP guests will fail to boot and/or panic() when unaccepted
memory is present in the EFI memory map.

The process of accepting memory under SNP involves invoking the hypervisor
to perform a page state change for the page to private memory and then
issuing a PVALIDATE instruction to accept the page.

Create the new header file arch/x86/boot/compressed/sev.h because adding
the function declaration to any of the existing SEV related header files
pulls in too many other header files, causing the build to fail.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/Kconfig                |  1 +
 arch/x86/boot/compressed/mem.c  |  3 +++
 arch/x86/boot/compressed/sev.c  | 10 +++++++++-
 arch/x86/boot/compressed/sev.h  | 23 +++++++++++++++++++++++
 arch/x86/include/asm/sev.h      |  3 +++
 arch/x86/kernel/sev.c           | 16 ++++++++++++++++
 arch/x86/mm/unaccepted_memory.c |  4 ++++
 7 files changed, 59 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/boot/compressed/sev.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 34146ecc5bdd..0ad53c3533c2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1553,6 +1553,7 @@ config AMD_MEM_ENCRYPT
 	select INSTRUCTION_DECODER
 	select ARCH_HAS_CC_PLATFORM
 	select X86_MEM_ENCRYPT
+	select UNACCEPTED_MEMORY
 	help
 	  Say yes to enable support for the encryption of system memory.
 	  This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index 48e36e640da1..3e19dc0da0d7 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -6,6 +6,7 @@
 #include "find.h"
 #include "math.h"
 #include "tdx.h"
+#include "sev.h"
 #include <asm/shared/tdx.h>
 
 #define PMD_SHIFT	21
@@ -39,6 +40,8 @@ static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
 	/* Platform-specific memory-acceptance call goes here */
 	if (is_tdx_guest())
 		tdx_accept_memory(start, end);
+	else if (sev_snp_enabled())
+		snp_accept_memory(start, end);
 	else
 		error("Cannot accept memory: unknown platform\n");
 }
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 730c4677e9db..d4b06c862094 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -115,7 +115,7 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
 /* Include code for early handlers */
 #include "../../kernel/sev-shared.c"
 
-static inline bool sev_snp_enabled(void)
+bool sev_snp_enabled(void)
 {
 	return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
 }
@@ -161,6 +161,14 @@ void snp_set_page_shared(unsigned long paddr)
 	__page_state_change(paddr, SNP_PAGE_STATE_SHARED);
 }
 
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	while (end > start) {
+		snp_set_page_private(start);
+		start += PAGE_SIZE;
+	}
+}
+
 static bool early_setup_ghcb(void)
 {
 	if (set_page_decrypted((unsigned long)&boot_ghcb_page))
diff --git a/arch/x86/boot/compressed/sev.h b/arch/x86/boot/compressed/sev.h
new file mode 100644
index 000000000000..fc725a981b09
--- /dev/null
+++ b/arch/x86/boot/compressed/sev.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * AMD SEV header for early boot related functions.
+ *
+ * Author: Tom Lendacky <thomas.lendacky@amd.com>
+ */
+
+#ifndef BOOT_COMPRESSED_SEV_H
+#define BOOT_COMPRESSED_SEV_H
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+
+bool sev_snp_enabled(void);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
+
+#else
+
+static inline bool sev_snp_enabled(void) { return false; }
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
+
+#endif
+
+#endif
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 19514524f0f8..21db66bacefe 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -202,6 +202,7 @@ void snp_set_wakeup_secondary_cpu(void);
 bool snp_init(struct boot_params *bp);
 void snp_abort(void);
 int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, unsigned long *fw_err);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
 #else
 static inline void sev_es_ist_enter(struct pt_regs *regs) { }
 static inline void sev_es_ist_exit(void) { }
@@ -226,6 +227,8 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
 {
 	return -ENOTTY;
 }
+
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
 #endif
 
 #endif
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 1f7f6205c4f6..289764e3a0b5 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -926,6 +926,22 @@ void snp_set_memory_private(unsigned long vaddr, unsigned int npages)
 	pvalidate_pages(vaddr, npages, true);
 }
 
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	unsigned long vaddr;
+	unsigned int npages;
+
+	if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
+		return;
+
+	vaddr = (unsigned long)__va(start);
+	npages = (end - start) >> PAGE_SHIFT;
+
+	set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
+
+	pvalidate_pages(vaddr, npages, true);
+}
+
 static int snp_set_vmsa(void *va, bool vmsa)
 {
 	u64 attrs;
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index 9ec2304272dc..b86ad6a8ddf5 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -9,6 +9,7 @@
 #include <asm/setup.h>
 #include <asm/shared/tdx.h>
 #include <asm/unaccepted_memory.h>
+#include <asm/sev.h>
 
 /* Protects unaccepted memory bitmap */
 static DEFINE_SPINLOCK(unaccepted_memory_lock);
@@ -66,6 +67,9 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
 		if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
 			tdx_accept_memory(range_start * PMD_SIZE,
 					  range_end * PMD_SIZE);
+		} else if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) {
+			snp_accept_memory(range_start * PMD_SIZE,
+					  range_end * PMD_SIZE);
 		} else {
 			panic("Cannot accept memory: unknown platform\n");
 		}
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* Re: [PATCH v1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support
  2022-07-29 14:01   ` [PATCH v1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support Tom Lendacky
@ 2022-07-29 14:18     ` Dave Hansen
  2022-07-29 14:25       ` Tom Lendacky
  0 siblings, 1 reply; 200+ messages in thread
From: Dave Hansen @ 2022-07-29 14:18 UTC (permalink / raw)
  To: Tom Lendacky, linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On 7/29/22 07:01, Tom Lendacky wrote:
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index c05f0124c410..1f7f6205c4f6 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -104,6 +104,15 @@ struct sev_es_runtime_data {
>  	 * is currently unsupported in SEV-ES guests.
>  	 */
>  	unsigned long dr7;
> +
> +	/*
> +	 * Page State Change structure for use when accepting memory or when
> +	 * changing page state. Interrupts are disabled when using the structure
> +	 * but an NMI could still be raised, so use a flag to indicate when the
> +	 * structure is in use and use the MSR protocol in these cases.
> +	 */
> +	struct snp_psc_desc psc_desc;
> +	bool psc_active;
>  };

This thing:

struct snp_psc_desc {
        struct psc_hdr hdr;
        struct psc_entry entries[VMGEXIT_PSC_MAX_ENTRY];
} __packed;

is 16k, right?  Being per-cpu, this might eat up a MB or two of memory
on a big server?

Considering that runtime acceptance is already single-threaded[1] *and*
there's a fallback method, why not just have a single copy of this
guarded by a single lock?

1.
https://lore.kernel.org/all/20220614120231.48165-10-kirill.shutemov@linux.intel.com/

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support
  2022-07-29 14:18     ` Dave Hansen
@ 2022-07-29 14:25       ` Tom Lendacky
  2022-07-29 19:08         ` Dave Hansen
  0 siblings, 1 reply; 200+ messages in thread
From: Tom Lendacky @ 2022-07-29 14:25 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra



On 7/29/22 09:18, Dave Hansen wrote:
> On 7/29/22 07:01, Tom Lendacky wrote:
>> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
>> index c05f0124c410..1f7f6205c4f6 100644
>> --- a/arch/x86/kernel/sev.c
>> +++ b/arch/x86/kernel/sev.c
>> @@ -104,6 +104,15 @@ struct sev_es_runtime_data {
>>   	 * is currently unsupported in SEV-ES guests.
>>   	 */
>>   	unsigned long dr7;
>> +
>> +	/*
>> +	 * Page State Change structure for use when accepting memory or when
>> +	 * changing page state. Interrupts are disabled when using the structure
>> +	 * but an NMI could still be raised, so use a flag to indicate when the
>> +	 * structure is in use and use the MSR protocol in these cases.
>> +	 */
>> +	struct snp_psc_desc psc_desc;
>> +	bool psc_active;
>>   };
> 
> This thing:
> 
> struct snp_psc_desc {
>          struct psc_hdr hdr;
>          struct psc_entry entries[VMGEXIT_PSC_MAX_ENTRY];
> } __packed;
> 
> is 16k, right?  Being per-cpu, this might eat up a MB or two of memory
> on a big server?

It's just under 2K, 2,032 bytes.

> 
> Considering that runtime acceptance is already single-threaded[1] *and*
> there's a fallback method, why not just have a single copy of this
> guarded by a single lock?

This function is called for more than just memory acceptance. It's also 
called for any changes from or to private or shared, which isn't 
single-threaded.

Thanks,
Tom

> 
> 1.
> https://lore.kernel.org/all/20220614120231.48165-10-kirill.shutemov@linux.intel.com/

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support
  2022-07-29 14:25       ` Tom Lendacky
@ 2022-07-29 19:08         ` Dave Hansen
  2022-07-29 19:22           ` Tom Lendacky
  0 siblings, 1 reply; 200+ messages in thread
From: Dave Hansen @ 2022-07-29 19:08 UTC (permalink / raw)
  To: Tom Lendacky, linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On 7/29/22 07:25, Tom Lendacky wrote:
>> Considering that runtime acceptance is already single-threaded[1] *and*
>> there's a fallback method, why not just have a single copy of this
>> guarded by a single lock?
> 
> This function is called for more than just memory acceptance. It's also
> called for any changes from or to private or shared, which isn't
> single-threaded.

I think this tidbit from the changelog threw me off:

> Protect the use of the per-CPU structure by disabling interrupts during
> memory acceptance.

Could you please revise that to accurately capture the impact of this
change?

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support
  2022-07-29 19:08         ` Dave Hansen
@ 2022-07-29 19:22           ` Tom Lendacky
  2022-07-29 19:28             ` Dave Hansen
  0 siblings, 1 reply; 200+ messages in thread
From: Tom Lendacky @ 2022-07-29 19:22 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On 7/29/22 14:08, Dave Hansen wrote:
> On 7/29/22 07:25, Tom Lendacky wrote:
>>> Considering that runtime acceptance is already single-threaded[1] *and*
>>> there's a fallback method, why not just have a single copy of this
>>> guarded by a single lock?
>>
>> This function is called for more than just memory acceptance. It's also
>> called for any changes from or to private or shared, which isn't
>> single-threaded.
> 
> I think this tidbit from the changelog threw me off:
> 
>> Protect the use of the per-CPU structure by disabling interrupts during
>> memory acceptance.
> 
> Could you please revise that to accurately capture the impact of this
> change?

Is s/memory acceptance/page state changes/ enough of what you are looking 
for or something more?

Thanks,
Tom

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support
  2022-07-29 19:22           ` Tom Lendacky
@ 2022-07-29 19:28             ` Dave Hansen
  2022-07-29 20:12               ` Tom Lendacky
  0 siblings, 1 reply; 200+ messages in thread
From: Dave Hansen @ 2022-07-29 19:28 UTC (permalink / raw)
  To: Tom Lendacky, linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On 7/29/22 12:22, Tom Lendacky wrote:
>> I think this tidbit from the changelog threw me off:
>>
>>> Protect the use of the per-CPU structure by disabling interrupts during
>>> memory acceptance.
>>
>> Could you please revise that to accurately capture the impact of this
>> change?
> 
> Is s/memory acceptance/page state changes/ enough of what you are
> looking for or something more?

That, plus a reminder of when "page state changes" are performed would
be nice.  How frequent are they?  Are they performance sensitive?
That'll help us decide if the design here is appropriate or not.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support
  2022-07-29 19:28             ` Dave Hansen
@ 2022-07-29 20:12               ` Tom Lendacky
  2022-08-03 18:11                 ` [PATCH v1.1 0/2] Provide SEV-SNP support for unaccepted memory Tom Lendacky
  0 siblings, 1 reply; 200+ messages in thread
From: Tom Lendacky @ 2022-07-29 20:12 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On 7/29/22 14:28, Dave Hansen wrote:
> On 7/29/22 12:22, Tom Lendacky wrote:
>>> I think this tidbit from the changelog threw me off:
>>>
>>>> Protect the use of the per-CPU structure by disabling interrupts during
>>>> memory acceptance.
>>>
>>> Could you please revise that to accurately capture the impact of this
>>> change?
>>
>> Is s/memory acceptance/page state changes/ enough of what you are
>> looking for or something more?
> 
> That, plus a reminder of when "page state changes" are performed would
> be nice.  How frequent are they?  Are they performance sensitive?
> That'll help us decide if the design here is appropriate or not.

Without submitting a v2, here's what the updated paragraph would look like:

  Page state changes occur whenever DMA memory is allocated or memory needs
  to be shared with the hypervisor (kvmclock, attestation reports, etc.).
  A per-CPU structure is chosen over a single PSC structure protected with
  a lock because these changes can be initiated from interrupt or
  soft-interrupt context (e.g. the NVMe driver). Protect the use of the
  per-CPU structure by disabling interrupts during page state changes.
  Since the set_pages_state() path is the only path into vmgexit_psc(),
  rename vmgexit_psc() to __vmgexit_psc() and remove the calls to disable
  interrupts which are now performed by set_pages_state().

Hopefully there aren't a lot of page state changes occurring once a system 
has booted, so maybe a static struct with a lock would work. I am a bit 
worried about an NMI occurring during a page state change that requires a 
lock. I suppose, in_nmi() can be used to detect that and go the MSR 
protocol route to avoid a deadlock.

I can investigate that if the 2K-extra per-CPU is not desired.

Thanks,
Tom

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2022-07-26 10:21   ` Borislav Petkov
@ 2022-08-02 23:46     ` Dave Hansen
  2022-08-03 14:02       ` Dave Hansen
  0 siblings, 1 reply; 200+ messages in thread
From: Dave Hansen @ 2022-08-02 23:46 UTC (permalink / raw)
  To: Borislav Petkov, Kirill A. Shutemov
  Cc: Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, x86, linux-mm, linux-coco, linux-efi, linux-kernel

On 7/26/22 03:21, Borislav Petkov wrote:
> On Tue, Jun 14, 2022 at 03:02:27PM +0300, Kirill A. Shutemov wrote:
>> But, this approach does not work for unaccepted memory. For TDX, a load
>> from unaccepted memory will not lead to a recoverable exception within
>> the guest. The guest will exit to the VMM where the only recourse is to
>> terminate the guest.
> FTR, this random-memory-access-to-unaccepted-memory-is-deadly thing is
> really silly. We should be able to handle such cases - because they do
> happen often - in a more resilient way. Just look at the complex dance
> this patch needs to do just to avoid this.
> 
> IOW, this part of the coco technology needs improvement.

This particular wound is self-inflicted.  The hardware can *today*
generate a #VE for these accesses.  But, to make writing the #VE code
more straightforward, we asked that the hardware not even bother
delivering the exception.  At the time, nobody could come up with a case
why there would ever be a legitimate, non-buggy access to unaccepted memory.

We learned about load_unaligned_zeropad() the hard way.  I never ran
into it and never knew it was there.  Dangit.

We _could_ go back to the way it was originally.  We could add
load_unaligned_zeropad() support to the #VE handler, and there's little
risk of load_unaligned_zeropad() itself being used in the
interrupts-disabled window early in the #VE handler.  That would get rid
of all the nasty adjacent page handling in the unaccepted memory code.

But, that would mean that we can land in the #VE handler from more
contexts.  Any normal, non-buggy use of load_unaligned_zeropad() can end
up there, obviously.  We would, for instance, need to be more careful
about #VE recursion.  We'd also have to make sure that _bugs_ that land
in the #VE handler can still be handled in a sane way.

To sum it all up, I'm not happy with the complexity of the page
acceptance code either but I'm not sure that it's bad tradeoff compared
to greater #VE complexity or fragility.

Does anyone think we should go back and really reconsider this?

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2022-08-02 23:46     ` Dave Hansen
@ 2022-08-03 14:02       ` Dave Hansen
  2022-08-11 11:26         ` Borislav Petkov
  2022-08-13 16:04         ` Andy Lutomirski
  0 siblings, 2 replies; 200+ messages in thread
From: Dave Hansen @ 2022-08-03 14:02 UTC (permalink / raw)
  To: Borislav Petkov, Kirill A. Shutemov
  Cc: Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, x86, linux-mm, linux-coco, linux-efi, linux-kernel

On 8/2/22 16:46, Dave Hansen wrote:
> To sum it all up, I'm not happy with the complexity of the page
> acceptance code either but I'm not sure that it's bad tradeoff compared
> to greater #VE complexity or fragility.
> 
> Does anyone think we should go back and really reconsider this?

One other thing I remembered as I re-read my write up on this.

In the "new" mode, guests never get #VE's for unaccepted memory.  They
just exit to the host and can never be reentered.  They must be killed.

In the "old" mode, I _believe_ that the guest always gets a #VE for
non-EPT-present memory.  The #VE is basically the same no matter if the
page is unaccepted or if the host goes out and makes a
previously-accepted page non-present.

One really nasty implication of this "old" mode is that the host can
remove *accepted* pages that are used in the syscall gap.  That means
that the #VE handler would need to be of the paranoid variety which
opens up all kinds of other fun.

 * "Old" - #VE's can happen in the syscall gap
 * "New" - #VE's happen at better-defined times.  Unexpected ones are
   fatal.

There's a third option which I proposed but doesn't yet exist.  The TDX
module _could_ separate the behavior of unaccepted memory #VE's and
host-induced #VEs.  This way, we could use load_unaligned_zeropad() with
impunity and handle it in the #VE handler.  At the same time, the host
would not be allowed to remove accepted memory and cause problems in the
syscall gap.  Kinda the best of both worlds.

But, I'm not sure how valuable that would be now that we have the
(admittedly squirrelly) code to avoid load_unaligned_zeropad() #VE's.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v1.1 0/2] Provide SEV-SNP support for unaccepted memory
  2022-07-29 20:12               ` Tom Lendacky
@ 2022-08-03 18:11                 ` Tom Lendacky
  2022-08-03 18:11                   ` [PATCH v1.1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support Tom Lendacky
  2022-08-03 18:11                   ` [PATCH v1.1 2/2] x86/sev: Add SNP-specific " Tom Lendacky
  0 siblings, 2 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-08-03 18:11 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

This series adds SEV-SNP support for unaccepted memory to the patch series
titled:

  [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory

Currently, when changing the state of a page under SNP, the page state
change structure is kmalloc()'d. This lead to hangs during boot when
accepting memory because the allocation can trigger the need to accept
more memory. So this series consists of two patches:

  - A pre-patch to switch from a kmalloc()'d page state change structure
    to a static page state change structure proteced with access protected
    by a spinlock.

  - SNP support for unaccepted memory.

The series is based off of and tested against Kirill Shutemov's tree:
  https://github.com/intel/tdx.git guest-unaccepted-memory

---

This is what the static structure / spinlock method looks like. Let me
know if this approach is preferred over the per-CPU structure. If so,
I'll submit this as a v2.

Thanks,
Tom

Tom Lendacky (2):
  x86/sev: Use per-CPU PSC structure in prep for unaccepted memory
    support
  x86/sev: Add SNP-specific unaccepted memory support

 arch/x86/Kconfig                |  1 +
 arch/x86/boot/compressed/mem.c  |  3 ++
 arch/x86/boot/compressed/sev.c  | 10 ++++-
 arch/x86/boot/compressed/sev.h  | 23 +++++++++++
 arch/x86/include/asm/sev.h      |  3 ++
 arch/x86/kernel/sev.c           | 71 ++++++++++++++++++++++-----------
 arch/x86/mm/unaccepted_memory.c |  4 ++
 7 files changed, 91 insertions(+), 24 deletions(-)
 create mode 100644 arch/x86/boot/compressed/sev.h

-- 
2.36.1


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v1.1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support
  2022-08-03 18:11                 ` [PATCH v1.1 0/2] Provide SEV-SNP support for unaccepted memory Tom Lendacky
@ 2022-08-03 18:11                   ` Tom Lendacky
  2022-08-03 18:17                     ` Dave Hansen
  2022-08-03 18:18                     ` Tom Lendacky
  2022-08-03 18:11                   ` [PATCH v1.1 2/2] x86/sev: Add SNP-specific " Tom Lendacky
  1 sibling, 2 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-08-03 18:11 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

In advance of providing support for unaccepted memory, switch from using
kmalloc() for allocating the Page State Change (PSC) structure to using a
static structure. This is needed to avoid a possible recursive call into
set_pages_state() if the kmalloc() call requires (more) memory to be
accepted, which would result in a hang.

Page state changes occur whenever DMA memory is allocated or memory needs
to be shared with the hypervisor (kvmclock, attestation reports, etc.).
Since most page state changes occur early in boot and are limited in
number, a single static PSC structure is used and protected by a spin
lock with interrupts disabled.

Even with interrupts disabled, an NMI can be raised while performing
memory acceptance. The NMI could then cause further memory acceptance to
be performed. To prevent a deadlock, use the MSR protocol if executing in
an NMI context.

Since the set_pages_state() path is the only path into vmgexit_psc(),
rename vmgexit_psc() to __vmgexit_psc() and remove the calls to disable
interrupts which are now performed by set_pages_state().

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/kernel/sev.c | 55 +++++++++++++++++++++++++------------------
 1 file changed, 32 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index c05f0124c410..84d94fd2ec53 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -66,6 +66,9 @@ static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
  */
 static struct ghcb *boot_ghcb __section(".data");
 
+/* Flag to indicate when the first per-CPU GHCB is registered */
+static bool ghcb_percpu_ready __section(".data");
+
 /* Bitmap of SEV features supported by the hypervisor */
 static u64 sev_hv_features __ro_after_init;
 
@@ -122,6 +125,15 @@ struct sev_config {
 
 static struct sev_config sev_cfg __read_mostly;
 
+/*
+ * Page State Change structure for use when accepting memory or when changing
+ * page state. Use is protected by a spinlock with interrupts disabled, but an
+ * NMI could still be raised, so check if running in an NMI an use the MSR
+ * protocol in these cases.
+ */
+static struct snp_psc_desc psc_desc;
+static DEFINE_SPINLOCK(psc_desc_lock);
+
 static __always_inline bool on_vc_stack(struct pt_regs *regs)
 {
 	unsigned long sp = regs->sp;
@@ -660,7 +672,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool valid
 	}
 }
 
-static void __init early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
+static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
 {
 	unsigned long paddr_end;
 	u64 val;
@@ -742,26 +754,17 @@ void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op
 		WARN(1, "invalid memory op %d\n", op);
 }
 
-static int vmgexit_psc(struct snp_psc_desc *desc)
+static int __vmgexit_psc(struct snp_psc_desc *desc)
 {
 	int cur_entry, end_entry, ret = 0;
 	struct snp_psc_desc *data;
 	struct ghcb_state state;
 	struct es_em_ctxt ctxt;
-	unsigned long flags;
 	struct ghcb *ghcb;
 
-	/*
-	 * __sev_get_ghcb() needs to run with IRQs disabled because it is using
-	 * a per-CPU GHCB.
-	 */
-	local_irq_save(flags);
-
 	ghcb = __sev_get_ghcb(&state);
-	if (!ghcb) {
-		ret = 1;
-		goto out_unlock;
-	}
+	if (!ghcb)
+		return 1;
 
 	/* Copy the input desc into GHCB shared buffer */
 	data = (struct snp_psc_desc *)ghcb->shared_buffer;
@@ -820,9 +823,6 @@ static int vmgexit_psc(struct snp_psc_desc *desc)
 out:
 	__sev_put_ghcb(&state);
 
-out_unlock:
-	local_irq_restore(flags);
-
 	return ret;
 }
 
@@ -861,18 +861,25 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
 		i++;
 	}
 
-	if (vmgexit_psc(data))
+	if (__vmgexit_psc(data))
 		sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
 }
 
 static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
 {
 	unsigned long vaddr_end, next_vaddr;
-	struct snp_psc_desc *desc;
+	unsigned long flags;
 
-	desc = kmalloc(sizeof(*desc), GFP_KERNEL_ACCOUNT);
-	if (!desc)
-		panic("SNP: failed to allocate memory for PSC descriptor\n");
+	/*
+	 * Use the MSR protocol when either:
+	 *   - executing in an NMI to avoid any possibility of a deadlock
+	 *   - per-CPU GHCBs are not yet registered, since __vmgexit_psc()
+	 *     uses the per-CPU GHCB.
+	 */
+	if (in_nmi() || !ghcb_percpu_ready)
+		return early_set_pages_state(__pa(vaddr), npages, op);
+
+	spin_lock_irqsave(&psc_desc_lock, flags);
 
 	vaddr = vaddr & PAGE_MASK;
 	vaddr_end = vaddr + (npages << PAGE_SHIFT);
@@ -882,12 +889,12 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
 		next_vaddr = min_t(unsigned long, vaddr_end,
 				   (VMGEXIT_PSC_MAX_ENTRY * PAGE_SIZE) + vaddr);
 
-		__set_pages_state(desc, vaddr, next_vaddr, op);
+		__set_pages_state(&psc_desc, vaddr, next_vaddr, op);
 
 		vaddr = next_vaddr;
 	}
 
-	kfree(desc);
+	spin_unlock_irqrestore(&psc_desc_lock, flags);
 }
 
 void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
@@ -1254,6 +1261,8 @@ void setup_ghcb(void)
 		if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
 			snp_register_per_cpu_ghcb();
 
+		ghcb_percpu_ready = true;
+
 		return;
 	}
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCH v1.1 2/2] x86/sev: Add SNP-specific unaccepted memory support
  2022-08-03 18:11                 ` [PATCH v1.1 0/2] Provide SEV-SNP support for unaccepted memory Tom Lendacky
  2022-08-03 18:11                   ` [PATCH v1.1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support Tom Lendacky
@ 2022-08-03 18:11                   ` Tom Lendacky
  1 sibling, 0 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-08-03 18:11 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

Add SNP-specific hooks to the unaccepted memory support in the boot
path (__accept_memory()) and the core kernel (accept_memory()) in order
to support booting SNP guests when unaccepted memory is present. Without
this support, SNP guests will fail to boot and/or panic() when unaccepted
memory is present in the EFI memory map.

The process of accepting memory under SNP involves invoking the hypervisor
to perform a page state change for the page to private memory and then
issuing a PVALIDATE instruction to accept the page.

Create the new header file arch/x86/boot/compressed/sev.h because adding
the function declaration to any of the existing SEV related header files
pulls in too many other header files, causing the build to fail.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/Kconfig                |  1 +
 arch/x86/boot/compressed/mem.c  |  3 +++
 arch/x86/boot/compressed/sev.c  | 10 +++++++++-
 arch/x86/boot/compressed/sev.h  | 23 +++++++++++++++++++++++
 arch/x86/include/asm/sev.h      |  3 +++
 arch/x86/kernel/sev.c           | 16 ++++++++++++++++
 arch/x86/mm/unaccepted_memory.c |  4 ++++
 7 files changed, 59 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/boot/compressed/sev.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 34146ecc5bdd..0ad53c3533c2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1553,6 +1553,7 @@ config AMD_MEM_ENCRYPT
 	select INSTRUCTION_DECODER
 	select ARCH_HAS_CC_PLATFORM
 	select X86_MEM_ENCRYPT
+	select UNACCEPTED_MEMORY
 	help
 	  Say yes to enable support for the encryption of system memory.
 	  This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index 48e36e640da1..3e19dc0da0d7 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -6,6 +6,7 @@
 #include "find.h"
 #include "math.h"
 #include "tdx.h"
+#include "sev.h"
 #include <asm/shared/tdx.h>
 
 #define PMD_SHIFT	21
@@ -39,6 +40,8 @@ static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
 	/* Platform-specific memory-acceptance call goes here */
 	if (is_tdx_guest())
 		tdx_accept_memory(start, end);
+	else if (sev_snp_enabled())
+		snp_accept_memory(start, end);
 	else
 		error("Cannot accept memory: unknown platform\n");
 }
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 730c4677e9db..d4b06c862094 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -115,7 +115,7 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
 /* Include code for early handlers */
 #include "../../kernel/sev-shared.c"
 
-static inline bool sev_snp_enabled(void)
+bool sev_snp_enabled(void)
 {
 	return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
 }
@@ -161,6 +161,14 @@ void snp_set_page_shared(unsigned long paddr)
 	__page_state_change(paddr, SNP_PAGE_STATE_SHARED);
 }
 
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	while (end > start) {
+		snp_set_page_private(start);
+		start += PAGE_SIZE;
+	}
+}
+
 static bool early_setup_ghcb(void)
 {
 	if (set_page_decrypted((unsigned long)&boot_ghcb_page))
diff --git a/arch/x86/boot/compressed/sev.h b/arch/x86/boot/compressed/sev.h
new file mode 100644
index 000000000000..fc725a981b09
--- /dev/null
+++ b/arch/x86/boot/compressed/sev.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * AMD SEV header for early boot related functions.
+ *
+ * Author: Tom Lendacky <thomas.lendacky@amd.com>
+ */
+
+#ifndef BOOT_COMPRESSED_SEV_H
+#define BOOT_COMPRESSED_SEV_H
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+
+bool sev_snp_enabled(void);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
+
+#else
+
+static inline bool sev_snp_enabled(void) { return false; }
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
+
+#endif
+
+#endif
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 19514524f0f8..21db66bacefe 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -202,6 +202,7 @@ void snp_set_wakeup_secondary_cpu(void);
 bool snp_init(struct boot_params *bp);
 void snp_abort(void);
 int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, unsigned long *fw_err);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
 #else
 static inline void sev_es_ist_enter(struct pt_regs *regs) { }
 static inline void sev_es_ist_exit(void) { }
@@ -226,6 +227,8 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
 {
 	return -ENOTTY;
 }
+
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
 #endif
 
 #endif
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 84d94fd2ec53..db74c38babf7 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -917,6 +917,22 @@ void snp_set_memory_private(unsigned long vaddr, unsigned int npages)
 	pvalidate_pages(vaddr, npages, true);
 }
 
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	unsigned long vaddr;
+	unsigned int npages;
+
+	if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
+		return;
+
+	vaddr = (unsigned long)__va(start);
+	npages = (end - start) >> PAGE_SHIFT;
+
+	set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
+
+	pvalidate_pages(vaddr, npages, true);
+}
+
 static int snp_set_vmsa(void *va, bool vmsa)
 {
 	u64 attrs;
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index 9ec2304272dc..b86ad6a8ddf5 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -9,6 +9,7 @@
 #include <asm/setup.h>
 #include <asm/shared/tdx.h>
 #include <asm/unaccepted_memory.h>
+#include <asm/sev.h>
 
 /* Protects unaccepted memory bitmap */
 static DEFINE_SPINLOCK(unaccepted_memory_lock);
@@ -66,6 +67,9 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
 		if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
 			tdx_accept_memory(range_start * PMD_SIZE,
 					  range_end * PMD_SIZE);
+		} else if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) {
+			snp_accept_memory(range_start * PMD_SIZE,
+					  range_end * PMD_SIZE);
 		} else {
 			panic("Cannot accept memory: unknown platform\n");
 		}
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* Re: [PATCH v1.1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support
  2022-08-03 18:11                   ` [PATCH v1.1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support Tom Lendacky
@ 2022-08-03 18:17                     ` Dave Hansen
  2022-08-03 18:21                       ` Tom Lendacky
  2022-08-03 18:18                     ` Tom Lendacky
  1 sibling, 1 reply; 200+ messages in thread
From: Dave Hansen @ 2022-08-03 18:17 UTC (permalink / raw)
  To: Tom Lendacky, linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On 8/3/22 11:11, Tom Lendacky wrote:
> +	/*
> +	 * Use the MSR protocol when either:
> +	 *   - executing in an NMI to avoid any possibility of a deadlock
> +	 *   - per-CPU GHCBs are not yet registered, since __vmgexit_psc()
> +	 *     uses the per-CPU GHCB.
> +	 */
> +	if (in_nmi() || !ghcb_percpu_ready)
> +		return early_set_pages_state(__pa(vaddr), npages, op);
> +
> +	spin_lock_irqsave(&psc_desc_lock, flags);

Would it be simpler to just do a spin_trylock_irqsave()?  You fall back
to early_set_pages_state() whenever you can't acquire the lock.

That avoids even having to know what the situations are where you
_might_ recurse.  If it recurses, the trylock will just naturally fail.
 You simply can't have bugs where the "(in_nmi() || !ghcb_percpu_ready)"
conditional was wrong.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v1.1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support
  2022-08-03 18:11                   ` [PATCH v1.1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support Tom Lendacky
  2022-08-03 18:17                     ` Dave Hansen
@ 2022-08-03 18:18                     ` Tom Lendacky
  1 sibling, 0 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-08-03 18:18 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On 8/3/22 13:11, Tom Lendacky wrote:

Of course, I'll fix the subject if submitting this for real... ugh.

Thanks,
Tom

> In advance of providing support for unaccepted memory, switch from using
> kmalloc() for allocating the Page State Change (PSC) structure to using a
> static structure. This is needed to avoid a possible recursive call into
> set_pages_state() if the kmalloc() call requires (more) memory to be
> accepted, which would result in a hang.
> 
> Page state changes occur whenever DMA memory is allocated or memory needs
> to be shared with the hypervisor (kvmclock, attestation reports, etc.).
> Since most page state changes occur early in boot and are limited in
> number, a single static PSC structure is used and protected by a spin
> lock with interrupts disabled.
> 
> Even with interrupts disabled, an NMI can be raised while performing
> memory acceptance. The NMI could then cause further memory acceptance to
> be performed. To prevent a deadlock, use the MSR protocol if executing in
> an NMI context.
> 
> Since the set_pages_state() path is the only path into vmgexit_psc(),
> rename vmgexit_psc() to __vmgexit_psc() and remove the calls to disable
> interrupts which are now performed by set_pages_state().
> 
> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
> ---
>   arch/x86/kernel/sev.c | 55 +++++++++++++++++++++++++------------------
>   1 file changed, 32 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index c05f0124c410..84d94fd2ec53 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -66,6 +66,9 @@ static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
>    */
>   static struct ghcb *boot_ghcb __section(".data");
>   
> +/* Flag to indicate when the first per-CPU GHCB is registered */
> +static bool ghcb_percpu_ready __section(".data");
> +
>   /* Bitmap of SEV features supported by the hypervisor */
>   static u64 sev_hv_features __ro_after_init;
>   
> @@ -122,6 +125,15 @@ struct sev_config {
>   
>   static struct sev_config sev_cfg __read_mostly;
>   
> +/*
> + * Page State Change structure for use when accepting memory or when changing
> + * page state. Use is protected by a spinlock with interrupts disabled, but an
> + * NMI could still be raised, so check if running in an NMI an use the MSR
> + * protocol in these cases.
> + */
> +static struct snp_psc_desc psc_desc;
> +static DEFINE_SPINLOCK(psc_desc_lock);
> +
>   static __always_inline bool on_vc_stack(struct pt_regs *regs)
>   {
>   	unsigned long sp = regs->sp;
> @@ -660,7 +672,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool valid
>   	}
>   }
>   
> -static void __init early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
> +static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
>   {
>   	unsigned long paddr_end;
>   	u64 val;
> @@ -742,26 +754,17 @@ void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op
>   		WARN(1, "invalid memory op %d\n", op);
>   }
>   
> -static int vmgexit_psc(struct snp_psc_desc *desc)
> +static int __vmgexit_psc(struct snp_psc_desc *desc)
>   {
>   	int cur_entry, end_entry, ret = 0;
>   	struct snp_psc_desc *data;
>   	struct ghcb_state state;
>   	struct es_em_ctxt ctxt;
> -	unsigned long flags;
>   	struct ghcb *ghcb;
>   
> -	/*
> -	 * __sev_get_ghcb() needs to run with IRQs disabled because it is using
> -	 * a per-CPU GHCB.
> -	 */
> -	local_irq_save(flags);
> -
>   	ghcb = __sev_get_ghcb(&state);
> -	if (!ghcb) {
> -		ret = 1;
> -		goto out_unlock;
> -	}
> +	if (!ghcb)
> +		return 1;
>   
>   	/* Copy the input desc into GHCB shared buffer */
>   	data = (struct snp_psc_desc *)ghcb->shared_buffer;
> @@ -820,9 +823,6 @@ static int vmgexit_psc(struct snp_psc_desc *desc)
>   out:
>   	__sev_put_ghcb(&state);
>   
> -out_unlock:
> -	local_irq_restore(flags);
> -
>   	return ret;
>   }
>   
> @@ -861,18 +861,25 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
>   		i++;
>   	}
>   
> -	if (vmgexit_psc(data))
> +	if (__vmgexit_psc(data))
>   		sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
>   }
>   
>   static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
>   {
>   	unsigned long vaddr_end, next_vaddr;
> -	struct snp_psc_desc *desc;
> +	unsigned long flags;
>   
> -	desc = kmalloc(sizeof(*desc), GFP_KERNEL_ACCOUNT);
> -	if (!desc)
> -		panic("SNP: failed to allocate memory for PSC descriptor\n");
> +	/*
> +	 * Use the MSR protocol when either:
> +	 *   - executing in an NMI to avoid any possibility of a deadlock
> +	 *   - per-CPU GHCBs are not yet registered, since __vmgexit_psc()
> +	 *     uses the per-CPU GHCB.
> +	 */
> +	if (in_nmi() || !ghcb_percpu_ready)
> +		return early_set_pages_state(__pa(vaddr), npages, op);
> +
> +	spin_lock_irqsave(&psc_desc_lock, flags);
>   
>   	vaddr = vaddr & PAGE_MASK;
>   	vaddr_end = vaddr + (npages << PAGE_SHIFT);
> @@ -882,12 +889,12 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
>   		next_vaddr = min_t(unsigned long, vaddr_end,
>   				   (VMGEXIT_PSC_MAX_ENTRY * PAGE_SIZE) + vaddr);
>   
> -		__set_pages_state(desc, vaddr, next_vaddr, op);
> +		__set_pages_state(&psc_desc, vaddr, next_vaddr, op);
>   
>   		vaddr = next_vaddr;
>   	}
>   
> -	kfree(desc);
> +	spin_unlock_irqrestore(&psc_desc_lock, flags);
>   }
>   
>   void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
> @@ -1254,6 +1261,8 @@ void setup_ghcb(void)
>   		if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
>   			snp_register_per_cpu_ghcb();
>   
> +		ghcb_percpu_ready = true;
> +
>   		return;
>   	}
>   

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v1.1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support
  2022-08-03 18:17                     ` Dave Hansen
@ 2022-08-03 18:21                       ` Tom Lendacky
  2022-08-03 18:24                         ` Dave Hansen
  0 siblings, 1 reply; 200+ messages in thread
From: Tom Lendacky @ 2022-08-03 18:21 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra



On 8/3/22 13:17, Dave Hansen wrote:
> On 8/3/22 11:11, Tom Lendacky wrote:
>> +	/*
>> +	 * Use the MSR protocol when either:
>> +	 *   - executing in an NMI to avoid any possibility of a deadlock
>> +	 *   - per-CPU GHCBs are not yet registered, since __vmgexit_psc()
>> +	 *     uses the per-CPU GHCB.
>> +	 */
>> +	if (in_nmi() || !ghcb_percpu_ready)
>> +		return early_set_pages_state(__pa(vaddr), npages, op);
>> +
>> +	spin_lock_irqsave(&psc_desc_lock, flags);
> 
> Would it be simpler to just do a spin_trylock_irqsave()?  You fall back
> to early_set_pages_state() whenever you can't acquire the lock.

I was looking at that and can definitely go that route if this approach is 
preferred.

Thanks,
Tom

> 
> That avoids even having to know what the situations are where you
> _might_ recurse.  If it recurses, the trylock will just naturally fail.
>   You simply can't have bugs where the "(in_nmi() || !ghcb_percpu_ready)"
> conditional was wrong.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v1.1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support
  2022-08-03 18:21                       ` Tom Lendacky
@ 2022-08-03 18:24                         ` Dave Hansen
  2022-08-03 21:03                           ` Tom Lendacky
  0 siblings, 1 reply; 200+ messages in thread
From: Dave Hansen @ 2022-08-03 18:24 UTC (permalink / raw)
  To: Tom Lendacky, linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On 8/3/22 11:21, Tom Lendacky wrote:
>> Would it be simpler to just do a spin_trylock_irqsave()?  You fall back
>> to early_set_pages_state() whenever you can't acquire the lock.
> 
> I was looking at that and can definitely go that route if this approach
> is preferred.

I prefer it for sure.

This whole iteration does look good to me versus the per-cpu version, so
I say go ahead with doing this for v2 once you wait a bit for any more
feedback.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v1.1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support
  2022-08-03 18:24                         ` Dave Hansen
@ 2022-08-03 21:03                           ` Tom Lendacky
  2022-08-03 21:18                             ` Dave Hansen
  0 siblings, 1 reply; 200+ messages in thread
From: Tom Lendacky @ 2022-08-03 21:03 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On 8/3/22 13:24, Dave Hansen wrote:
> On 8/3/22 11:21, Tom Lendacky wrote:
>>> Would it be simpler to just do a spin_trylock_irqsave()?  You fall back
>>> to early_set_pages_state() whenever you can't acquire the lock.
>>
>> I was looking at that and can definitely go that route if this approach
>> is preferred.
> 
> I prefer it for sure.
> 
> This whole iteration does look good to me versus the per-cpu version, so
> I say go ahead with doing this for v2 once you wait a bit for any more
> feedback.

I'm still concerned about the whole spinlock and performance. What if I 
reduce the number of entries in the PSC structure to, say, 64, which 
reduces the size of the struct to 520 bytes. Any issue if that is put on 
the stack, instead? It definitely makes things less complicated and feels 
like a good compromise on the size vs the number of PSC VMGEXIT requests.

Thanks,
Tom

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v1.1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support
  2022-08-03 21:03                           ` Tom Lendacky
@ 2022-08-03 21:18                             ` Dave Hansen
  2022-08-03 21:34                               ` Tom Lendacky
  0 siblings, 1 reply; 200+ messages in thread
From: Dave Hansen @ 2022-08-03 21:18 UTC (permalink / raw)
  To: Tom Lendacky, linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On 8/3/22 14:03, Tom Lendacky wrote:
>> This whole iteration does look good to me versus the per-cpu version, so
>> I say go ahead with doing this for v2 once you wait a bit for any more
>> feedback.
> 
> I'm still concerned about the whole spinlock and performance. What if I
> reduce the number of entries in the PSC structure to, say, 64, which
> reduces the size of the struct to 520 bytes. Any issue if that is put on
> the stack, instead? It definitely makes things less complicated and
> feels like a good compromise on the size vs the number of PSC VMGEXIT
> requests.

That would be fine too.

But, I doubt there will be any real performance issues coming out of
this.  As bad as this MSR thing is, I suspect it's not half as
disastrous as the global spinlock in Kirill's patches.

Also, private<->shared page conversions are *NOT* common from what I can
tell.  There are a few pages converted at boot, but most host the
guest<->host communications are through the swiotlb pages which are static.

Are there other things that SEV uses this structure for that I'm missing?

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v1.1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support
  2022-08-03 21:18                             ` Dave Hansen
@ 2022-08-03 21:34                               ` Tom Lendacky
  2022-08-03 21:48                                 ` Dave Hansen
  0 siblings, 1 reply; 200+ messages in thread
From: Tom Lendacky @ 2022-08-03 21:34 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On 8/3/22 16:18, Dave Hansen wrote:
> On 8/3/22 14:03, Tom Lendacky wrote:
>>> This whole iteration does look good to me versus the per-cpu version, so
>>> I say go ahead with doing this for v2 once you wait a bit for any more
>>> feedback.
>>
>> I'm still concerned about the whole spinlock and performance. What if I
>> reduce the number of entries in the PSC structure to, say, 64, which
>> reduces the size of the struct to 520 bytes. Any issue if that is put on
>> the stack, instead? It definitely makes things less complicated and
>> feels like a good compromise on the size vs the number of PSC VMGEXIT
>> requests.
> 
> That would be fine too.

Ok.

> 
> But, I doubt there will be any real performance issues coming out of
> this.  As bad as this MSR thing is, I suspect it's not half as
> disastrous as the global spinlock in Kirill's patches.
> 
> Also, private<->shared page conversions are *NOT* common from what I can
> tell.  There are a few pages converted at boot, but most host the
> guest<->host communications are through the swiotlb pages which are static.

Generally, that's true. But, e.g., a dma_alloc_coherent() actually doesn't 
go through SWIOTLB, but instead allocates the pages and makes them shared, 
which results in a page state change. The NVMe driver was calling that API 
a lot. In this case, though, the NVMe driver was running in IRQ context 
and set_memory_decrypted() could sleep, so an unencrypted DMA memory pool 
was created to work around the sleeping issue and reduce the page state 
changes. It's just things like that, that make me wary.

Thanks,
Tom

> 
> Are there other things that SEV uses this structure for that I'm missing?

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v1.1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support
  2022-08-03 21:34                               ` Tom Lendacky
@ 2022-08-03 21:48                                 ` Dave Hansen
  2022-08-03 22:17                                   ` Tom Lendacky
  0 siblings, 1 reply; 200+ messages in thread
From: Dave Hansen @ 2022-08-03 21:48 UTC (permalink / raw)
  To: Tom Lendacky, linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra, Shutemov, Kirill, Huang, Kai

On 8/3/22 14:34, Tom Lendacky wrote:
>> Also, private<->shared page conversions are *NOT* common from what I can
>> tell.  There are a few pages converted at boot, but most host the
>> guest<->host communications are through the swiotlb pages which are
>> static.
> 
> Generally, that's true. But, e.g., a dma_alloc_coherent() actually
> doesn't go through SWIOTLB, but instead allocates the pages and makes
> them shared, which results in a page state change. The NVMe driver was
> calling that API a lot. In this case, though, the NVMe driver was
> running in IRQ context and set_memory_decrypted() could sleep, so an
> unencrypted DMA memory pool was created to work around the sleeping
> issue and reduce the page state changes. It's just things like that,
> that make me wary.

Interesting.  Is that a real passthrough NVMe device or the hypervisor
presenting a virtual one that just happens to use the NVMe driver?

I'm pretty sure the TDX folks have been banking on having very few page
state changes.  But, part of that at least is their expectation of
relying heavily on virtio.

I wonder if their expectations are accurate, or if once TDX gets out
into the real world if their hopes will be dashed.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v1.1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support
  2022-08-03 21:48                                 ` Dave Hansen
@ 2022-08-03 22:17                                   ` Tom Lendacky
  0 siblings, 0 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-08-03 22:17 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra, Shutemov, Kirill, Huang, Kai

On 8/3/22 16:48, Dave Hansen wrote:
> On 8/3/22 14:34, Tom Lendacky wrote:
>>> Also, private<->shared page conversions are *NOT* common from what I can
>>> tell.  There are a few pages converted at boot, but most host the
>>> guest<->host communications are through the swiotlb pages which are
>>> static.
>>
>> Generally, that's true. But, e.g., a dma_alloc_coherent() actually
>> doesn't go through SWIOTLB, but instead allocates the pages and makes
>> them shared, which results in a page state change. The NVMe driver was
>> calling that API a lot. In this case, though, the NVMe driver was
>> running in IRQ context and set_memory_decrypted() could sleep, so an
>> unencrypted DMA memory pool was created to work around the sleeping
>> issue and reduce the page state changes. It's just things like that,
>> that make me wary.
> 
> Interesting.  Is that a real passthrough NVMe device or the hypervisor
> presenting a virtual one that just happens to use the NVMe driver?

Hmmm... not sure, possibly the latter. I just knew that whatever it was, 
the NVMe driver was being used.

Thanks,
Tom

> 
> I'm pretty sure the TDX folks have been banking on having very few page
> state changes.  But, part of that at least is their expectation of
> relying heavily on virtio.
> 
> I wonder if their expectations are accurate, or if once TDX gets out
> into the real world if their hopes will be dashed.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-06-14 12:02 ` [PATCHv7 02/14] mm: Add support for unaccepted memory Kirill A. Shutemov
                     ` (2 preceding siblings ...)
  2022-07-21 15:14   ` Borislav Petkov
@ 2022-08-05 11:49   ` Vlastimil Babka
  2022-08-05 12:09     ` David Hildenbrand
  2022-08-10 14:19     ` Mel Gorman
  3 siblings, 2 replies; 200+ messages in thread
From: Vlastimil Babka @ 2022-08-05 11:49 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Dave Hansen, Mike Rapoport,
	David Hildenbrand, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, x86, linux-mm, linux-coco, linux-efi, linux-kernel,
	Mike Rapoport, Mel Gorman

On 6/14/22 14:02, Kirill A. Shutemov wrote:
> UEFI Specification version 2.9 introduces the concept of memory
> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
> SEV-SNP, require memory to be accepted before it can be used by the
> guest. Accepting happens via a protocol specific to the Virtual Machine
> platform.
> 
> There are several ways kernel can deal with unaccepted memory:
> 
>  1. Accept all the memory during the boot. It is easy to implement and
>     it doesn't have runtime cost once the system is booted. The downside
>     is very long boot time.
> 
>     Accept can be parallelized to multiple CPUs to keep it manageable
>     (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
>     memory bandwidth and does not scale beyond the point.
> 
>  2. Accept a block of memory on the first use. It requires more
>     infrastructure and changes in page allocator to make it work, but
>     it provides good boot time.
> 
>     On-demand memory accept means latency spikes every time kernel steps
>     onto a new memory block. The spikes will go away once workload data
>     set size gets stabilized or all memory gets accepted.
> 
>  3. Accept all memory in background. Introduce a thread (or multiple)
>     that gets memory accepted proactively. It will minimize time the
>     system experience latency spikes on memory allocation while keeping
>     low boot time.
> 
>     This approach cannot function on its own. It is an extension of #2:
>     background memory acceptance requires functional scheduler, but the
>     page allocator may need to tap into unaccepted memory before that.
> 
>     The downside of the approach is that these threads also steal CPU
>     cycles and memory bandwidth from the user's workload and may hurt
>     user experience.
> 
> Implement #2 for now. It is a reasonable default. Some workloads may
> want to use #1 or #3 and they can be implemented later based on user's
> demands.
> 
> Support of unaccepted memory requires a few changes in core-mm code:
> 
>   - memblock has to accept memory on allocation;
> 
>   - page allocator has to accept memory on the first allocation of the
>     page;
> 
> Memblock change is trivial.
> 
> The page allocator is modified to accept pages on the first allocation.
> The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is
> used to indicate that the page requires acceptance.
> 
> Architecture has to provide two helpers if it wants to support
> unaccepted memory:
> 
>  - accept_memory() makes a range of physical addresses accepted.
> 
>  - range_contains_unaccepted_memory() checks anything within the range
>    of physical addresses requires acceptance.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
> Reviewed-by: David Hildenbrand <david@redhat.com>

Hmm I realize it's not ideal to raise this at v7, and maybe it was discussed
before, but it's really not great how this affects the core page allocator
paths. Wouldn't it be possible to only release pages to page allocator when
accepted, and otherwise use some new per-zone variables together with the
bitmap to track how much exactly is where to accept? Then it could be hooked
in get_page_from_freelist() similarly to CONFIG_DEFERRED_STRUCT_PAGE_INIT -
if we fail zone_watermark_fast() and there are unaccepted pages in the zone,
accept them and continue. With a static key to flip in case we eventually
accept everything. Because this is really similar scenario to the deferred
init and that one was solved in a way that adds minimal overhead.

> ---
>  include/linux/page-flags.h | 31 +++++++++++++
>  mm/internal.h              | 12 +++++
>  mm/memblock.c              |  9 ++++
>  mm/page_alloc.c            | 89 +++++++++++++++++++++++++++++++++++++-
>  4 files changed, 139 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index e66f7aa3191d..444ba8f4bfa0 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -942,6 +942,14 @@ static inline bool is_page_hwpoison(struct page *page)
>  #define PG_offline	0x00000100
>  #define PG_table	0x00000200
>  #define PG_guard	0x00000400
> +#define PG_unaccepted	0x00000800
> +
> +/*
> + * Page types allowed at page_expected_state()
> + *
> + * PageUnaccepted() will get cleared in post_alloc_hook().
> + */
> +#define PAGE_TYPES_EXPECTED	PG_unaccepted
>  
>  #define PageType(page, flag)						\
>  	((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
> @@ -967,6 +975,18 @@ static __always_inline void __ClearPage##uname(struct page *page)	\
>  	page->page_type |= PG_##lname;					\
>  }
>  
> +#define PAGE_TYPE_OPS_FALSE(uname)					\
> +static __always_inline int Page##uname(struct page *page)		\
> +{									\
> +	return false;							\
> +}									\
> +static __always_inline void __SetPage##uname(struct page *page)		\
> +{									\
> +}									\
> +static __always_inline void __ClearPage##uname(struct page *page)	\
> +{									\
> +}
> +
>  /*
>   * PageBuddy() indicates that the page is free and in the buddy system
>   * (see mm/page_alloc.c).
> @@ -997,6 +1017,17 @@ PAGE_TYPE_OPS(Buddy, buddy)
>   */
>  PAGE_TYPE_OPS(Offline, offline)
>  
> +/*
> + * PageUnaccepted() indicates that the page has to be "accepted" before it can
> + * be read or written. The page allocator must call accept_page() before
> + * touching the page or returning it to the caller.
> + */
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +PAGE_TYPE_OPS(Unaccepted, unaccepted)
> +#else
> +PAGE_TYPE_OPS_FALSE(Unaccepted)
> +#endif
> +
>  extern void page_offline_freeze(void);
>  extern void page_offline_thaw(void);
>  extern void page_offline_begin(void);
> diff --git a/mm/internal.h b/mm/internal.h
> index c0f8fbe0445b..7c1a653edfc8 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -861,4 +861,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags);
>  
>  DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>  
> +#ifndef CONFIG_UNACCEPTED_MEMORY
> +static inline bool range_contains_unaccepted_memory(phys_addr_t start,
> +						    phys_addr_t end)
> +{
> +	return false;
> +}
> +
> +static inline void accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +}
> +#endif
> +
>  #endif	/* __MM_INTERNAL_H */
> diff --git a/mm/memblock.c b/mm/memblock.c
> index e4f03a6e8e56..a1f7f8b304d5 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1405,6 +1405,15 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
>  		 */
>  		kmemleak_alloc_phys(found, size, 0, 0);
>  
> +	/*
> +	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> +	 * require memory to be accepted before it can be used by the
> +	 * guest.
> +	 *
> +	 * Accept the memory of the allocated buffer.
> +	 */
> +	accept_memory(found, found + size);
> +
>  	return found;
>  }
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e008a3df0485..279c2746aaa8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -122,6 +122,12 @@ typedef int __bitwise fpi_t;
>   */
>  #define FPI_SKIP_KASAN_POISON	((__force fpi_t)BIT(2))
>  
> +/*
> + * Check if the page needs to be marked as PageUnaccepted().
> + * Used for the new pages added to the buddy allocator for the first time.
> + */
> +#define FPI_UNACCEPTED_SLOWPATH	((__force fpi_t)BIT(3))
> +
>  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>  static DEFINE_MUTEX(pcp_batch_high_lock);
>  #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
> @@ -993,6 +999,35 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
>  			NULL) != NULL;
>  }
>  
> +/*
> + * Page acceptance can be very slow. Do not call under critical locks.
> + */
> +static void accept_page(struct page *page, unsigned int order)
> +{
> +	phys_addr_t start = page_to_phys(page);
> +	int i;
> +
> +	if (!PageUnaccepted(page))
> +		return;
> +
> +	accept_memory(start, start + (PAGE_SIZE << order));
> +	__ClearPageUnaccepted(page);
> +
> +	/* Assert that there is no PageUnaccepted() on tail pages */
> +	if (IS_ENABLED(CONFIG_DEBUG_VM)) {
> +		for (i = 1; i < (1 << order); i++)
> +			VM_BUG_ON_PAGE(PageUnaccepted(page + i), page + i);
> +	}
> +}
> +
> +static bool page_contains_unaccepted(struct page *page, unsigned int order)
> +{
> +	phys_addr_t start = page_to_phys(page);
> +	phys_addr_t end = start + (PAGE_SIZE << order);
> +
> +	return range_contains_unaccepted_memory(start, end);
> +}
> +
>  /*
>   * Freeing function for a buddy system allocator.
>   *
> @@ -1027,6 +1062,7 @@ static inline void __free_one_page(struct page *page,
>  	unsigned long combined_pfn;
>  	struct page *buddy;
>  	bool to_tail;
> +	bool page_needs_acceptance = false;
>  
>  	VM_BUG_ON(!zone_is_initialized(zone));
>  	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> @@ -1038,6 +1074,11 @@ static inline void __free_one_page(struct page *page,
>  	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
>  	VM_BUG_ON_PAGE(bad_range(zone, page), page);
>  
> +	if (PageUnaccepted(page)) {
> +		page_needs_acceptance = true;
> +		__ClearPageUnaccepted(page);
> +	}
> +
>  	while (order < MAX_ORDER - 1) {
>  		if (compaction_capture(capc, page, order, migratetype)) {
>  			__mod_zone_freepage_state(zone, -(1 << order),
> @@ -1072,6 +1113,13 @@ static inline void __free_one_page(struct page *page,
>  			clear_page_guard(zone, buddy, order, migratetype);
>  		else
>  			del_page_from_free_list(buddy, zone, order);
> +
> +		/* Mark page unaccepted if any of merged pages were unaccepted */
> +		if (PageUnaccepted(buddy)) {
> +			page_needs_acceptance = true;
> +			__ClearPageUnaccepted(buddy);
> +		}
> +
>  		combined_pfn = buddy_pfn & pfn;
>  		page = page + (combined_pfn - pfn);
>  		pfn = combined_pfn;
> @@ -1081,6 +1129,23 @@ static inline void __free_one_page(struct page *page,
>  done_merging:
>  	set_buddy_order(page, order);
>  
> +	/*
> +	 * The page gets marked as PageUnaccepted() if any of merged-in pages
> +	 * is PageUnaccepted().
> +	 *
> +	 * New pages, just being added to buddy allocator, do not have
> +	 * PageUnaccepted() set. FPI_UNACCEPTED_SLOWPATH indicates that the
> +	 * page is new and page_contains_unaccepted() check is required to
> +	 * determinate if acceptance is required.
> +	 *
> +	 * Avoid calling page_contains_unaccepted() if it is known that the page
> +	 * needs acceptance. It can be costly.
> +	 */
> +	if (!page_needs_acceptance && (fpi_flags & FPI_UNACCEPTED_SLOWPATH))
> +		page_needs_acceptance = page_contains_unaccepted(page, order);
> +	if (page_needs_acceptance)
> +		__SetPageUnaccepted(page);
> +
>  	if (fpi_flags & FPI_TO_TAIL)
>  		to_tail = true;
>  	else if (is_shuffle_order(order))
> @@ -1164,7 +1229,13 @@ int split_free_page(struct page *free_page,
>  static inline bool page_expected_state(struct page *page,
>  					unsigned long check_flags)
>  {
> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
> +	/*
> +	 * The page must not be mapped to userspace and must not have
> +	 * a PageType other than listed in PAGE_TYPES_EXPECTED.
> +	 *
> +	 * Note, bit cleared means the page type is set.
> +	 */
> +	if (unlikely((atomic_read(&page->_mapcount) | PAGE_TYPES_EXPECTED) != -1))
>  		return false;
>  
>  	if (unlikely((unsigned long)page->mapping |
> @@ -1669,7 +1740,9 @@ void __free_pages_core(struct page *page, unsigned int order)
>  	 * Bypass PCP and place fresh pages right to the tail, primarily
>  	 * relevant for memory onlining.
>  	 */
> -	__free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON);
> +	__free_pages_ok(page, order,
> +			FPI_TO_TAIL | FPI_SKIP_KASAN_POISON |
> +			FPI_UNACCEPTED_SLOWPATH);
>  }
>  
>  #ifdef CONFIG_NUMA
> @@ -1822,6 +1895,9 @@ static void __init deferred_free_range(unsigned long pfn,
>  		return;
>  	}
>  
> +	/* Accept chunks smaller than page-block upfront */
> +	accept_memory(PFN_PHYS(pfn), PFN_PHYS(pfn + nr_pages));
> +
>  	for (i = 0; i < nr_pages; i++, page++, pfn++) {
>  		if ((pfn & (pageblock_nr_pages - 1)) == 0)
>  			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> @@ -2281,6 +2357,13 @@ static inline void expand(struct zone *zone, struct page *page,
>  		if (set_page_guard(zone, &page[size], high, migratetype))
>  			continue;
>  
> +		/*
> +		 * Transfer PageUnaccepted() to the newly split pages so
> +		 * they can be accepted after dropping the zone lock.
> +		 */
> +		if (PageUnaccepted(page))
> +			__SetPageUnaccepted(&page[size]);
> +
>  		add_to_free_list(&page[size], zone, high, migratetype);
>  		set_buddy_order(&page[size], high);
>  	}
> @@ -2411,6 +2494,8 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>  	 */
>  	kernel_unpoison_pages(page, 1 << order);
>  
> +	accept_page(page, order);
> +
>  	/*
>  	 * As memory initialization might be integrated into KASAN,
>  	 * KASAN unpoisoning and memory initializion code must be


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-08-05 11:49   ` Vlastimil Babka
@ 2022-08-05 12:09     ` David Hildenbrand
  2022-08-05 13:38       ` Vlastimil Babka
  2022-08-10 14:19     ` Mel Gorman
  1 sibling, 1 reply; 200+ messages in thread
From: David Hildenbrand @ 2022-08-05 12:09 UTC (permalink / raw)
  To: Vlastimil Babka, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Dave Hansen, Mike Rapoport,
	marcelo.cerri, tim.gardner, khalid.elmously, philip.cox, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel, Mike Rapoport,
	Mel Gorman

On 05.08.22 13:49, Vlastimil Babka wrote:
> On 6/14/22 14:02, Kirill A. Shutemov wrote:
>> UEFI Specification version 2.9 introduces the concept of memory
>> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
>> SEV-SNP, require memory to be accepted before it can be used by the
>> guest. Accepting happens via a protocol specific to the Virtual Machine
>> platform.
>>
>> There are several ways kernel can deal with unaccepted memory:
>>
>>  1. Accept all the memory during the boot. It is easy to implement and
>>     it doesn't have runtime cost once the system is booted. The downside
>>     is very long boot time.
>>
>>     Accept can be parallelized to multiple CPUs to keep it manageable
>>     (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
>>     memory bandwidth and does not scale beyond the point.
>>
>>  2. Accept a block of memory on the first use. It requires more
>>     infrastructure and changes in page allocator to make it work, but
>>     it provides good boot time.
>>
>>     On-demand memory accept means latency spikes every time kernel steps
>>     onto a new memory block. The spikes will go away once workload data
>>     set size gets stabilized or all memory gets accepted.
>>
>>  3. Accept all memory in background. Introduce a thread (or multiple)
>>     that gets memory accepted proactively. It will minimize time the
>>     system experience latency spikes on memory allocation while keeping
>>     low boot time.
>>
>>     This approach cannot function on its own. It is an extension of #2:
>>     background memory acceptance requires functional scheduler, but the
>>     page allocator may need to tap into unaccepted memory before that.
>>
>>     The downside of the approach is that these threads also steal CPU
>>     cycles and memory bandwidth from the user's workload and may hurt
>>     user experience.
>>
>> Implement #2 for now. It is a reasonable default. Some workloads may
>> want to use #1 or #3 and they can be implemented later based on user's
>> demands.
>>
>> Support of unaccepted memory requires a few changes in core-mm code:
>>
>>   - memblock has to accept memory on allocation;
>>
>>   - page allocator has to accept memory on the first allocation of the
>>     page;
>>
>> Memblock change is trivial.
>>
>> The page allocator is modified to accept pages on the first allocation.
>> The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is
>> used to indicate that the page requires acceptance.
>>
>> Architecture has to provide two helpers if it wants to support
>> unaccepted memory:
>>
>>  - accept_memory() makes a range of physical addresses accepted.
>>
>>  - range_contains_unaccepted_memory() checks anything within the range
>>    of physical addresses requires acceptance.
>>
>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
>> Reviewed-by: David Hildenbrand <david@redhat.com>
> 
> Hmm I realize it's not ideal to raise this at v7, and maybe it was discussed
> before, but it's really not great how this affects the core page allocator
> paths. Wouldn't it be possible to only release pages to page allocator when
> accepted, and otherwise use some new per-zone variables together with the
> bitmap to track how much exactly is where to accept? Then it could be hooked
> in get_page_from_freelist() similarly to CONFIG_DEFERRED_STRUCT_PAGE_INIT -
> if we fail zone_watermark_fast() and there are unaccepted pages in the zone,
> accept them and continue. With a static key to flip in case we eventually
> accept everything. Because this is really similar scenario to the deferred
> init and that one was solved in a way that adds minimal overhead.

I kind of like just having the memory stats being correct (e.g., free
memory) and acceptance being an internal detail to be triggered when
allocating pages -- just like the arch_alloc_page() callback.

I'm sure we could optimize for the !unaccepted memory via static keys
also in this version with some checks at the right places if we find
this to hurt performance?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-08-05 12:09     ` David Hildenbrand
@ 2022-08-05 13:38       ` Vlastimil Babka
  2022-08-05 14:22         ` David Hildenbrand
  2022-08-05 14:41         ` Dave Hansen
  0 siblings, 2 replies; 200+ messages in thread
From: Vlastimil Babka @ 2022-08-05 13:38 UTC (permalink / raw)
  To: David Hildenbrand, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Dave Hansen, Mike Rapoport,
	marcelo.cerri, tim.gardner, khalid.elmously, philip.cox, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel, Mike Rapoport,
	Mel Gorman

On 8/5/22 14:09, David Hildenbrand wrote:
> On 05.08.22 13:49, Vlastimil Babka wrote:
>> On 6/14/22 14:02, Kirill A. Shutemov wrote:
>>> UEFI Specification version 2.9 introduces the concept of memory
>>> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
>>> SEV-SNP, require memory to be accepted before it can be used by the
>>> guest. Accepting happens via a protocol specific to the Virtual Machine
>>> platform.
>>>
>>> There are several ways kernel can deal with unaccepted memory:
>>>
>>>  1. Accept all the memory during the boot. It is easy to implement and
>>>     it doesn't have runtime cost once the system is booted. The downside
>>>     is very long boot time.
>>>
>>>     Accept can be parallelized to multiple CPUs to keep it manageable
>>>     (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
>>>     memory bandwidth and does not scale beyond the point.
>>>
>>>  2. Accept a block of memory on the first use. It requires more
>>>     infrastructure and changes in page allocator to make it work, but
>>>     it provides good boot time.
>>>
>>>     On-demand memory accept means latency spikes every time kernel steps
>>>     onto a new memory block. The spikes will go away once workload data
>>>     set size gets stabilized or all memory gets accepted.
>>>
>>>  3. Accept all memory in background. Introduce a thread (or multiple)
>>>     that gets memory accepted proactively. It will minimize time the
>>>     system experience latency spikes on memory allocation while keeping
>>>     low boot time.
>>>
>>>     This approach cannot function on its own. It is an extension of #2:
>>>     background memory acceptance requires functional scheduler, but the
>>>     page allocator may need to tap into unaccepted memory before that.
>>>
>>>     The downside of the approach is that these threads also steal CPU
>>>     cycles and memory bandwidth from the user's workload and may hurt
>>>     user experience.
>>>
>>> Implement #2 for now. It is a reasonable default. Some workloads may
>>> want to use #1 or #3 and they can be implemented later based on user's
>>> demands.
>>>
>>> Support of unaccepted memory requires a few changes in core-mm code:
>>>
>>>   - memblock has to accept memory on allocation;
>>>
>>>   - page allocator has to accept memory on the first allocation of the
>>>     page;
>>>
>>> Memblock change is trivial.
>>>
>>> The page allocator is modified to accept pages on the first allocation.
>>> The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is
>>> used to indicate that the page requires acceptance.
>>>
>>> Architecture has to provide two helpers if it wants to support
>>> unaccepted memory:
>>>
>>>  - accept_memory() makes a range of physical addresses accepted.
>>>
>>>  - range_contains_unaccepted_memory() checks anything within the range
>>>    of physical addresses requires acceptance.
>>>
>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
>>> Reviewed-by: David Hildenbrand <david@redhat.com>
>> 
>> Hmm I realize it's not ideal to raise this at v7, and maybe it was discussed
>> before, but it's really not great how this affects the core page allocator
>> paths. Wouldn't it be possible to only release pages to page allocator when
>> accepted, and otherwise use some new per-zone variables together with the
>> bitmap to track how much exactly is where to accept? Then it could be hooked
>> in get_page_from_freelist() similarly to CONFIG_DEFERRED_STRUCT_PAGE_INIT -
>> if we fail zone_watermark_fast() and there are unaccepted pages in the zone,
>> accept them and continue. With a static key to flip in case we eventually
>> accept everything. Because this is really similar scenario to the deferred
>> init and that one was solved in a way that adds minimal overhead.
> 
> I kind of like just having the memory stats being correct (e.g., free
> memory) and acceptance being an internal detail to be triggered when
> allocating pages -- just like the arch_alloc_page() callback.

Hm, good point about the stats. Could be tweaked perhaps so it appears
correct on the outside, but might be tricky.

> I'm sure we could optimize for the !unaccepted memory via static keys
> also in this version with some checks at the right places if we find
> this to hurt performance?

It would be great if we would at least somehow hit the necessary code only
when dealing with a >=pageblock size block. The bitmap approach and
accepting everything smaller uprofront actually seems rather compatible. Yet
in the current patch we e.g. check PageUnaccepted(buddy) on every buddy size
while merging.

A list that sits besides the existing free_area, contains only >=pageblock
order sizes of unaccepted pages (no migratetype distinguished) and we tap
into it approximately before __rmqueue_fallback()? There would be some
trickery around releasing zone-lock for doing accept_memory(), but should be
manageable.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-08-05 13:38       ` Vlastimil Babka
@ 2022-08-05 14:22         ` David Hildenbrand
  2022-08-05 14:53           ` Dave Hansen
  2022-08-05 14:41         ` Dave Hansen
  1 sibling, 1 reply; 200+ messages in thread
From: David Hildenbrand @ 2022-08-05 14:22 UTC (permalink / raw)
  To: Vlastimil Babka, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Dave Hansen, Mike Rapoport,
	marcelo.cerri, tim.gardner, khalid.elmously, philip.cox, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel, Mike Rapoport,
	Mel Gorman

On 05.08.22 15:38, Vlastimil Babka wrote:
> On 8/5/22 14:09, David Hildenbrand wrote:
>> On 05.08.22 13:49, Vlastimil Babka wrote:
>>> On 6/14/22 14:02, Kirill A. Shutemov wrote:
>>>> UEFI Specification version 2.9 introduces the concept of memory
>>>> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
>>>> SEV-SNP, require memory to be accepted before it can be used by the
>>>> guest. Accepting happens via a protocol specific to the Virtual Machine
>>>> platform.
>>>>
>>>> There are several ways kernel can deal with unaccepted memory:
>>>>
>>>>  1. Accept all the memory during the boot. It is easy to implement and
>>>>     it doesn't have runtime cost once the system is booted. The downside
>>>>     is very long boot time.
>>>>
>>>>     Accept can be parallelized to multiple CPUs to keep it manageable
>>>>     (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
>>>>     memory bandwidth and does not scale beyond the point.
>>>>
>>>>  2. Accept a block of memory on the first use. It requires more
>>>>     infrastructure and changes in page allocator to make it work, but
>>>>     it provides good boot time.
>>>>
>>>>     On-demand memory accept means latency spikes every time kernel steps
>>>>     onto a new memory block. The spikes will go away once workload data
>>>>     set size gets stabilized or all memory gets accepted.
>>>>
>>>>  3. Accept all memory in background. Introduce a thread (or multiple)
>>>>     that gets memory accepted proactively. It will minimize time the
>>>>     system experience latency spikes on memory allocation while keeping
>>>>     low boot time.
>>>>
>>>>     This approach cannot function on its own. It is an extension of #2:
>>>>     background memory acceptance requires functional scheduler, but the
>>>>     page allocator may need to tap into unaccepted memory before that.
>>>>
>>>>     The downside of the approach is that these threads also steal CPU
>>>>     cycles and memory bandwidth from the user's workload and may hurt
>>>>     user experience.
>>>>
>>>> Implement #2 for now. It is a reasonable default. Some workloads may
>>>> want to use #1 or #3 and they can be implemented later based on user's
>>>> demands.
>>>>
>>>> Support of unaccepted memory requires a few changes in core-mm code:
>>>>
>>>>   - memblock has to accept memory on allocation;
>>>>
>>>>   - page allocator has to accept memory on the first allocation of the
>>>>     page;
>>>>
>>>> Memblock change is trivial.
>>>>
>>>> The page allocator is modified to accept pages on the first allocation.
>>>> The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is
>>>> used to indicate that the page requires acceptance.
>>>>
>>>> Architecture has to provide two helpers if it wants to support
>>>> unaccepted memory:
>>>>
>>>>  - accept_memory() makes a range of physical addresses accepted.
>>>>
>>>>  - range_contains_unaccepted_memory() checks anything within the range
>>>>    of physical addresses requires acceptance.
>>>>
>>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>>> Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
>>>> Reviewed-by: David Hildenbrand <david@redhat.com>
>>>
>>> Hmm I realize it's not ideal to raise this at v7, and maybe it was discussed
>>> before, but it's really not great how this affects the core page allocator
>>> paths. Wouldn't it be possible to only release pages to page allocator when
>>> accepted, and otherwise use some new per-zone variables together with the
>>> bitmap to track how much exactly is where to accept? Then it could be hooked
>>> in get_page_from_freelist() similarly to CONFIG_DEFERRED_STRUCT_PAGE_INIT -
>>> if we fail zone_watermark_fast() and there are unaccepted pages in the zone,
>>> accept them and continue. With a static key to flip in case we eventually
>>> accept everything. Because this is really similar scenario to the deferred
>>> init and that one was solved in a way that adds minimal overhead.
>>
>> I kind of like just having the memory stats being correct (e.g., free
>> memory) and acceptance being an internal detail to be triggered when
>> allocating pages -- just like the arch_alloc_page() callback.
> 
> Hm, good point about the stats. Could be tweaked perhaps so it appears
> correct on the outside, but might be tricky.
> 
>> I'm sure we could optimize for the !unaccepted memory via static keys
>> also in this version with some checks at the right places if we find
>> this to hurt performance?
> 
> It would be great if we would at least somehow hit the necessary code only
> when dealing with a >=pageblock size block. The bitmap approach and
> accepting everything smaller uprofront actually seems rather compatible. Yet
> in the current patch we e.g. check PageUnaccepted(buddy) on every buddy size
> while merging.
> 
> A list that sits besides the existing free_area, contains only >=pageblock
> order sizes of unaccepted pages (no migratetype distinguished) and we tap
> into it approximately before __rmqueue_fallback()? There would be some
> trickery around releasing zone-lock for doing accept_memory(), but should be
> manageable.
> 

Just curious, do we have a microbenchmark that is able to reveal the
impact of such code changes before we start worrying?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-08-05 13:38       ` Vlastimil Babka
  2022-08-05 14:22         ` David Hildenbrand
@ 2022-08-05 14:41         ` Dave Hansen
  2022-08-05 18:17           ` Vlastimil Babka
  1 sibling, 1 reply; 200+ messages in thread
From: Dave Hansen @ 2022-08-05 14:41 UTC (permalink / raw)
  To: Vlastimil Babka, David Hildenbrand, Kirill A. Shutemov,
	Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Mike Rapoport, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Mike Rapoport, Mel Gorman

On 8/5/22 06:38, Vlastimil Babka wrote:
>> I'm sure we could optimize for the !unaccepted memory via static keys
>> also in this version with some checks at the right places if we find
>> this to hurt performance?
> It would be great if we would at least somehow hit the necessary code only
> when dealing with a >=pageblock size block. The bitmap approach and
> accepting everything smaller uprofront actually seems rather compatible. Yet
> in the current patch we e.g. check PageUnaccepted(buddy) on every buddy size
> while merging.

Needing to check PageUnaccepted() during the merge is fallout from
moving the acceptance to post_alloc_hook().  I _think_ an earlier
version of this did page acceptance under the zone lock, closer to where
the page comes off the 2M/4M lists.

But, page acceptance is horribly slow, so I asked Kirill to move it out
from under the zone lock.  Doing it in post_alloc_hook() (after the zone
lock is dropped) makes a lot of sense since we do zeroing in there and
zeroing is also nice and slow.

But, post_alloc_hook() is long after the 2M page has been split and that
means that we have to deal with potentially unaccepted pages during merges.

I think there are three basic options:

1. This patch: Do acceptance after the zone lock is dropped and deal
   with mixed-acceptance merges
2. Do acceptance under the zone lock as pages come off the 2M/4M lists,
   but before the page is split.
3. Pull the page off the 2M/4M lists, drop the zone lock, accept it,
   then put it back.

I'm not sure any of those other options are better.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-08-05 14:22         ` David Hildenbrand
@ 2022-08-05 14:53           ` Dave Hansen
  0 siblings, 0 replies; 200+ messages in thread
From: Dave Hansen @ 2022-08-05 14:53 UTC (permalink / raw)
  To: David Hildenbrand, Vlastimil Babka, Kirill A. Shutemov,
	Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Mike Rapoport, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Mike Rapoport, Mel Gorman

On 8/5/22 07:22, David Hildenbrand wrote:
>> A list that sits besides the existing free_area, contains only >=pageblock
>> order sizes of unaccepted pages (no migratetype distinguished) and we tap
>> into it approximately before __rmqueue_fallback()? There would be some
>> trickery around releasing zone-lock for doing accept_memory(), but should be
>> manageable.
>>
> Just curious, do we have a microbenchmark that is able to reveal the
> impact of such code changes before we start worrying?

Nope.  I went looking to see if I could find any impact.  I think Kirill
did too.  Too bad that effort didn't make it into the changelog yet.

The merging check at least is just checking a field in a cache-hot
'struct page'.  The common case is probably three instructions:

	load to a register
	check the bit
	jump if not set

It adds a wee bit of icache pressure, but it's also the kind of thing
that should be a piece of cake for the branch predictors.

That dynamic check could easily be wrapped by a static branch.  But,
that first requires more code to go dig in the nooks and crannies of the
page allocator to make sure *ALL* pages are accepted.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-08-05 14:41         ` Dave Hansen
@ 2022-08-05 18:17           ` Vlastimil Babka
  2022-08-08 15:55             ` Dave Hansen
  0 siblings, 1 reply; 200+ messages in thread
From: Vlastimil Babka @ 2022-08-05 18:17 UTC (permalink / raw)
  To: Dave Hansen, David Hildenbrand, Kirill A. Shutemov,
	Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Mike Rapoport, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Mike Rapoport, Mel Gorman

On 8/5/22 16:41, Dave Hansen wrote:
> On 8/5/22 06:38, Vlastimil Babka wrote:
>>> I'm sure we could optimize for the !unaccepted memory via static keys
>>> also in this version with some checks at the right places if we find
>>> this to hurt performance?
>> It would be great if we would at least somehow hit the necessary code only
>> when dealing with a >=pageblock size block. The bitmap approach and
>> accepting everything smaller uprofront actually seems rather compatible. Yet
>> in the current patch we e.g. check PageUnaccepted(buddy) on every buddy size
>> while merging.
> 
> Needing to check PageUnaccepted() during the merge is fallout from
> moving the acceptance to post_alloc_hook().  I _think_ an earlier
> version of this did page acceptance under the zone lock, closer to where
> the page comes off the 2M/4M lists.
> 
> But, page acceptance is horribly slow, so I asked Kirill to move it out
> from under the zone lock.  Doing it in post_alloc_hook() (after the zone
> lock is dropped) makes a lot of sense since we do zeroing in there and
> zeroing is also nice and slow.
> 
> But, post_alloc_hook() is long after the 2M page has been split and that
> means that we have to deal with potentially unaccepted pages during merges.
> 
> I think there are three basic options:
> 
> 1. This patch: Do acceptance after the zone lock is dropped and deal
>    with mixed-acceptance merges
> 2. Do acceptance under the zone lock as pages come off the 2M/4M lists,
>    but before the page is split.

Rather not, as acceptance can be slow and we shouldn't hog the zone lock
while doing it.

> 3. Pull the page off the 2M/4M lists, drop the zone lock, accept it,
>    then put it back.

Worth trying, IMHO. Perhaps easier to manage if the lists are distinct from
the normal ones, as I suggested.

> I'm not sure any of those other options are better.


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-08-05 18:17           ` Vlastimil Babka
@ 2022-08-08 15:55             ` Dave Hansen
  0 siblings, 0 replies; 200+ messages in thread
From: Dave Hansen @ 2022-08-08 15:55 UTC (permalink / raw)
  To: Vlastimil Babka, David Hildenbrand, Kirill A. Shutemov,
	Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Mike Rapoport, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Mike Rapoport, Mel Gorman

On 8/5/22 11:17, Vlastimil Babka wrote:
>> 3. Pull the page off the 2M/4M lists, drop the zone lock, accept it,
>>    then put it back.
> Worth trying, IMHO. Perhaps easier to manage if the lists are distinct from
> the normal ones, as I suggested.

I was playing with another series recently where I did this, momentarily
taking pages off some of the high-order lists and dropping the zone lock.

Kirill, if you go looking at this, just make sure that you don't let
this happen to too much memory at once.  You might end up yanking memory
out of the allocator that's not reflected in NR_FREE_PAGES.

You might, for instance want to make sure that only a small number of
threads can have pulled memory off the free lists at once.  Something
*logically* like this:

// Limit to two threads at once:
atomic_t nr_accepting_threads = ATOMIC_INIT(2);

	page = del_page_from_free_list();
	if (!PageAccepted(page)) {
		if (atomic_dec_and_test(&nr_accepting_threads)) {
			// already at the thread limit
			add_page_from_free_list(page, ...);
			spin_unlock_irq(&zone->lock);
			// wait for a slot...
			spin_lock_irq(&zone->lock);
			goto retry;
		} else {
			spin_unlock_irq(&zone->lock);
			accept_page(page);
			spin_lock_irq(&zone->lock);
			add_page_from_free_list(page, ...);
			// do merging if it was a 2M page
		}
	}

It's a little nasty because the whole thing is not a sleepable context.
 I also know that the merging code needs some refactoring if you want to
do merging with 2M pages here.  It might all get easier if you move all
the page allocator stuff to only work at the 4M granularity.

In any case, I'm not trying to dissuade anyone from listening to the
other reviewer feedback.  Just trying to save you a few cycles on a
similar problem I was looking at recently.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v2 0/2] Provide SEV-SNP support for unaccepted memory
  2022-06-14 12:02 [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (15 preceding siblings ...)
  2022-07-29 14:01 ` [PATCH v1 0/2] Provide SEV-SNP " Tom Lendacky
@ 2022-08-08 17:16 ` Tom Lendacky
  2022-08-08 17:16   ` [PATCH v2 1/2] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support Tom Lendacky
  2022-08-08 17:16   ` [PATCH v2 2/2] x86/sev: Add SNP-specific " Tom Lendacky
  2022-08-15 15:57 ` [PATCH v3 0/2] Provide SEV-SNP support for unaccepted memory Tom Lendacky
                   ` (2 subsequent siblings)
  19 siblings, 2 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-08-08 17:16 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

This series adds SEV-SNP support for unaccepted memory to the patch series
titled:

  [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory

Currently, when changing the state of a page under SNP, the page state
change structure is kmalloc()'d. This lead to hangs during boot when
accepting memory because the allocation can trigger the need to accept
more memory. So this series consists of two patches:

  - A pre-patch to switch from a kmalloc()'d page state change structure
    to a (smaller) stack-based page state change structure.

  - SNP support for unaccepted memory.

The series is based off of and tested against Kirill Shutemov's tree:
  https://github.com/intel/tdx.git guest-unaccepted-memory

---

Changes since v1:
- Change from using a per-CPU PSC structure to a (smaller) stack PSC
  structure.


Tom Lendacky (2):
  x86/sev: Put PSC struct on the stack in prep for unaccepted memory
    support
  x86/sev: Add SNP-specific unaccepted memory support

 arch/x86/Kconfig                  |  1 +
 arch/x86/boot/compressed/mem.c    |  3 +++
 arch/x86/boot/compressed/sev.c    | 10 +++++++-
 arch/x86/boot/compressed/sev.h    | 23 ++++++++++++++++++
 arch/x86/include/asm/sev-common.h |  2 +-
 arch/x86/include/asm/sev.h        |  3 +++
 arch/x86/kernel/sev.c             | 40 ++++++++++++++++++++++++-------
 arch/x86/mm/unaccepted_memory.c   |  4 ++++
 8 files changed, 76 insertions(+), 10 deletions(-)
 create mode 100644 arch/x86/boot/compressed/sev.h

-- 
2.36.1


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v2 1/2] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
  2022-08-08 17:16 ` [PATCH v2 0/2] Provide SEV-SNP support for unaccepted memory Tom Lendacky
@ 2022-08-08 17:16   ` Tom Lendacky
  2022-08-08 21:43     ` Dave Hansen
  2022-08-12 13:03     ` Borislav Petkov
  2022-08-08 17:16   ` [PATCH v2 2/2] x86/sev: Add SNP-specific " Tom Lendacky
  1 sibling, 2 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-08-08 17:16 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

In advance of providing support for unaccepted memory, switch from using
kmalloc() for allocating the Page State Change (PSC) structure to using a
local variable that lives on the stack. This is needed to avoid a possible
recursive call into set_pages_state() if the kmalloc() call requires
(more) memory to be accepted, which would result in a hang.

The current size of the PSC struct is 2,032 bytes. To make the struct more
stack friendly, reduce the number of PSC entries from 253 down to 64,
resulting in a size of 520 bytes. This is a nice compromise on struct size
and total PSC requests.

Also, since set_pages_state() uses the per-CPU GHCB, add a static variable
that indicates when per-CPU GHCBs are available. Until they are available,
the GHCB MSR protocol is used to perform page state changes.

If either the reduction in PSC entries or the use of the MSR protocol
until the per-CPU GHCBs are available, result in any kind of performance
issue (that is not seen at the moment), use of a larger static PSC struct,
with fallback to the smaller stack version, or use the boot GHCB prior to
per-CPU GHCBs, can be investigated.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/include/asm/sev-common.h |  2 +-
 arch/x86/kernel/sev.c             | 24 ++++++++++++++++--------
 2 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index b8357d6ecd47..6f7268a817fc 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -107,7 +107,7 @@ enum psc_op {
 #define GHCB_HV_FT_SNP_AP_CREATION	BIT_ULL(1)
 
 /* SNP Page State Change NAE event */
-#define VMGEXIT_PSC_MAX_ENTRY		253
+#define VMGEXIT_PSC_MAX_ENTRY		64
 
 struct psc_hdr {
 	u16 cur_entry;
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index c05f0124c410..275aa890611f 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -66,6 +66,9 @@ static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
  */
 static struct ghcb *boot_ghcb __section(".data");
 
+/* Flag to indicate when the first per-CPU GHCB is registered */
+static bool ghcb_percpu_ready __section(".data");
+
 /* Bitmap of SEV features supported by the hypervisor */
 static u64 sev_hv_features __ro_after_init;
 
@@ -660,7 +663,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool valid
 	}
 }
 
-static void __init early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
+static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
 {
 	unsigned long paddr_end;
 	u64 val;
@@ -868,11 +871,16 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
 static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
 {
 	unsigned long vaddr_end, next_vaddr;
-	struct snp_psc_desc *desc;
+	struct snp_psc_desc desc;
+
+	/*
+	 * Use the MSR protocol when the per-CPU GHCBs are not yet registered,
+	 * since vmgexit_psc() uses the per-CPU GHCB.
+	 */
+	if (!ghcb_percpu_ready)
+		return early_set_pages_state(__pa(vaddr), npages, op);
 
-	desc = kmalloc(sizeof(*desc), GFP_KERNEL_ACCOUNT);
-	if (!desc)
-		panic("SNP: failed to allocate memory for PSC descriptor\n");
+	memset(&desc, 0, sizeof(desc));
 
 	vaddr = vaddr & PAGE_MASK;
 	vaddr_end = vaddr + (npages << PAGE_SHIFT);
@@ -882,12 +890,10 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
 		next_vaddr = min_t(unsigned long, vaddr_end,
 				   (VMGEXIT_PSC_MAX_ENTRY * PAGE_SIZE) + vaddr);
 
-		__set_pages_state(desc, vaddr, next_vaddr, op);
+		__set_pages_state(&desc, vaddr, next_vaddr, op);
 
 		vaddr = next_vaddr;
 	}
-
-	kfree(desc);
 }
 
 void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
@@ -1254,6 +1260,8 @@ void setup_ghcb(void)
 		if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
 			snp_register_per_cpu_ghcb();
 
+		ghcb_percpu_ready = true;
+
 		return;
 	}
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCH v2 2/2] x86/sev: Add SNP-specific unaccepted memory support
  2022-08-08 17:16 ` [PATCH v2 0/2] Provide SEV-SNP support for unaccepted memory Tom Lendacky
  2022-08-08 17:16   ` [PATCH v2 1/2] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support Tom Lendacky
@ 2022-08-08 17:16   ` Tom Lendacky
  1 sibling, 0 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-08-08 17:16 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

Add SNP-specific hooks to the unaccepted memory support in the boot
path (__accept_memory()) and the core kernel (accept_memory()) in order
to support booting SNP guests when unaccepted memory is present. Without
this support, SNP guests will fail to boot and/or panic() when unaccepted
memory is present in the EFI memory map.

The process of accepting memory under SNP involves invoking the hypervisor
to perform a page state change for the page to private memory and then
issuing a PVALIDATE instruction to accept the page.

Create the new header file arch/x86/boot/compressed/sev.h because adding
the function declaration to any of the existing SEV related header files
pulls in too many other header files, causing the build to fail.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/Kconfig                |  1 +
 arch/x86/boot/compressed/mem.c  |  3 +++
 arch/x86/boot/compressed/sev.c  | 10 +++++++++-
 arch/x86/boot/compressed/sev.h  | 23 +++++++++++++++++++++++
 arch/x86/include/asm/sev.h      |  3 +++
 arch/x86/kernel/sev.c           | 16 ++++++++++++++++
 arch/x86/mm/unaccepted_memory.c |  4 ++++
 7 files changed, 59 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/boot/compressed/sev.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 34146ecc5bdd..0ad53c3533c2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1553,6 +1553,7 @@ config AMD_MEM_ENCRYPT
 	select INSTRUCTION_DECODER
 	select ARCH_HAS_CC_PLATFORM
 	select X86_MEM_ENCRYPT
+	select UNACCEPTED_MEMORY
 	help
 	  Say yes to enable support for the encryption of system memory.
 	  This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index 48e36e640da1..3e19dc0da0d7 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -6,6 +6,7 @@
 #include "find.h"
 #include "math.h"
 #include "tdx.h"
+#include "sev.h"
 #include <asm/shared/tdx.h>
 
 #define PMD_SHIFT	21
@@ -39,6 +40,8 @@ static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
 	/* Platform-specific memory-acceptance call goes here */
 	if (is_tdx_guest())
 		tdx_accept_memory(start, end);
+	else if (sev_snp_enabled())
+		snp_accept_memory(start, end);
 	else
 		error("Cannot accept memory: unknown platform\n");
 }
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 730c4677e9db..d4b06c862094 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -115,7 +115,7 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
 /* Include code for early handlers */
 #include "../../kernel/sev-shared.c"
 
-static inline bool sev_snp_enabled(void)
+bool sev_snp_enabled(void)
 {
 	return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
 }
@@ -161,6 +161,14 @@ void snp_set_page_shared(unsigned long paddr)
 	__page_state_change(paddr, SNP_PAGE_STATE_SHARED);
 }
 
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	while (end > start) {
+		snp_set_page_private(start);
+		start += PAGE_SIZE;
+	}
+}
+
 static bool early_setup_ghcb(void)
 {
 	if (set_page_decrypted((unsigned long)&boot_ghcb_page))
diff --git a/arch/x86/boot/compressed/sev.h b/arch/x86/boot/compressed/sev.h
new file mode 100644
index 000000000000..fc725a981b09
--- /dev/null
+++ b/arch/x86/boot/compressed/sev.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * AMD SEV header for early boot related functions.
+ *
+ * Author: Tom Lendacky <thomas.lendacky@amd.com>
+ */
+
+#ifndef BOOT_COMPRESSED_SEV_H
+#define BOOT_COMPRESSED_SEV_H
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+
+bool sev_snp_enabled(void);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
+
+#else
+
+static inline bool sev_snp_enabled(void) { return false; }
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
+
+#endif
+
+#endif
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 19514524f0f8..21db66bacefe 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -202,6 +202,7 @@ void snp_set_wakeup_secondary_cpu(void);
 bool snp_init(struct boot_params *bp);
 void snp_abort(void);
 int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, unsigned long *fw_err);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
 #else
 static inline void sev_es_ist_enter(struct pt_regs *regs) { }
 static inline void sev_es_ist_exit(void) { }
@@ -226,6 +227,8 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
 {
 	return -ENOTTY;
 }
+
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
 #endif
 
 #endif
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 275aa890611f..f64805fa5dcb 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -916,6 +916,22 @@ void snp_set_memory_private(unsigned long vaddr, unsigned int npages)
 	pvalidate_pages(vaddr, npages, true);
 }
 
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	unsigned long vaddr;
+	unsigned int npages;
+
+	if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
+		return;
+
+	vaddr = (unsigned long)__va(start);
+	npages = (end - start) >> PAGE_SHIFT;
+
+	set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
+
+	pvalidate_pages(vaddr, npages, true);
+}
+
 static int snp_set_vmsa(void *va, bool vmsa)
 {
 	u64 attrs;
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index 9ec2304272dc..b86ad6a8ddf5 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -9,6 +9,7 @@
 #include <asm/setup.h>
 #include <asm/shared/tdx.h>
 #include <asm/unaccepted_memory.h>
+#include <asm/sev.h>
 
 /* Protects unaccepted memory bitmap */
 static DEFINE_SPINLOCK(unaccepted_memory_lock);
@@ -66,6 +67,9 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
 		if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
 			tdx_accept_memory(range_start * PMD_SIZE,
 					  range_end * PMD_SIZE);
+		} else if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) {
+			snp_accept_memory(range_start * PMD_SIZE,
+					  range_end * PMD_SIZE);
 		} else {
 			panic("Cannot accept memory: unknown platform\n");
 		}
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* Re: [PATCH v2 1/2] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
  2022-08-08 17:16   ` [PATCH v2 1/2] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support Tom Lendacky
@ 2022-08-08 21:43     ` Dave Hansen
  2022-08-08 22:18       ` Tom Lendacky
  2022-08-12 13:03     ` Borislav Petkov
  1 sibling, 1 reply; 200+ messages in thread
From: Dave Hansen @ 2022-08-08 21:43 UTC (permalink / raw)
  To: Tom Lendacky, linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On 8/8/22 10:16, Tom Lendacky wrote:
...
> diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
> index b8357d6ecd47..6f7268a817fc 100644
> --- a/arch/x86/include/asm/sev-common.h
> +++ b/arch/x86/include/asm/sev-common.h
> @@ -107,7 +107,7 @@ enum psc_op {
>  #define GHCB_HV_FT_SNP_AP_CREATION	BIT_ULL(1)
>  
>  /* SNP Page State Change NAE event */
> -#define VMGEXIT_PSC_MAX_ENTRY		253
> +#define VMGEXIT_PSC_MAX_ENTRY		64

In general, the stack-based allocation looks fine.  It might be worth a
comment in there to make it clear that this can consume stack space.

>  struct psc_hdr {
>  	u16 cur_entry;
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index c05f0124c410..275aa890611f 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -66,6 +66,9 @@ static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
>   */
>  static struct ghcb *boot_ghcb __section(".data");
>  
> +/* Flag to indicate when the first per-CPU GHCB is registered */
> +static bool ghcb_percpu_ready __section(".data");

So, there's a code path that can't be entered until this is set?  Seems
like the least we can do it annotate that path with a
WARN_ON_ONCE(!ghcb_percpu_ready).

Also, how does having _one_ global variable work for indicating the
state of multiple per-cpu structures?  The code doesn't seem to delay
setting this variable until after _all_ of the per-cpu state is ready.

>  /* Bitmap of SEV features supported by the hypervisor */
>  static u64 sev_hv_features __ro_after_init;
>  
> @@ -660,7 +663,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool valid
>  	}
>  }
>  
> -static void __init early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
> +static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
>  {
>  	unsigned long paddr_end;
>  	u64 val;
> @@ -868,11 +871,16 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
>  static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
>  {
>  	unsigned long vaddr_end, next_vaddr;
> -	struct snp_psc_desc *desc;
> +	struct snp_psc_desc desc;
> +
> +	/*
> +	 * Use the MSR protocol when the per-CPU GHCBs are not yet registered,
> +	 * since vmgexit_psc() uses the per-CPU GHCB.
> +	 */
> +	if (!ghcb_percpu_ready)
> +		return early_set_pages_state(__pa(vaddr), npages, op);
>  
> -	desc = kmalloc(sizeof(*desc), GFP_KERNEL_ACCOUNT);
> -	if (!desc)
> -		panic("SNP: failed to allocate memory for PSC descriptor\n");
> +	memset(&desc, 0, sizeof(desc));

Why is this using memset()?  The compiler should be smart enough to
delay initializing 'desc' until after the return with this kind of
construct:

	struct snp_psc_desc desc = {};
	if (foo)
		return;
	// use 'desc' here

The compiler *knows* there is no access to 'desc' before the if().


>  	vaddr = vaddr & PAGE_MASK;
>  	vaddr_end = vaddr + (npages << PAGE_SHIFT);
> @@ -882,12 +890,10 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
>  		next_vaddr = min_t(unsigned long, vaddr_end,
>  				   (VMGEXIT_PSC_MAX_ENTRY * PAGE_SIZE) + vaddr);
>  
> -		__set_pages_state(desc, vaddr, next_vaddr, op);
> +		__set_pages_state(&desc, vaddr, next_vaddr, op);
>  
>  		vaddr = next_vaddr;
>  	}
> -
> -	kfree(desc);
>  }
>  
>  void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
> @@ -1254,6 +1260,8 @@ void setup_ghcb(void)
>  		if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
>  			snp_register_per_cpu_ghcb();
>  
> +		ghcb_percpu_ready = true;
> +
>  		return;
>  	}
>  


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v2 1/2] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
  2022-08-08 21:43     ` Dave Hansen
@ 2022-08-08 22:18       ` Tom Lendacky
  2022-08-08 22:33         ` Dave Hansen
  0 siblings, 1 reply; 200+ messages in thread
From: Tom Lendacky @ 2022-08-08 22:18 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On 8/8/22 16:43, Dave Hansen wrote:
> On 8/8/22 10:16, Tom Lendacky wrote:
> ...
>> diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
>> index b8357d6ecd47..6f7268a817fc 100644
>> --- a/arch/x86/include/asm/sev-common.h
>> +++ b/arch/x86/include/asm/sev-common.h
>> @@ -107,7 +107,7 @@ enum psc_op {
>>   #define GHCB_HV_FT_SNP_AP_CREATION	BIT_ULL(1)
>>   
>>   /* SNP Page State Change NAE event */
>> -#define VMGEXIT_PSC_MAX_ENTRY		253
>> +#define VMGEXIT_PSC_MAX_ENTRY		64
> 
> In general, the stack-based allocation looks fine.  It might be worth a
> comment in there to make it clear that this can consume stack space.

I'll add that.

> 
>>   struct psc_hdr {
>>   	u16 cur_entry;
>> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
>> index c05f0124c410..275aa890611f 100644
>> --- a/arch/x86/kernel/sev.c
>> +++ b/arch/x86/kernel/sev.c
>> @@ -66,6 +66,9 @@ static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
>>    */
>>   static struct ghcb *boot_ghcb __section(".data");
>>   
>> +/* Flag to indicate when the first per-CPU GHCB is registered */
>> +static bool ghcb_percpu_ready __section(".data");
> 
> So, there's a code path that can't be entered until this is set?  Seems
> like the least we can do it annotate that path with a
> WARN_ON_ONCE(!ghcb_percpu_ready).

Sure, that can be added. Right now the only function that calls 
vmgexit_psc() is covered (set_pages_state()/__set_pages_state()) and is 
doing the right thing. But I guess if anything is added in the future, 
that will provide details on what happened.

> 
> Also, how does having _one_ global variable work for indicating the
> state of multiple per-cpu structures?  The code doesn't seem to delay
> setting this variable until after _all_ of the per-cpu state is ready.

All of the per-CPU GHCBs are allocated during the BSP boot, before any AP 
is started. The APs only ever run the kernel_exc_vmm_communication() #VC 
handler and only ever use the per-CPU version of the GHCB, never the early 
boot version. This is based on the initial_vc_handler being switched to 
the runtime #VC handler, kernel_exc_vmm_communication.

The trigger for the switch over for the BSP from the early boot GHCB to 
the per-CPU GHCB is during setup_ghcb() after the initial_vc_handler has 
been switched to kernel_exc_vmm_communication, which is just after the 
per-CPU allocations. By putting the setting of the ghcb_percpu_ready in 
setup_ghcb(), it indicates that the BSP per-CPU GHCB has been registered 
and can be used.

> 
>>   /* Bitmap of SEV features supported by the hypervisor */
>>   static u64 sev_hv_features __ro_after_init;
>>   
>> @@ -660,7 +663,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool valid
>>   	}
>>   }
>>   
>> -static void __init early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
>> +static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
>>   {
>>   	unsigned long paddr_end;
>>   	u64 val;
>> @@ -868,11 +871,16 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
>>   static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
>>   {
>>   	unsigned long vaddr_end, next_vaddr;
>> -	struct snp_psc_desc *desc;
>> +	struct snp_psc_desc desc;
>> +
>> +	/*
>> +	 * Use the MSR protocol when the per-CPU GHCBs are not yet registered,
>> +	 * since vmgexit_psc() uses the per-CPU GHCB.
>> +	 */
>> +	if (!ghcb_percpu_ready)
>> +		return early_set_pages_state(__pa(vaddr), npages, op);
>>   
>> -	desc = kmalloc(sizeof(*desc), GFP_KERNEL_ACCOUNT);
>> -	if (!desc)
>> -		panic("SNP: failed to allocate memory for PSC descriptor\n");
>> +	memset(&desc, 0, sizeof(desc));
> 
> Why is this using memset()?  The compiler should be smart enough to
> delay initializing 'desc' until after the return with this kind of
> construct:
> 
> 	struct snp_psc_desc desc = {};
> 	if (foo)
> 		return;
> 	// use 'desc' here
> 
> The compiler *knows* there is no access to 'desc' before the if().

Yup, I can change that.

Thanks,
Tom

> 
> 
>>   	vaddr = vaddr & PAGE_MASK;
>>   	vaddr_end = vaddr + (npages << PAGE_SHIFT);
>> @@ -882,12 +890,10 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
>>   		next_vaddr = min_t(unsigned long, vaddr_end,
>>   				   (VMGEXIT_PSC_MAX_ENTRY * PAGE_SIZE) + vaddr);
>>   
>> -		__set_pages_state(desc, vaddr, next_vaddr, op);
>> +		__set_pages_state(&desc, vaddr, next_vaddr, op);
>>   
>>   		vaddr = next_vaddr;
>>   	}
>> -
>> -	kfree(desc);
>>   }
>>   
>>   void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
>> @@ -1254,6 +1260,8 @@ void setup_ghcb(void)
>>   		if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
>>   			snp_register_per_cpu_ghcb();
>>   
>> +		ghcb_percpu_ready = true;
>> +
>>   		return;
>>   	}
>>   
> 

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v2 1/2] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
  2022-08-08 22:18       ` Tom Lendacky
@ 2022-08-08 22:33         ` Dave Hansen
  2022-08-08 22:35           ` Tom Lendacky
  0 siblings, 1 reply; 200+ messages in thread
From: Dave Hansen @ 2022-08-08 22:33 UTC (permalink / raw)
  To: Tom Lendacky, linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On 8/8/22 15:18, Tom Lendacky wrote:
>>>   +/* Flag to indicate when the first per-CPU GHCB is registered */
>>> +static bool ghcb_percpu_ready __section(".data");
>>
>> So, there's a code path that can't be entered until this is set?  Seems
>> like the least we can do it annotate that path with a
>> WARN_ON_ONCE(!ghcb_percpu_ready).
> 
> Sure, that can be added. Right now the only function that calls
> vmgexit_psc() is covered (set_pages_state()/__set_pages_state()) and is
> doing the right thing. But I guess if anything is added in the future,
> that will provide details on what happened.
> 
>>
>> Also, how does having _one_ global variable work for indicating the
>> state of multiple per-cpu structures?  The code doesn't seem to delay
>> setting this variable until after _all_ of the per-cpu state is ready.
> 
> All of the per-CPU GHCBs are allocated during the BSP boot, before any
> AP is started. The APs only ever run the kernel_exc_vmm_communication()
> #VC handler and only ever use the per-CPU version of the GHCB, never the
> early boot version. This is based on the initial_vc_handler being
> switched to the runtime #VC handler, kernel_exc_vmm_communication.
> 
> The trigger for the switch over for the BSP from the early boot GHCB to
> the per-CPU GHCB is during setup_ghcb() after the initial_vc_handler has
> been switched to kernel_exc_vmm_communication, which is just after the
> per-CPU allocations. By putting the setting of the ghcb_percpu_ready in
> setup_ghcb(), it indicates that the BSP per-CPU GHCB has been registered
> and can be used.

That description makes the proposed comment even more confusing:

	/* Flag to indicate when the first per-CPU GHCB is registered */

The important thing is that this variable is only _useful_ for the boot
CPU.  After the boot CPU has allocated space for _itself_, it can then
go and stop using the MSR-based method.

The reason it's set after "the first" is because "the first" is also the
boot CPU, but referring to it as the "the first" is a bit oblique.
Maybe something like this:

	/*
	 * Set after the boot CPU's GHCB is registered.  At that point,
	 * it can be used for calls instead of MSRs.
	 */

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v2 1/2] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
  2022-08-08 22:33         ` Dave Hansen
@ 2022-08-08 22:35           ` Tom Lendacky
  0 siblings, 0 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-08-08 22:35 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On 8/8/22 17:33, Dave Hansen wrote:
> On 8/8/22 15:18, Tom Lendacky wrote:
>>>>    +/* Flag to indicate when the first per-CPU GHCB is registered */
>>>> +static bool ghcb_percpu_ready __section(".data");
>>>
>>> So, there's a code path that can't be entered until this is set?  Seems
>>> like the least we can do it annotate that path with a
>>> WARN_ON_ONCE(!ghcb_percpu_ready).
>>
>> Sure, that can be added. Right now the only function that calls
>> vmgexit_psc() is covered (set_pages_state()/__set_pages_state()) and is
>> doing the right thing. But I guess if anything is added in the future,
>> that will provide details on what happened.
>>
>>>
>>> Also, how does having _one_ global variable work for indicating the
>>> state of multiple per-cpu structures?  The code doesn't seem to delay
>>> setting this variable until after _all_ of the per-cpu state is ready.
>>
>> All of the per-CPU GHCBs are allocated during the BSP boot, before any
>> AP is started. The APs only ever run the kernel_exc_vmm_communication()
>> #VC handler and only ever use the per-CPU version of the GHCB, never the
>> early boot version. This is based on the initial_vc_handler being
>> switched to the runtime #VC handler, kernel_exc_vmm_communication.
>>
>> The trigger for the switch over for the BSP from the early boot GHCB to
>> the per-CPU GHCB is during setup_ghcb() after the initial_vc_handler has
>> been switched to kernel_exc_vmm_communication, which is just after the
>> per-CPU allocations. By putting the setting of the ghcb_percpu_ready in
>> setup_ghcb(), it indicates that the BSP per-CPU GHCB has been registered
>> and can be used.
> 
> That description makes the proposed comment even more confusing:
> 
> 	/* Flag to indicate when the first per-CPU GHCB is registered */
> 
> The important thing is that this variable is only _useful_ for the boot
> CPU.  After the boot CPU has allocated space for _itself_, it can then
> go and stop using the MSR-based method.
> 
> The reason it's set after "the first" is because "the first" is also the
> boot CPU, but referring to it as the "the first" is a bit oblique.
> Maybe something like this:
> 
> 	/*
> 	 * Set after the boot CPU's GHCB is registered.  At that point,
> 	 * it can be used for calls instead of MSRs.
> 	 */

Sure, I'll work on the wording.

Thanks,
Tom

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-07-23 11:14                                         ` Ard Biesheuvel
  2022-07-28 22:01                                           ` Dionna Amalie Glaze
@ 2022-08-09 11:14                                           ` Kirill A. Shutemov
  2022-08-09 11:36                                             ` Ard Biesheuvel
  1 sibling, 1 reply; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-08-09 11:14 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Dave Hansen, Marc Orr, Borislav Petkov, Dionna Amalie Glaze,
	Peter Gonda, Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Andi Kleen, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Varad Gautam,
	Dario Faggioli, Mike Rapoport, David Hildenbrand, Marcelo Cerri,
	tim.gardner, Khalid ElMously, philip.cox,
	the arch/x86 maintainers, Linux Memory Management List,
	linux-coco, linux-efi, LKML, Yao, Jiewen

On Sat, Jul 23, 2022 at 01:14:07PM +0200, Ard Biesheuvel wrote:
> On Thu, 21 Jul 2022 at 19:13, Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 7/19/22 17:26, Marc Orr wrote:
> > > - Dave's suggestion to "2. Boot some intermediate thing like a
> > > bootloader that does acceptance ..." is pretty clever! So if upstream
> > > thinks this FW-kernel negotiation is not a good direction, maybe we
> > > (Google) can pursue this idea to avoid introducing yet another tag on
> > > our images.
> >
> > I'm obviously speaking only for myself here and not for "upstream" as a
> > whole, but I clearly don't like the FW/kernel negotiation thing.  It's a
> > permanent pain in our necks to solve a very temporary problem.
> 
> EFI is basically our existing embodiment of this fw/kernel negotiation
> thing, and iff we need it, I have no objection to using it for this
> purpose, i.e., to allow the firmware to infer whether or not it should
> accept all available memory on behalf of the OS before exiting boot
> services. But if we don't need this, even better.

FW/kernel negotiation does not work if there's a boot loader in the middle
that does ExitBootServices(). By the time kernel can announce if it
supports unaccepted memory there's nobody to announce to.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-08-09 11:14                                           ` Kirill A. Shutemov
@ 2022-08-09 11:36                                             ` Ard Biesheuvel
  2022-08-09 11:54                                               ` Kirill A. Shutemov
  0 siblings, 1 reply; 200+ messages in thread
From: Ard Biesheuvel @ 2022-08-09 11:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Marc Orr, Borislav Petkov, Dionna Amalie Glaze,
	Peter Gonda, Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Andi Kleen, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Varad Gautam,
	Dario Faggioli, Mike Rapoport, David Hildenbrand, Marcelo Cerri,
	tim.gardner, Khalid ElMously, philip.cox,
	the arch/x86 maintainers, Linux Memory Management List,
	linux-coco, linux-efi, LKML, Yao, Jiewen

On Tue, 9 Aug 2022 at 13:11, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Sat, Jul 23, 2022 at 01:14:07PM +0200, Ard Biesheuvel wrote:
> > On Thu, 21 Jul 2022 at 19:13, Dave Hansen <dave.hansen@intel.com> wrote:
> > >
> > > On 7/19/22 17:26, Marc Orr wrote:
> > > > - Dave's suggestion to "2. Boot some intermediate thing like a
> > > > bootloader that does acceptance ..." is pretty clever! So if upstream
> > > > thinks this FW-kernel negotiation is not a good direction, maybe we
> > > > (Google) can pursue this idea to avoid introducing yet another tag on
> > > > our images.
> > >
> > > I'm obviously speaking only for myself here and not for "upstream" as a
> > > whole, but I clearly don't like the FW/kernel negotiation thing.  It's a
> > > permanent pain in our necks to solve a very temporary problem.
> >
> > EFI is basically our existing embodiment of this fw/kernel negotiation
> > thing, and iff we need it, I have no objection to using it for this
> > purpose, i.e., to allow the firmware to infer whether or not it should
> > accept all available memory on behalf of the OS before exiting boot
> > services. But if we don't need this, even better.
>
> FW/kernel negotiation does not work if there's a boot loader in the middle
> that does ExitBootServices(). By the time kernel can announce if it
> supports unaccepted memory there's nobody to announce to.
>

Why would you want to support such bootloaders for TDX anyway? TDX
heavily relies on measured boot abstractions and other things that are
heavily tied to firmware.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2022-07-26 20:17   ` Andy Lutomirski
@ 2022-08-09 11:38     ` Kirill A. Shutemov
  2022-08-13 16:03       ` Andy Lutomirski
  0 siblings, 1 reply; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-08-09 11:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Sathyanarayanan Kuppuswamy, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra (Intel),
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand,
	Marcelo Henrique Cerri, tim.gardner, khalid.elmously, philip.cox,
	the arch/x86 maintainers, linux-mm, linux-coco, linux-efi,
	Linux Kernel Mailing List

On Tue, Jul 26, 2022 at 01:17:13PM -0700, Andy Lutomirski wrote:
> 
> 
> On Tue, Jun 14, 2022, at 5:02 AM, Kirill A. Shutemov wrote:
> > load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
> > The unwanted loads are typically harmless. But, they might be made to
> > totally unrelated or even unmapped memory. load_unaligned_zeropad()
> > relies on exception fixup (#PF, #GP and now #VE) to recover from these
> > unwanted loads.
> >
> > But, this approach does not work for unaccepted memory. For TDX, a load
> > from unaccepted memory will not lead to a recoverable exception within
> > the guest. The guest will exit to the VMM where the only recourse is to
> > terminate the guest.
> 
> Why is unaccepted memory marked present in the direct map in the first place?
> 
> Having kernel code assume that every valid address is followed by
> several bytes of memory that may be read without side effects other than
> #PF also seems like a mistake, but I probably won’t win that fight. But
> sticking guard pages in front of definitely-not-logically present pages
> seems silly to me.  Let’s just not map it.

It would mean no 1G pages in direct mapping for TDX as we accept 2M a
time.

> (What if MMIO memory is mapped next to regular memory?  Doing random
> unaligned reads that cross into MMIO seems unwise.)

MMIO is shared, not unaccpted private. We already handle the situation.
See 1e7769653b06 ("x86/tdx: Handle load_unaligned_zeropad() page-cross to
a shared page").

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 14/14] x86/tdx: Add unaccepted memory support
  2022-07-26 14:51   ` Borislav Petkov
@ 2022-08-09 11:45     ` Kirill A. Shutemov
  2022-08-10 10:27       ` Borislav Petkov
  0 siblings, 1 reply; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-08-09 11:45 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel

On Tue, Jul 26, 2022 at 04:51:16PM +0200, Borislav Petkov wrote:
> On Tue, Jun 14, 2022 at 03:02:31PM +0300, Kirill A. Shutemov wrote:
> > +static bool is_tdx_guest(void)
> > +{
> > +	static bool once;
> > +	static bool is_tdx;
> > +
> > +	if (!IS_ENABLED(CONFIG_INTEL_TDX_GUEST))
> > +		return false;
> > +
> > +	if (!once) {
> > +		u32 eax, sig[3];
> > +
> > +		cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax,
> > +			    &sig[0], &sig[2],  &sig[1]);
> > +		is_tdx = !memcmp(TDX_IDENT, sig, sizeof(sig));
> > +		once = true;
> > +	}
> > +
> > +	return is_tdx;
> > +}
> 
> early_tdx_detect() already calls this CPUID function. It assigns
> function pointers too.
> 
> So why can't you assign an accept_memory() function pointer there and
> get rid of this sprinkled if (tdx) everywhere?

This code called from EFI stub which runs before decompressor code and
therefore before early_tdx_detect().

> > diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
> > index 918a7606f53c..8518a75e5dd5 100644
> > --- a/arch/x86/boot/compressed/tdx.c
> > +++ b/arch/x86/boot/compressed/tdx.c
> > @@ -3,12 +3,15 @@
> >  #include "../cpuflags.h"
> >  #include "../string.h"
> >  #include "../io.h"
> > +#include "align.h"
> >  #include "error.h"
> > +#include "pgtable_types.h"
> >  
> >  #include <vdso/limits.h>
> >  #include <uapi/asm/vmx.h>
> >  
> >  #include <asm/shared/tdx.h>
> > +#include <asm/page_types.h>
> >  
> >  /* Called from __tdx_hypercall() for unrecoverable failure */
> >  void __tdx_hypercall_failed(void)
> > @@ -75,3 +78,78 @@ void early_tdx_detect(void)
> >  	pio_ops.f_outb = tdx_outb;
> >  	pio_ops.f_outw = tdx_outw;
> >  }
> > +
> > +static unsigned long try_accept_one(phys_addr_t start, unsigned long len,
> > +				    enum pg_level level)
> 
> That's pretty much a copy of the same function in arch/x86/coco/tdx/tdx.c.
> 
> Yeah, you need a tdx-shared.c which you include in both places just like
> it is done with sev-shared.c

Okay, will look into this.

> > +		accept_size = try_accept_one(start, len, PG_LEVEL_1G);
> > +		if (!accept_size)
> > +			accept_size = try_accept_one(start, len, PG_LEVEL_2M);
> > +		if (!accept_size)
> > +			accept_size = try_accept_one(start, len, PG_LEVEL_4K);
> > +		if (!accept_size)
> > +			error("Accepting memory failed\n");
> > +		start += accept_size;
> 
> This series of calls to try_accept_one() appear in at least three
> places. Please carve them out into a separate function can put it in
> tdx-shared.c.

Okay.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-08-09 11:36                                             ` Ard Biesheuvel
@ 2022-08-09 11:54                                               ` Kirill A. Shutemov
  2022-08-09 21:09                                                 ` Dionna Amalie Glaze
  0 siblings, 1 reply; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-08-09 11:54 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Dave Hansen, Marc Orr, Borislav Petkov, Dionna Amalie Glaze,
	Peter Gonda, Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Andi Kleen, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Varad Gautam,
	Dario Faggioli, Mike Rapoport, David Hildenbrand, Marcelo Cerri,
	tim.gardner, Khalid ElMously, philip.cox,
	the arch/x86 maintainers, Linux Memory Management List,
	linux-coco, linux-efi, LKML, Yao, Jiewen

On Tue, Aug 09, 2022 at 01:36:00PM +0200, Ard Biesheuvel wrote:
> On Tue, 9 Aug 2022 at 13:11, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > On Sat, Jul 23, 2022 at 01:14:07PM +0200, Ard Biesheuvel wrote:
> > > On Thu, 21 Jul 2022 at 19:13, Dave Hansen <dave.hansen@intel.com> wrote:
> > > >
> > > > On 7/19/22 17:26, Marc Orr wrote:
> > > > > - Dave's suggestion to "2. Boot some intermediate thing like a
> > > > > bootloader that does acceptance ..." is pretty clever! So if upstream
> > > > > thinks this FW-kernel negotiation is not a good direction, maybe we
> > > > > (Google) can pursue this idea to avoid introducing yet another tag on
> > > > > our images.
> > > >
> > > > I'm obviously speaking only for myself here and not for "upstream" as a
> > > > whole, but I clearly don't like the FW/kernel negotiation thing.  It's a
> > > > permanent pain in our necks to solve a very temporary problem.
> > >
> > > EFI is basically our existing embodiment of this fw/kernel negotiation
> > > thing, and iff we need it, I have no objection to using it for this
> > > purpose, i.e., to allow the firmware to infer whether or not it should
> > > accept all available memory on behalf of the OS before exiting boot
> > > services. But if we don't need this, even better.
> >
> > FW/kernel negotiation does not work if there's a boot loader in the middle
> > that does ExitBootServices(). By the time kernel can announce if it
> > supports unaccepted memory there's nobody to announce to.
> >
> 
> Why would you want to support such bootloaders for TDX anyway? TDX
> heavily relies on measured boot abstractions and other things that are
> heavily tied to firmware.

I don't understand it either. And, yet, there's demand for it.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory
  2022-08-09 11:54                                               ` Kirill A. Shutemov
@ 2022-08-09 21:09                                                 ` Dionna Amalie Glaze
  0 siblings, 0 replies; 200+ messages in thread
From: Dionna Amalie Glaze @ 2022-08-09 21:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Ard Biesheuvel, Dave Hansen, Marc Orr, Borislav Petkov,
	Peter Gonda, Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Andi Kleen, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Varad Gautam,
	Dario Faggioli, Mike Rapoport, David Hildenbrand, Marcelo Cerri,
	tim.gardner, Khalid ElMously, philip.cox,
	the arch/x86 maintainers, Linux Memory Management List,
	linux-coco, linux-efi, LKML, Yao, Jiewen

> > > > EFI is basically our existing embodiment of this fw/kernel negotiation
> > > > thing, and iff we need it, I have no objection to using it for this
> > > > purpose, i.e., to allow the firmware to infer whether or not it should
> > > > accept all available memory on behalf of the OS before exiting boot
> > > > services. But if we don't need this, even better.
> > >
> > > FW/kernel negotiation does not work if there's a boot loader in the middle
> > > that does ExitBootServices(). By the time kernel can announce if it
> > > supports unaccepted memory there's nobody to announce to.
> > >
> >
> > Why would you want to support such bootloaders for TDX anyway? TDX
> > heavily relies on measured boot abstractions and other things that are
> > heavily tied to firmware.
>
> I don't understand it either. And, yet, there's demand for it.
>

I think there's no good solution for this bad upgrade path that the
UEFI spec stuck us with, so I think I'm going to stick to what many
folks have suggested: just have the host require external information.
What this means is that at VM creation time, the user has to specify
an extra flag that all memory has to be accepted in firmware before
booting the guest OS. Failure to provide the flag leads to the
unfortunate outcome that the VM only has access to the lower 4GB of
RAM. We can only hope that the VM OOMs shortly after they start up the
machine and the user reads an FAQ that they should add this flag.

I'll do a round of appeals to distributions to include this patch set
and AMD's follow-up that defines accept_memory for SEV-SNP to reduce
the time that people need to know about this flag.

-- 
-Dionna Glaze, PhD (she/her)

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 14/14] x86/tdx: Add unaccepted memory support
  2022-08-09 11:45     ` Kirill A. Shutemov
@ 2022-08-10 10:27       ` Borislav Petkov
  0 siblings, 0 replies; 200+ messages in thread
From: Borislav Petkov @ 2022-08-10 10:27 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel

On Tue, Aug 09, 2022 at 02:45:44PM +0300, Kirill A. Shutemov wrote:
> This code called from EFI stub which runs before decompressor code and
> therefore before early_tdx_detect().

Then pls call that function early_is_tdx_guest() and slap a comment
above it explaining why it needs to be a separate thing.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-08-05 11:49   ` Vlastimil Babka
  2022-08-05 12:09     ` David Hildenbrand
@ 2022-08-10 14:19     ` Mel Gorman
  2022-08-15 21:08       ` Dionna Amalie Glaze
  1 sibling, 1 reply; 200+ messages in thread
From: Mel Gorman @ 2022-08-10 14:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Dave Hansen, Mike Rapoport,
	David Hildenbrand, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, x86, linux-mm, linux-coco, linux-efi, linux-kernel,
	Mike Rapoport

On Fri, Aug 05, 2022 at 01:49:41PM +0200, Vlastimil Babka wrote:
> On 6/14/22 14:02, Kirill A. Shutemov wrote:
> > UEFI Specification version 2.9 introduces the concept of memory
> > acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
> > SEV-SNP, require memory to be accepted before it can be used by the
> > guest. Accepting happens via a protocol specific to the Virtual Machine
> > platform.
> > 
> > There are several ways kernel can deal with unaccepted memory:
> > 
> >  1. Accept all the memory during the boot. It is easy to implement and
> >     it doesn't have runtime cost once the system is booted. The downside
> >     is very long boot time.
> > 
> >     Accept can be parallelized to multiple CPUs to keep it manageable
> >     (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
> >     memory bandwidth and does not scale beyond the point.
> > 
> >  2. Accept a block of memory on the first use. It requires more
> >     infrastructure and changes in page allocator to make it work, but
> >     it provides good boot time.
> > 
> >     On-demand memory accept means latency spikes every time kernel steps
> >     onto a new memory block. The spikes will go away once workload data
> >     set size gets stabilized or all memory gets accepted.
> > 
> >  3. Accept all memory in background. Introduce a thread (or multiple)
> >     that gets memory accepted proactively. It will minimize time the
> >     system experience latency spikes on memory allocation while keeping
> >     low boot time.
> > 
> >     This approach cannot function on its own. It is an extension of #2:
> >     background memory acceptance requires functional scheduler, but the
> >     page allocator may need to tap into unaccepted memory before that.
> > 
> >     The downside of the approach is that these threads also steal CPU
> >     cycles and memory bandwidth from the user's workload and may hurt
> >     user experience.
> > 
> > Implement #2 for now. It is a reasonable default. Some workloads may
> > want to use #1 or #3 and they can be implemented later based on user's
> > demands.
> > 
> > Support of unaccepted memory requires a few changes in core-mm code:
> > 
> >   - memblock has to accept memory on allocation;
> > 
> >   - page allocator has to accept memory on the first allocation of the
> >     page;
> > 
> > Memblock change is trivial.
> > 
> > The page allocator is modified to accept pages on the first allocation.
> > The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is
> > used to indicate that the page requires acceptance.
> > 
> > Architecture has to provide two helpers if it wants to support
> > unaccepted memory:
> > 
> >  - accept_memory() makes a range of physical addresses accepted.
> > 
> >  - range_contains_unaccepted_memory() checks anything within the range
> >    of physical addresses requires acceptance.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
> > Reviewed-by: David Hildenbrand <david@redhat.com>
> 
> Hmm I realize it's not ideal to raise this at v7, and maybe it was discussed
> before, but it's really not great how this affects the core page allocator
> paths. Wouldn't it be possible to only release pages to page allocator when
> accepted, and otherwise use some new per-zone variables together with the
> bitmap to track how much exactly is where to accept? Then it could be hooked
> in get_page_from_freelist() similarly to CONFIG_DEFERRED_STRUCT_PAGE_INIT -
> if we fail zone_watermark_fast() and there are unaccepted pages in the zone,
> accept them and continue. With a static key to flip in case we eventually
> accept everything. Because this is really similar scenario to the deferred
> init and that one was solved in a way that adds minimal overhead.
> 

I think it might be more straight-forward to always accept pages in the
size of the pageblock. Smaller ranges should matter because they have been
accepted in deferred_free_range. In expand, if PageUnaccepted is set on
a pageblock-sized page, take it off the list, drop the zone->lock leaving
IRQs disabled, accept the memory and reacquire the lock to split the page
into the required order.

IRQs being left disabled is unfortunate but even if the acceptance is slow,
it's presumably not so slow to cause major problems. This would would reduce
and probably eliminate the need to do the assert check in accept_page. It
might also simplify __free_one_page if it's known that a pageblock range
of pages are either all accepted or unaccepted.

Lastly, the default behaviour should probably be "accept all memory at
boot" and use Kconfig to allow acceptable be deferred or overridden by
command line. There are at least two reasons for this. Even though this
is a virtual machine, there still may be latency sensitive applications
running early in boot using pinned vcpu->pcpu and no memory overcommit.
The unpredictable performance of the application early in boot may be
unacceptable and unavoidable. It might take a long time but it could
eventually generate bug reports about "unpredictable performance early
in boot" that will be hard to track down unless accept_memory is observed
using perf at the right time. Even when that does happen, there will need
to be an option to turn it off if the unpredictable performance cannot
be tolerated. Second, any benchmarking done early in boot is likely to
be disrupted making the series a potential bisection magnet that masks a
performance bug elsewhere in the merge window.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2022-08-03 14:02       ` Dave Hansen
@ 2022-08-11 11:26         ` Borislav Petkov
  2022-08-13 16:11           ` Andy Lutomirski
  2022-08-13 16:04         ` Andy Lutomirski
  1 sibling, 1 reply; 200+ messages in thread
From: Borislav Petkov @ 2022-08-11 11:26 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, x86, linux-mm, linux-coco, linux-efi, linux-kernel

On Wed, Aug 03, 2022 at 07:02:31AM -0700, Dave Hansen wrote:
> One other thing I remembered as I re-read my write up on this.
> 
> In the "new" mode, guests never get #VE's for unaccepted memory.  They
> just exit to the host and can never be reentered.  They must be killed.

Yeah, this is the part which I think is really silly.

OSes, in their execution lifetime, can - erroneously or not - but it
happens pretty often in real life, touch some unrelated memory. And this
has never been a big deal - #PF, that's it.

But now they don't even get a chance to correct their mistake - VMEXIT,
die.

load_unaligned_zeropad() is just one case.

Imagine the user loads some buggy driver in the guest and that driver
starts doing stray memory accesses through a wild pointer into the
fields. Guest dies immediately.

Dunno bit it all feels a bit too harsh and unfriendly to me.

Sure, if that user is really unlucky, those stray accesses can kill
his OS on baremetal too. So maybe you could argue here that such stray
accesses are actually a good thing. :)

All I know is, there should be a more resilient way to handle those.

> In the "old" mode, I _believe_ that the guest always gets a #VE for
> non-EPT-present memory.  The #VE is basically the same no matter if the
> page is unaccepted or if the host goes out and makes a
> previously-accepted page non-present.
> 
> One really nasty implication of this "old" mode is that the host can
> remove *accepted* pages that are used in the syscall gap.  That means
> that the #VE handler would need to be of the paranoid variety which
> opens up all kinds of other fun.

Yeah, I believe this needs to be dealt with anyway, for SNP at least.
But on AMD it would simply cause an exception and it'll be handled in
the #VC thing. And there's some ugly code to deal with the gap too.

>  * "Old" - #VE's can happen in the syscall gap
>  * "New" - #VE's happen at better-defined times.  Unexpected ones are
>    fatal.
> 
> There's a third option which I proposed but doesn't yet exist.  The TDX
> module _could_ separate the behavior of unaccepted memory #VE's and
> host-induced #VEs.  This way, we could use load_unaligned_zeropad() with
> impunity and handle it in the #VE handler.  At the same time, the host
> would not be allowed to remove accepted memory and cause problems in the
> syscall gap.  Kinda the best of both worlds.

I like that. This should've been the default from the get-go. Oh well,
what's it called in English, hindsight is 20 20...?

> But, I'm not sure how valuable that would be now that we have the
> (admittedly squirrelly) code to avoid load_unaligned_zeropad() #VE's.

I think you should push for the bestest solution and one day we can kill
those ugly workarounds.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v2 1/2] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
  2022-08-08 17:16   ` [PATCH v2 1/2] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support Tom Lendacky
  2022-08-08 21:43     ` Dave Hansen
@ 2022-08-12 13:03     ` Borislav Petkov
  2022-08-12 14:11       ` Tom Lendacky
  1 sibling, 1 reply; 200+ messages in thread
From: Borislav Petkov @ 2022-08-12 13:03 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: linux-kernel, x86, Thomas Gleixner, Ingo Molnar, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On Mon, Aug 08, 2022 at 12:16:24PM -0500, Tom Lendacky wrote:
> In advance of providing support for unaccepted memory, switch from using
> kmalloc() for allocating the Page State Change (PSC) structure to using a
> local variable that lives on the stack. This is needed to avoid a possible
> recursive call into set_pages_state() if the kmalloc() call requires
> (more) memory to be accepted, which would result in a hang.

I don't understand: kmalloc() allocates memory which is unaccepted?

> The current size of the PSC struct is 2,032 bytes. To make the struct more
> stack friendly, reduce the number of PSC entries from 253 down to 64,
> resulting in a size of 520 bytes. This is a nice compromise on struct size
> and total PSC requests.

Why can't you simply allocate that one PSC page once at boot, accept the
memory for it and use it throughout? Under locking, ofc, if multiple PSC
calls need to happen in parallel...

Instead of limiting the PSC req size.

> @@ -1254,6 +1260,8 @@ void setup_ghcb(void)
>  		if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
>  			snp_register_per_cpu_ghcb();
>  
> +		ghcb_percpu_ready = true;

You know how I can't stand those random boolean vars stating something
has been initialized?

Can't you at least use some field in struct ghcb.reserved_1[] or so
which the spec can provide to OS use so that FW doesn't touch it?

And then stick a "percpu_ready" bit there.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v2 1/2] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
  2022-08-12 13:03     ` Borislav Petkov
@ 2022-08-12 14:11       ` Tom Lendacky
  2022-08-12 14:33         ` Borislav Petkov
  0 siblings, 1 reply; 200+ messages in thread
From: Tom Lendacky @ 2022-08-12 14:11 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, x86, Thomas Gleixner, Ingo Molnar, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra



On 8/12/22 08:03, Borislav Petkov wrote:
> On Mon, Aug 08, 2022 at 12:16:24PM -0500, Tom Lendacky wrote:
>> In advance of providing support for unaccepted memory, switch from using
>> kmalloc() for allocating the Page State Change (PSC) structure to using a
>> local variable that lives on the stack. This is needed to avoid a possible
>> recursive call into set_pages_state() if the kmalloc() call requires
>> (more) memory to be accepted, which would result in a hang.
> 
> I don't understand: kmalloc() allocates memory which is unaccepted?

In order to satisfy the kmalloc() some memory has to be accepted. So it 
tries to accept some additional memory, but we're already in the accept 
memory path... deadlock.

> 
>> The current size of the PSC struct is 2,032 bytes. To make the struct more
>> stack friendly, reduce the number of PSC entries from 253 down to 64,
>> resulting in a size of 520 bytes. This is a nice compromise on struct size
>> and total PSC requests.
> 
> Why can't you simply allocate that one PSC page once at boot, accept the
> memory for it and use it throughout? Under locking, ofc, if multiple PSC
> calls need to happen in parallel...
> 
> Instead of limiting the PSC req size.

There was a whole discussion on this and I would prefer to keep the 
ability to parallelize PSC without locking.

> 
>> @@ -1254,6 +1260,8 @@ void setup_ghcb(void)
>>   		if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
>>   			snp_register_per_cpu_ghcb();
>>   
>> +		ghcb_percpu_ready = true;
> 
> You know how I can't stand those random boolean vars stating something
> has been initialized?
> 
> Can't you at least use some field in struct ghcb.reserved_1[] or so
> which the spec can provide to OS use so that FW doesn't touch it?

Well when we don't know which GHCB is in use, using that reserved area in 
the GHCB doesn't help. Also, I don't want to update the GHCB specification 
for a single bit that is only required because of the way Linux went about 
establishing the GHCB usage.

Thanks,
Tom

> 
> And then stick a "percpu_ready" bit there.
> 
> Thx.
> 

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v2 1/2] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
  2022-08-12 14:11       ` Tom Lendacky
@ 2022-08-12 14:33         ` Borislav Petkov
  2022-08-12 14:51           ` Tom Lendacky
  0 siblings, 1 reply; 200+ messages in thread
From: Borislav Petkov @ 2022-08-12 14:33 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: linux-kernel, x86, Thomas Gleixner, Ingo Molnar, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On Fri, Aug 12, 2022 at 09:11:25AM -0500, Tom Lendacky wrote:
> There was a whole discussion on this

Pointer to it?

> and I would prefer to keep the ability to parallelize PSC without
> locking.

So smaller, on-stack PSC but lockless is still better than a bigger one
but with synchronized accesses to it?

> Well when we don't know which GHCB is in use, using that reserved area in
> the GHCB doesn't help.

What do you mean?

The one which you read with

	data = this_cpu_read(runtime_data);

in snp_register_per_cpu_ghcb() is the one you register.

> Also, I don't want to update the GHCB specification for a single bit
> that is only required because of the way Linux went about establishing
> the GHCB usage.

Linux?

You mean, you did it this way: 885689e47dfa1499b756a07237eb645234d93cf9

:-)

"The runtime handler needs one GHCB per-CPU. Set them up and map them
unencrypted."

Why does that handler need one GHCB per CPU?

As to the field, I was thinking along the lines of

	struct ghcb.vendor_flags

field which each virt vendor can use however they like.

It might be overkill but a random bool ain't pretty either. Especially
if those things start getting added for all kinds of other things.

If anything, you could make this a single u64 sev_flags which can at
least collect all that gunk in one variable ... at least...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v2 1/2] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
  2022-08-12 14:33         ` Borislav Petkov
@ 2022-08-12 14:51           ` Tom Lendacky
  2022-08-13 19:40             ` Borislav Petkov
  0 siblings, 1 reply; 200+ messages in thread
From: Tom Lendacky @ 2022-08-12 14:51 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, x86, Thomas Gleixner, Ingo Molnar, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On 8/12/22 09:33, Borislav Petkov wrote:
> On Fri, Aug 12, 2022 at 09:11:25AM -0500, Tom Lendacky wrote:
>> There was a whole discussion on this
> 
> Pointer to it?

It starts here: 
https://lore.kernel.org/lkml/658c455c40e8950cb046dd885dd19dc1c52d060a.1659103274.git.thomas.lendacky@amd.com/

> 
>> and I would prefer to keep the ability to parallelize PSC without
>> locking.
> 
> So smaller, on-stack PSC but lockless is still better than a bigger one
> but with synchronized accesses to it?
> 
>> Well when we don't know which GHCB is in use, using that reserved area in
>> the GHCB doesn't help.
> 
> What do you mean?
> 
> The one which you read with
> 
> 	data = this_cpu_read(runtime_data);

Memory acceptance is called before the per-CPU GHCBs have been allocated
and so you would be actually be using early boot GHCB. And that is decided
based on the #VC handler that is invoked - but in this case we're not
coming through the #VC handler to accept memory.

> 
> in snp_register_per_cpu_ghcb() is the one you register.
> 
>> Also, I don't want to update the GHCB specification for a single bit
>> that is only required because of the way Linux went about establishing
>> the GHCB usage.
> 
> Linux?
> 
> You mean, you did it this way: 885689e47dfa1499b756a07237eb645234d93cf9
> 
> :-)

Well Joerg re-worked all that quite a bit. And with the SNP support, the
added requirement of registering the GHCB changed which GHCB could be
used. So even when the per-CPU GHCB is allocated, it can't be used until
it is registered, which depends on when the #VC handler is changed from
the boot #VC handler to the runtime #VC handler.

> 
> "The runtime handler needs one GHCB per-CPU. Set them up and map them
> unencrypted."
> 
> Why does that handler need one GHCB per CPU?

Each vCPU can be handling a #VC and you don't want to be serializing on a
single GHCB.

Thanks,
Tom

> 
> As to the field, I was thinking along the lines of
> 
> 	struct ghcb.vendor_flags
> 
> field which each virt vendor can use however they like.
> 
> It might be overkill but a random bool ain't pretty either. Especially
> if those things start getting added for all kinds of other things.
> 
> If anything, you could make this a single u64 sev_flags which can at
> least collect all that gunk in one variable ... at least...
> 
> Thx.
> 

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2022-08-09 11:38     ` Kirill A. Shutemov
@ 2022-08-13 16:03       ` Andy Lutomirski
  2022-08-13 21:02         ` Kirill A. Shutemov
  0 siblings, 1 reply; 200+ messages in thread
From: Andy Lutomirski @ 2022-08-13 16:03 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Borislav Petkov, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Sathyanarayanan Kuppuswamy, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra (Intel),
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand,
	Marcelo Henrique Cerri, tim.gardner, khalid.elmously, philip.cox,
	the arch/x86 maintainers, linux-mm, linux-coco, linux-efi,
	Linux Kernel Mailing List



On Tue, Aug 9, 2022, at 4:38 AM, Kirill A. Shutemov wrote:
> On Tue, Jul 26, 2022 at 01:17:13PM -0700, Andy Lutomirski wrote:
>> 
>> 
>> On Tue, Jun 14, 2022, at 5:02 AM, Kirill A. Shutemov wrote:
>> > load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
>> > The unwanted loads are typically harmless. But, they might be made to
>> > totally unrelated or even unmapped memory. load_unaligned_zeropad()
>> > relies on exception fixup (#PF, #GP and now #VE) to recover from these
>> > unwanted loads.
>> >
>> > But, this approach does not work for unaccepted memory. For TDX, a load
>> > from unaccepted memory will not lead to a recoverable exception within
>> > the guest. The guest will exit to the VMM where the only recourse is to
>> > terminate the guest.
>> 
>> Why is unaccepted memory marked present in the direct map in the first place?
>> 
>> Having kernel code assume that every valid address is followed by
>> several bytes of memory that may be read without side effects other than
>> #PF also seems like a mistake, but I probably won’t win that fight. But
>> sticking guard pages in front of definitely-not-logically present pages
>> seems silly to me.  Let’s just not map it.
>
> It would mean no 1G pages in direct mapping for TDX as we accept 2M a
> time.
>
>> (What if MMIO memory is mapped next to regular memory?  Doing random
>> unaligned reads that cross into MMIO seems unwise.)
>
> MMIO is shared, not unaccpted private. We already handle the situation.
> See 1e7769653b06 ("x86/tdx: Handle load_unaligned_zeropad() page-cross to
> a shared page").
>

I don’t mean in a confidential guest — I mean generally. This whole model of “overrun the buffer — no big deal” is just fragile.

> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2022-08-03 14:02       ` Dave Hansen
  2022-08-11 11:26         ` Borislav Petkov
@ 2022-08-13 16:04         ` Andy Lutomirski
  2022-08-13 20:58           ` Kirill A. Shutemov
  1 sibling, 1 reply; 200+ messages in thread
From: Andy Lutomirski @ 2022-08-13 16:04 UTC (permalink / raw)
  To: Dave Hansen, Borislav Petkov, Kirill A. Shutemov
  Cc: Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Sathyanarayanan Kuppuswamy, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, Marcelo Henrique Cerri,
	tim.gardner, khalid.elmously, philip.cox,
	the arch/x86 maintainers, linux-mm, linux-coco, linux-efi,
	Linux Kernel Mailing List



On Wed, Aug 3, 2022, at 7:02 AM, Dave Hansen wrote:
> On 8/2/22 16:46, Dave Hansen wrote:
>> To sum it all up, I'm not happy with the complexity of the page
>> acceptance code either but I'm not sure that it's bad tradeoff compared
>> to greater #VE complexity or fragility.
>> 
>> Does anyone think we should go back and really reconsider this?
>
> One other thing I remembered as I re-read my write up on this.
>
> In the "new" mode, guests never get #VE's for unaccepted memory.  They
> just exit to the host and can never be reentered.  They must be killed.
>
> In the "old" mode, I _believe_ that the guest always gets a #VE for
> non-EPT-present memory.  The #VE is basically the same no matter if the
> page is unaccepted or if the host goes out and makes a
> previously-accepted page non-present.
>
> One really nasty implication of this "old" mode is that the host can
> remove *accepted* pages that are used in the syscall gap.  That means
> that the #VE handler would need to be of the paranoid variety which
> opens up all kinds of other fun.
>
>  * "Old" - #VE's can happen in the syscall gap
>  * "New" - #VE's happen at better-defined times.  Unexpected ones are
>    fatal.
>
> There's a third option which I proposed but doesn't yet exist.  The TDX
> module _could_ separate the behavior of unaccepted memory #VE's and
> host-induced #VEs.  This way, we could use load_unaligned_zeropad() with
> impunity and handle it in the #VE handler.  At the same time, the host
> would not be allowed to remove accepted memory and cause problems in the
> syscall gap.  Kinda the best of both worlds.
>
> But, I'm not sure how valuable that would be now that we have the
> (admittedly squirrelly) code to avoid load_unaligned_zeropad() #VE's.

How would that be implemented?  It would need to track which GPAs *were* accepted across a host-induced unmap/remap cycle. This would involve preventing the host from ever completely removing a secure EPT table without the guest’s help, right?

Admittedly this would IMO be better behavior. Is it practical to implement?

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2022-08-11 11:26         ` Borislav Petkov
@ 2022-08-13 16:11           ` Andy Lutomirski
  2022-08-13 21:13             ` Kirill A. Shutemov
  0 siblings, 1 reply; 200+ messages in thread
From: Andy Lutomirski @ 2022-08-13 16:11 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen
  Cc: Kirill A. Shutemov, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Sathyanarayanan Kuppuswamy, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra (Intel),
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, Marcelo Henrique Cerri,
	tim.gardner, khalid.elmously, philip.cox,
	the arch/x86 maintainers, linux-mm, linux-coco, linux-efi,
	Linux Kernel Mailing List



On Thu, Aug 11, 2022, at 4:26 AM, Borislav Petkov wrote:
> On Wed, Aug 03, 2022 at 07:02:31AM -0700, Dave Hansen wrote:
>> One other thing I remembered as I re-read my write up on this.
>> 
>> In the "new" mode, guests never get #VE's for unaccepted memory.  They
>> just exit to the host and can never be reentered.  They must be killed.
>
> Yeah, this is the part which I think is really silly.
>
> OSes, in their execution lifetime, can - erroneously or not - but it
> happens pretty often in real life, touch some unrelated memory. And this
> has never been a big deal - #PF, that's it.
>
> But now they don't even get a chance to correct their mistake - VMEXIT,
> die.
>
> load_unaligned_zeropad() is just one case.
>
> Imagine the user loads some buggy driver in the guest and that driver
> starts doing stray memory accesses through a wild pointer into the
> fields. Guest dies immediately.
>
> Dunno bit it all feels a bit too harsh and unfriendly to me.
>
> Sure, if that user is really unlucky, those stray accesses can kill
> his OS on baremetal too. So maybe you could argue here that such stray
> accesses are actually a good thing. :)
>
> All I know is, there should be a more resilient way to handle those.
>
>> In the "old" mode, I _believe_ that the guest always gets a #VE for
>> non-EPT-present memory.  The #VE is basically the same no matter if the
>> page is unaccepted or if the host goes out and makes a
>> previously-accepted page non-present.
>> 
>> One really nasty implication of this "old" mode is that the host can
>> remove *accepted* pages that are used in the syscall gap.  That means
>> that the #VE handler would need to be of the paranoid variety which
>> opens up all kinds of other fun.
>
> Yeah, I believe this needs to be dealt with anyway, for SNP at least.
> But on AMD it would simply cause an exception and it'll be handled in
> the #VC thing. And there's some ugly code to deal with the gap too.

I do not believe for a second that the “ugly” code in question is correct. Let’s please not go there for TDX.  The whole point of this thing is security — I would rather see a somewhat fragile situation than an exploit.

Now if the TD module could deliver an unrecoverable #MC instead of an impossible-to-handle #VE, maybe we could at least get a nice debug trace out?  Of course it’s not so easy to do anything with a debug trace that doesn’t break confidentiality.

So, Dave (and other Intel folks), here’s a different feature I think we will want: a secure way to get debug logs out of a TDX guest. Maybe the guest could have a way to send syslog lines to the TD module (via hypercall or a ring buffer) and the TD module could encrypt them against a secret that is only accessible to future boots of the same guest or to an attested outside key.  TDX already has an exceedingly flexible attestation mechanism — there is surely room for the guest config to include a public key that is used for encryption of exported logs.

(Yes, this is sort of doable in software, but the part where any kind of buffer actually gets saved anywhere if the guest goes kaboom requires either help from the TD module or a defined protocol with key negotiation involving the untrusted host.)

The model where the guest leaves a panic on the screen or serial console won’t work in TDX…

>
>>  * "Old" - #VE's can happen in the syscall gap
>>  * "New" - #VE's happen at better-defined times.  Unexpected ones are
>>    fatal.
>> 
>> There's a third option which I proposed but doesn't yet exist.  The TDX
>> module _could_ separate the behavior of unaccepted memory #VE's and
>> host-induced #VEs.  This way, we could use load_unaligned_zeropad() with
>> impunity and handle it in the #VE handler.  At the same time, the host
>> would not be allowed to remove accepted memory and cause problems in the
>> syscall gap.  Kinda the best of both worlds.
>
> I like that. This should've been the default from the get-go. Oh well,
> what's it called in English, hindsight is 20 20...?
>
>> But, I'm not sure how valuable that would be now that we have the
>> (admittedly squirrelly) code to avoid load_unaligned_zeropad() #VE's.
>
> I think you should push for the bestest solution and one day we can kill
> those ugly workarounds.
>
> Thx.
>
> -- 
> Regards/Gruss,
>     Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v2 1/2] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
  2022-08-12 14:51           ` Tom Lendacky
@ 2022-08-13 19:40             ` Borislav Petkov
  2022-08-14 13:36               ` Tom Lendacky
  0 siblings, 1 reply; 200+ messages in thread
From: Borislav Petkov @ 2022-08-13 19:40 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: linux-kernel, x86, Thomas Gleixner, Ingo Molnar, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On Fri, Aug 12, 2022 at 09:51:41AM -0500, Tom Lendacky wrote:
> On 8/12/22 09:33, Borislav Petkov wrote:
> > On Fri, Aug 12, 2022 at 09:11:25AM -0500, Tom Lendacky wrote:
> > > There was a whole discussion on this
> > 
> > Pointer to it?
> 
> It starts here: https://lore.kernel.org/lkml/658c455c40e8950cb046dd885dd19dc1c52d060a.1659103274.git.thomas.lendacky@amd.com/

So how come none of the rationale for the on-stack decision vs a single
buffer with a spinlock protection hasn't made it to this patch?

We need to have the reason why this thing is changed documented
somewhere.

> > So smaller, on-stack PSC but lockless is still better than a bigger one
> > but with synchronized accesses to it?

That thing.

That decision for on-stack buffer needs explaining why.

> > > Well when we don't know which GHCB is in use, using that reserved area in
> > > the GHCB doesn't help.
> > 
> > What do you mean?
> > 
> > The one which you read with
> > 
> > 	data = this_cpu_read(runtime_data);
> 
> Memory acceptance is called before the per-CPU GHCBs have been allocated
> and so you would be actually be using early boot GHCB. And that is decided
> based on the #VC handler that is invoked - but in this case we're not
> coming through the #VC handler to accept memory.

But then ghcb_percpu_ready needs to be a per-CPU variable too! Because
it is set right after snp_register_per_cpu_ghcb() which works on the
*per-CPU* GHCB.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2022-08-13 16:04         ` Andy Lutomirski
@ 2022-08-13 20:58           ` Kirill A. Shutemov
  0 siblings, 0 replies; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-08-13 20:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Borislav Petkov, Kirill A. Shutemov,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Sathyanarayanan Kuppuswamy, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, Marcelo Henrique Cerri,
	tim.gardner, khalid.elmously, philip.cox,
	the arch/x86 maintainers, linux-mm, linux-coco, linux-efi,
	Linux Kernel Mailing List

On Sat, Aug 13, 2022 at 09:04:58AM -0700, Andy Lutomirski wrote:
> 
> 
> On Wed, Aug 3, 2022, at 7:02 AM, Dave Hansen wrote:
> > On 8/2/22 16:46, Dave Hansen wrote:
> >> To sum it all up, I'm not happy with the complexity of the page
> >> acceptance code either but I'm not sure that it's bad tradeoff compared
> >> to greater #VE complexity or fragility.
> >> 
> >> Does anyone think we should go back and really reconsider this?
> >
> > One other thing I remembered as I re-read my write up on this.
> >
> > In the "new" mode, guests never get #VE's for unaccepted memory.  They
> > just exit to the host and can never be reentered.  They must be killed.
> >
> > In the "old" mode, I _believe_ that the guest always gets a #VE for
> > non-EPT-present memory.  The #VE is basically the same no matter if the
> > page is unaccepted or if the host goes out and makes a
> > previously-accepted page non-present.
> >
> > One really nasty implication of this "old" mode is that the host can
> > remove *accepted* pages that are used in the syscall gap.  That means
> > that the #VE handler would need to be of the paranoid variety which
> > opens up all kinds of other fun.
> >
> >  * "Old" - #VE's can happen in the syscall gap
> >  * "New" - #VE's happen at better-defined times.  Unexpected ones are
> >    fatal.
> >
> > There's a third option which I proposed but doesn't yet exist.  The TDX
> > module _could_ separate the behavior of unaccepted memory #VE's and
> > host-induced #VEs.  This way, we could use load_unaligned_zeropad() with
> > impunity and handle it in the #VE handler.  At the same time, the host
> > would not be allowed to remove accepted memory and cause problems in the
> > syscall gap.  Kinda the best of both worlds.
> >
> > But, I'm not sure how valuable that would be now that we have the
> > (admittedly squirrelly) code to avoid load_unaligned_zeropad() #VE's.
> 
> How would that be implemented?  It would need to track which GPAs *were*
> accepted across a host-induced unmap/remap cycle. This would involve
> preventing the host from ever completely removing a secure EPT table
> without the guest’s help, right?
> 
> Admittedly this would IMO be better behavior. Is it practical to implement?

I don't think it is better if you look from host POV. It owns resources of
the machine and it has to have a way to pull memory from uncooperative TD
at any point.

It also would require more complicated private->shared conversion as guest
has to notify TDX module about the change, not only host as we do now.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2022-08-13 16:03       ` Andy Lutomirski
@ 2022-08-13 21:02         ` Kirill A. Shutemov
  0 siblings, 0 replies; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-08-13 21:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kirill A. Shutemov, Borislav Petkov, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Sathyanarayanan Kuppuswamy, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra (Intel),
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand,
	Marcelo Henrique Cerri, tim.gardner, khalid.elmously, philip.cox,
	the arch/x86 maintainers, linux-mm, linux-coco, linux-efi,
	Linux Kernel Mailing List

On Sat, Aug 13, 2022 at 09:03:13AM -0700, Andy Lutomirski wrote:
> 
> 
> On Tue, Aug 9, 2022, at 4:38 AM, Kirill A. Shutemov wrote:
> > On Tue, Jul 26, 2022 at 01:17:13PM -0700, Andy Lutomirski wrote:
> >> 
> >> 
> >> On Tue, Jun 14, 2022, at 5:02 AM, Kirill A. Shutemov wrote:
> >> > load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
> >> > The unwanted loads are typically harmless. But, they might be made to
> >> > totally unrelated or even unmapped memory. load_unaligned_zeropad()
> >> > relies on exception fixup (#PF, #GP and now #VE) to recover from these
> >> > unwanted loads.
> >> >
> >> > But, this approach does not work for unaccepted memory. For TDX, a load
> >> > from unaccepted memory will not lead to a recoverable exception within
> >> > the guest. The guest will exit to the VMM where the only recourse is to
> >> > terminate the guest.
> >> 
> >> Why is unaccepted memory marked present in the direct map in the first place?
> >> 
> >> Having kernel code assume that every valid address is followed by
> >> several bytes of memory that may be read without side effects other than
> >> #PF also seems like a mistake, but I probably won’t win that fight. But
> >> sticking guard pages in front of definitely-not-logically present pages
> >> seems silly to me.  Let’s just not map it.
> >
> > It would mean no 1G pages in direct mapping for TDX as we accept 2M a
> > time.

As of now, we don't have a way to recover direct mapping from
fragmentation. So once we split 1G to 2M it stays this way.

> >> (What if MMIO memory is mapped next to regular memory?  Doing random
> >> unaligned reads that cross into MMIO seems unwise.)
> >
> > MMIO is shared, not unaccpted private. We already handle the situation.
> > See 1e7769653b06 ("x86/tdx: Handle load_unaligned_zeropad() page-cross to
> > a shared page").
> >
> 
> I don’t mean in a confidential guest — I mean generally. This whole
> model of “overrun the buffer — no big deal” is just fragile.

If you want to remove load_unaligned_zeropad(), I would not object. It can
make life easier.

I presumed that optimization it brings has measuarable benefit (otherwise,
why bother).

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2022-08-13 16:11           ` Andy Lutomirski
@ 2022-08-13 21:13             ` Kirill A. Shutemov
  0 siblings, 0 replies; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-08-13 21:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, Dave Hansen, Kirill A. Shutemov,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Sathyanarayanan Kuppuswamy, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra (Intel),
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, Marcelo Henrique Cerri,
	tim.gardner, khalid.elmously, philip.cox,
	the arch/x86 maintainers, linux-mm, linux-coco, linux-efi,
	Linux Kernel Mailing List

On Sat, Aug 13, 2022 at 09:11:52AM -0700, Andy Lutomirski wrote:
> Now if the TD module could deliver an unrecoverable #MC instead of an
> impossible-to-handle #VE, maybe we could at least get a nice debug trace
> out?  Of course it’s not so easy to do anything with a debug trace that
> doesn’t break confidentiality.

It is not impossible-to-handle #VE, it is no #VE for the guest and exit to
the host that cannot be recovered. Yes, it is not friednly for debugging.

Our plan was to allow SEPT_VE_DISABLE=0 for debug TD. It helps with
debugging stepping on unaccepted memory as allows #VE in the guest which
leads to panic() and nice traceback.

Would it be enough?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v2 1/2] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
  2022-08-13 19:40             ` Borislav Petkov
@ 2022-08-14 13:36               ` Tom Lendacky
  0 siblings, 0 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-08-14 13:36 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, x86, Thomas Gleixner, Ingo Molnar, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On 8/13/22 14:40, Borislav Petkov wrote:
> On Fri, Aug 12, 2022 at 09:51:41AM -0500, Tom Lendacky wrote:
>> On 8/12/22 09:33, Borislav Petkov wrote:
>>> On Fri, Aug 12, 2022 at 09:11:25AM -0500, Tom Lendacky wrote:
>>>> There was a whole discussion on this
>>>
>>> Pointer to it?
>>
>> It starts here: https://lore.kernel.org/lkml/658c455c40e8950cb046dd885dd19dc1c52d060a.1659103274.git.thomas.lendacky@amd.com/
> 
> So how come none of the rationale for the on-stack decision vs a single
> buffer with a spinlock protection hasn't made it to this patch?
> 
> We need to have the reason why this thing is changed documented
> somewhere.

Yup, was all being addressed in v3 based on Dave's comments.

> 
>>> So smaller, on-stack PSC but lockless is still better than a bigger one
>>> but with synchronized accesses to it?
> 
> That thing.
> 
> That decision for on-stack buffer needs explaining why.
> 
>>>> Well when we don't know which GHCB is in use, using that reserved area in
>>>> the GHCB doesn't help.
>>>
>>> What do you mean?
>>>
>>> The one which you read with
>>>
>>> 	data = this_cpu_read(runtime_data);
>>
>> Memory acceptance is called before the per-CPU GHCBs have been allocated
>> and so you would be actually be using early boot GHCB. And that is decided
>> based on the #VC handler that is invoked - but in this case we're not
>> coming through the #VC handler to accept memory.
> 
> But then ghcb_percpu_ready needs to be a per-CPU variable too! Because
> it is set right after snp_register_per_cpu_ghcb() which works on the
> *per-CPU* GHCB.

No, and the code comment will explain this. Since the APs only ever use 
the per-CPU GHCB there is no concern as to when there is a switch over 
from the early boot GHCB to the per-CPU GHCB, so a single global variable 
is all that is needed.

I'll send out v3 soon.

Thanks,
Tom

> 

^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v3 0/2] Provide SEV-SNP support for unaccepted memory
  2022-06-14 12:02 [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (16 preceding siblings ...)
  2022-08-08 17:16 ` [PATCH v2 0/2] Provide SEV-SNP support for unaccepted memory Tom Lendacky
@ 2022-08-15 15:57 ` Tom Lendacky
  2022-08-15 15:57   ` [PATCH v3 1/2] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support Tom Lendacky
  2022-08-15 15:57   ` [PATCH v3 2/2] x86/sev: Add SNP-specific " Tom Lendacky
  2022-08-25 14:23 ` [PATCH v4 0/4] Provide SEV-SNP support for unaccepted memory Tom Lendacky
  2022-09-27 17:04 ` [PATCH v5 0/6] Provide SEV-SNP support for unaccepted memory Tom Lendacky
  19 siblings, 2 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-08-15 15:57 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

This series adds SEV-SNP support for unaccepted memory to the patch series
titled:

  [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory

Currently, when changing the state of a page under SNP, the page state
change structure is kmalloc()'d. This lead to hangs during boot when
accepting memory because the allocation can trigger the need to accept
more memory. So this series consists of two patches:

  - A pre-patch to switch from a kmalloc()'d page state change structure
    to a (smaller) stack-based page state change structure.

  - SNP support for unaccepted memory.

The series is based off of and tested against Kirill Shutemov's tree:
  https://github.com/intel/tdx.git guest-unaccepted-memory

---

Changes since v2:
- Improve code comments in regards to when to use the per-CPU GHCB vs
  the MSR protocol and why a single global value is valid for both
  the BSP and APs.
- Add a comment related to the number of PSC entries and how it can
  impact the size of the struct and, therefore, stack usage.
- Add a WARN_ON_ONCE() for invoking vmgexit_psc() when per-CPU GHCBs
  haven't been created or registered, yet.
- Use the compiler support for clearing the PSC struct instead of
  issuing memset().

Changes since v1:
- Change from using a per-CPU PSC structure to a (smaller) stack PSC
  structure.


Tom Lendacky (2):
  x86/sev: Put PSC struct on the stack in prep for unaccepted memory
    support
  x86/sev: Add SNP-specific unaccepted memory support

 arch/x86/Kconfig                  |  1 +
 arch/x86/boot/compressed/mem.c    |  3 ++
 arch/x86/boot/compressed/sev.c    | 10 ++++++-
 arch/x86/boot/compressed/sev.h    | 23 +++++++++++++++
 arch/x86/include/asm/sev-common.h |  9 ++++--
 arch/x86/include/asm/sev.h        |  3 ++
 arch/x86/kernel/sev.c             | 48 +++++++++++++++++++++++++------
 arch/x86/mm/unaccepted_memory.c   |  4 +++
 8 files changed, 90 insertions(+), 11 deletions(-)
 create mode 100644 arch/x86/boot/compressed/sev.h

-- 
2.36.1


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v3 1/2] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
  2022-08-15 15:57 ` [PATCH v3 0/2] Provide SEV-SNP support for unaccepted memory Tom Lendacky
@ 2022-08-15 15:57   ` Tom Lendacky
  2022-08-17 16:08     ` Borislav Petkov
  2022-08-15 15:57   ` [PATCH v3 2/2] x86/sev: Add SNP-specific " Tom Lendacky
  1 sibling, 1 reply; 200+ messages in thread
From: Tom Lendacky @ 2022-08-15 15:57 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

In advance of providing support for unaccepted memory, switch from using
kmalloc() for allocating the Page State Change (PSC) structure to using a
local variable that lives on the stack. This is needed to avoid a possible
recursive call into set_pages_state() if the kmalloc() call requires
(more) memory to be accepted, which would result in a hang.

The current size of the PSC struct is 2,032 bytes. To make the struct more
stack friendly, reduce the number of PSC entries from 253 down to 64,
resulting in a size of 520 bytes. This is a nice compromise on struct size
and total PSC requests while still allowing parallel PSC operations across
vCPUs.

Also, since set_pages_state() uses the per-CPU GHCB, add a static variable
that indicates when per-CPU GHCBs are available. Until they are available,
the GHCB MSR protocol is used to perform page state changes. For APs, the
per-CPU GHCB is created before they are started and registered upon
startup, so this flag can be used globally for the BSP and APs instead of
creating a per-CPU flag.

If either the reduction in PSC entries or the use of the MSR protocol
until the per-CPU GHCBs are available, result in any kind of performance
issue (that is not seen at the moment), use of a larger static PSC struct,
with fallback to the smaller stack version, or use the boot GHCB prior to
per-CPU GHCBs, can be investigated.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/include/asm/sev-common.h |  9 +++++++--
 arch/x86/kernel/sev.c             | 32 +++++++++++++++++++++++--------
 2 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index b8357d6ecd47..6c3d61c5f6a3 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -106,8 +106,13 @@ enum psc_op {
 #define GHCB_HV_FT_SNP			BIT_ULL(0)
 #define GHCB_HV_FT_SNP_AP_CREATION	BIT_ULL(1)
 
-/* SNP Page State Change NAE event */
-#define VMGEXIT_PSC_MAX_ENTRY		253
+/*
+ * SNP Page State Change NAE event
+ *   The VMGEXIT_PSC_MAX_ENTRY determines the size of the PSC structure,
+ *   which is a local variable (stack usage) in set_pages_state(). Do not
+ *   increase this value without evaluating the impact to stack usage.
+ */
+#define VMGEXIT_PSC_MAX_ENTRY		64
 
 struct psc_hdr {
 	u16 cur_entry;
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index c05f0124c410..40268ce97aad 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -66,6 +66,17 @@ static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
  */
 static struct ghcb *boot_ghcb __section(".data");
 
+/*
+ * A flag used by set_pages_state() that indicates when the per-CPU GHCB has
+ * been created and registered and thus can be used instead of using the MSR
+ * protocol. The set_pages_state() function eventually invokes vmgexit_psc(),
+ * which only works with a per-CPU GHCB.
+ *
+ * For APs, the per-CPU GHCB is created before they are started and registered
+ * upon startup, so this flag can be used globally for the BSP and APs.
+ */
+static bool ghcb_percpu_ready __section(".data");
+
 /* Bitmap of SEV features supported by the hypervisor */
 static u64 sev_hv_features __ro_after_init;
 
@@ -660,7 +671,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool valid
 	}
 }
 
-static void __init early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
+static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
 {
 	unsigned long paddr_end;
 	u64 val;
@@ -751,6 +762,8 @@ static int vmgexit_psc(struct snp_psc_desc *desc)
 	unsigned long flags;
 	struct ghcb *ghcb;
 
+	WARN_ON_ONCE(!ghcb_percpu_ready);
+
 	/*
 	 * __sev_get_ghcb() needs to run with IRQs disabled because it is using
 	 * a per-CPU GHCB.
@@ -868,11 +881,14 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
 static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
 {
 	unsigned long vaddr_end, next_vaddr;
-	struct snp_psc_desc *desc;
+	struct snp_psc_desc desc = {};
 
-	desc = kmalloc(sizeof(*desc), GFP_KERNEL_ACCOUNT);
-	if (!desc)
-		panic("SNP: failed to allocate memory for PSC descriptor\n");
+	/*
+	 * Use the MSR protocol when the per-CPU GHCBs are not yet registered,
+	 * since vmgexit_psc() uses the per-CPU GHCB.
+	 */
+	if (!ghcb_percpu_ready)
+		return early_set_pages_state(__pa(vaddr), npages, op);
 
 	vaddr = vaddr & PAGE_MASK;
 	vaddr_end = vaddr + (npages << PAGE_SHIFT);
@@ -882,12 +898,10 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
 		next_vaddr = min_t(unsigned long, vaddr_end,
 				   (VMGEXIT_PSC_MAX_ENTRY * PAGE_SIZE) + vaddr);
 
-		__set_pages_state(desc, vaddr, next_vaddr, op);
+		__set_pages_state(&desc, vaddr, next_vaddr, op);
 
 		vaddr = next_vaddr;
 	}
-
-	kfree(desc);
 }
 
 void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
@@ -1254,6 +1268,8 @@ void setup_ghcb(void)
 		if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
 			snp_register_per_cpu_ghcb();
 
+		ghcb_percpu_ready = true;
+
 		return;
 	}
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCH v3 2/2] x86/sev: Add SNP-specific unaccepted memory support
  2022-08-15 15:57 ` [PATCH v3 0/2] Provide SEV-SNP support for unaccepted memory Tom Lendacky
  2022-08-15 15:57   ` [PATCH v3 1/2] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support Tom Lendacky
@ 2022-08-15 15:57   ` Tom Lendacky
  2022-08-18 13:39     ` Borislav Petkov
  1 sibling, 1 reply; 200+ messages in thread
From: Tom Lendacky @ 2022-08-15 15:57 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

Add SNP-specific hooks to the unaccepted memory support in the boot
path (__accept_memory()) and the core kernel (accept_memory()) in order
to support booting SNP guests when unaccepted memory is present. Without
this support, SNP guests will fail to boot and/or panic() when unaccepted
memory is present in the EFI memory map.

The process of accepting memory under SNP involves invoking the hypervisor
to perform a page state change for the page to private memory and then
issuing a PVALIDATE instruction to accept the page.

Create the new header file arch/x86/boot/compressed/sev.h because adding
the function declaration to any of the existing SEV related header files
pulls in too many other header files, causing the build to fail.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/Kconfig                |  1 +
 arch/x86/boot/compressed/mem.c  |  3 +++
 arch/x86/boot/compressed/sev.c  | 10 +++++++++-
 arch/x86/boot/compressed/sev.h  | 23 +++++++++++++++++++++++
 arch/x86/include/asm/sev.h      |  3 +++
 arch/x86/kernel/sev.c           | 16 ++++++++++++++++
 arch/x86/mm/unaccepted_memory.c |  4 ++++
 7 files changed, 59 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/boot/compressed/sev.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 34146ecc5bdd..0ad53c3533c2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1553,6 +1553,7 @@ config AMD_MEM_ENCRYPT
 	select INSTRUCTION_DECODER
 	select ARCH_HAS_CC_PLATFORM
 	select X86_MEM_ENCRYPT
+	select UNACCEPTED_MEMORY
 	help
 	  Say yes to enable support for the encryption of system memory.
 	  This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index 48e36e640da1..3e19dc0da0d7 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -6,6 +6,7 @@
 #include "find.h"
 #include "math.h"
 #include "tdx.h"
+#include "sev.h"
 #include <asm/shared/tdx.h>
 
 #define PMD_SHIFT	21
@@ -39,6 +40,8 @@ static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
 	/* Platform-specific memory-acceptance call goes here */
 	if (is_tdx_guest())
 		tdx_accept_memory(start, end);
+	else if (sev_snp_enabled())
+		snp_accept_memory(start, end);
 	else
 		error("Cannot accept memory: unknown platform\n");
 }
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 730c4677e9db..d4b06c862094 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -115,7 +115,7 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
 /* Include code for early handlers */
 #include "../../kernel/sev-shared.c"
 
-static inline bool sev_snp_enabled(void)
+bool sev_snp_enabled(void)
 {
 	return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
 }
@@ -161,6 +161,14 @@ void snp_set_page_shared(unsigned long paddr)
 	__page_state_change(paddr, SNP_PAGE_STATE_SHARED);
 }
 
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	while (end > start) {
+		snp_set_page_private(start);
+		start += PAGE_SIZE;
+	}
+}
+
 static bool early_setup_ghcb(void)
 {
 	if (set_page_decrypted((unsigned long)&boot_ghcb_page))
diff --git a/arch/x86/boot/compressed/sev.h b/arch/x86/boot/compressed/sev.h
new file mode 100644
index 000000000000..fc725a981b09
--- /dev/null
+++ b/arch/x86/boot/compressed/sev.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * AMD SEV header for early boot related functions.
+ *
+ * Author: Tom Lendacky <thomas.lendacky@amd.com>
+ */
+
+#ifndef BOOT_COMPRESSED_SEV_H
+#define BOOT_COMPRESSED_SEV_H
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+
+bool sev_snp_enabled(void);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
+
+#else
+
+static inline bool sev_snp_enabled(void) { return false; }
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
+
+#endif
+
+#endif
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 19514524f0f8..21db66bacefe 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -202,6 +202,7 @@ void snp_set_wakeup_secondary_cpu(void);
 bool snp_init(struct boot_params *bp);
 void snp_abort(void);
 int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, unsigned long *fw_err);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
 #else
 static inline void sev_es_ist_enter(struct pt_regs *regs) { }
 static inline void sev_es_ist_exit(void) { }
@@ -226,6 +227,8 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
 {
 	return -ENOTTY;
 }
+
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
 #endif
 
 #endif
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 40268ce97aad..d71740f54277 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -924,6 +924,22 @@ void snp_set_memory_private(unsigned long vaddr, unsigned int npages)
 	pvalidate_pages(vaddr, npages, true);
 }
 
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	unsigned long vaddr;
+	unsigned int npages;
+
+	if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
+		return;
+
+	vaddr = (unsigned long)__va(start);
+	npages = (end - start) >> PAGE_SHIFT;
+
+	set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
+
+	pvalidate_pages(vaddr, npages, true);
+}
+
 static int snp_set_vmsa(void *va, bool vmsa)
 {
 	u64 attrs;
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index 9ec2304272dc..b86ad6a8ddf5 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -9,6 +9,7 @@
 #include <asm/setup.h>
 #include <asm/shared/tdx.h>
 #include <asm/unaccepted_memory.h>
+#include <asm/sev.h>
 
 /* Protects unaccepted memory bitmap */
 static DEFINE_SPINLOCK(unaccepted_memory_lock);
@@ -66,6 +67,9 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
 		if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
 			tdx_accept_memory(range_start * PMD_SIZE,
 					  range_end * PMD_SIZE);
+		} else if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) {
+			snp_accept_memory(range_start * PMD_SIZE,
+					  range_end * PMD_SIZE);
 		} else {
 			panic("Cannot accept memory: unknown platform\n");
 		}
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-08-10 14:19     ` Mel Gorman
@ 2022-08-15 21:08       ` Dionna Amalie Glaze
  2022-08-15 22:02         ` Tom Lendacky
  0 siblings, 1 reply; 200+ messages in thread
From: Dionna Amalie Glaze @ 2022-08-15 21:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vlastimil Babka, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Tom Lendacky,
	Thomas Gleixner, Peter Zijlstra, Paolo Bonzini, Ingo Molnar,
	Dario Faggioli, Dave Hansen, Mike Rapoport, David Hildenbrand,
	Marcelo Cerri, tim.gardner, Khalid ElMously, philip.cox,
	the arch/x86 maintainers, Linux Memory Management List,
	linux-coco, linux-efi, LKML, Mike Rapoport

>
>
> The unpredictable performance of the application early in boot may be
> unacceptable and unavoidable. It might take a long time but it could
> eventually generate bug reports about "unpredictable performance early
> in boot" that will be hard to track down unless accept_memory is observed
> using perf at the right time. Even when that does happen, there will need
> to be an option to turn it off if the unpredictable performance cannot
> be tolerated. Second, any benchmarking done early in boot is likely to
> be disrupted making the series a potential bisection magnet that masks a
> performance bug elsewhere in the merge window.

I'm doing some boot performance tests now before I run some workload
memory acceptance latency tests.
Note that this testing is on AMD SEV-SNP, so this patch series on top
of the AMD guest patches v12, plus a
patch Brijesh Singh wrote to define __accept_memory for SEV-SNP
https://github.com/AMDESE/linux/commit/ecae2582666d50ce1e633975d703d2f904183ece

I was getting pretty consistent boot times, only going up slightly as
the memory size increased, but at 256GB, the VM crashes because it
touches some unaccepted memory without first accepting it. 255GB boots
fine.

The stack track is in mm/page_alloc.c. I've done a little
investigation, but I can't account for why there's a hard cutoff of
correctness at 256GB

[    0.065563] RIP: 0010:memmap_init_range+0x108/0x173
[    0.066309] Code: 77 16 f6 42 10 02 74 10 48 03 42 08 48 c1 e8 0c
48 89 c3 e9 3a ff ff ff 48 89 df 48 c1 e7 06 48 03 3d d9 a2 66 ff 48
8d 47 08 <c7> 47 34 01 00 00 00 48 c7 47 38 00 00 00 00 c7 47 30 ff ff
ff ff
[    0.069108] RSP: 0000:ffffffffad603dc8 EFLAGS: 00010082 ORIG_RAX:
0000000000000404
[    0.070193] RAX: ffffdba740000048 RBX: 0000000000000001 RCX: 0000000000000000
[    0.071170] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffdba740000040
[    0.072224] RBP: 0000000000000000 R08: 0000000000001000 R09: 0000000000000000
[    0.073283] R10: 0000000000000001 R11: ffffffffad645c60 R12: 0000000000000000
[    0.074304] R13: 00000000000000a0 R14: 0000000000000000 R15: 0000000000000000
[    0.075285] FS:  0000000000000000(0000) GS:ffffffffadd6c000(0000)
knlGS:0000000000000000
[    0.076365] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.077194] CR2: ffffdba740000074 CR3: 0008001ee3a0c000 CR4: 00000000000606b0
[    0.078209] Call Trace:
[    0.078524]  <TASK>
[    0.078887]  ? free_area_init+0x5c1/0x66c
[    0.079417]  ? zone_sizes_init+0x52/0x6c
[    0.079934]  ? setup_arch+0xa55/0xb6d
[    0.080417]  ? start_kernel+0x64/0x65a
[    0.080897]  ? secondary_startup_64_no_verify+0xd6/0xdb
[    0.081620]  </TASK>

>
> --
> Mel Gorman
> SUSE Labs



-- 
-Dionna Glaze, PhD (she/her)

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-08-15 21:08       ` Dionna Amalie Glaze
@ 2022-08-15 22:02         ` Tom Lendacky
  2022-08-29 16:02           ` Dionna Amalie Glaze
  0 siblings, 1 reply; 200+ messages in thread
From: Tom Lendacky @ 2022-08-15 22:02 UTC (permalink / raw)
  To: Dionna Amalie Glaze, Mel Gorman
  Cc: Vlastimil Babka, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand, Marcelo Cerri,
	tim.gardner, Khalid ElMously, philip.cox,
	the arch/x86 maintainers, Linux Memory Management List,
	linux-coco, linux-efi, LKML, Mike Rapoport

On 8/15/22 16:08, Dionna Amalie Glaze wrote:
>>
>>
>> The unpredictable performance of the application early in boot may be
>> unacceptable and unavoidable. It might take a long time but it could
>> eventually generate bug reports about "unpredictable performance early
>> in boot" that will be hard to track down unless accept_memory is observed
>> using perf at the right time. Even when that does happen, there will need
>> to be an option to turn it off if the unpredictable performance cannot
>> be tolerated. Second, any benchmarking done early in boot is likely to
>> be disrupted making the series a potential bisection magnet that masks a
>> performance bug elsewhere in the merge window.
> 
> I'm doing some boot performance tests now before I run some workload
> memory acceptance latency tests.
> Note that this testing is on AMD SEV-SNP, so this patch series on top
> of the AMD guest patches v12, plus a
> patch Brijesh Singh wrote to define __accept_memory for SEV-SNP
> https://github.com/AMDESE/linux/commit/ecae2582666d50ce1e633975d703d2f904183ece

Note that there is a bug in Brijesh's version of the patch and it will 
almost exclusively use the MSR protocol. Please try the version of the 
patch that I recently sent up based on the current unaccepted memory tree 
from Kirill.

https://lore.kernel.org/lkml/cover.1660579062.git.thomas.lendacky@amd.com/

Thanks,
Tom

> 
> I was getting pretty consistent boot times, only going up slightly as
> the memory size increased, but at 256GB, the VM crashes because it
> touches some unaccepted memory without first accepting it. 255GB boots
> fine.
> 
> The stack track is in mm/page_alloc.c. I've done a little
> investigation, but I can't account for why there's a hard cutoff of
> correctness at 256GB
> 
> [    0.065563] RIP: 0010:memmap_init_range+0x108/0x173
> [    0.066309] Code: 77 16 f6 42 10 02 74 10 48 03 42 08 48 c1 e8 0c
> 48 89 c3 e9 3a ff ff ff 48 89 df 48 c1 e7 06 48 03 3d d9 a2 66 ff 48
> 8d 47 08 <c7> 47 34 01 00 00 00 48 c7 47 38 00 00 00 00 c7 47 30 ff ff
> ff ff
> [    0.069108] RSP: 0000:ffffffffad603dc8 EFLAGS: 00010082 ORIG_RAX:
> 0000000000000404
> [    0.070193] RAX: ffffdba740000048 RBX: 0000000000000001 RCX: 0000000000000000
> [    0.071170] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffdba740000040
> [    0.072224] RBP: 0000000000000000 R08: 0000000000001000 R09: 0000000000000000
> [    0.073283] R10: 0000000000000001 R11: ffffffffad645c60 R12: 0000000000000000
> [    0.074304] R13: 00000000000000a0 R14: 0000000000000000 R15: 0000000000000000
> [    0.075285] FS:  0000000000000000(0000) GS:ffffffffadd6c000(0000)
> knlGS:0000000000000000
> [    0.076365] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.077194] CR2: ffffdba740000074 CR3: 0008001ee3a0c000 CR4: 00000000000606b0
> [    0.078209] Call Trace:
> [    0.078524]  <TASK>
> [    0.078887]  ? free_area_init+0x5c1/0x66c
> [    0.079417]  ? zone_sizes_init+0x52/0x6c
> [    0.079934]  ? setup_arch+0xa55/0xb6d
> [    0.080417]  ? start_kernel+0x64/0x65a
> [    0.080897]  ? secondary_startup_64_no_verify+0xd6/0xdb
> [    0.081620]  </TASK>
> 
>>
>> --
>> Mel Gorman
>> SUSE Labs
> 
> 
> 

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v3 1/2] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
  2022-08-15 15:57   ` [PATCH v3 1/2] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support Tom Lendacky
@ 2022-08-17 16:08     ` Borislav Petkov
  2022-08-17 21:17       ` Tom Lendacky
  0 siblings, 1 reply; 200+ messages in thread
From: Borislav Petkov @ 2022-08-17 16:08 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: linux-kernel, x86, Thomas Gleixner, Ingo Molnar, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On Mon, Aug 15, 2022 at 10:57:42AM -0500, Tom Lendacky wrote:
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index c05f0124c410..40268ce97aad 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -66,6 +66,17 @@ static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
>   */
>  static struct ghcb *boot_ghcb __section(".data");
>  
> +/*
> + * A flag used by set_pages_state() that indicates when the per-CPU GHCB has
> + * been created and registered and thus can be used instead of using the MSR
> + * protocol. The set_pages_state() function eventually invokes vmgexit_psc(),
> + * which only works with a per-CPU GHCB.
> + *
> + * For APs, the per-CPU GHCB is created before they are started and registered
> + * upon startup, so this flag can be used globally for the BSP and APs.
> + */

Ok, better, thanks!

> +static bool ghcb_percpu_ready __section(".data");

However, it reads really weird if you have "percpu" in the name of a
variable which is not per CPU...

Let's just call it "ghcbs_initialized" and be done with it.

And I still hate the whole thing ofc.

Do this ontop (and I knew we had a flags thing already):

(And yes, __read_mostly is in the .data section too).

---
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 40268ce97aad..5b3afbf26349 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -66,17 +66,6 @@ static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
  */
 static struct ghcb *boot_ghcb __section(".data");
 
-/*
- * A flag used by set_pages_state() that indicates when the per-CPU GHCB has
- * been created and registered and thus can be used instead of using the MSR
- * protocol. The set_pages_state() function eventually invokes vmgexit_psc(),
- * which only works with a per-CPU GHCB.
- *
- * For APs, the per-CPU GHCB is created before they are started and registered
- * upon startup, so this flag can be used globally for the BSP and APs.
- */
-static bool ghcb_percpu_ready __section(".data");
-
 /* Bitmap of SEV features supported by the hypervisor */
 static u64 sev_hv_features __ro_after_init;
 
@@ -128,7 +117,18 @@ static DEFINE_PER_CPU(struct sev_es_save_area *, sev_vmsa);
 
 struct sev_config {
 	__u64 debug		: 1,
-	      __reserved	: 63;
+
+	      /*
+	       * A flag used by set_pages_state() that indicates when the per-CPU GHCB has
+	       * been created and registered and thus can be used instead of using the MSR
+	       * protocol. The set_pages_state() function eventually invokes vmgexit_psc(),
+	       * which only works with a per-CPU GHCB.
+	       *
+	       * For APs, the per-CPU GHCB is created before they are started and registered
+	       * upon startup, so this flag can be used globally for the BSP and APs.
+	       */
+	      ghcbs_initialized : 1,
+	      __reserved	: 62;
 };
 
 static struct sev_config sev_cfg __read_mostly;
@@ -762,7 +762,7 @@ static int vmgexit_psc(struct snp_psc_desc *desc)
 	unsigned long flags;
 	struct ghcb *ghcb;
 
-	WARN_ON_ONCE(!ghcb_percpu_ready);
+	WARN_ON_ONCE(!sev_cfg.ghcbs_initialized);
 
 	/*
 	 * __sev_get_ghcb() needs to run with IRQs disabled because it is using
@@ -887,7 +887,7 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
 	 * Use the MSR protocol when the per-CPU GHCBs are not yet registered,
 	 * since vmgexit_psc() uses the per-CPU GHCB.
 	 */
-	if (!ghcb_percpu_ready)
+	if (!sev_cfg.ghcbs_initialized)
 		return early_set_pages_state(__pa(vaddr), npages, op);
 
 	vaddr = vaddr & PAGE_MASK;
@@ -1268,7 +1268,7 @@ void setup_ghcb(void)
 		if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
 			snp_register_per_cpu_ghcb();
 
-		ghcb_percpu_ready = true;
+		sev_cfg.ghcbs_initialized = true;
 
 		return;
 	}

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 200+ messages in thread

* Re: [PATCH v3 1/2] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
  2022-08-17 16:08     ` Borislav Petkov
@ 2022-08-17 21:17       ` Tom Lendacky
  0 siblings, 0 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-08-17 21:17 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, x86, Thomas Gleixner, Ingo Molnar, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On 8/17/22 11:08, Borislav Petkov wrote:
> On Mon, Aug 15, 2022 at 10:57:42AM -0500, Tom Lendacky wrote:
>> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
>> index c05f0124c410..40268ce97aad 100644
>> --- a/arch/x86/kernel/sev.c
>> +++ b/arch/x86/kernel/sev.c
>> @@ -66,6 +66,17 @@ static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
>>    */
>>   static struct ghcb *boot_ghcb __section(".data");
>>   
>> +/*
>> + * A flag used by set_pages_state() that indicates when the per-CPU GHCB has
>> + * been created and registered and thus can be used instead of using the MSR
>> + * protocol. The set_pages_state() function eventually invokes vmgexit_psc(),
>> + * which only works with a per-CPU GHCB.
>> + *
>> + * For APs, the per-CPU GHCB is created before they are started and registered
>> + * upon startup, so this flag can be used globally for the BSP and APs.
>> + */
> 
> Ok, better, thanks!
> 
>> +static bool ghcb_percpu_ready __section(".data");
> 
> However, it reads really weird if you have "percpu" in the name of a
> variable which is not per CPU...
> 
> Let's just call it "ghcbs_initialized" and be done with it.
> 
> And I still hate the whole thing ofc.
> 
> Do this ontop (and I knew we had a flags thing already):
> 
> (And yes, __read_mostly is in the .data section too).

Cool, will do.

Thanks,
Tom

> 
> ---
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index 40268ce97aad..5b3afbf26349 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -66,17 +66,6 @@ static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
>    */
>   static struct ghcb *boot_ghcb __section(".data");
>   
> -/*
> - * A flag used by set_pages_state() that indicates when the per-CPU GHCB has
> - * been created and registered and thus can be used instead of using the MSR
> - * protocol. The set_pages_state() function eventually invokes vmgexit_psc(),
> - * which only works with a per-CPU GHCB.
> - *
> - * For APs, the per-CPU GHCB is created before they are started and registered
> - * upon startup, so this flag can be used globally for the BSP and APs.
> - */
> -static bool ghcb_percpu_ready __section(".data");
> -
>   /* Bitmap of SEV features supported by the hypervisor */
>   static u64 sev_hv_features __ro_after_init;
>   
> @@ -128,7 +117,18 @@ static DEFINE_PER_CPU(struct sev_es_save_area *, sev_vmsa);
>   
>   struct sev_config {
>   	__u64 debug		: 1,
> -	      __reserved	: 63;
> +
> +	      /*
> +	       * A flag used by set_pages_state() that indicates when the per-CPU GHCB has
> +	       * been created and registered and thus can be used instead of using the MSR
> +	       * protocol. The set_pages_state() function eventually invokes vmgexit_psc(),
> +	       * which only works with a per-CPU GHCB.
> +	       *
> +	       * For APs, the per-CPU GHCB is created before they are started and registered
> +	       * upon startup, so this flag can be used globally for the BSP and APs.
> +	       */
> +	      ghcbs_initialized : 1,
> +	      __reserved	: 62;
>   };
>   
>   static struct sev_config sev_cfg __read_mostly;
> @@ -762,7 +762,7 @@ static int vmgexit_psc(struct snp_psc_desc *desc)
>   	unsigned long flags;
>   	struct ghcb *ghcb;
>   
> -	WARN_ON_ONCE(!ghcb_percpu_ready);
> +	WARN_ON_ONCE(!sev_cfg.ghcbs_initialized);
>   
>   	/*
>   	 * __sev_get_ghcb() needs to run with IRQs disabled because it is using
> @@ -887,7 +887,7 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
>   	 * Use the MSR protocol when the per-CPU GHCBs are not yet registered,
>   	 * since vmgexit_psc() uses the per-CPU GHCB.
>   	 */
> -	if (!ghcb_percpu_ready)
> +	if (!sev_cfg.ghcbs_initialized)
>   		return early_set_pages_state(__pa(vaddr), npages, op);
>   
>   	vaddr = vaddr & PAGE_MASK;
> @@ -1268,7 +1268,7 @@ void setup_ghcb(void)
>   		if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
>   			snp_register_per_cpu_ghcb();
>   
> -		ghcb_percpu_ready = true;
> +		sev_cfg.ghcbs_initialized = true;
>   
>   		return;
>   	}
> 

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v3 2/2] x86/sev: Add SNP-specific unaccepted memory support
  2022-08-15 15:57   ` [PATCH v3 2/2] x86/sev: Add SNP-specific " Tom Lendacky
@ 2022-08-18 13:39     ` Borislav Petkov
  0 siblings, 0 replies; 200+ messages in thread
From: Borislav Petkov @ 2022-08-18 13:39 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: linux-kernel, x86, Thomas Gleixner, Ingo Molnar, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On Mon, Aug 15, 2022 at 10:57:43AM -0500, Tom Lendacky wrote:
> Add SNP-specific hooks to the unaccepted memory support in the boot
> path (__accept_memory()) and the core kernel (accept_memory()) in order
> to support booting SNP guests when unaccepted memory is present. Without
> this support, SNP guests will fail to boot and/or panic() when unaccepted
> memory is present in the EFI memory map.
> 
> The process of accepting memory under SNP involves invoking the hypervisor
> to perform a page state change for the page to private memory and then
> issuing a PVALIDATE instruction to accept the page.
> 
> Create the new header file arch/x86/boot/compressed/sev.h because adding
> the function declaration to any of the existing SEV related header files
> pulls in too many other header files, causing the build to fail.
> 
> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
> ---
>  arch/x86/Kconfig                |  1 +
>  arch/x86/boot/compressed/mem.c  |  3 +++
>  arch/x86/boot/compressed/sev.c  | 10 +++++++++-
>  arch/x86/boot/compressed/sev.h  | 23 +++++++++++++++++++++++
>  arch/x86/include/asm/sev.h      |  3 +++
>  arch/x86/kernel/sev.c           | 16 ++++++++++++++++
>  arch/x86/mm/unaccepted_memory.c |  4 ++++
>  7 files changed, 59 insertions(+), 1 deletion(-)
>  create mode 100644 arch/x86/boot/compressed/sev.h

Looks mostly ok to me...

> diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
> index 730c4677e9db..d4b06c862094 100644
> --- a/arch/x86/boot/compressed/sev.c
> +++ b/arch/x86/boot/compressed/sev.c
> @@ -115,7 +115,7 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
>  /* Include code for early handlers */
>  #include "../../kernel/sev-shared.c"
>  
> -static inline bool sev_snp_enabled(void)
> +bool sev_snp_enabled(void)
>  {
>  	return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
>  }

This is another one of my pet peeves and now it even gets exported but
it is the early decompressor crap so I won't even try to mention cc_*
helpers...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v1 2/2] x86/sev: Add SNP-specific unaccepted memory support
  2022-07-29 14:01   ` [PATCH v1 " Tom Lendacky
@ 2022-08-23  0:24     ` Dionna Amalie Glaze
  2022-08-23 14:28       ` Tom Lendacky
  2022-08-23 23:28     ` Dionna Amalie Glaze
  1 sibling, 1 reply; 200+ messages in thread
From: Dionna Amalie Glaze @ 2022-08-23  0:24 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: LKML, the arch/x86 maintainers, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Kirill A. Shutemov, H. Peter Anvin,
	Michael Roth, Joerg Roedel, Andy Lutomirski, Peter Zijlstra

>
> +void snp_accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +       unsigned long vaddr;
> +       unsigned int npages;
> +
> +       if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
> +               return;
> +
> +       vaddr = (unsigned long)__va(start);
> +       npages = (end - start) >> PAGE_SHIFT;
> +
> +       set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
> +
> +       pvalidate_pages(vaddr, npages, true);
> +}

My testing of this patch shows that a significant amount of time is
spent using the MSR protocol to change page state, in such a
significant fashion that it's slower than eagerly accepting all
memory. The difference gets worse as the RAM size goes up, so I think
there's some phase problem with the GHCB protocol not getting used
early enough?

-- 
-Dionna Glaze, PhD (she/her)

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v1 2/2] x86/sev: Add SNP-specific unaccepted memory support
  2022-08-23  0:24     ` Dionna Amalie Glaze
@ 2022-08-23 14:28       ` Tom Lendacky
  0 siblings, 0 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-08-23 14:28 UTC (permalink / raw)
  To: Dionna Amalie Glaze
  Cc: LKML, the arch/x86 maintainers, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Kirill A. Shutemov, H. Peter Anvin,
	Michael Roth, Joerg Roedel, Andy Lutomirski, Peter Zijlstra

On 8/22/22 19:24, Dionna Amalie Glaze wrote:
>>
>> +void snp_accept_memory(phys_addr_t start, phys_addr_t end)
>> +{
>> +       unsigned long vaddr;
>> +       unsigned int npages;
>> +
>> +       if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
>> +               return;
>> +
>> +       vaddr = (unsigned long)__va(start);
>> +       npages = (end - start) >> PAGE_SHIFT;
>> +
>> +       set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
>> +
>> +       pvalidate_pages(vaddr, npages, true);
>> +}
> 
> My testing of this patch shows that a significant amount of time is
> spent using the MSR protocol to change page state, in such a
> significant fashion that it's slower than eagerly accepting all
> memory. The difference gets worse as the RAM size goes up, so I think
> there's some phase problem with the GHCB protocol not getting used
> early enough?

Thank you for testing. Let me see what I can find. I might have to rework 
Brijesh's original patches more to make use of the early boot GHCB in 
order to cut down on the number of MSR protocol requests.

Thanks,
Tom

> 

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v1 2/2] x86/sev: Add SNP-specific unaccepted memory support
  2022-07-29 14:01   ` [PATCH v1 " Tom Lendacky
  2022-08-23  0:24     ` Dionna Amalie Glaze
@ 2022-08-23 23:28     ` Dionna Amalie Glaze
  1 sibling, 0 replies; 200+ messages in thread
From: Dionna Amalie Glaze @ 2022-08-23 23:28 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: LKML, the arch/x86 maintainers, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Kirill A. Shutemov, H. Peter Anvin,
	Michael Roth, Joerg Roedel, Andy Lutomirski, Peter Zijlstra

> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1553,6 +1553,7 @@ config AMD_MEM_ENCRYPT
>         select INSTRUCTION_DECODER
>         select ARCH_HAS_CC_PLATFORM
>         select X86_MEM_ENCRYPT
> +       select UNACCEPTED_MEMORY
>         help
>           Say yes to enable support for the encryption of system memory.
>           This requires an AMD processor that supports Secure Memory

At the risk of starting another centithread like on Kirill's patches
for unaccepted memory, I think this needs to be brought up.

By making unaccepted_memory on option rather than a dependency, we get
into an inescapable situation of always needing to know whether or not
the guest OS will support unaccepted memory, from within the firmware.
I think that makes a UEFI specification change necessary.
If we don't make this configurable, and indeed make it a dependency,
then we can say SEV-SNP implies that the firmware should create
unaccepted memory. We can work around the short gap of support between
kernel versions.

What are your thoughts on dependency versus UEFI spec change to allow
this configuration to be negotiated with the firmware?

-- 
-Dionna Glaze, PhD (she/her)

^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v4 0/4] Provide SEV-SNP support for unaccepted memory
  2022-06-14 12:02 [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (17 preceding siblings ...)
  2022-08-15 15:57 ` [PATCH v3 0/2] Provide SEV-SNP support for unaccepted memory Tom Lendacky
@ 2022-08-25 14:23 ` Tom Lendacky
  2022-08-25 14:23   ` [PATCH v4 1/4] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support Tom Lendacky
                     ` (3 more replies)
  2022-09-27 17:04 ` [PATCH v5 0/6] Provide SEV-SNP support for unaccepted memory Tom Lendacky
  19 siblings, 4 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-08-25 14:23 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

This series adds SEV-SNP support for unaccepted memory to the patch series
titled:

  [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory

Currently, when changing the state of a page under SNP, the page state
change structure is kmalloc()'d. This lead to hangs during boot when
accepting memory because the allocation can trigger the need to accept
more memory. Additionally, the page state change operations are not
optimized under Linux since it was expected that all memory has been
validated already, resulting in poor performance when adding basic
support for unaccepted memory.

 So this series consists of four patches:

  - A pre-patch to switch from a kmalloc()'d page state change structure
    to a (smaller) stack-based page state change structure.

  - A pre-patch to allow the use of the early boot GHCB in the core kernel
    path.

  - A pre-patch to allow for use of 2M page state change requests and 2M
    page validation.

  - SNP support for unaccepted memory.

The series is based off of and tested against Kirill Shutemov's tree:
  https://github.com/intel/tdx.git guest-unaccepted-memory

---

Changes since v3:
- Reworks the PSC process to greatly improve performance:
  - Optimize the PSC process to use 2M pages when applicable.
  - Optimize the page validation process to use 2M pages when applicable.
  - Use the early GHCB in both the decompression phase and core kernel
    boot phase in order to minimize the use of the MSR protocol. The MSR
    protocol only allows for a single 4K page to be updated at a time.
- Move the ghcb_percpu_ready flag into the sev_config structure and
  rename it to ghcbs_initialized.

Changes since v2:
- Improve code comments in regards to when to use the per-CPU GHCB vs
  the MSR protocol and why a single global value is valid for both
  the BSP and APs.
- Add a comment related to the number of PSC entries and how it can
  impact the size of the struct and, therefore, stack usage.
- Add a WARN_ON_ONCE() for invoking vmgexit_psc() when per-CPU GHCBs
  haven't been created or registered, yet.
- Use the compiler support for clearing the PSC struct instead of
  issuing memset().

Changes since v1:
- Change from using a per-CPU PSC structure to a (smaller) stack PSC
  structure.


Tom Lendacky (4):
  x86/sev: Put PSC struct on the stack in prep for unaccepted memory
    support
  x86/sev: Allow for use of the early boot GHCB for PSC requests
  x86/sev: Use large PSC requests if applicable
  x86/sev: Add SNP-specific unaccepted memory support

 arch/x86/Kconfig                  |   1 +
 arch/x86/boot/compressed/mem.c    |   3 +
 arch/x86/boot/compressed/sev.c    |  54 ++++++-
 arch/x86/boot/compressed/sev.h    |  23 +++
 arch/x86/include/asm/sev-common.h |   9 +-
 arch/x86/include/asm/sev.h        |   7 +
 arch/x86/kernel/sev-shared.c      | 104 +++++++++++++
 arch/x86/kernel/sev.c             | 246 +++++++++++++-----------------
 arch/x86/mm/unaccepted_memory.c   |   4 +
 9 files changed, 305 insertions(+), 146 deletions(-)
 create mode 100644 arch/x86/boot/compressed/sev.h

-- 
2.37.2


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v4 1/4] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
  2022-08-25 14:23 ` [PATCH v4 0/4] Provide SEV-SNP support for unaccepted memory Tom Lendacky
@ 2022-08-25 14:23   ` Tom Lendacky
  2022-09-20 16:15     ` Borislav Petkov
  2022-08-25 14:23   ` [PATCH v4 2/4] x86/sev: Allow for use of the early boot GHCB for PSC requests Tom Lendacky
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 200+ messages in thread
From: Tom Lendacky @ 2022-08-25 14:23 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

In advance of providing support for unaccepted memory, switch from using
kmalloc() for allocating the Page State Change (PSC) structure to using a
local variable that lives on the stack. This is needed to avoid a possible
recursive call into set_pages_state() if the kmalloc() call requires
(more) memory to be accepted, which would result in a hang.

The current size of the PSC struct is 2,032 bytes. To make the struct more
stack friendly, reduce the number of PSC entries from 253 down to 64,
resulting in a size of 520 bytes. This is a nice compromise on struct size
and total PSC requests while still allowing parallel PSC operations across
vCPUs.

If the reduction in PSC entries results in any kind of performance issue
(that is not seen at the moment), use of a larger static PSC struct, with
fallback to the smaller stack version, can be investigated.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/include/asm/sev-common.h |  9 +++++++--
 arch/x86/kernel/sev.c             | 10 ++--------
 2 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index b8357d6ecd47..6c3d61c5f6a3 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -106,8 +106,13 @@ enum psc_op {
 #define GHCB_HV_FT_SNP			BIT_ULL(0)
 #define GHCB_HV_FT_SNP_AP_CREATION	BIT_ULL(1)
 
-/* SNP Page State Change NAE event */
-#define VMGEXIT_PSC_MAX_ENTRY		253
+/*
+ * SNP Page State Change NAE event
+ *   The VMGEXIT_PSC_MAX_ENTRY determines the size of the PSC structure,
+ *   which is a local variable (stack usage) in set_pages_state(). Do not
+ *   increase this value without evaluating the impact to stack usage.
+ */
+#define VMGEXIT_PSC_MAX_ENTRY		64
 
 struct psc_hdr {
 	u16 cur_entry;
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index c05f0124c410..d18a580dd048 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -868,11 +868,7 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
 static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
 {
 	unsigned long vaddr_end, next_vaddr;
-	struct snp_psc_desc *desc;
-
-	desc = kmalloc(sizeof(*desc), GFP_KERNEL_ACCOUNT);
-	if (!desc)
-		panic("SNP: failed to allocate memory for PSC descriptor\n");
+	struct snp_psc_desc desc;
 
 	vaddr = vaddr & PAGE_MASK;
 	vaddr_end = vaddr + (npages << PAGE_SHIFT);
@@ -882,12 +878,10 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
 		next_vaddr = min_t(unsigned long, vaddr_end,
 				   (VMGEXIT_PSC_MAX_ENTRY * PAGE_SIZE) + vaddr);
 
-		__set_pages_state(desc, vaddr, next_vaddr, op);
+		__set_pages_state(&desc, vaddr, next_vaddr, op);
 
 		vaddr = next_vaddr;
 	}
-
-	kfree(desc);
 }
 
 void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCH v4 2/4] x86/sev: Allow for use of the early boot GHCB for PSC requests
  2022-08-25 14:23 ` [PATCH v4 0/4] Provide SEV-SNP support for unaccepted memory Tom Lendacky
  2022-08-25 14:23   ` [PATCH v4 1/4] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support Tom Lendacky
@ 2022-08-25 14:23   ` Tom Lendacky
  2022-08-25 14:23   ` [PATCH v4 3/4] x86/sev: Use large PSC requests if applicable Tom Lendacky
  2022-08-25 14:23   ` [PATCH v4 4/4] x86/sev: Add SNP-specific unaccepted memory support Tom Lendacky
  3 siblings, 0 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-08-25 14:23 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

Using a GHCB for a page stage change (as opposed to the MSR protocol)
allows for multiple pages to be processed in a single request. In prep
for early PSC requests in support of unaccepted memory, update the
invocation of vmgexit_psc() to be able to use the early boot GHCB and not
just the per-CPU GHCB structure.

In order to use the proper GHCB (early boot vs per-CPU), set a flag that
indicates when the per-CPU GHCBs are available and registered. For APs,
the per-CPU GHCBs are created before they are started and registered upon
startup, so this flag can be used globally for the BSP and APs instead of
creating a per-CPU flag. This will allow for a significant reduction in
the number of MSR protocol page state change requests when accepting
memory.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/kernel/sev.c | 61 +++++++++++++++++++++++++++----------------
 1 file changed, 38 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index d18a580dd048..a5f02b6b099b 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -117,7 +117,19 @@ static DEFINE_PER_CPU(struct sev_es_save_area *, sev_vmsa);
 
 struct sev_config {
 	__u64 debug		: 1,
-	      __reserved	: 63;
+
+	      /*
+	       * A flag used by __set_pages_state() that indicates when the
+	       * per-CPU GHCB has been created and registered and thus can be
+	       * used by the BSP instead of the early boot GHCB.
+	       *
+	       * For APs, the per-CPU GHCB is created before they are started
+	       * and registered upon startup, so this flag can be used globally
+	       * for the BSP and APs.
+	       */
+	      ghcbs_initialized	: 1,
+
+	      __reserved	: 62;
 };
 
 static struct sev_config sev_cfg __read_mostly;
@@ -660,7 +672,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool valid
 	}
 }
 
-static void __init early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
+static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
 {
 	unsigned long paddr_end;
 	u64 val;
@@ -742,26 +754,13 @@ void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op
 		WARN(1, "invalid memory op %d\n", op);
 }
 
-static int vmgexit_psc(struct snp_psc_desc *desc)
+static int vmgexit_psc(struct ghcb *ghcb, struct snp_psc_desc *desc)
 {
 	int cur_entry, end_entry, ret = 0;
 	struct snp_psc_desc *data;
-	struct ghcb_state state;
 	struct es_em_ctxt ctxt;
-	unsigned long flags;
-	struct ghcb *ghcb;
 
-	/*
-	 * __sev_get_ghcb() needs to run with IRQs disabled because it is using
-	 * a per-CPU GHCB.
-	 */
-	local_irq_save(flags);
-
-	ghcb = __sev_get_ghcb(&state);
-	if (!ghcb) {
-		ret = 1;
-		goto out_unlock;
-	}
+	vc_ghcb_invalidate(ghcb);
 
 	/* Copy the input desc into GHCB shared buffer */
 	data = (struct snp_psc_desc *)ghcb->shared_buffer;
@@ -818,20 +817,18 @@ static int vmgexit_psc(struct snp_psc_desc *desc)
 	}
 
 out:
-	__sev_put_ghcb(&state);
-
-out_unlock:
-	local_irq_restore(flags);
-
 	return ret;
 }
 
 static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
 			      unsigned long vaddr_end, int op)
 {
+	struct ghcb_state state;
 	struct psc_hdr *hdr;
 	struct psc_entry *e;
+	unsigned long flags;
 	unsigned long pfn;
+	struct ghcb *ghcb;
 	int i;
 
 	hdr = &data->hdr;
@@ -861,8 +858,20 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
 		i++;
 	}
 
-	if (vmgexit_psc(data))
+	local_irq_save(flags);
+
+	if (sev_cfg.ghcbs_initialized)
+		ghcb = __sev_get_ghcb(&state);
+	else
+		ghcb = boot_ghcb;
+
+	if (!ghcb || vmgexit_psc(ghcb, data))
 		sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
+
+	if (sev_cfg.ghcbs_initialized)
+		__sev_put_ghcb(&state);
+
+	local_irq_restore(flags);
 }
 
 static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
@@ -870,6 +879,10 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
 	unsigned long vaddr_end, next_vaddr;
 	struct snp_psc_desc desc;
 
+	/* Use the MSR protocol when a GHCB is not available. */
+	if (!boot_ghcb)
+		return early_set_pages_state(__pa(vaddr), npages, op);
+
 	vaddr = vaddr & PAGE_MASK;
 	vaddr_end = vaddr + (npages << PAGE_SHIFT);
 
@@ -1248,6 +1261,8 @@ void setup_ghcb(void)
 		if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
 			snp_register_per_cpu_ghcb();
 
+		sev_cfg.ghcbs_initialized = true;
+
 		return;
 	}
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCH v4 3/4] x86/sev: Use large PSC requests if applicable
  2022-08-25 14:23 ` [PATCH v4 0/4] Provide SEV-SNP support for unaccepted memory Tom Lendacky
  2022-08-25 14:23   ` [PATCH v4 1/4] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support Tom Lendacky
  2022-08-25 14:23   ` [PATCH v4 2/4] x86/sev: Allow for use of the early boot GHCB for PSC requests Tom Lendacky
@ 2022-08-25 14:23   ` Tom Lendacky
  2022-08-25 14:23   ` [PATCH v4 4/4] x86/sev: Add SNP-specific unaccepted memory support Tom Lendacky
  3 siblings, 0 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-08-25 14:23 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

In advance of providing support for unaccepted memory, request 2M Page
State Change (PSC) requests when the address range allows for it. By using
a 2M page size, more PSC operations can be handled in a single request to
the hypervisor. The hypervisor will determine if it can accommodate the
larger request by checking the mapping in the nested page table. If mapped
as a large page, then the 2M page request can be performed, otherwise the
2M page request will be broken down into 512 4K page requests. This is
still more efficient than having the guest perform multiple PSC requests
in order to process the 512 4K pages.

In conjunction with the 2M PSC requests, attempt to perform the associated
PVALIDATE instruction of the page using the 2M page size. If PVALIDATE
fails with a size mismatch, then fallback to validating 512 4K pages. To
do this, page validation is modified to work with the PSC structure and
not just a virtual address range.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/include/asm/sev.h |   4 ++
 arch/x86/kernel/sev.c      | 125 ++++++++++++++++++++++++-------------
 2 files changed, 84 insertions(+), 45 deletions(-)

diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 19514524f0f8..0007ab04ac5f 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -79,11 +79,15 @@ extern void vc_no_ghcb(void);
 extern void vc_boot_ghcb(void);
 extern bool handle_vc_boot_ghcb(struct pt_regs *regs);
 
+/* PVALIDATE return codes */
+#define PVALIDATE_FAIL_SIZEMISMATCH	6
+
 /* Software defined (when rFlags.CF = 1) */
 #define PVALIDATE_FAIL_NOUPDATE		255
 
 /* RMP page size */
 #define RMP_PG_SIZE_4K			0
+#define RMP_PG_SIZE_2M			1
 
 #define RMPADJUST_VMSA_PAGE_BIT		BIT(16)
 
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index a5f02b6b099b..a744f7f2e72b 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -655,32 +655,58 @@ static u64 __init get_jump_table_addr(void)
 	return ret;
 }
 
-static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool validate)
+static void pvalidate_pages(struct snp_psc_desc *desc)
 {
-	unsigned long vaddr_end;
+	struct psc_entry *e;
+	unsigned long vaddr;
+	unsigned int size;
+	unsigned int i;
+	bool validate;
 	int rc;
 
-	vaddr = vaddr & PAGE_MASK;
-	vaddr_end = vaddr + (npages << PAGE_SHIFT);
+	for (i = 0; i <= desc->hdr.end_entry; i++) {
+		e = &desc->entries[i];
+
+		vaddr = (unsigned long)pfn_to_kaddr(e->gfn);
+		size = e->pagesize ? RMP_PG_SIZE_2M : RMP_PG_SIZE_4K;
+		validate = (e->operation == SNP_PAGE_STATE_PRIVATE) ? true : false;
+
+		rc = pvalidate(vaddr, size, validate);
+		if (rc == PVALIDATE_FAIL_SIZEMISMATCH && size == RMP_PG_SIZE_2M) {
+			unsigned long vaddr_end = vaddr + PMD_PAGE_SIZE;
+
+			for (; vaddr < vaddr_end; vaddr += PAGE_SIZE) {
+				rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
+				if (rc)
+					break;
+			}
+		}
 
-	while (vaddr < vaddr_end) {
-		rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
 		if (WARN(rc, "Failed to validate address 0x%lx ret %d", vaddr, rc))
 			sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PVALIDATE);
-
-		vaddr = vaddr + PAGE_SIZE;
 	}
 }
 
-static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
+static void early_set_pages_state(unsigned long vaddr, unsigned long paddr,
+				  unsigned int npages, enum psc_op op)
 {
 	unsigned long paddr_end;
 	u64 val;
+	int ret;
+
+	vaddr = vaddr & PAGE_MASK;
 
 	paddr = paddr & PAGE_MASK;
 	paddr_end = paddr + (npages << PAGE_SHIFT);
 
 	while (paddr < paddr_end) {
+		if (op == SNP_PAGE_STATE_SHARED) {
+			/* Page validation must be rescinded before changing to shared */
+			ret = pvalidate(vaddr, RMP_PG_SIZE_4K, false);
+			if (WARN(ret, "Failed to validate address 0x%lx ret %d", paddr, ret))
+				goto e_term;
+		}
+
 		/*
 		 * Use the MSR protocol because this function can be called before
 		 * the GHCB is established.
@@ -701,7 +727,15 @@ static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum
 			 paddr, GHCB_MSR_PSC_RESP_VAL(val)))
 			goto e_term;
 
-		paddr = paddr + PAGE_SIZE;
+		if (op == SNP_PAGE_STATE_PRIVATE) {
+			/* Page validation must be performed after changing to private */
+			ret = pvalidate(vaddr, RMP_PG_SIZE_4K, true);
+			if (WARN(ret, "Failed to validate address 0x%lx ret %d", paddr, ret))
+				goto e_term;
+		}
+
+		vaddr += PAGE_SIZE;
+		paddr += PAGE_SIZE;
 	}
 
 	return;
@@ -720,10 +754,7 @@ void __init early_snp_set_memory_private(unsigned long vaddr, unsigned long padd
 	  * Ask the hypervisor to mark the memory pages as private in the RMP
 	  * table.
 	  */
-	early_set_pages_state(paddr, npages, SNP_PAGE_STATE_PRIVATE);
-
-	/* Validate the memory pages after they've been added in the RMP table. */
-	pvalidate_pages(vaddr, npages, true);
+	early_set_pages_state(vaddr, paddr, npages, SNP_PAGE_STATE_PRIVATE);
 }
 
 void __init early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr,
@@ -732,11 +763,8 @@ void __init early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr
 	if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
 		return;
 
-	/* Invalidate the memory pages before they are marked shared in the RMP table. */
-	pvalidate_pages(vaddr, npages, false);
-
 	 /* Ask hypervisor to mark the memory pages shared in the RMP table. */
-	early_set_pages_state(paddr, npages, SNP_PAGE_STATE_SHARED);
+	early_set_pages_state(vaddr, paddr, npages, SNP_PAGE_STATE_SHARED);
 }
 
 void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op)
@@ -820,10 +848,11 @@ static int vmgexit_psc(struct ghcb *ghcb, struct snp_psc_desc *desc)
 	return ret;
 }
 
-static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
-			      unsigned long vaddr_end, int op)
+static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
+				       unsigned long vaddr_end, int op)
 {
 	struct ghcb_state state;
+	bool use_large_entry;
 	struct psc_hdr *hdr;
 	struct psc_entry *e;
 	unsigned long flags;
@@ -837,27 +866,37 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
 	memset(data, 0, sizeof(*data));
 	i = 0;
 
-	while (vaddr < vaddr_end) {
-		if (is_vmalloc_addr((void *)vaddr))
+	while (vaddr < vaddr_end && i < ARRAY_SIZE(data->entries)) {
+		hdr->end_entry = i;
+
+		if (is_vmalloc_addr((void *)vaddr)) {
 			pfn = vmalloc_to_pfn((void *)vaddr);
-		else
+			use_large_entry = false;
+		} else {
 			pfn = __pa(vaddr) >> PAGE_SHIFT;
+			use_large_entry = true;
+		}
 
 		e->gfn = pfn;
 		e->operation = op;
-		hdr->end_entry = i;
 
-		/*
-		 * Current SNP implementation doesn't keep track of the RMP page
-		 * size so use 4K for simplicity.
-		 */
-		e->pagesize = RMP_PG_SIZE_4K;
+		if (use_large_entry && IS_ALIGNED(vaddr, PMD_PAGE_SIZE) &&
+		    (vaddr_end - vaddr) >= PMD_PAGE_SIZE) {
+			e->pagesize = RMP_PG_SIZE_2M;
+			vaddr += PMD_PAGE_SIZE;
+		} else {
+			e->pagesize = RMP_PG_SIZE_4K;
+			vaddr += PAGE_SIZE;
+		}
 
-		vaddr = vaddr + PAGE_SIZE;
 		e++;
 		i++;
 	}
 
+	/* Page validation must be rescinded before changing to shared */
+	if (op == SNP_PAGE_STATE_SHARED)
+		pvalidate_pages(data);
+
 	local_irq_save(flags);
 
 	if (sev_cfg.ghcbs_initialized)
@@ -865,6 +904,7 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
 	else
 		ghcb = boot_ghcb;
 
+	/* Invoke the hypervisor to perform the page state changes */
 	if (!ghcb || vmgexit_psc(ghcb, data))
 		sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
 
@@ -872,29 +912,28 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
 		__sev_put_ghcb(&state);
 
 	local_irq_restore(flags);
+
+	/* Page validation must be performed after changing to private */
+	if (op == SNP_PAGE_STATE_PRIVATE)
+		pvalidate_pages(data);
+
+	return vaddr;
 }
 
 static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
 {
-	unsigned long vaddr_end, next_vaddr;
 	struct snp_psc_desc desc;
+	unsigned long vaddr_end;
 
 	/* Use the MSR protocol when a GHCB is not available. */
 	if (!boot_ghcb)
-		return early_set_pages_state(__pa(vaddr), npages, op);
+		return early_set_pages_state(vaddr, __pa(vaddr), npages, op);
 
 	vaddr = vaddr & PAGE_MASK;
 	vaddr_end = vaddr + (npages << PAGE_SHIFT);
 
-	while (vaddr < vaddr_end) {
-		/* Calculate the last vaddr that fits in one struct snp_psc_desc. */
-		next_vaddr = min_t(unsigned long, vaddr_end,
-				   (VMGEXIT_PSC_MAX_ENTRY * PAGE_SIZE) + vaddr);
-
-		__set_pages_state(&desc, vaddr, next_vaddr, op);
-
-		vaddr = next_vaddr;
-	}
+	while (vaddr < vaddr_end)
+		vaddr = __set_pages_state(&desc, vaddr, vaddr_end, op);
 }
 
 void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
@@ -902,8 +941,6 @@ void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
 	if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
 		return;
 
-	pvalidate_pages(vaddr, npages, false);
-
 	set_pages_state(vaddr, npages, SNP_PAGE_STATE_SHARED);
 }
 
@@ -913,8 +950,6 @@ void snp_set_memory_private(unsigned long vaddr, unsigned int npages)
 		return;
 
 	set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
-
-	pvalidate_pages(vaddr, npages, true);
 }
 
 static int snp_set_vmsa(void *va, bool vmsa)
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCH v4 4/4] x86/sev: Add SNP-specific unaccepted memory support
  2022-08-25 14:23 ` [PATCH v4 0/4] Provide SEV-SNP support for unaccepted memory Tom Lendacky
                     ` (2 preceding siblings ...)
  2022-08-25 14:23   ` [PATCH v4 3/4] x86/sev: Use large PSC requests if applicable Tom Lendacky
@ 2022-08-25 14:23   ` Tom Lendacky
  2022-08-25 22:10     ` Dionna Amalie Glaze
  3 siblings, 1 reply; 200+ messages in thread
From: Tom Lendacky @ 2022-08-25 14:23 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

Add SNP-specific hooks to the unaccepted memory support in the boot
path (__accept_memory()) and the core kernel (accept_memory()) in order
to support booting SNP guests when unaccepted memory is present. Without
this support, SNP guests will fail to boot and/or panic() when unaccepted
memory is present in the EFI memory map.

The process of accepting memory under SNP involves invoking the hypervisor
to perform a page state change for the page to private memory and then
issuing a PVALIDATE instruction to accept the page.

Since the boot path and the core kernel paths perform similar operations,
move the pvalidate_pages() and vmgexit_psc() functions into sev-shared.c
to avoid code duplication.

Create the new header file arch/x86/boot/compressed/sev.h because adding
the function declaration to any of the existing SEV related header files
pulls in too many other header files, causing the build to fail.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/Kconfig                |   1 +
 arch/x86/boot/compressed/mem.c  |   3 +
 arch/x86/boot/compressed/sev.c  |  54 ++++++++++++++-
 arch/x86/boot/compressed/sev.h  |  23 +++++++
 arch/x86/include/asm/sev.h      |   3 +
 arch/x86/kernel/sev-shared.c    | 104 +++++++++++++++++++++++++++++
 arch/x86/kernel/sev.c           | 112 ++++----------------------------
 arch/x86/mm/unaccepted_memory.c |   4 ++
 8 files changed, 205 insertions(+), 99 deletions(-)
 create mode 100644 arch/x86/boot/compressed/sev.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 34146ecc5bdd..0ad53c3533c2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1553,6 +1553,7 @@ config AMD_MEM_ENCRYPT
 	select INSTRUCTION_DECODER
 	select ARCH_HAS_CC_PLATFORM
 	select X86_MEM_ENCRYPT
+	select UNACCEPTED_MEMORY
 	help
 	  Say yes to enable support for the encryption of system memory.
 	  This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index 48e36e640da1..3e19dc0da0d7 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -6,6 +6,7 @@
 #include "find.h"
 #include "math.h"
 #include "tdx.h"
+#include "sev.h"
 #include <asm/shared/tdx.h>
 
 #define PMD_SHIFT	21
@@ -39,6 +40,8 @@ static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
 	/* Platform-specific memory-acceptance call goes here */
 	if (is_tdx_guest())
 		tdx_accept_memory(start, end);
+	else if (sev_snp_enabled())
+		snp_accept_memory(start, end);
 	else
 		error("Cannot accept memory: unknown platform\n");
 }
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 730c4677e9db..22da65c96b47 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -115,7 +115,7 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
 /* Include code for early handlers */
 #include "../../kernel/sev-shared.c"
 
-static inline bool sev_snp_enabled(void)
+bool sev_snp_enabled(void)
 {
 	return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
 }
@@ -181,6 +181,58 @@ static bool early_setup_ghcb(void)
 	return true;
 }
 
+static phys_addr_t __snp_accept_memory(struct snp_psc_desc *desc,
+				       phys_addr_t pa, phys_addr_t pa_end)
+{
+	struct psc_hdr *hdr;
+	struct psc_entry *e;
+	unsigned int i;
+
+	hdr = &desc->hdr;
+	memset(hdr, 0, sizeof(*hdr));
+
+	e = desc->entries;
+
+	i = 0;
+	while (pa < pa_end && i < VMGEXIT_PSC_MAX_ENTRY) {
+		hdr->end_entry = i;
+
+		e->gfn = pa >> PAGE_SHIFT;
+		e->operation = SNP_PAGE_STATE_PRIVATE;
+		if (IS_ALIGNED(pa, PMD_PAGE_SIZE) && (pa_end - pa) >= PMD_PAGE_SIZE) {
+			e->pagesize = RMP_PG_SIZE_2M;
+			pa += PMD_PAGE_SIZE;
+		} else {
+			e->pagesize = RMP_PG_SIZE_4K;
+			pa += PAGE_SIZE;
+		}
+
+		e++;
+		i++;
+	}
+
+	if (vmgexit_psc(boot_ghcb, desc))
+		sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
+
+	pvalidate_pages(desc);
+
+	return pa;
+}
+
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	struct snp_psc_desc desc = {};
+	unsigned int i;
+	phys_addr_t pa;
+
+	if (!boot_ghcb && !early_setup_ghcb())
+		sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
+
+	pa = start;
+	while (pa < end)
+		pa = __snp_accept_memory(&desc, pa, end);
+}
+
 void sev_es_shutdown_ghcb(void)
 {
 	if (!boot_ghcb)
diff --git a/arch/x86/boot/compressed/sev.h b/arch/x86/boot/compressed/sev.h
new file mode 100644
index 000000000000..fc725a981b09
--- /dev/null
+++ b/arch/x86/boot/compressed/sev.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * AMD SEV header for early boot related functions.
+ *
+ * Author: Tom Lendacky <thomas.lendacky@amd.com>
+ */
+
+#ifndef BOOT_COMPRESSED_SEV_H
+#define BOOT_COMPRESSED_SEV_H
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+
+bool sev_snp_enabled(void);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
+
+#else
+
+static inline bool sev_snp_enabled(void) { return false; }
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
+
+#endif
+
+#endif
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 0007ab04ac5f..9297aab0c79e 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -206,6 +206,7 @@ void snp_set_wakeup_secondary_cpu(void);
 bool snp_init(struct boot_params *bp);
 void snp_abort(void);
 int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, unsigned long *fw_err);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
 #else
 static inline void sev_es_ist_enter(struct pt_regs *regs) { }
 static inline void sev_es_ist_exit(void) { }
@@ -230,6 +231,8 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
 {
 	return -ENOTTY;
 }
+
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
 #endif
 
 #endif
diff --git a/arch/x86/kernel/sev-shared.c b/arch/x86/kernel/sev-shared.c
index b478edf43bec..7ac7857da2b8 100644
--- a/arch/x86/kernel/sev-shared.c
+++ b/arch/x86/kernel/sev-shared.c
@@ -12,6 +12,9 @@
 #ifndef __BOOT_COMPRESSED
 #define error(v)	pr_err(v)
 #define has_cpuflag(f)	boot_cpu_has(f)
+#else
+#undef WARN
+#define WARN(condition...)
 #endif
 
 /* I/O parameters for CPUID-related helpers */
@@ -998,3 +1001,104 @@ static void __init setup_cpuid_table(const struct cc_blob_sev_info *cc_info)
 			cpuid_ext_range_max = fn->eax;
 	}
 }
+
+static void pvalidate_pages(struct snp_psc_desc *desc)
+{
+	struct psc_entry *e;
+	unsigned long vaddr;
+	unsigned int size;
+	unsigned int i;
+	bool validate;
+	int rc;
+
+	for (i = 0; i <= desc->hdr.end_entry; i++) {
+		e = &desc->entries[i];
+
+		vaddr = (unsigned long)pfn_to_kaddr(e->gfn);
+		size = e->pagesize ? RMP_PG_SIZE_2M : RMP_PG_SIZE_4K;
+		validate = (e->operation == SNP_PAGE_STATE_PRIVATE) ? true : false;
+
+		rc = pvalidate(vaddr, size, validate);
+		if (rc == PVALIDATE_FAIL_SIZEMISMATCH && size == RMP_PG_SIZE_2M) {
+			unsigned long vaddr_end = vaddr + PMD_PAGE_SIZE;
+
+			for (; vaddr < vaddr_end; vaddr += PAGE_SIZE) {
+				rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
+				if (rc)
+					break;
+			}
+		}
+
+		if (rc) {
+			WARN(1, "Failed to validate address 0x%lx ret %d", vaddr, rc);
+			sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PVALIDATE);
+		}
+	}
+}
+
+static int vmgexit_psc(struct ghcb *ghcb, struct snp_psc_desc *desc)
+{
+	int cur_entry, end_entry, ret = 0;
+	struct snp_psc_desc *data;
+	struct es_em_ctxt ctxt;
+
+	vc_ghcb_invalidate(ghcb);
+
+	/* Copy the input desc into GHCB shared buffer */
+	data = (struct snp_psc_desc *)ghcb->shared_buffer;
+	memcpy(ghcb->shared_buffer, desc, min_t(int, GHCB_SHARED_BUF_SIZE, sizeof(*desc)));
+
+	/*
+	 * As per the GHCB specification, the hypervisor can resume the guest
+	 * before processing all the entries. Check whether all the entries
+	 * are processed. If not, then keep retrying. Note, the hypervisor
+	 * will update the data memory directly to indicate the status, so
+	 * reference the data->hdr everywhere.
+	 *
+	 * The strategy here is to wait for the hypervisor to change the page
+	 * state in the RMP table before guest accesses the memory pages. If the
+	 * page state change was not successful, then later memory access will
+	 * result in a crash.
+	 */
+	cur_entry = data->hdr.cur_entry;
+	end_entry = data->hdr.end_entry;
+
+	while (data->hdr.cur_entry <= data->hdr.end_entry) {
+		ghcb_set_sw_scratch(ghcb, (u64)__pa(data));
+
+		/* This will advance the shared buffer data points to. */
+		ret = sev_es_ghcb_hv_call(ghcb, true, &ctxt, SVM_VMGEXIT_PSC, 0, 0);
+
+		/*
+		 * Page State Change VMGEXIT can pass error code through
+		 * exit_info_2.
+		 */
+		if (ret || ghcb->save.sw_exit_info_2) {
+			WARN(1, "SNP: PSC failed ret=%d exit_info_2=%llx\n",
+			     ret, ghcb->save.sw_exit_info_2);
+			ret = 1;
+			goto out;
+		}
+
+		/* Verify that reserved bit is not set */
+		if (data->hdr.reserved) {
+			WARN(1, "Reserved bit is set in the PSC header\n");
+			ret = 1;
+			goto out;
+		}
+
+		/*
+		 * Sanity check that entry processing is not going backwards.
+		 * This will happen only if hypervisor is tricking us.
+		 */
+		if (data->hdr.end_entry > end_entry || cur_entry > data->hdr.cur_entry) {
+			WARN(1, "SNP: PSC processing going backward, end_entry %d (got %d) cur_entry %d (got %d)\n",
+			     end_entry, data->hdr.end_entry, cur_entry, data->hdr.cur_entry);
+			ret = 1;
+			goto out;
+		}
+	}
+
+out:
+	return ret;
+}
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index a744f7f2e72b..abdf431622ea 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -655,38 +655,6 @@ static u64 __init get_jump_table_addr(void)
 	return ret;
 }
 
-static void pvalidate_pages(struct snp_psc_desc *desc)
-{
-	struct psc_entry *e;
-	unsigned long vaddr;
-	unsigned int size;
-	unsigned int i;
-	bool validate;
-	int rc;
-
-	for (i = 0; i <= desc->hdr.end_entry; i++) {
-		e = &desc->entries[i];
-
-		vaddr = (unsigned long)pfn_to_kaddr(e->gfn);
-		size = e->pagesize ? RMP_PG_SIZE_2M : RMP_PG_SIZE_4K;
-		validate = (e->operation == SNP_PAGE_STATE_PRIVATE) ? true : false;
-
-		rc = pvalidate(vaddr, size, validate);
-		if (rc == PVALIDATE_FAIL_SIZEMISMATCH && size == RMP_PG_SIZE_2M) {
-			unsigned long vaddr_end = vaddr + PMD_PAGE_SIZE;
-
-			for (; vaddr < vaddr_end; vaddr += PAGE_SIZE) {
-				rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
-				if (rc)
-					break;
-			}
-		}
-
-		if (WARN(rc, "Failed to validate address 0x%lx ret %d", vaddr, rc))
-			sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PVALIDATE);
-	}
-}
-
 static void early_set_pages_state(unsigned long vaddr, unsigned long paddr,
 				  unsigned int npages, enum psc_op op)
 {
@@ -782,72 +750,6 @@ void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op
 		WARN(1, "invalid memory op %d\n", op);
 }
 
-static int vmgexit_psc(struct ghcb *ghcb, struct snp_psc_desc *desc)
-{
-	int cur_entry, end_entry, ret = 0;
-	struct snp_psc_desc *data;
-	struct es_em_ctxt ctxt;
-
-	vc_ghcb_invalidate(ghcb);
-
-	/* Copy the input desc into GHCB shared buffer */
-	data = (struct snp_psc_desc *)ghcb->shared_buffer;
-	memcpy(ghcb->shared_buffer, desc, min_t(int, GHCB_SHARED_BUF_SIZE, sizeof(*desc)));
-
-	/*
-	 * As per the GHCB specification, the hypervisor can resume the guest
-	 * before processing all the entries. Check whether all the entries
-	 * are processed. If not, then keep retrying. Note, the hypervisor
-	 * will update the data memory directly to indicate the status, so
-	 * reference the data->hdr everywhere.
-	 *
-	 * The strategy here is to wait for the hypervisor to change the page
-	 * state in the RMP table before guest accesses the memory pages. If the
-	 * page state change was not successful, then later memory access will
-	 * result in a crash.
-	 */
-	cur_entry = data->hdr.cur_entry;
-	end_entry = data->hdr.end_entry;
-
-	while (data->hdr.cur_entry <= data->hdr.end_entry) {
-		ghcb_set_sw_scratch(ghcb, (u64)__pa(data));
-
-		/* This will advance the shared buffer data points to. */
-		ret = sev_es_ghcb_hv_call(ghcb, true, &ctxt, SVM_VMGEXIT_PSC, 0, 0);
-
-		/*
-		 * Page State Change VMGEXIT can pass error code through
-		 * exit_info_2.
-		 */
-		if (WARN(ret || ghcb->save.sw_exit_info_2,
-			 "SNP: PSC failed ret=%d exit_info_2=%llx\n",
-			 ret, ghcb->save.sw_exit_info_2)) {
-			ret = 1;
-			goto out;
-		}
-
-		/* Verify that reserved bit is not set */
-		if (WARN(data->hdr.reserved, "Reserved bit is set in the PSC header\n")) {
-			ret = 1;
-			goto out;
-		}
-
-		/*
-		 * Sanity check that entry processing is not going backwards.
-		 * This will happen only if hypervisor is tricking us.
-		 */
-		if (WARN(data->hdr.end_entry > end_entry || cur_entry > data->hdr.cur_entry,
-"SNP: PSC processing going backward, end_entry %d (got %d) cur_entry %d (got %d)\n",
-			 end_entry, data->hdr.end_entry, cur_entry, data->hdr.cur_entry)) {
-			ret = 1;
-			goto out;
-		}
-	}
-
-out:
-	return ret;
-}
-
 static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
 				       unsigned long vaddr_end, int op)
 {
@@ -952,6 +854,20 @@ void snp_set_memory_private(unsigned long vaddr, unsigned int npages)
 	set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
 }
 
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	unsigned long vaddr;
+	unsigned int npages;
+
+	if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
+		return;
+
+	vaddr = (unsigned long)__va(start);
+	npages = (end - start) >> PAGE_SHIFT;
+
+	set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
+}
+
 static int snp_set_vmsa(void *va, bool vmsa)
 {
 	u64 attrs;
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index 9ec2304272dc..b86ad6a8ddf5 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -9,6 +9,7 @@
 #include <asm/setup.h>
 #include <asm/shared/tdx.h>
 #include <asm/unaccepted_memory.h>
+#include <asm/sev.h>
 
 /* Protects unaccepted memory bitmap */
 static DEFINE_SPINLOCK(unaccepted_memory_lock);
@@ -66,6 +67,9 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
 		if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
 			tdx_accept_memory(range_start * PMD_SIZE,
 					  range_end * PMD_SIZE);
+		} else if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) {
+			snp_accept_memory(range_start * PMD_SIZE,
+					  range_end * PMD_SIZE);
 		} else {
 			panic("Cannot accept memory: unknown platform\n");
 		}
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* Re: [PATCH v4 4/4] x86/sev: Add SNP-specific unaccepted memory support
  2022-08-25 14:23   ` [PATCH v4 4/4] x86/sev: Add SNP-specific unaccepted memory support Tom Lendacky
@ 2022-08-25 22:10     ` Dionna Amalie Glaze
  2022-08-26 21:29       ` Tom Lendacky
  0 siblings, 1 reply; 200+ messages in thread
From: Dionna Amalie Glaze @ 2022-08-25 22:10 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: LKML, the arch/x86 maintainers, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Kirill A. Shutemov, H. Peter Anvin,
	Michael Roth, Joerg Roedel, Andy Lutomirski, Peter Zijlstra

>
> Add SNP-specific hooks to the unaccepted memory support in the boot
> path (__accept_memory()) and the core kernel (accept_memory()) in order
> to support booting SNP guests when unaccepted memory is present. Without
> this support, SNP guests will fail to boot and/or panic() when unaccepted
> memory is present in the EFI memory map.
>
> The process of accepting memory under SNP involves invoking the hypervisor
> to perform a page state change for the page to private memory and then
> issuing a PVALIDATE instruction to accept the page.

Thanks for this update! Tests show the boot performance shaves off a
good few seconds over eager acceptance, and it'll get better when we
have on-demand pinning.

The uncaught #VC exception is still there for 256GB machines and larger though.

-- 
-Dionna Glaze, PhD (she/her)

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v4 4/4] x86/sev: Add SNP-specific unaccepted memory support
  2022-08-25 22:10     ` Dionna Amalie Glaze
@ 2022-08-26 21:29       ` Tom Lendacky
  0 siblings, 0 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-08-26 21:29 UTC (permalink / raw)
  To: Dionna Amalie Glaze
  Cc: LKML, the arch/x86 maintainers, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Kirill A. Shutemov, H. Peter Anvin,
	Michael Roth, Joerg Roedel, Andy Lutomirski, Peter Zijlstra

On 8/25/22 17:10, Dionna Amalie Glaze wrote:
>>
>> Add SNP-specific hooks to the unaccepted memory support in the boot
>> path (__accept_memory()) and the core kernel (accept_memory()) in order
>> to support booting SNP guests when unaccepted memory is present. Without
>> this support, SNP guests will fail to boot and/or panic() when unaccepted
>> memory is present in the EFI memory map.
>>
>> The process of accepting memory under SNP involves invoking the hypervisor
>> to perform a page state change for the page to private memory and then
>> issuing a PVALIDATE instruction to accept the page.
> 
> Thanks for this update! Tests show the boot performance shaves off a
> good few seconds over eager acceptance, and it'll get better when we
> have on-demand pinning.
> 
> The uncaught #VC exception is still there for 256GB machines and larger though.

Any chance of getting a stack trace when this occurs, e.g. adding a 
WARN_ON() in vc_handle_exitcode() (assuming it happens when logging is 
enabled)?

Thanks,
Tom

> 

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-08-15 22:02         ` Tom Lendacky
@ 2022-08-29 16:02           ` Dionna Amalie Glaze
  2022-08-29 16:19             ` Dave Hansen
  0 siblings, 1 reply; 200+ messages in thread
From: Dionna Amalie Glaze @ 2022-08-29 16:02 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Mel Gorman, Vlastimil Babka, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Dario Faggioli,
	Dave Hansen, Mike Rapoport, David Hildenbrand, Marcelo Cerri,
	tim.gardner, Khalid ElMously, philip.cox,
	the arch/x86 maintainers, Linux Memory Management List,
	linux-coco, linux-efi, LKML, Mike Rapoport

> > The stack track is in mm/page_alloc.c. I've done a little
> > investigation, but I can't account for why there's a hard cutoff of
> > correctness at 256GB
> >
> > [    0.065563] RIP: 0010:memmap_init_range+0x108/0x173
> > [    0.066309] Code: 77 16 f6 42 10 02 74 10 48 03 42 08 48 c1 e8 0c
> > 48 89 c3 e9 3a ff ff ff 48 89 df 48 c1 e7 06 48 03 3d d9 a2 66 ff 48
> > 8d 47 08 <c7> 47 34 01 00 00 00 48 c7 47 38 00 00 00 00 c7 47 30 ff ff
> > ff ff
> > [    0.069108] RSP: 0000:ffffffffad603dc8 EFLAGS: 00010082 ORIG_RAX:
> > 0000000000000404
> > [    0.070193] RAX: ffffdba740000048 RBX: 0000000000000001 RCX: 0000000000000000
> > [    0.071170] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffdba740000040
> > [    0.072224] RBP: 0000000000000000 R08: 0000000000001000 R09: 0000000000000000
> > [    0.073283] R10: 0000000000000001 R11: ffffffffad645c60 R12: 0000000000000000
> > [    0.074304] R13: 00000000000000a0 R14: 0000000000000000 R15: 0000000000000000
> > [    0.075285] FS:  0000000000000000(0000) GS:ffffffffadd6c000(0000)
> > knlGS:0000000000000000
> > [    0.076365] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [    0.077194] CR2: ffffdba740000074 CR3: 0008001ee3a0c000 CR4: 00000000000606b0
> > [    0.078209] Call Trace:
> > [    0.078524]  <TASK>
> > [    0.078887]  ? free_area_init+0x5c1/0x66c
> > [    0.079417]  ? zone_sizes_init+0x52/0x6c
> > [    0.079934]  ? setup_arch+0xa55/0xb6d
> > [    0.080417]  ? start_kernel+0x64/0x65a
> > [    0.080897]  ? secondary_startup_64_no_verify+0xd6/0xdb
> > [    0.081620]  </TASK>
>
> Note that there is a bug in Brijesh's version of the patch and it will
> almost exclusively use the MSR protocol. Please try the version of the
> patch that I recently sent up based on the current unaccepted memory tree
> from Kirill.
>

I've now tested this patch set with Tom's new patch set, and it
appears to be that the problem with 256GB is more likely to be due to
this unaccepted memory patch set rather than something AMD-specific.

Kirill, do you see any problems with 256GB on TDX? It seems there is
some unaccepted memory getting touched in memmap_init_range when the
VM's memory size is at least 256GB

-- 
-Dionna Glaze, PhD (she/her)

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-08-29 16:02           ` Dionna Amalie Glaze
@ 2022-08-29 16:19             ` Dave Hansen
  2022-09-06 17:50               ` Dionna Amalie Glaze
  0 siblings, 1 reply; 200+ messages in thread
From: Dave Hansen @ 2022-08-29 16:19 UTC (permalink / raw)
  To: Dionna Amalie Glaze, Tom Lendacky
  Cc: Mel Gorman, Vlastimil Babka, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, Marcelo Cerri, tim.gardner,
	Khalid ElMously, philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML,
	Mike Rapoport

On 8/29/22 09:02, Dionna Amalie Glaze wrote:
>>> The stack track is in mm/page_alloc.c. I've done a little
>>> investigation, but I can't account for why there's a hard cutoff of
>>> correctness at 256GB
>>>
>>> [    0.065563] RIP: 0010:memmap_init_range+0x108/0x173
>>> [    0.066309] Code: 77 16 f6 42 10 02 74 10 48 03 42 08 48 c1 e8 0c
>>> 48 89 c3 e9 3a ff ff ff 48 89 df 48 c1 e7 06 48 03 3d d9 a2 66 ff 48
>>> 8d 47 08 <c7> 47 34 01 00 00 00 48 c7 47 38 00 00 00 00 c7 47 30 ff ff
>>> ff ff
>>> [    0.069108] RSP: 0000:ffffffffad603dc8 EFLAGS: 00010082 ORIG_RAX:
>>> 0000000000000404
>>> [    0.070193] RAX: ffffdba740000048 RBX: 0000000000000001 RCX: 0000000000000000
>>> [    0.071170] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffdba740000040
>>> [    0.072224] RBP: 0000000000000000 R08: 0000000000001000 R09: 0000000000000000
>>> [    0.073283] R10: 0000000000000001 R11: ffffffffad645c60 R12: 0000000000000000
>>> [    0.074304] R13: 00000000000000a0 R14: 0000000000000000 R15: 0000000000000000
>>> [    0.075285] FS:  0000000000000000(0000) GS:ffffffffadd6c000(0000)
>>> knlGS:0000000000000000
>>> [    0.076365] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [    0.077194] CR2: ffffdba740000074 CR3: 0008001ee3a0c000 CR4: 00000000000606b0
>>> [    0.078209] Call Trace:
>>> [    0.078524]  <TASK>
>>> [    0.078887]  ? free_area_init+0x5c1/0x66c
>>> [    0.079417]  ? zone_sizes_init+0x52/0x6c
>>> [    0.079934]  ? setup_arch+0xa55/0xb6d
>>> [    0.080417]  ? start_kernel+0x64/0x65a
>>> [    0.080897]  ? secondary_startup_64_no_verify+0xd6/0xdb
>>> [    0.081620]  </TASK>
>> Note that there is a bug in Brijesh's version of the patch and it will
>> almost exclusively use the MSR protocol. Please try the version of the
>> patch that I recently sent up based on the current unaccepted memory tree
>> from Kirill.
>>
> I've now tested this patch set with Tom's new patch set, and it
> appears to be that the problem with 256GB is more likely to be due to
> this unaccepted memory patch set rather than something AMD-specific.
> 
> Kirill, do you see any problems with 256GB on TDX? It seems there is
> some unaccepted memory getting touched in memmap_init_range when the
> VM's memory size is at least 256GB

It really helps this kind of stuff if you can post the *actual* error.
I assume this was a page fault, so there should have been some other
stuff before the RIP:...

Another thing that's really nice is to do the disassembly of the "Code:"
or share disassembly of memmap_init_range.  Even nicer would be to give
an faddr2line of the RIP value and track down which C code was actually
at fault.

It's *possible* to look into these things from what you posted, but it's
just slow.  I'm sure Kirill will appreciate the help.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-08-29 16:19             ` Dave Hansen
@ 2022-09-06 17:50               ` Dionna Amalie Glaze
  2022-09-08 12:11                 ` Mike Rapoport
  0 siblings, 1 reply; 200+ messages in thread
From: Dionna Amalie Glaze @ 2022-09-06 17:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Tom Lendacky, Mel Gorman, Vlastimil Babka, Kirill A. Shutemov,
	Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, Marcelo Cerri, tim.gardner,
	Khalid ElMously, philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML,
	Mike Rapoport

>
> It really helps this kind of stuff if you can post the *actual* error.
> I assume this was a page fault, so there should have been some other
> stuff before the RIP:...
>

I posted the error on August 15th.  I was bumping in my last post
since I confirmed with Tom Lendacky that it wasn't AMD's patches at
fault.
Here's a new dump below that matches the disassembly:

[    0.043137] Faking a node at [mem 0x0000000000000000-0x000000403fffffff]
[    0.044018] NODE_DATA(0) allocated [mem 0x403fffc000-0x403fffffff]
[    0.044922] Zone ranges:
[    0.045250]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.046039]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
[    0.046828]   Normal   [mem 0x0000000100000000-0x000000403fffffff]
[    0.047657] Movable zone start for each node
[    0.048201] Early memory node ranges
[    0.048674]   node   0: [mem 0x0000000000001000-0x000000000009ffff]
[    0.049474]   node   0: [mem 0x0000000000100000-0x000000000080cfff
[    0.050274]   node   0: [mem 0x000000000080f000-0x00000000beceefff]
[    0.051074]   node   0: [mem 0x00000000befff000-0x00000000bfbb0fff]
[    0.051874]   node   0: [mem 0x00000000bfbb2000-0x00000000bffdbfff]
[    0.052674]   node   0: [mem 0x0000000100000000-0x000000403fffffff]
[    0.053530] Initmem setup node 0 [mem 0x0000000000001000-0x000000403fffffff]
PANIC: Unsupported exit-code 0x404 in early #VC exception (IP:
0xfffffffface0cdd0)
[    0.056667] CPU: 0 PID: 0 Comm: swapper Not tainted
5.17.0-rc6-173762-gffb12b02c6d7-dirty #1
[    0.057744] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[    0.058920] RIP: 0010:memmap_init_range+0x11d/0x188
[    0.059686] Code: 77 16 f6 42 10 02 74 10 48 03 42 08 48 c1 e8 0c
48 89 c3 e9 3a ff ff ff 48 89 df 48 c1 e7 06 48 03 3d a4 1e 65 ff 48
8d 47 08 <c7> 47 34 01 00 00 00 48 c7 47 38 00 00 00 00 c7 47 30 ff ff
ff ff
[    0.062121] RSP: 0000:ffffffffac603dc0 EFLAGS: 00010082 ORIG_RAX:
0000000000000404
[    0.063087] RAX: ffffda1ac0000048 RBX: 0000000000000001 RCX: 0000000000000000
[    0.063998] RDX: 0300000000000000 RSI: 0000000000000000 RDI: ffffda1ac000004
[    0.064944] RBP: 0000000000000000 R08: 0000000000001000 R09: 0000000000000000
[    0.065873] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[    0.066782] R13: 00000000000000a0 R14: 0000000000000000 R15: 0000000000000000
[    0.067695] FS:  0000000000000000(0000) GS:ffffffffacd88000(0000)
knlGS:0000000000000000
[    0.068727] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.069488] CR2: ffffda1ac0000074 CR3: 00080020b680c000 CR4: 00000000000606b0
[    0.070397] Call Trace:
[    0.070710]  <TASK>
[    0.070976]  ? free_area_init+0x724/0x7d4
[    0.071486]  ? zone_sizes_init+0x52/0x6c
[    0.071986]  ? setup_arch+0xa55/0xb77
[    0.072453]  ? start_kernel+0x64/0x65f
[    0.072931]  ? secondary_startup_64_no_verify+0xd6/0xdb
[    0.073598]  </TASK>

Note this is a crash in SEV-SNP, but I assume we'd get the same #VE in TDX.

> Another thing that's really nice is to do the disassembly of the "Code:"
> or share disassembly of memmap_init_range.

0000000000000172 <memmap_init_range>:
 172:   41 56                   push   %r14
 174:   89 f0                   mov    %esi,%eax
 176:   45 89 ce                mov    %r9d,%r14d
 179:   41 55                   push   %r13
 17b:   4c 8d 2c 39             lea    (%rcx,%rdi,1),%r13
 17f:   41 54                   push   %r12
 181:   49 89 d4                mov    %rdx,%r12
 184:   49 8d 55 ff             lea    -0x1(%r13),%rdx
 188:   48 3b 15 00 00 00 00    cmp    0x0(%rip),%rdx        # 18f
<memmap_init_range+0x1d>
 18f:   55                      push   %rbp
 190:   53                      push   %rbx
 191:   48 89 cb                mov    %rcx,%rbx
 194:   76 07                   jbe    19d <memmap_init_range+0x2b>
 196:   48 89 15 00 00 00 00    mov    %rdx,0x0(%rip)        # 19d
<memmap_init_range+0x2b>
 19d:   4c 89 e5                mov    %r12,%rbp
 1a0:   ba 03 00 00 00          mov    $0x3,%edx
 1a5:   48 c1 e0 3a             shl    $0x3a,%rax
 1a9:   48 c1 e5 38             shl    $0x38,%rbp
 1ad:   48 c1 e2 38             shl    $0x38,%rdx
 1b1:   48 21 d5                and    %rdx,%rbp
 1b4:   48 09 c5                or     %rax,%rbp
 1b7:   49 39 dd                cmp    %rbx,%r13
 1ba:   0f 86 31 01 00 00       jbe    2f1 <memmap_init_range+0x17f>
 1c0:   45 85 f6                test   %r14d,%r14d
 1c3:   0f 85 b4 00 00 00       jne    27d <memmap_init_range+0x10b>
 1c9:   49 83 fc 03             cmp    $0x3,%r12
 1cd:   0f 94 c1                sete   %cl
 1d0:   22 0d 00 00 00 00       and    0x0(%rip),%cl        # 1d6
<memmap_init_range+0x64>
 1d6:   0f 84 a1 00 00 00       je     27d <memmap_init_range+0x10b>
 1dc:   48 8b 15 00 00 00 00    mov    0x0(%rip),%rdx        # 1e3
<memmap_init_range+0x71>
 1e3:   48 85 d2                test   %rdx,%rdx
 1e6:   74 10                   je     1f8 <memmap_init_range+0x86>
 1e8:   48 8b 42 08             mov    0x8(%rdx),%rax
 1ec:   48 03 02                add    (%rdx),%rax
 1ef:   48 c1 e8 0c             shr    $0xc,%rax
 1f3:   48 39 d8                cmp    %rbx,%rax
 1f6:   77 55                   ja     24d <memmap_init_range+0xdb>
 1f8:   48 8b 05 00 00 00 00    mov    0x0(%rip),%rax        # 1ff
<memmap_init_range+0x8d>
 1ff:   4c 6b 05 00 00 00 00    imul   $0x18,0x0(%rip),%r8        #
207 <memmap_init_range+0x95>
 206:   18
 207:   31 f6                   xor    %esi,%esi
 209:   48 89 05 00 00 00 00    mov    %rax,0x0(%rip)        # 210
<memmap_init_range+0x9e>
 210:   49 01 c0                add    %rax,%r8
 213:   48 89 c7                mov    %rax,%rdi
 216:   4c 39 c0                cmp    %r8,%rax
 219:   73 26                   jae    241 <memmap_init_range+0xcf>
 21b:   48 8b 57 08             mov    0x8(%rdi),%rdx
 21f:   48 03 17                add    (%rdi),%rdx
 222:   48 83 c0 18             add    $0x18,%rax
 226:   48 c1 ea 0c             shr    $0xc,%rdx
 22a:   48 39 da                cmp    %rbx,%rdx
 22d:   76 0e                   jbe    23d <memmap_init_range+0xcb>
 22f:   40 84 f6                test   %sil,%sil
 232:   74 19                   je     24d <memmap_init_range+0xdb>
 234:   48 89 3d 00 00 00 00    mov    %rdi,0x0(%rip)        # 23b
<memmap_init_range+0xc9>
 23b:   eb 10                   jmp    24d <memmap_init_range+0xdb>
 23d:   89 ce                   mov    %ecx,%esi
 23f:   eb d2                   jmp    213 <memmap_init_range+0xa1>
 241:   40 84 f6                test   %sil,%sil
 244:   74 07                   je     24d <memmap_init_range+0xdb>
 246:   48 89 05 00 00 00 00    mov    %rax,0x0(%rip)        # 24d
<memmap_init_range+0xdb>
 24d:   48 8b 15 00 00 00 00    mov    0x0(%rip),%rdx        # 254
<memmap_init_range+0xe2>
 254:   48 8b 02                mov    (%rdx),%rax
 257:   48 8d 88 ff 0f 00 00    lea    0xfff(%rax),%rcx
 25e:   48 c1 e9 0c             shr    $0xc,%rcx
 262:   48 39 d9                cmp    %rbx,%rcx
 265:   77 16                   ja     27d <memmap_init_range+0x10b>
 267:   f6 42 10 02             testb  $0x2,0x10(%rdx)
 26b:   74 10                   je     27d <memmap_init_range+0x10b>
 26d:   48 03 42 08             add    0x8(%rdx),%rax
 271:   48 c1 e8 0c             shr    $0xc,%rax
 275:   48 89 c3                mov    %rax,%rbx
 278:   e9 3a ff ff ff          jmp    1b7 <memmap_init_range+0x45>
 27d:   48 89 df                mov    %rbx,%rdi
 280:   48 c1 e7 06             shl    $0x6,%rdi
 284:   48 03 3d 00 00 00 00    add    0x0(%rip),%rdi        # 28b
<memmap_init_range+0x119>
 28b:   48 8d 47 08             lea    0x8(%rdi),%rax
 28f:   c7 47 34 01 00 00 00    movl   $0x1,0x34(%rdi)    # Here's
where the crash RIP is.
 296:   48 c7 47 38 00 00 00    movq   $0x0,0x38(%rdi)
 29d:   00
 29e:   c7 47 30 ff ff ff ff    movl   $0xffffffff,0x30(%rdi)
 2a5:   48 c7 47 28 00 00 00    movq   $0x0,0x28(%rdi)
 2ac:   00
 2ad:   48 c7 47 20 00 00 00    movq   $0x0,0x20(%rdi)
 2b4:   00
 2b5:   48 c7 47 18 00 00 00    movq   $0x0,0x18(%rdi)
 2bc:   00
 2bd:   48 89 2f                mov    %rbp,(%rdi)
 2c0:   48 89 47 08             mov    %rax,0x8(%rdi)
 2c4:   48 89 47 10             mov    %rax,0x10(%rdi)
 2c8:   41 83 fe 01             cmp    $0x1,%r14d
 2cc:   75 05                   jne    2d3 <memmap_init_range+0x161>
 2ce:   48 0f ba 2f 0c          btsq   $0xc,(%rdi)
 2d3:   f7 c3 ff 01 00 00       test   $0x1ff,%ebx
 2d9:   75 0e                   jne    2e9 <memmap_init_range+0x177>
 2db:   8b 74 24 38             mov    0x38(%rsp),%esi
 2df:   e8 00 00 00 00          call   2e4 <memmap_init_range+0x172>
 2e4:   e8 00 00 00 00          call   2e9 <memmap_init_range+0x177>
 2e9:   48 ff c3                inc    %rbx
 2ec:   e9 c6 fe ff ff          jmp    1b7 <memmap_init_range+0x45>
 2f1:   5b                      pop    %rbx
 2f2:   5d                      pop    %rbp
 2f3:   41 5c                   pop    %r12
 2f5:   41 5d                   pop    %r13
 2f7:   41 5e                   pop    %r14
 2f9:   c3                      ret

> Even nicer would be to give
> an faddr2line of the RIP value and track down which C code was actually
> at fault.

arch_atomic_set
arch/x86/include/asm/atomic.h:41

of INIT_LIST_HEAD in __init_single_page, called from memmap_init_range.
--
-Dionna Glaze, PhD (she/her)

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-09-06 17:50               ` Dionna Amalie Glaze
@ 2022-09-08 12:11                 ` Mike Rapoport
  2022-09-08 16:23                   ` Dionna Amalie Glaze
  0 siblings, 1 reply; 200+ messages in thread
From: Mike Rapoport @ 2022-09-08 12:11 UTC (permalink / raw)
  To: Dionna Amalie Glaze
  Cc: Dave Hansen, Tom Lendacky, Mel Gorman, Vlastimil Babka,
	Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Thomas Gleixner, Peter Zijlstra, Paolo Bonzini, Ingo Molnar,
	Dario Faggioli, Mike Rapoport, David Hildenbrand, Marcelo Cerri,
	tim.gardner, Khalid ElMously, philip.cox,
	the arch/x86 maintainers, Linux Memory Management List,
	linux-coco, linux-efi, LKML

On Tue, Sep 06, 2022 at 10:50:42AM -0700, Dionna Amalie Glaze wrote:
> >
> > It really helps this kind of stuff if you can post the *actual* error.
> > I assume this was a page fault, so there should have been some other
> > stuff before the RIP:...
> >
> 
> I posted the error on August 15th.  I was bumping in my last post
> since I confirmed with Tom Lendacky that it wasn't AMD's patches at
> fault.
> Here's a new dump below that matches the disassembly:
> 
> [    0.043137] Faking a node at [mem 0x0000000000000000-0x000000403fffffff]
> [    0.044018] NODE_DATA(0) allocated [mem 0x403fffc000-0x403fffffff]
> [    0.044922] Zone ranges:
> [    0.045250]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> [    0.046039]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
> [    0.046828]   Normal   [mem 0x0000000100000000-0x000000403fffffff]
> [    0.047657] Movable zone start for each node
> [    0.048201] Early memory node ranges
> [    0.048674]   node   0: [mem 0x0000000000001000-0x000000000009ffff]
> [    0.049474]   node   0: [mem 0x0000000000100000-0x000000000080cfff
> [    0.050274]   node   0: [mem 0x000000000080f000-0x00000000beceefff]
> [    0.051074]   node   0: [mem 0x00000000befff000-0x00000000bfbb0fff]
> [    0.051874]   node   0: [mem 0x00000000bfbb2000-0x00000000bffdbfff]
> [    0.052674]   node   0: [mem 0x0000000100000000-0x000000403fffffff]
> [    0.053530] Initmem setup node 0 [mem 0x0000000000001000-0x000000403fffffff]
> PANIC: Unsupported exit-code 0x404 in early #VC exception (IP:
> 0xfffffffface0cdd0)
> [    0.056667] CPU: 0 PID: 0 Comm: swapper Not tainted
> 5.17.0-rc6-173762-gffb12b02c6d7-dirty #1
> [    0.057744] Hardware name: Google Google Compute Engine/Google
> Compute Engine, BIOS Google 01/01/2011
> [    0.058920] RIP: 0010:memmap_init_range+0x11d/0x188
> [    0.059686] Code: 77 16 f6 42 10 02 74 10 48 03 42 08 48 c1 e8 0c
> 48 89 c3 e9 3a ff ff ff 48 89 df 48 c1 e7 06 48 03 3d a4 1e 65 ff 48
> 8d 47 08 <c7> 47 34 01 00 00 00 48 c7 47 38 00 00 00 00 c7 47 30 ff ff
> ff ff
> [    0.062121] RSP: 0000:ffffffffac603dc0 EFLAGS: 00010082 ORIG_RAX:
> 0000000000000404
> [    0.063087] RAX: ffffda1ac0000048 RBX: 0000000000000001 RCX: 0000000000000000
> [    0.063998] RDX: 0300000000000000 RSI: 0000000000000000 RDI: ffffda1ac000004
> [    0.064944] RBP: 0000000000000000 R08: 0000000000001000 R09: 0000000000000000
> [    0.065873] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
> [    0.066782] R13: 00000000000000a0 R14: 0000000000000000 R15: 0000000000000000
> [    0.067695] FS:  0000000000000000(0000) GS:ffffffffacd88000(0000)
> knlGS:0000000000000000
> [    0.068727] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.069488] CR2: ffffda1ac0000074 CR3: 00080020b680c000 CR4: 00000000000606b0
> [    0.070397] Call Trace:
> [    0.070710]  <TASK>
> [    0.070976]  ? free_area_init+0x724/0x7d4
> [    0.071486]  ? zone_sizes_init+0x52/0x6c
> [    0.071986]  ? setup_arch+0xa55/0xb77
> [    0.072453]  ? start_kernel+0x64/0x65f
> [    0.072931]  ? secondary_startup_64_no_verify+0xd6/0xdb
> [    0.073598]  </TASK>
> 
> Note this is a crash in SEV-SNP, but I assume we'd get the same #VE in TDX.
> 
> > Another thing that's really nice is to do the disassembly of the "Code:"
> > or share disassembly of memmap_init_range.
> 
> 0000000000000172 <memmap_init_range>:
>  172:   41 56                   push   %r14
>  174:   89 f0                   mov    %esi,%eax
>  176:   45 89 ce                mov    %r9d,%r14d
>  179:   41 55                   push   %r13
>  17b:   4c 8d 2c 39             lea    (%rcx,%rdi,1),%r13
>  17f:   41 54                   push   %r12
>  181:   49 89 d4                mov    %rdx,%r12
>  184:   49 8d 55 ff             lea    -0x1(%r13),%rdx
>  188:   48 3b 15 00 00 00 00    cmp    0x0(%rip),%rdx        # 18f
> <memmap_init_range+0x1d>
>  18f:   55                      push   %rbp
>  190:   53                      push   %rbx
>  191:   48 89 cb                mov    %rcx,%rbx
>  194:   76 07                   jbe    19d <memmap_init_range+0x2b>
>  196:   48 89 15 00 00 00 00    mov    %rdx,0x0(%rip)        # 19d
> <memmap_init_range+0x2b>
>  19d:   4c 89 e5                mov    %r12,%rbp
>  1a0:   ba 03 00 00 00          mov    $0x3,%edx
>  1a5:   48 c1 e0 3a             shl    $0x3a,%rax
>  1a9:   48 c1 e5 38             shl    $0x38,%rbp
>  1ad:   48 c1 e2 38             shl    $0x38,%rdx
>  1b1:   48 21 d5                and    %rdx,%rbp
>  1b4:   48 09 c5                or     %rax,%rbp
>  1b7:   49 39 dd                cmp    %rbx,%r13
>  1ba:   0f 86 31 01 00 00       jbe    2f1 <memmap_init_range+0x17f>
>  1c0:   45 85 f6                test   %r14d,%r14d
>  1c3:   0f 85 b4 00 00 00       jne    27d <memmap_init_range+0x10b>
>  1c9:   49 83 fc 03             cmp    $0x3,%r12
>  1cd:   0f 94 c1                sete   %cl
>  1d0:   22 0d 00 00 00 00       and    0x0(%rip),%cl        # 1d6
> <memmap_init_range+0x64>
>  1d6:   0f 84 a1 00 00 00       je     27d <memmap_init_range+0x10b>
>  1dc:   48 8b 15 00 00 00 00    mov    0x0(%rip),%rdx        # 1e3
> <memmap_init_range+0x71>
>  1e3:   48 85 d2                test   %rdx,%rdx
>  1e6:   74 10                   je     1f8 <memmap_init_range+0x86>
>  1e8:   48 8b 42 08             mov    0x8(%rdx),%rax
>  1ec:   48 03 02                add    (%rdx),%rax
>  1ef:   48 c1 e8 0c             shr    $0xc,%rax
>  1f3:   48 39 d8                cmp    %rbx,%rax
>  1f6:   77 55                   ja     24d <memmap_init_range+0xdb>
>  1f8:   48 8b 05 00 00 00 00    mov    0x0(%rip),%rax        # 1ff
> <memmap_init_range+0x8d>
>  1ff:   4c 6b 05 00 00 00 00    imul   $0x18,0x0(%rip),%r8        #
> 207 <memmap_init_range+0x95>
>  206:   18
>  207:   31 f6                   xor    %esi,%esi
>  209:   48 89 05 00 00 00 00    mov    %rax,0x0(%rip)        # 210
> <memmap_init_range+0x9e>
>  210:   49 01 c0                add    %rax,%r8
>  213:   48 89 c7                mov    %rax,%rdi
>  216:   4c 39 c0                cmp    %r8,%rax
>  219:   73 26                   jae    241 <memmap_init_range+0xcf>
>  21b:   48 8b 57 08             mov    0x8(%rdi),%rdx
>  21f:   48 03 17                add    (%rdi),%rdx
>  222:   48 83 c0 18             add    $0x18,%rax
>  226:   48 c1 ea 0c             shr    $0xc,%rdx
>  22a:   48 39 da                cmp    %rbx,%rdx
>  22d:   76 0e                   jbe    23d <memmap_init_range+0xcb>
>  22f:   40 84 f6                test   %sil,%sil
>  232:   74 19                   je     24d <memmap_init_range+0xdb>
>  234:   48 89 3d 00 00 00 00    mov    %rdi,0x0(%rip)        # 23b
> <memmap_init_range+0xc9>
>  23b:   eb 10                   jmp    24d <memmap_init_range+0xdb>
>  23d:   89 ce                   mov    %ecx,%esi
>  23f:   eb d2                   jmp    213 <memmap_init_range+0xa1>
>  241:   40 84 f6                test   %sil,%sil
>  244:   74 07                   je     24d <memmap_init_range+0xdb>
>  246:   48 89 05 00 00 00 00    mov    %rax,0x0(%rip)        # 24d
> <memmap_init_range+0xdb>
>  24d:   48 8b 15 00 00 00 00    mov    0x0(%rip),%rdx        # 254
> <memmap_init_range+0xe2>
>  254:   48 8b 02                mov    (%rdx),%rax
>  257:   48 8d 88 ff 0f 00 00    lea    0xfff(%rax),%rcx
>  25e:   48 c1 e9 0c             shr    $0xc,%rcx
>  262:   48 39 d9                cmp    %rbx,%rcx
>  265:   77 16                   ja     27d <memmap_init_range+0x10b>
>  267:   f6 42 10 02             testb  $0x2,0x10(%rdx)
>  26b:   74 10                   je     27d <memmap_init_range+0x10b>
>  26d:   48 03 42 08             add    0x8(%rdx),%rax
>  271:   48 c1 e8 0c             shr    $0xc,%rax
>  275:   48 89 c3                mov    %rax,%rbx
>  278:   e9 3a ff ff ff          jmp    1b7 <memmap_init_range+0x45>
>  27d:   48 89 df                mov    %rbx,%rdi
>  280:   48 c1 e7 06             shl    $0x6,%rdi
>  284:   48 03 3d 00 00 00 00    add    0x0(%rip),%rdi        # 28b
> <memmap_init_range+0x119>
>  28b:   48 8d 47 08             lea    0x8(%rdi),%rax
>  28f:   c7 47 34 01 00 00 00    movl   $0x1,0x34(%rdi)    # Here's
> where the crash RIP is.
>  296:   48 c7 47 38 00 00 00    movq   $0x0,0x38(%rdi)
>  29d:   00
>  29e:   c7 47 30 ff ff ff ff    movl   $0xffffffff,0x30(%rdi)
>  2a5:   48 c7 47 28 00 00 00    movq   $0x0,0x28(%rdi)
>  2ac:   00
>  2ad:   48 c7 47 20 00 00 00    movq   $0x0,0x20(%rdi)
>  2b4:   00
>  2b5:   48 c7 47 18 00 00 00    movq   $0x0,0x18(%rdi)
>  2bc:   00
>  2bd:   48 89 2f                mov    %rbp,(%rdi)
>  2c0:   48 89 47 08             mov    %rax,0x8(%rdi)
>  2c4:   48 89 47 10             mov    %rax,0x10(%rdi)
>  2c8:   41 83 fe 01             cmp    $0x1,%r14d
>  2cc:   75 05                   jne    2d3 <memmap_init_range+0x161>
>  2ce:   48 0f ba 2f 0c          btsq   $0xc,(%rdi)
>  2d3:   f7 c3 ff 01 00 00       test   $0x1ff,%ebx
>  2d9:   75 0e                   jne    2e9 <memmap_init_range+0x177>
>  2db:   8b 74 24 38             mov    0x38(%rsp),%esi
>  2df:   e8 00 00 00 00          call   2e4 <memmap_init_range+0x172>
>  2e4:   e8 00 00 00 00          call   2e9 <memmap_init_range+0x177>
>  2e9:   48 ff c3                inc    %rbx
>  2ec:   e9 c6 fe ff ff          jmp    1b7 <memmap_init_range+0x45>
>  2f1:   5b                      pop    %rbx
>  2f2:   5d                      pop    %rbp
>  2f3:   41 5c                   pop    %r12
>  2f5:   41 5d                   pop    %r13
>  2f7:   41 5e                   pop    %r14
>  2f9:   c3                      ret
> 
> > Even nicer would be to give
> > an faddr2line of the RIP value and track down which C code was actually
> > at fault.
> 
> arch_atomic_set
> arch/x86/include/asm/atomic.h:41
> 
> of INIT_LIST_HEAD in __init_single_page, called from memmap_init_range.

Looks like the first access to the memory map fails, although I think
it's not in INIT_LIST_HEAD() but rather in init_page_count().

I'd start with making sure that page_alloc::memmap_alloc() actually returns
accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map
will poisoned in this function, so my guess is it'd crash there.

> --
> -Dionna Glaze, PhD (she/her)

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-09-08 12:11                 ` Mike Rapoport
@ 2022-09-08 16:23                   ` Dionna Amalie Glaze
  2022-09-08 19:28                     ` Mike Rapoport
  0 siblings, 1 reply; 200+ messages in thread
From: Dionna Amalie Glaze @ 2022-09-08 16:23 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Dave Hansen, Tom Lendacky, Mel Gorman, Vlastimil Babka,
	Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Thomas Gleixner, Peter Zijlstra, Paolo Bonzini, Ingo Molnar,
	Dario Faggioli, Mike Rapoport, David Hildenbrand, Marcelo Cerri,
	tim.gardner, Khalid ElMously, philip.cox,
	the arch/x86 maintainers, Linux Memory Management List,
	linux-coco, linux-efi, LKML

>
> Looks like the first access to the memory map fails, although I think
> it's not in INIT_LIST_HEAD() but rather in init_page_count().
>
> I'd start with making sure that page_alloc::memmap_alloc() actually returns
> accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map
> will poisoned in this function, so my guess is it'd crash there.
>

That's a wonderful hint, thank you! I did not run this test
CONFIG_DEBUG_VM set, but you think it's possible it could still be
here?

> --
> Sincerely yours,
> Mike.



-- 
-Dionna Glaze, PhD (she/her)

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-09-08 16:23                   ` Dionna Amalie Glaze
@ 2022-09-08 19:28                     ` Mike Rapoport
  2022-09-22 14:31                       ` Tom Lendacky
  0 siblings, 1 reply; 200+ messages in thread
From: Mike Rapoport @ 2022-09-08 19:28 UTC (permalink / raw)
  To: Dionna Amalie Glaze
  Cc: Dave Hansen, Tom Lendacky, Mel Gorman, Vlastimil Babka,
	Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Thomas Gleixner, Peter Zijlstra, Paolo Bonzini, Ingo Molnar,
	Dario Faggioli, Mike Rapoport, David Hildenbrand, Marcelo Cerri,
	tim.gardner, Khalid ElMously, philip.cox,
	the arch/x86 maintainers, Linux Memory Management List,
	linux-coco, linux-efi, LKML

On Thu, Sep 08, 2022 at 09:23:07AM -0700, Dionna Amalie Glaze wrote:
> >
> > Looks like the first access to the memory map fails, although I think
> > it's not in INIT_LIST_HEAD() but rather in init_page_count().
> >
> > I'd start with making sure that page_alloc::memmap_alloc() actually returns
> > accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map
> > will poisoned in this function, so my guess is it'd crash there.
> >
> 
> That's a wonderful hint, thank you! I did not run this test
> CONFIG_DEBUG_VM set, but you think it's possible it could still be
> here?

It depends on how you configured your kernel. Say, defconfig does not set
it.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v4 1/4] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
  2022-08-25 14:23   ` [PATCH v4 1/4] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support Tom Lendacky
@ 2022-09-20 16:15     ` Borislav Petkov
  0 siblings, 0 replies; 200+ messages in thread
From: Borislav Petkov @ 2022-09-20 16:15 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: linux-kernel, x86, Thomas Gleixner, Ingo Molnar, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On Thu, Aug 25, 2022 at 09:23:14AM -0500, Tom Lendacky wrote:
> In advance of providing support for unaccepted memory, switch from using
> kmalloc() for allocating the Page State Change (PSC) structure to using a
> local variable that lives on the stack. This is needed to avoid a possible
> recursive call into set_pages_state() if the kmalloc() call requires
> (more) memory to be accepted, which would result in a hang.
> 
> The current size of the PSC struct is 2,032 bytes. To make the struct more
> stack friendly, reduce the number of PSC entries from 253 down to 64,
> resulting in a size of 520 bytes. This is a nice compromise on struct size
> and total PSC requests while still allowing parallel PSC operations across
> vCPUs.
> 
> If the reduction in PSC entries results in any kind of performance issue
> (that is not seen at the moment), use of a larger static PSC struct, with
> fallback to the smaller stack version, can be investigated.

"For more background info on this decision see the subthread in the Link
tag below."

> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>

Link: https://lore.kernel.org/lkml/658c455c40e8950cb046dd885dd19dc1c52d060a.1659103274.git.thomas.lendacky@amd.com

> ---
>  arch/x86/include/asm/sev-common.h |  9 +++++++--
>  arch/x86/kernel/sev.c             | 10 ++--------
>  2 files changed, 9 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
> index b8357d6ecd47..6c3d61c5f6a3 100644
> --- a/arch/x86/include/asm/sev-common.h
> +++ b/arch/x86/include/asm/sev-common.h
> @@ -106,8 +106,13 @@ enum psc_op {
>  #define GHCB_HV_FT_SNP			BIT_ULL(0)
>  #define GHCB_HV_FT_SNP_AP_CREATION	BIT_ULL(1)
>  
> -/* SNP Page State Change NAE event */
> -#define VMGEXIT_PSC_MAX_ENTRY		253
> +/*
> + * SNP Page State Change NAE event
> + *   The VMGEXIT_PSC_MAX_ENTRY determines the size of the PSC structure,
> + *   which is a local variable (stack usage) in set_pages_state(). Do not

... which is a local stack variable...


> + *   increase this value without evaluating the impact to stack usage.
> + */
...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-09-08 19:28                     ` Mike Rapoport
@ 2022-09-22 14:31                       ` Tom Lendacky
  2022-09-24  1:03                         ` Kirill A. Shutemov
  0 siblings, 1 reply; 200+ messages in thread
From: Tom Lendacky @ 2022-09-22 14:31 UTC (permalink / raw)
  To: Dionna Amalie Glaze, Kirill A. Shutemov
  Cc: Dave Hansen, Mel Gorman, Vlastimil Babka, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, Marcelo Cerri, tim.gardner,
	Khalid ElMously, philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML,
	Mike Rapoport

On 9/8/22 14:28, Mike Rapoport wrote:
> On Thu, Sep 08, 2022 at 09:23:07AM -0700, Dionna Amalie Glaze wrote:
>>>
>>> Looks like the first access to the memory map fails, although I think
>>> it's not in INIT_LIST_HEAD() but rather in init_page_count().
>>>
>>> I'd start with making sure that page_alloc::memmap_alloc() actually returns
>>> accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map
>>> will poisoned in this function, so my guess is it'd crash there.
>>>
>>
>> That's a wonderful hint, thank you! I did not run this test
>> CONFIG_DEBUG_VM set, but you think it's possible it could still be
>> here?
> 
> It depends on how you configured your kernel. Say, defconfig does not set
> it.
> 

I also hit the issue at 256GB. My config is using CONFIG_SPARSEMEM_VMEMMAP 
and fails in memmap_init_range() when attempting to add the first PFN. It 
looks like the underlying page that is backing the vmemmap has not been 
accepted (I receive a #VC 0x404 => page not validated).

Kirill, is this a path that you've looked at? It would appear that 
somewhere in the vmemmap_populate_hugepages() path, some memory acceptance 
needs to be done for the pages that are used to back vmemmap. I'm not very 
familiar with this code, so I'm not sure why everything works for a guest 
with 255GB of memory, but then fails for a guest with 256GB of memory.

Thanks,
Tom

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-09-22 14:31                       ` Tom Lendacky
@ 2022-09-24  1:03                         ` Kirill A. Shutemov
  2022-09-24  9:36                           ` Mike Rapoport
  2022-09-26 12:10                           ` Kirill A. Shutemov
  0 siblings, 2 replies; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-09-24  1:03 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Dionna Amalie Glaze, Dave Hansen, Mel Gorman, Vlastimil Babka,
	Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, Marcelo Cerri, tim.gardner,
	Khalid ElMously, philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML,
	Mike Rapoport

On Thu, Sep 22, 2022 at 09:31:12AM -0500, Tom Lendacky wrote:
> On 9/8/22 14:28, Mike Rapoport wrote:
> > On Thu, Sep 08, 2022 at 09:23:07AM -0700, Dionna Amalie Glaze wrote:
> > > > 
> > > > Looks like the first access to the memory map fails, although I think
> > > > it's not in INIT_LIST_HEAD() but rather in init_page_count().
> > > > 
> > > > I'd start with making sure that page_alloc::memmap_alloc() actually returns
> > > > accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map
> > > > will poisoned in this function, so my guess is it'd crash there.
> > > > 
> > > 
> > > That's a wonderful hint, thank you! I did not run this test
> > > CONFIG_DEBUG_VM set, but you think it's possible it could still be
> > > here?
> > 
> > It depends on how you configured your kernel. Say, defconfig does not set
> > it.
> > 
> 
> I also hit the issue at 256GB. My config is using CONFIG_SPARSEMEM_VMEMMAP
> and fails in memmap_init_range() when attempting to add the first PFN. It
> looks like the underlying page that is backing the vmemmap has not been
> accepted (I receive a #VC 0x404 => page not validated).
> 
> Kirill, is this a path that you've looked at? It would appear that somewhere
> in the vmemmap_populate_hugepages() path, some memory acceptance needs to be
> done for the pages that are used to back vmemmap. I'm not very familiar with
> this code, so I'm not sure why everything works for a guest with 255GB of
> memory, but then fails for a guest with 256GB of memory.

Hm. I don't have machine that large at hands at the moment. And I have not
looked at the codepath before.

I will try to look into the issue.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-09-24  1:03                         ` Kirill A. Shutemov
@ 2022-09-24  9:36                           ` Mike Rapoport
  2022-09-26 12:10                           ` Kirill A. Shutemov
  1 sibling, 0 replies; 200+ messages in thread
From: Mike Rapoport @ 2022-09-24  9:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Tom Lendacky, Dionna Amalie Glaze, Dave Hansen, Mel Gorman,
	Vlastimil Babka, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Thomas Gleixner, Peter Zijlstra, Paolo Bonzini, Ingo Molnar,
	Dario Faggioli, David Hildenbrand, Marcelo Cerri, tim.gardner,
	Khalid ElMously, philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML,
	Mike Rapoport

On Sat, Sep 24, 2022 at 04:03:02AM +0300, Kirill A. Shutemov wrote:
> On Thu, Sep 22, 2022 at 09:31:12AM -0500, Tom Lendacky wrote:
> > On 9/8/22 14:28, Mike Rapoport wrote:
> > > On Thu, Sep 08, 2022 at 09:23:07AM -0700, Dionna Amalie Glaze wrote:
> > > > > 
> > > > > Looks like the first access to the memory map fails, although I think
> > > > > it's not in INIT_LIST_HEAD() but rather in init_page_count().
> > > > > 
> > > > > I'd start with making sure that page_alloc::memmap_alloc() actually returns
> > > > > accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map
> > > > > will poisoned in this function, so my guess is it'd crash there.
> > > > > 
> > > > 
> > > > That's a wonderful hint, thank you! I did not run this test
> > > > CONFIG_DEBUG_VM set, but you think it's possible it could still be
> > > > here?
> > > 
> > > It depends on how you configured your kernel. Say, defconfig does not set
> > > it.
> > > 
> > 
> > I also hit the issue at 256GB. My config is using CONFIG_SPARSEMEM_VMEMMAP
> > and fails in memmap_init_range() when attempting to add the first PFN. It
> > looks like the underlying page that is backing the vmemmap has not been
> > accepted (I receive a #VC 0x404 => page not validated).
> > 
> > Kirill, is this a path that you've looked at? It would appear that somewhere
> > in the vmemmap_populate_hugepages() path, some memory acceptance needs to be
> > done for the pages that are used to back vmemmap. I'm not very familiar with
> > this code, so I'm not sure why everything works for a guest with 255GB of
> > memory, but then fails for a guest with 256GB of memory.
> 
> Hm. I don't have machine that large at hands at the moment. And I have not
> looked at the codepath before.
> 
> I will try to look into the issue.

I'd add some printks to verify we actually try to accept the allocated
memory. E.g. something like the patch below:

diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index 9ec2304272dc..8f00639facc4 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -22,6 +22,9 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
 	if (!boot_params.unaccepted_memory)
 		return;
 
+	if (system_state == SYSTEM_BOOTING)
+		pr_info("%s: start: %pa end: %pa %pS\n", __func__, &start, &end, (void *)_RET_IP_);
+
 	bitmap = __va(boot_params.unaccepted_memory);
 	range_start = start / PMD_SIZE;
 
diff --git a/mm/memblock.c b/mm/memblock.c
index a1f7f8b304d5..029dd520102d 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1535,7 +1535,7 @@ void * __init memblock_alloc_exact_nid_raw(
 			phys_addr_t min_addr, phys_addr_t max_addr,
 			int nid)
 {
-	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=%pa max_addr=%pa %pS\n",
+	pr_info("%s: %llu bytes align=0x%llx nid=%d from=%pa max_addr=%pa %pS\n",
 		     __func__, (u64)size, (u64)align, nid, &min_addr,
 		     &max_addr, (void *)_RET_IP_);
 
@@ -1567,7 +1567,7 @@ void * __init memblock_alloc_try_nid_raw(
 			phys_addr_t min_addr, phys_addr_t max_addr,
 			int nid)
 {
-	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=%pa max_addr=%pa %pS\n",
+	pr_info("%s: %llu bytes align=0x%llx nid=%d from=%pa max_addr=%pa %pS\n",
 		     __func__, (u64)size, (u64)align, nid, &min_addr,
 		     &max_addr, (void *)_RET_IP_);
 
 
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply related	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-09-24  1:03                         ` Kirill A. Shutemov
  2022-09-24  9:36                           ` Mike Rapoport
@ 2022-09-26 12:10                           ` Kirill A. Shutemov
  2022-09-26 13:38                             ` Tom Lendacky
  1 sibling, 1 reply; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-09-26 12:10 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Tom Lendacky, Dionna Amalie Glaze, Dave Hansen, Mel Gorman,
	Vlastimil Babka, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Thomas Gleixner, Peter Zijlstra, Paolo Bonzini, Ingo Molnar,
	Dario Faggioli, Mike Rapoport, David Hildenbrand, Marcelo Cerri,
	tim.gardner, Khalid ElMously, philip.cox,
	the arch/x86 maintainers, Linux Memory Management List,
	linux-coco, linux-efi, LKML, Mike Rapoport

On Sat, Sep 24, 2022 at 04:03:02AM +0300, Kirill A. Shutemov wrote:
> On Thu, Sep 22, 2022 at 09:31:12AM -0500, Tom Lendacky wrote:
> > On 9/8/22 14:28, Mike Rapoport wrote:
> > > On Thu, Sep 08, 2022 at 09:23:07AM -0700, Dionna Amalie Glaze wrote:
> > > > > 
> > > > > Looks like the first access to the memory map fails, although I think
> > > > > it's not in INIT_LIST_HEAD() but rather in init_page_count().
> > > > > 
> > > > > I'd start with making sure that page_alloc::memmap_alloc() actually returns
> > > > > accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map
> > > > > will poisoned in this function, so my guess is it'd crash there.
> > > > > 
> > > > 
> > > > That's a wonderful hint, thank you! I did not run this test
> > > > CONFIG_DEBUG_VM set, but you think it's possible it could still be
> > > > here?
> > > 
> > > It depends on how you configured your kernel. Say, defconfig does not set
> > > it.
> > > 
> > 
> > I also hit the issue at 256GB. My config is using CONFIG_SPARSEMEM_VMEMMAP
> > and fails in memmap_init_range() when attempting to add the first PFN. It
> > looks like the underlying page that is backing the vmemmap has not been
> > accepted (I receive a #VC 0x404 => page not validated).
> > 
> > Kirill, is this a path that you've looked at? It would appear that somewhere
> > in the vmemmap_populate_hugepages() path, some memory acceptance needs to be
> > done for the pages that are used to back vmemmap. I'm not very familiar with
> > this code, so I'm not sure why everything works for a guest with 255GB of
> > memory, but then fails for a guest with 256GB of memory.
> 
> Hm. I don't have machine that large at hands at the moment. And I have not
> looked at the codepath before.
> 
> I will try to look into the issue.

I'm not able to trigger the bug.

With help of vm.overcommit_memory=1, I was managed boot TDX guest to shell
with 256G and 1T of guest memory just fine.

Any chance it is SEV-SNP specific?

Or maybe there some difference in kernel config? Could you share yours?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-09-26 12:10                           ` Kirill A. Shutemov
@ 2022-09-26 13:38                             ` Tom Lendacky
  2022-09-26 15:42                               ` Kirill A. Shutemov
  2022-09-26 15:42                               ` Tom Lendacky
  0 siblings, 2 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-09-26 13:38 UTC (permalink / raw)
  To: Kirill A. Shutemov, Kirill A. Shutemov
  Cc: Dionna Amalie Glaze, Dave Hansen, Mel Gorman, Vlastimil Babka,
	Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, Marcelo Cerri, tim.gardner,
	Khalid ElMously, philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML,
	Mike Rapoport

On 9/26/22 07:10, Kirill A. Shutemov wrote:
> On Sat, Sep 24, 2022 at 04:03:02AM +0300, Kirill A. Shutemov wrote:
>> On Thu, Sep 22, 2022 at 09:31:12AM -0500, Tom Lendacky wrote:
>>> On 9/8/22 14:28, Mike Rapoport wrote:
>>>> On Thu, Sep 08, 2022 at 09:23:07AM -0700, Dionna Amalie Glaze wrote:
>>>>>>
>>>>>> Looks like the first access to the memory map fails, although I think
>>>>>> it's not in INIT_LIST_HEAD() but rather in init_page_count().
>>>>>>
>>>>>> I'd start with making sure that page_alloc::memmap_alloc() actually returns
>>>>>> accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map
>>>>>> will poisoned in this function, so my guess is it'd crash there.
>>>>>>
>>>>>
>>>>> That's a wonderful hint, thank you! I did not run this test
>>>>> CONFIG_DEBUG_VM set, but you think it's possible it could still be
>>>>> here?
>>>>
>>>> It depends on how you configured your kernel. Say, defconfig does not set
>>>> it.
>>>>
>>>
>>> I also hit the issue at 256GB. My config is using CONFIG_SPARSEMEM_VMEMMAP
>>> and fails in memmap_init_range() when attempting to add the first PFN. It
>>> looks like the underlying page that is backing the vmemmap has not been
>>> accepted (I receive a #VC 0x404 => page not validated).
>>>
>>> Kirill, is this a path that you've looked at? It would appear that somewhere
>>> in the vmemmap_populate_hugepages() path, some memory acceptance needs to be
>>> done for the pages that are used to back vmemmap. I'm not very familiar with
>>> this code, so I'm not sure why everything works for a guest with 255GB of
>>> memory, but then fails for a guest with 256GB of memory.
>>
>> Hm. I don't have machine that large at hands at the moment. And I have not
>> looked at the codepath before.
>>
>> I will try to look into the issue.
> 
> I'm not able to trigger the bug.
> 
> With help of vm.overcommit_memory=1, I was managed boot TDX guest to shell
> with 256G and 1T of guest memory just fine.
> 
> Any chance it is SEV-SNP specific?

There's always a chance. I'll do some more tracing and see what I can find 
to try and be certain.

> 
> Or maybe there some difference in kernel config? Could you share yours?

Yes, I'll send that to you off-list.

Thanks,
Tom

> 

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-09-26 13:38                             ` Tom Lendacky
@ 2022-09-26 15:42                               ` Kirill A. Shutemov
  2022-09-26 15:42                               ` Tom Lendacky
  1 sibling, 0 replies; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-09-26 15:42 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Kirill A. Shutemov, Dionna Amalie Glaze, Dave Hansen, Mel Gorman,
	Vlastimil Babka, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Thomas Gleixner, Peter Zijlstra, Paolo Bonzini, Ingo Molnar,
	Dario Faggioli, Mike Rapoport, David Hildenbrand, Marcelo Cerri,
	tim.gardner, Khalid ElMously, philip.cox,
	the arch/x86 maintainers, Linux Memory Management List,
	linux-coco, linux-efi, LKML, Mike Rapoport

On Mon, Sep 26, 2022 at 08:38:34AM -0500, Tom Lendacky wrote:
> On 9/26/22 07:10, Kirill A. Shutemov wrote:
> > On Sat, Sep 24, 2022 at 04:03:02AM +0300, Kirill A. Shutemov wrote:
> > > On Thu, Sep 22, 2022 at 09:31:12AM -0500, Tom Lendacky wrote:
> > > > On 9/8/22 14:28, Mike Rapoport wrote:
> > > > > On Thu, Sep 08, 2022 at 09:23:07AM -0700, Dionna Amalie Glaze wrote:
> > > > > > > 
> > > > > > > Looks like the first access to the memory map fails, although I think
> > > > > > > it's not in INIT_LIST_HEAD() but rather in init_page_count().
> > > > > > > 
> > > > > > > I'd start with making sure that page_alloc::memmap_alloc() actually returns
> > > > > > > accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map
> > > > > > > will poisoned in this function, so my guess is it'd crash there.
> > > > > > > 
> > > > > > 
> > > > > > That's a wonderful hint, thank you! I did not run this test
> > > > > > CONFIG_DEBUG_VM set, but you think it's possible it could still be
> > > > > > here?
> > > > > 
> > > > > It depends on how you configured your kernel. Say, defconfig does not set
> > > > > it.
> > > > > 
> > > > 
> > > > I also hit the issue at 256GB. My config is using CONFIG_SPARSEMEM_VMEMMAP
> > > > and fails in memmap_init_range() when attempting to add the first PFN. It
> > > > looks like the underlying page that is backing the vmemmap has not been
> > > > accepted (I receive a #VC 0x404 => page not validated).
> > > > 
> > > > Kirill, is this a path that you've looked at? It would appear that somewhere
> > > > in the vmemmap_populate_hugepages() path, some memory acceptance needs to be
> > > > done for the pages that are used to back vmemmap. I'm not very familiar with
> > > > this code, so I'm not sure why everything works for a guest with 255GB of
> > > > memory, but then fails for a guest with 256GB of memory.
> > > 
> > > Hm. I don't have machine that large at hands at the moment. And I have not
> > > looked at the codepath before.
> > > 
> > > I will try to look into the issue.
> > 
> > I'm not able to trigger the bug.
> > 
> > With help of vm.overcommit_memory=1, I was managed boot TDX guest to shell
> > with 256G and 1T of guest memory just fine.
> > 
> > Any chance it is SEV-SNP specific?
> 
> There's always a chance. I'll do some more tracing and see what I can find
> to try and be certain.
> 
> > 
> > Or maybe there some difference in kernel config? Could you share yours?
> 
> Yes, I'll send that to you off-list.

Still nothing with your config :/

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 02/14] mm: Add support for unaccepted memory
  2022-09-26 13:38                             ` Tom Lendacky
  2022-09-26 15:42                               ` Kirill A. Shutemov
@ 2022-09-26 15:42                               ` Tom Lendacky
  1 sibling, 0 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-09-26 15:42 UTC (permalink / raw)
  To: Kirill A. Shutemov, Kirill A. Shutemov
  Cc: Dionna Amalie Glaze, Dave Hansen, Mel Gorman, Vlastimil Babka,
	Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, Marcelo Cerri, tim.gardner,
	Khalid ElMously, philip.cox, the arch/x86 maintainers,
	Linux Memory Management List, linux-coco, linux-efi, LKML,
	Mike Rapoport

On 9/26/22 08:38, Tom Lendacky wrote:
> On 9/26/22 07:10, Kirill A. Shutemov wrote:
>> On Sat, Sep 24, 2022 at 04:03:02AM +0300, Kirill A. Shutemov wrote:
>>> On Thu, Sep 22, 2022 at 09:31:12AM -0500, Tom Lendacky wrote:
>>>> On 9/8/22 14:28, Mike Rapoport wrote:
>>>>> On Thu, Sep 08, 2022 at 09:23:07AM -0700, Dionna Amalie Glaze wrote:
>>>>>>>
>>>>>>> Looks like the first access to the memory map fails, although I think
>>>>>>> it's not in INIT_LIST_HEAD() but rather in init_page_count().
>>>>>>>
>>>>>>> I'd start with making sure that page_alloc::memmap_alloc() actually 
>>>>>>> returns
>>>>>>> accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the 
>>>>>>> memory map
>>>>>>> will poisoned in this function, so my guess is it'd crash there.
>>>>>>>
>>>>>>
>>>>>> That's a wonderful hint, thank you! I did not run this test
>>>>>> CONFIG_DEBUG_VM set, but you think it's possible it could still be
>>>>>> here?
>>>>>
>>>>> It depends on how you configured your kernel. Say, defconfig does not 
>>>>> set
>>>>> it.
>>>>>
>>>>
>>>> I also hit the issue at 256GB. My config is using 
>>>> CONFIG_SPARSEMEM_VMEMMAP
>>>> and fails in memmap_init_range() when attempting to add the first PFN. It
>>>> looks like the underlying page that is backing the vmemmap has not been
>>>> accepted (I receive a #VC 0x404 => page not validated).
>>>>
>>>> Kirill, is this a path that you've looked at? It would appear that 
>>>> somewhere
>>>> in the vmemmap_populate_hugepages() path, some memory acceptance needs 
>>>> to be
>>>> done for the pages that are used to back vmemmap. I'm not very 
>>>> familiar with
>>>> this code, so I'm not sure why everything works for a guest with 255GB of
>>>> memory, but then fails for a guest with 256GB of memory.
>>>
>>> Hm. I don't have machine that large at hands at the moment. And I have not
>>> looked at the codepath before.
>>>
>>> I will try to look into the issue.
>>
>> I'm not able to trigger the bug.
>>
>> With help of vm.overcommit_memory=1, I was managed boot TDX guest to shell
>> with 256G and 1T of guest memory just fine.
>>
>> Any chance it is SEV-SNP specific?
> 
> There's always a chance. I'll do some more tracing and see what I can find 
> to try and be certain.

Indeed, it was an issue in the SNP path, shifting the number of pages as 
an unsigned int by PAGE_SIZE and adding to an unsigned long. The value to 
be shifted was 0x100000 and resulted in 0. Changing the number of pages to 
an unsigned long fixed it.

There are a few places where that is being done, so I'll address those 
with some pre-patches as fixes.

Thanks for verifying it was working for you.

Tom

> 
>>
>> Or maybe there some difference in kernel config? Could you share yours?
> 
> Yes, I'll send that to you off-list.
> 
> Thanks,
> Tom
> 
>>

^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v5 0/6] Provide SEV-SNP support for unaccepted memory
  2022-06-14 12:02 [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (18 preceding siblings ...)
  2022-08-25 14:23 ` [PATCH v4 0/4] Provide SEV-SNP support for unaccepted memory Tom Lendacky
@ 2022-09-27 17:04 ` Tom Lendacky
  2022-09-27 17:04   ` [PATCH v5 1/6] x86/sev: Fix calculation of end address based on number of pages Tom Lendacky
                     ` (5 more replies)
  19 siblings, 6 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-09-27 17:04 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

This series adds SEV-SNP support for unaccepted memory to the patch series
titled:

  [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory

Currently, when changing the state of a page under SNP, the page state
change structure is kmalloc()'d. This lead to hangs during boot when
accepting memory because the allocation can trigger the need to accept
more memory. Additionally, the page state change operations are not
optimized under Linux since it was expected that all memory has been
validated already, resulting in poor performance when adding basic
support for unaccepted memory.

 This series consists of six patches:
  - Two pre-patch fixes which can be taken regardless of this series.

  - A pre-patch to switch from a kmalloc()'d page state change structure
    to a (smaller) stack-based page state change structure.

  - A pre-patch to allow the use of the early boot GHCB in the core kernel
    path.

  - A pre-patch to allow for use of 2M page state change requests and 2M
    page validation.

  - SNP support for unaccepted memory.

The series is based off of and tested against Kirill Shutemov's tree:
  https://github.com/intel/tdx.git guest-unaccepted-memory

---

Changes since v4:
- Two fixes for when an unsigned int used as the number of pages to
  process, it needs to be converted to an unsigned long before being
  used to calculate ending addresses, otherwise a value >= 0x100000
  results in adding 0 in the calculation.
- Commit message and comment updates.

Changes since v3:
- Reworks the PSC process to greatly improve performance:
  - Optimize the PSC process to use 2M pages when applicable.
  - Optimize the page validation process to use 2M pages when applicable.
  - Use the early GHCB in both the decompression phase and core kernel
    boot phase in order to minimize the use of the MSR protocol. The MSR
    protocol only allows for a single 4K page to be updated at a time.
- Move the ghcb_percpu_ready flag into the sev_config structure and
  rename it to ghcbs_initialized.

Changes since v2:
- Improve code comments in regards to when to use the per-CPU GHCB vs
  the MSR protocol and why a single global value is valid for both
  the BSP and APs.
- Add a comment related to the number of PSC entries and how it can
  impact the size of the struct and, therefore, stack usage.
- Add a WARN_ON_ONCE() for invoking vmgexit_psc() when per-CPU GHCBs
  haven't been created or registered, yet.
- Use the compiler support for clearing the PSC struct instead of
  issuing memset().

Changes since v1:
- Change from using a per-CPU PSC structure to a (smaller) stack PSC
  structure.


Tom Lendacky (6):
  x86/sev: Fix calculation of end address based on number of pages
  x86/sev: Fix calculation of end address based on number of pages
  x86/sev: Put PSC struct on the stack in prep for unaccepted memory
    support
  x86/sev: Allow for use of the early boot GHCB for PSC requests
  x86/sev: Use large PSC requests if applicable
  x86/sev: Add SNP-specific unaccepted memory support

 arch/x86/Kconfig                  |   1 +
 arch/x86/boot/compressed/mem.c    |   3 +
 arch/x86/boot/compressed/sev.c    |  54 ++++++-
 arch/x86/boot/compressed/sev.h    |  23 +++
 arch/x86/include/asm/sev-common.h |   9 +-
 arch/x86/include/asm/sev.h        |   7 +
 arch/x86/kernel/sev-shared.c      | 104 +++++++++++++
 arch/x86/kernel/sev.c             | 250 +++++++++++++-----------------
 arch/x86/mm/unaccepted_memory.c   |   4 +
 9 files changed, 307 insertions(+), 148 deletions(-)
 create mode 100644 arch/x86/boot/compressed/sev.h

-- 
2.37.3


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v5 1/6] x86/sev: Fix calculation of end address based on number of pages
  2022-09-27 17:04 ` [PATCH v5 0/6] Provide SEV-SNP support for unaccepted memory Tom Lendacky
@ 2022-09-27 17:04   ` Tom Lendacky
  2022-09-27 17:10     ` Dave Hansen
  2022-09-27 19:04     ` Dionna Amalie Glaze
  2022-09-27 17:04   ` [PATCH v5 2/6] " Tom Lendacky
                     ` (4 subsequent siblings)
  5 siblings, 2 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-09-27 17:04 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

When calculating an end address based on an unsigned int number of pages,
the number of pages must be cast to an unsigned long so that any value
greater than or equal to 0x100000 does not result in zero after the shift.

Fixes: 5e5ccff60a29 ("x86/sev: Add helper for validating pages in early enc attribute changes")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/kernel/sev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index c05f0124c410..cac56540929d 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -649,7 +649,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool valid
 	int rc;
 
 	vaddr = vaddr & PAGE_MASK;
-	vaddr_end = vaddr + (npages << PAGE_SHIFT);
+	vaddr_end = vaddr + ((unsigned long)npages << PAGE_SHIFT);
 
 	while (vaddr < vaddr_end) {
 		rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
@@ -666,7 +666,7 @@ static void __init early_set_pages_state(unsigned long paddr, unsigned int npage
 	u64 val;
 
 	paddr = paddr & PAGE_MASK;
-	paddr_end = paddr + (npages << PAGE_SHIFT);
+	paddr_end = paddr + ((unsigned long)npages << PAGE_SHIFT);
 
 	while (paddr < paddr_end) {
 		/*
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCH v5 2/6] x86/sev: Fix calculation of end address based on number of pages
  2022-09-27 17:04 ` [PATCH v5 0/6] Provide SEV-SNP support for unaccepted memory Tom Lendacky
  2022-09-27 17:04   ` [PATCH v5 1/6] x86/sev: Fix calculation of end address based on number of pages Tom Lendacky
@ 2022-09-27 17:04   ` Tom Lendacky
  2022-09-27 17:04   ` [PATCH v5 3/6] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support Tom Lendacky
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-09-27 17:04 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

When calculating an end address based on an unsigned int number of pages,
the number of pages must be cast to an unsigned long so that any value
greater than or equal to 0x100000 does not result in zero after the shift.

Fixes: dc3f3d2474b8 ("x86/mm: Validate memory when changing the C-bit")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/kernel/sev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index cac56540929d..c90a47c39f6b 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -875,7 +875,7 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
 		panic("SNP: failed to allocate memory for PSC descriptor\n");
 
 	vaddr = vaddr & PAGE_MASK;
-	vaddr_end = vaddr + (npages << PAGE_SHIFT);
+	vaddr_end = vaddr + ((unsigned long)npages << PAGE_SHIFT);
 
 	while (vaddr < vaddr_end) {
 		/* Calculate the last vaddr that fits in one struct snp_psc_desc. */
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCH v5 3/6] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
  2022-09-27 17:04 ` [PATCH v5 0/6] Provide SEV-SNP support for unaccepted memory Tom Lendacky
  2022-09-27 17:04   ` [PATCH v5 1/6] x86/sev: Fix calculation of end address based on number of pages Tom Lendacky
  2022-09-27 17:04   ` [PATCH v5 2/6] " Tom Lendacky
@ 2022-09-27 17:04   ` Tom Lendacky
  2022-09-27 17:04   ` [PATCH v5 4/6] x86/sev: Allow for use of the early boot GHCB for PSC requests Tom Lendacky
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-09-27 17:04 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

In advance of providing support for unaccepted memory, switch from using
kmalloc() for allocating the Page State Change (PSC) structure to using a
local variable that lives on the stack. This is needed to avoid a possible
recursive call into set_pages_state() if the kmalloc() call requires
(more) memory to be accepted, which would result in a hang.

The current size of the PSC struct is 2,032 bytes. To make the struct more
stack friendly, reduce the number of PSC entries from 253 down to 64,
resulting in a size of 520 bytes. This is a nice compromise on struct size
and total PSC requests while still allowing parallel PSC operations across
vCPUs.

If the reduction in PSC entries results in any kind of performance issue
(that is not seen at the moment), use of a larger static PSC struct, with
fallback to the smaller stack version, can be investigated.

For more background info on this decision, see the subthread in the Link:
tag below.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/lkml/658c455c40e8950cb046dd885dd19dc1c52d060a.1659103274.git.thomas.lendacky@amd.com
---
 arch/x86/include/asm/sev-common.h |  9 +++++++--
 arch/x86/kernel/sev.c             | 10 ++--------
 2 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index b8357d6ecd47..8ddfdbe521d4 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -106,8 +106,13 @@ enum psc_op {
 #define GHCB_HV_FT_SNP			BIT_ULL(0)
 #define GHCB_HV_FT_SNP_AP_CREATION	BIT_ULL(1)
 
-/* SNP Page State Change NAE event */
-#define VMGEXIT_PSC_MAX_ENTRY		253
+/*
+ * SNP Page State Change NAE event
+ *   The VMGEXIT_PSC_MAX_ENTRY determines the size of the PSC structure, which
+ *   is a local stack variable in set_pages_state(). Do not increase this value
+ *   without evaluating the impact to stack usage.
+ */
+#define VMGEXIT_PSC_MAX_ENTRY		64
 
 struct psc_hdr {
 	u16 cur_entry;
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index c90a47c39f6b..664a4de91757 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -868,11 +868,7 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
 static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
 {
 	unsigned long vaddr_end, next_vaddr;
-	struct snp_psc_desc *desc;
-
-	desc = kmalloc(sizeof(*desc), GFP_KERNEL_ACCOUNT);
-	if (!desc)
-		panic("SNP: failed to allocate memory for PSC descriptor\n");
+	struct snp_psc_desc desc;
 
 	vaddr = vaddr & PAGE_MASK;
 	vaddr_end = vaddr + ((unsigned long)npages << PAGE_SHIFT);
@@ -882,12 +878,10 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
 		next_vaddr = min_t(unsigned long, vaddr_end,
 				   (VMGEXIT_PSC_MAX_ENTRY * PAGE_SIZE) + vaddr);
 
-		__set_pages_state(desc, vaddr, next_vaddr, op);
+		__set_pages_state(&desc, vaddr, next_vaddr, op);
 
 		vaddr = next_vaddr;
 	}
-
-	kfree(desc);
 }
 
 void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCH v5 4/6] x86/sev: Allow for use of the early boot GHCB for PSC requests
  2022-09-27 17:04 ` [PATCH v5 0/6] Provide SEV-SNP support for unaccepted memory Tom Lendacky
                     ` (2 preceding siblings ...)
  2022-09-27 17:04   ` [PATCH v5 3/6] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support Tom Lendacky
@ 2022-09-27 17:04   ` Tom Lendacky
  2022-09-27 17:04   ` [PATCH v5 5/6] x86/sev: Use large PSC requests if applicable Tom Lendacky
  2022-09-27 17:04   ` [PATCH v5 6/6] x86/sev: Add SNP-specific unaccepted memory support Tom Lendacky
  5 siblings, 0 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-09-27 17:04 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

Using a GHCB for a page stage change (as opposed to the MSR protocol)
allows for multiple pages to be processed in a single request. In prep
for early PSC requests in support of unaccepted memory, update the
invocation of vmgexit_psc() to be able to use the early boot GHCB and not
just the per-CPU GHCB structure.

In order to use the proper GHCB (early boot vs per-CPU), set a flag that
indicates when the per-CPU GHCBs are available and registered. For APs,
the per-CPU GHCBs are created before they are started and registered upon
startup, so this flag can be used globally for the BSP and APs instead of
creating a per-CPU flag. This will allow for a significant reduction in
the number of MSR protocol page state change requests when accepting
memory.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/kernel/sev.c | 61 +++++++++++++++++++++++++++----------------
 1 file changed, 38 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 664a4de91757..0b958d77abb4 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -117,7 +117,19 @@ static DEFINE_PER_CPU(struct sev_es_save_area *, sev_vmsa);
 
 struct sev_config {
 	__u64 debug		: 1,
-	      __reserved	: 63;
+
+	      /*
+	       * A flag used by __set_pages_state() that indicates when the
+	       * per-CPU GHCB has been created and registered and thus can be
+	       * used by the BSP instead of the early boot GHCB.
+	       *
+	       * For APs, the per-CPU GHCB is created before they are started
+	       * and registered upon startup, so this flag can be used globally
+	       * for the BSP and APs.
+	       */
+	      ghcbs_initialized	: 1,
+
+	      __reserved	: 62;
 };
 
 static struct sev_config sev_cfg __read_mostly;
@@ -660,7 +672,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool valid
 	}
 }
 
-static void __init early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
+static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
 {
 	unsigned long paddr_end;
 	u64 val;
@@ -742,26 +754,13 @@ void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op
 		WARN(1, "invalid memory op %d\n", op);
 }
 
-static int vmgexit_psc(struct snp_psc_desc *desc)
+static int vmgexit_psc(struct ghcb *ghcb, struct snp_psc_desc *desc)
 {
 	int cur_entry, end_entry, ret = 0;
 	struct snp_psc_desc *data;
-	struct ghcb_state state;
 	struct es_em_ctxt ctxt;
-	unsigned long flags;
-	struct ghcb *ghcb;
 
-	/*
-	 * __sev_get_ghcb() needs to run with IRQs disabled because it is using
-	 * a per-CPU GHCB.
-	 */
-	local_irq_save(flags);
-
-	ghcb = __sev_get_ghcb(&state);
-	if (!ghcb) {
-		ret = 1;
-		goto out_unlock;
-	}
+	vc_ghcb_invalidate(ghcb);
 
 	/* Copy the input desc into GHCB shared buffer */
 	data = (struct snp_psc_desc *)ghcb->shared_buffer;
@@ -818,20 +817,18 @@ static int vmgexit_psc(struct snp_psc_desc *desc)
 	}
 
 out:
-	__sev_put_ghcb(&state);
-
-out_unlock:
-	local_irq_restore(flags);
-
 	return ret;
 }
 
 static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
 			      unsigned long vaddr_end, int op)
 {
+	struct ghcb_state state;
 	struct psc_hdr *hdr;
 	struct psc_entry *e;
+	unsigned long flags;
 	unsigned long pfn;
+	struct ghcb *ghcb;
 	int i;
 
 	hdr = &data->hdr;
@@ -861,8 +858,20 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
 		i++;
 	}
 
-	if (vmgexit_psc(data))
+	local_irq_save(flags);
+
+	if (sev_cfg.ghcbs_initialized)
+		ghcb = __sev_get_ghcb(&state);
+	else
+		ghcb = boot_ghcb;
+
+	if (!ghcb || vmgexit_psc(ghcb, data))
 		sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
+
+	if (sev_cfg.ghcbs_initialized)
+		__sev_put_ghcb(&state);
+
+	local_irq_restore(flags);
 }
 
 static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
@@ -870,6 +879,10 @@ static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
 	unsigned long vaddr_end, next_vaddr;
 	struct snp_psc_desc desc;
 
+	/* Use the MSR protocol when a GHCB is not available. */
+	if (!boot_ghcb)
+		return early_set_pages_state(__pa(vaddr), npages, op);
+
 	vaddr = vaddr & PAGE_MASK;
 	vaddr_end = vaddr + ((unsigned long)npages << PAGE_SHIFT);
 
@@ -1248,6 +1261,8 @@ void setup_ghcb(void)
 		if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
 			snp_register_per_cpu_ghcb();
 
+		sev_cfg.ghcbs_initialized = true;
+
 		return;
 	}
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCH v5 5/6] x86/sev: Use large PSC requests if applicable
  2022-09-27 17:04 ` [PATCH v5 0/6] Provide SEV-SNP support for unaccepted memory Tom Lendacky
                     ` (3 preceding siblings ...)
  2022-09-27 17:04   ` [PATCH v5 4/6] x86/sev: Allow for use of the early boot GHCB for PSC requests Tom Lendacky
@ 2022-09-27 17:04   ` Tom Lendacky
  2022-09-27 17:04   ` [PATCH v5 6/6] x86/sev: Add SNP-specific unaccepted memory support Tom Lendacky
  5 siblings, 0 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-09-27 17:04 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

In advance of providing support for unaccepted memory, request 2M Page
State Change (PSC) requests when the address range allows for it. By using
a 2M page size, more PSC operations can be handled in a single request to
the hypervisor. The hypervisor will determine if it can accommodate the
larger request by checking the mapping in the nested page table. If mapped
as a large page, then the 2M page request can be performed, otherwise the
2M page request will be broken down into 512 4K page requests. This is
still more efficient than having the guest perform multiple PSC requests
in order to process the 512 4K pages.

In conjunction with the 2M PSC requests, attempt to perform the associated
PVALIDATE instruction of the page using the 2M page size. If PVALIDATE
fails with a size mismatch, then fallback to validating 512 4K pages. To
do this, page validation is modified to work with the PSC structure and
not just a virtual address range.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/include/asm/sev.h |   4 ++
 arch/x86/kernel/sev.c      | 125 ++++++++++++++++++++++++-------------
 2 files changed, 84 insertions(+), 45 deletions(-)

diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 19514524f0f8..0007ab04ac5f 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -79,11 +79,15 @@ extern void vc_no_ghcb(void);
 extern void vc_boot_ghcb(void);
 extern bool handle_vc_boot_ghcb(struct pt_regs *regs);
 
+/* PVALIDATE return codes */
+#define PVALIDATE_FAIL_SIZEMISMATCH	6
+
 /* Software defined (when rFlags.CF = 1) */
 #define PVALIDATE_FAIL_NOUPDATE		255
 
 /* RMP page size */
 #define RMP_PG_SIZE_4K			0
+#define RMP_PG_SIZE_2M			1
 
 #define RMPADJUST_VMSA_PAGE_BIT		BIT(16)
 
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 0b958d77abb4..eabb8dd5be5b 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -655,32 +655,58 @@ static u64 __init get_jump_table_addr(void)
 	return ret;
 }
 
-static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool validate)
+static void pvalidate_pages(struct snp_psc_desc *desc)
 {
-	unsigned long vaddr_end;
+	struct psc_entry *e;
+	unsigned long vaddr;
+	unsigned int size;
+	unsigned int i;
+	bool validate;
 	int rc;
 
-	vaddr = vaddr & PAGE_MASK;
-	vaddr_end = vaddr + ((unsigned long)npages << PAGE_SHIFT);
+	for (i = 0; i <= desc->hdr.end_entry; i++) {
+		e = &desc->entries[i];
+
+		vaddr = (unsigned long)pfn_to_kaddr(e->gfn);
+		size = e->pagesize ? RMP_PG_SIZE_2M : RMP_PG_SIZE_4K;
+		validate = (e->operation == SNP_PAGE_STATE_PRIVATE) ? true : false;
+
+		rc = pvalidate(vaddr, size, validate);
+		if (rc == PVALIDATE_FAIL_SIZEMISMATCH && size == RMP_PG_SIZE_2M) {
+			unsigned long vaddr_end = vaddr + PMD_PAGE_SIZE;
+
+			for (; vaddr < vaddr_end; vaddr += PAGE_SIZE) {
+				rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
+				if (rc)
+					break;
+			}
+		}
 
-	while (vaddr < vaddr_end) {
-		rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
 		if (WARN(rc, "Failed to validate address 0x%lx ret %d", vaddr, rc))
 			sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PVALIDATE);
-
-		vaddr = vaddr + PAGE_SIZE;
 	}
 }
 
-static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
+static void early_set_pages_state(unsigned long vaddr, unsigned long paddr,
+				  unsigned int npages, enum psc_op op)
 {
 	unsigned long paddr_end;
 	u64 val;
+	int ret;
+
+	vaddr = vaddr & PAGE_MASK;
 
 	paddr = paddr & PAGE_MASK;
 	paddr_end = paddr + ((unsigned long)npages << PAGE_SHIFT);
 
 	while (paddr < paddr_end) {
+		if (op == SNP_PAGE_STATE_SHARED) {
+			/* Page validation must be rescinded before changing to shared */
+			ret = pvalidate(vaddr, RMP_PG_SIZE_4K, false);
+			if (WARN(ret, "Failed to validate address 0x%lx ret %d", paddr, ret))
+				goto e_term;
+		}
+
 		/*
 		 * Use the MSR protocol because this function can be called before
 		 * the GHCB is established.
@@ -701,7 +727,15 @@ static void early_set_pages_state(unsigned long paddr, unsigned int npages, enum
 			 paddr, GHCB_MSR_PSC_RESP_VAL(val)))
 			goto e_term;
 
-		paddr = paddr + PAGE_SIZE;
+		if (op == SNP_PAGE_STATE_PRIVATE) {
+			/* Page validation must be performed after changing to private */
+			ret = pvalidate(vaddr, RMP_PG_SIZE_4K, true);
+			if (WARN(ret, "Failed to validate address 0x%lx ret %d", paddr, ret))
+				goto e_term;
+		}
+
+		vaddr += PAGE_SIZE;
+		paddr += PAGE_SIZE;
 	}
 
 	return;
@@ -720,10 +754,7 @@ void __init early_snp_set_memory_private(unsigned long vaddr, unsigned long padd
 	  * Ask the hypervisor to mark the memory pages as private in the RMP
 	  * table.
 	  */
-	early_set_pages_state(paddr, npages, SNP_PAGE_STATE_PRIVATE);
-
-	/* Validate the memory pages after they've been added in the RMP table. */
-	pvalidate_pages(vaddr, npages, true);
+	early_set_pages_state(vaddr, paddr, npages, SNP_PAGE_STATE_PRIVATE);
 }
 
 void __init early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr,
@@ -732,11 +763,8 @@ void __init early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr
 	if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
 		return;
 
-	/* Invalidate the memory pages before they are marked shared in the RMP table. */
-	pvalidate_pages(vaddr, npages, false);
-
 	 /* Ask hypervisor to mark the memory pages shared in the RMP table. */
-	early_set_pages_state(paddr, npages, SNP_PAGE_STATE_SHARED);
+	early_set_pages_state(vaddr, paddr, npages, SNP_PAGE_STATE_SHARED);
 }
 
 void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op)
@@ -820,10 +848,11 @@ static int vmgexit_psc(struct ghcb *ghcb, struct snp_psc_desc *desc)
 	return ret;
 }
 
-static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
-			      unsigned long vaddr_end, int op)
+static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
+				       unsigned long vaddr_end, int op)
 {
 	struct ghcb_state state;
+	bool use_large_entry;
 	struct psc_hdr *hdr;
 	struct psc_entry *e;
 	unsigned long flags;
@@ -837,27 +866,37 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
 	memset(data, 0, sizeof(*data));
 	i = 0;
 
-	while (vaddr < vaddr_end) {
-		if (is_vmalloc_addr((void *)vaddr))
+	while (vaddr < vaddr_end && i < ARRAY_SIZE(data->entries)) {
+		hdr->end_entry = i;
+
+		if (is_vmalloc_addr((void *)vaddr)) {
 			pfn = vmalloc_to_pfn((void *)vaddr);
-		else
+			use_large_entry = false;
+		} else {
 			pfn = __pa(vaddr) >> PAGE_SHIFT;
+			use_large_entry = true;
+		}
 
 		e->gfn = pfn;
 		e->operation = op;
-		hdr->end_entry = i;
 
-		/*
-		 * Current SNP implementation doesn't keep track of the RMP page
-		 * size so use 4K for simplicity.
-		 */
-		e->pagesize = RMP_PG_SIZE_4K;
+		if (use_large_entry && IS_ALIGNED(vaddr, PMD_PAGE_SIZE) &&
+		    (vaddr_end - vaddr) >= PMD_PAGE_SIZE) {
+			e->pagesize = RMP_PG_SIZE_2M;
+			vaddr += PMD_PAGE_SIZE;
+		} else {
+			e->pagesize = RMP_PG_SIZE_4K;
+			vaddr += PAGE_SIZE;
+		}
 
-		vaddr = vaddr + PAGE_SIZE;
 		e++;
 		i++;
 	}
 
+	/* Page validation must be rescinded before changing to shared */
+	if (op == SNP_PAGE_STATE_SHARED)
+		pvalidate_pages(data);
+
 	local_irq_save(flags);
 
 	if (sev_cfg.ghcbs_initialized)
@@ -865,6 +904,7 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
 	else
 		ghcb = boot_ghcb;
 
+	/* Invoke the hypervisor to perform the page state changes */
 	if (!ghcb || vmgexit_psc(ghcb, data))
 		sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
 
@@ -872,29 +912,28 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
 		__sev_put_ghcb(&state);
 
 	local_irq_restore(flags);
+
+	/* Page validation must be performed after changing to private */
+	if (op == SNP_PAGE_STATE_PRIVATE)
+		pvalidate_pages(data);
+
+	return vaddr;
 }
 
 static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
 {
-	unsigned long vaddr_end, next_vaddr;
 	struct snp_psc_desc desc;
+	unsigned long vaddr_end;
 
 	/* Use the MSR protocol when a GHCB is not available. */
 	if (!boot_ghcb)
-		return early_set_pages_state(__pa(vaddr), npages, op);
+		return early_set_pages_state(vaddr, __pa(vaddr), npages, op);
 
 	vaddr = vaddr & PAGE_MASK;
 	vaddr_end = vaddr + ((unsigned long)npages << PAGE_SHIFT);
 
-	while (vaddr < vaddr_end) {
-		/* Calculate the last vaddr that fits in one struct snp_psc_desc. */
-		next_vaddr = min_t(unsigned long, vaddr_end,
-				   (VMGEXIT_PSC_MAX_ENTRY * PAGE_SIZE) + vaddr);
-
-		__set_pages_state(&desc, vaddr, next_vaddr, op);
-
-		vaddr = next_vaddr;
-	}
+	while (vaddr < vaddr_end)
+		vaddr = __set_pages_state(&desc, vaddr, vaddr_end, op);
 }
 
 void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
@@ -902,8 +941,6 @@ void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
 	if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
 		return;
 
-	pvalidate_pages(vaddr, npages, false);
-
 	set_pages_state(vaddr, npages, SNP_PAGE_STATE_SHARED);
 }
 
@@ -913,8 +950,6 @@ void snp_set_memory_private(unsigned long vaddr, unsigned int npages)
 		return;
 
 	set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
-
-	pvalidate_pages(vaddr, npages, true);
 }
 
 static int snp_set_vmsa(void *va, bool vmsa)
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [PATCH v5 6/6] x86/sev: Add SNP-specific unaccepted memory support
  2022-09-27 17:04 ` [PATCH v5 0/6] Provide SEV-SNP support for unaccepted memory Tom Lendacky
                     ` (4 preceding siblings ...)
  2022-09-27 17:04   ` [PATCH v5 5/6] x86/sev: Use large PSC requests if applicable Tom Lendacky
@ 2022-09-27 17:04   ` Tom Lendacky
  5 siblings, 0 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-09-27 17:04 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

Add SNP-specific hooks to the unaccepted memory support in the boot
path (__accept_memory()) and the core kernel (accept_memory()) in order
to support booting SNP guests when unaccepted memory is present. Without
this support, SNP guests will fail to boot and/or panic() when unaccepted
memory is present in the EFI memory map.

The process of accepting memory under SNP involves invoking the hypervisor
to perform a page state change for the page to private memory and then
issuing a PVALIDATE instruction to accept the page.

Since the boot path and the core kernel paths perform similar operations,
move the pvalidate_pages() and vmgexit_psc() functions into sev-shared.c
to avoid code duplication.

Create the new header file arch/x86/boot/compressed/sev.h because adding
the function declaration to any of the existing SEV related header files
pulls in too many other header files, causing the build to fail.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/Kconfig                |   1 +
 arch/x86/boot/compressed/mem.c  |   3 +
 arch/x86/boot/compressed/sev.c  |  54 ++++++++++++++-
 arch/x86/boot/compressed/sev.h  |  23 +++++++
 arch/x86/include/asm/sev.h      |   3 +
 arch/x86/kernel/sev-shared.c    | 104 +++++++++++++++++++++++++++++
 arch/x86/kernel/sev.c           | 112 ++++----------------------------
 arch/x86/mm/unaccepted_memory.c |   4 ++
 8 files changed, 205 insertions(+), 99 deletions(-)
 create mode 100644 arch/x86/boot/compressed/sev.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 34146ecc5bdd..0ad53c3533c2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1553,6 +1553,7 @@ config AMD_MEM_ENCRYPT
 	select INSTRUCTION_DECODER
 	select ARCH_HAS_CC_PLATFORM
 	select X86_MEM_ENCRYPT
+	select UNACCEPTED_MEMORY
 	help
 	  Say yes to enable support for the encryption of system memory.
 	  This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index 48e36e640da1..3e19dc0da0d7 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -6,6 +6,7 @@
 #include "find.h"
 #include "math.h"
 #include "tdx.h"
+#include "sev.h"
 #include <asm/shared/tdx.h>
 
 #define PMD_SHIFT	21
@@ -39,6 +40,8 @@ static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
 	/* Platform-specific memory-acceptance call goes here */
 	if (is_tdx_guest())
 		tdx_accept_memory(start, end);
+	else if (sev_snp_enabled())
+		snp_accept_memory(start, end);
 	else
 		error("Cannot accept memory: unknown platform\n");
 }
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 730c4677e9db..22da65c96b47 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -115,7 +115,7 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
 /* Include code for early handlers */
 #include "../../kernel/sev-shared.c"
 
-static inline bool sev_snp_enabled(void)
+bool sev_snp_enabled(void)
 {
 	return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
 }
@@ -181,6 +181,58 @@ static bool early_setup_ghcb(void)
 	return true;
 }
 
+static phys_addr_t __snp_accept_memory(struct snp_psc_desc *desc,
+				       phys_addr_t pa, phys_addr_t pa_end)
+{
+	struct psc_hdr *hdr;
+	struct psc_entry *e;
+	unsigned int i;
+
+	hdr = &desc->hdr;
+	memset(hdr, 0, sizeof(*hdr));
+
+	e = desc->entries;
+
+	i = 0;
+	while (pa < pa_end && i < VMGEXIT_PSC_MAX_ENTRY) {
+		hdr->end_entry = i;
+
+		e->gfn = pa >> PAGE_SHIFT;
+		e->operation = SNP_PAGE_STATE_PRIVATE;
+		if (IS_ALIGNED(pa, PMD_PAGE_SIZE) && (pa_end - pa) >= PMD_PAGE_SIZE) {
+			e->pagesize = RMP_PG_SIZE_2M;
+			pa += PMD_PAGE_SIZE;
+		} else {
+			e->pagesize = RMP_PG_SIZE_4K;
+			pa += PAGE_SIZE;
+		}
+
+		e++;
+		i++;
+	}
+
+	if (vmgexit_psc(boot_ghcb, desc))
+		sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
+
+	pvalidate_pages(desc);
+
+	return pa;
+}
+
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	struct snp_psc_desc desc = {};
+	unsigned int i;
+	phys_addr_t pa;
+
+	if (!boot_ghcb && !early_setup_ghcb())
+		sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
+
+	pa = start;
+	while (pa < end)
+		pa = __snp_accept_memory(&desc, pa, end);
+}
+
 void sev_es_shutdown_ghcb(void)
 {
 	if (!boot_ghcb)
diff --git a/arch/x86/boot/compressed/sev.h b/arch/x86/boot/compressed/sev.h
new file mode 100644
index 000000000000..fc725a981b09
--- /dev/null
+++ b/arch/x86/boot/compressed/sev.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * AMD SEV header for early boot related functions.
+ *
+ * Author: Tom Lendacky <thomas.lendacky@amd.com>
+ */
+
+#ifndef BOOT_COMPRESSED_SEV_H
+#define BOOT_COMPRESSED_SEV_H
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+
+bool sev_snp_enabled(void);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
+
+#else
+
+static inline bool sev_snp_enabled(void) { return false; }
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
+
+#endif
+
+#endif
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 0007ab04ac5f..9297aab0c79e 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -206,6 +206,7 @@ void snp_set_wakeup_secondary_cpu(void);
 bool snp_init(struct boot_params *bp);
 void snp_abort(void);
 int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, unsigned long *fw_err);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
 #else
 static inline void sev_es_ist_enter(struct pt_regs *regs) { }
 static inline void sev_es_ist_exit(void) { }
@@ -230,6 +231,8 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
 {
 	return -ENOTTY;
 }
+
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
 #endif
 
 #endif
diff --git a/arch/x86/kernel/sev-shared.c b/arch/x86/kernel/sev-shared.c
index b478edf43bec..7ac7857da2b8 100644
--- a/arch/x86/kernel/sev-shared.c
+++ b/arch/x86/kernel/sev-shared.c
@@ -12,6 +12,9 @@
 #ifndef __BOOT_COMPRESSED
 #define error(v)	pr_err(v)
 #define has_cpuflag(f)	boot_cpu_has(f)
+#else
+#undef WARN
+#define WARN(condition...)
 #endif
 
 /* I/O parameters for CPUID-related helpers */
@@ -998,3 +1001,104 @@ static void __init setup_cpuid_table(const struct cc_blob_sev_info *cc_info)
 			cpuid_ext_range_max = fn->eax;
 	}
 }
+
+static void pvalidate_pages(struct snp_psc_desc *desc)
+{
+	struct psc_entry *e;
+	unsigned long vaddr;
+	unsigned int size;
+	unsigned int i;
+	bool validate;
+	int rc;
+
+	for (i = 0; i <= desc->hdr.end_entry; i++) {
+		e = &desc->entries[i];
+
+		vaddr = (unsigned long)pfn_to_kaddr(e->gfn);
+		size = e->pagesize ? RMP_PG_SIZE_2M : RMP_PG_SIZE_4K;
+		validate = (e->operation == SNP_PAGE_STATE_PRIVATE) ? true : false;
+
+		rc = pvalidate(vaddr, size, validate);
+		if (rc == PVALIDATE_FAIL_SIZEMISMATCH && size == RMP_PG_SIZE_2M) {
+			unsigned long vaddr_end = vaddr + PMD_PAGE_SIZE;
+
+			for (; vaddr < vaddr_end; vaddr += PAGE_SIZE) {
+				rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
+				if (rc)
+					break;
+			}
+		}
+
+		if (rc) {
+			WARN(1, "Failed to validate address 0x%lx ret %d", vaddr, rc);
+			sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PVALIDATE);
+		}
+	}
+}
+
+static int vmgexit_psc(struct ghcb *ghcb, struct snp_psc_desc *desc)
+{
+	int cur_entry, end_entry, ret = 0;
+	struct snp_psc_desc *data;
+	struct es_em_ctxt ctxt;
+
+	vc_ghcb_invalidate(ghcb);
+
+	/* Copy the input desc into GHCB shared buffer */
+	data = (struct snp_psc_desc *)ghcb->shared_buffer;
+	memcpy(ghcb->shared_buffer, desc, min_t(int, GHCB_SHARED_BUF_SIZE, sizeof(*desc)));
+
+	/*
+	 * As per the GHCB specification, the hypervisor can resume the guest
+	 * before processing all the entries. Check whether all the entries
+	 * are processed. If not, then keep retrying. Note, the hypervisor
+	 * will update the data memory directly to indicate the status, so
+	 * reference the data->hdr everywhere.
+	 *
+	 * The strategy here is to wait for the hypervisor to change the page
+	 * state in the RMP table before guest accesses the memory pages. If the
+	 * page state change was not successful, then later memory access will
+	 * result in a crash.
+	 */
+	cur_entry = data->hdr.cur_entry;
+	end_entry = data->hdr.end_entry;
+
+	while (data->hdr.cur_entry <= data->hdr.end_entry) {
+		ghcb_set_sw_scratch(ghcb, (u64)__pa(data));
+
+		/* This will advance the shared buffer data points to. */
+		ret = sev_es_ghcb_hv_call(ghcb, true, &ctxt, SVM_VMGEXIT_PSC, 0, 0);
+
+		/*
+		 * Page State Change VMGEXIT can pass error code through
+		 * exit_info_2.
+		 */
+		if (ret || ghcb->save.sw_exit_info_2) {
+			WARN(1, "SNP: PSC failed ret=%d exit_info_2=%llx\n",
+			     ret, ghcb->save.sw_exit_info_2);
+			ret = 1;
+			goto out;
+		}
+
+		/* Verify that reserved bit is not set */
+		if (data->hdr.reserved) {
+			WARN(1, "Reserved bit is set in the PSC header\n");
+			ret = 1;
+			goto out;
+		}
+
+		/*
+		 * Sanity check that entry processing is not going backwards.
+		 * This will happen only if hypervisor is tricking us.
+		 */
+		if (data->hdr.end_entry > end_entry || cur_entry > data->hdr.cur_entry) {
+			WARN(1, "SNP: PSC processing going backward, end_entry %d (got %d) cur_entry %d (got %d)\n",
+			     end_entry, data->hdr.end_entry, cur_entry, data->hdr.cur_entry);
+			ret = 1;
+			goto out;
+		}
+	}
+
+out:
+	return ret;
+}
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index eabb8dd5be5b..48440933bde2 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -655,38 +655,6 @@ static u64 __init get_jump_table_addr(void)
 	return ret;
 }
 
-static void pvalidate_pages(struct snp_psc_desc *desc)
-{
-	struct psc_entry *e;
-	unsigned long vaddr;
-	unsigned int size;
-	unsigned int i;
-	bool validate;
-	int rc;
-
-	for (i = 0; i <= desc->hdr.end_entry; i++) {
-		e = &desc->entries[i];
-
-		vaddr = (unsigned long)pfn_to_kaddr(e->gfn);
-		size = e->pagesize ? RMP_PG_SIZE_2M : RMP_PG_SIZE_4K;
-		validate = (e->operation == SNP_PAGE_STATE_PRIVATE) ? true : false;
-
-		rc = pvalidate(vaddr, size, validate);
-		if (rc == PVALIDATE_FAIL_SIZEMISMATCH && size == RMP_PG_SIZE_2M) {
-			unsigned long vaddr_end = vaddr + PMD_PAGE_SIZE;
-
-			for (; vaddr < vaddr_end; vaddr += PAGE_SIZE) {
-				rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
-				if (rc)
-					break;
-			}
-		}
-
-		if (WARN(rc, "Failed to validate address 0x%lx ret %d", vaddr, rc))
-			sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PVALIDATE);
-	}
-}
-
 static void early_set_pages_state(unsigned long vaddr, unsigned long paddr,
 				  unsigned int npages, enum psc_op op)
 {
@@ -782,72 +750,6 @@ void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op
 		WARN(1, "invalid memory op %d\n", op);
 }
 
-static int vmgexit_psc(struct ghcb *ghcb, struct snp_psc_desc *desc)
-{
-	int cur_entry, end_entry, ret = 0;
-	struct snp_psc_desc *data;
-	struct es_em_ctxt ctxt;
-
-	vc_ghcb_invalidate(ghcb);
-
-	/* Copy the input desc into GHCB shared buffer */
-	data = (struct snp_psc_desc *)ghcb->shared_buffer;
-	memcpy(ghcb->shared_buffer, desc, min_t(int, GHCB_SHARED_BUF_SIZE, sizeof(*desc)));
-
-	/*
-	 * As per the GHCB specification, the hypervisor can resume the guest
-	 * before processing all the entries. Check whether all the entries
-	 * are processed. If not, then keep retrying. Note, the hypervisor
-	 * will update the data memory directly to indicate the status, so
-	 * reference the data->hdr everywhere.
-	 *
-	 * The strategy here is to wait for the hypervisor to change the page
-	 * state in the RMP table before guest accesses the memory pages. If the
-	 * page state change was not successful, then later memory access will
-	 * result in a crash.
-	 */
-	cur_entry = data->hdr.cur_entry;
-	end_entry = data->hdr.end_entry;
-
-	while (data->hdr.cur_entry <= data->hdr.end_entry) {
-		ghcb_set_sw_scratch(ghcb, (u64)__pa(data));
-
-		/* This will advance the shared buffer data points to. */
-		ret = sev_es_ghcb_hv_call(ghcb, true, &ctxt, SVM_VMGEXIT_PSC, 0, 0);
-
-		/*
-		 * Page State Change VMGEXIT can pass error code through
-		 * exit_info_2.
-		 */
-		if (WARN(ret || ghcb->save.sw_exit_info_2,
-			 "SNP: PSC failed ret=%d exit_info_2=%llx\n",
-			 ret, ghcb->save.sw_exit_info_2)) {
-			ret = 1;
-			goto out;
-		}
-
-		/* Verify that reserved bit is not set */
-		if (WARN(data->hdr.reserved, "Reserved bit is set in the PSC header\n")) {
-			ret = 1;
-			goto out;
-		}
-
-		/*
-		 * Sanity check that entry processing is not going backwards.
-		 * This will happen only if hypervisor is tricking us.
-		 */
-		if (WARN(data->hdr.end_entry > end_entry || cur_entry > data->hdr.cur_entry,
-"SNP: PSC processing going backward, end_entry %d (got %d) cur_entry %d (got %d)\n",
-			 end_entry, data->hdr.end_entry, cur_entry, data->hdr.cur_entry)) {
-			ret = 1;
-			goto out;
-		}
-	}
-
-out:
-	return ret;
-}
-
 static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
 				       unsigned long vaddr_end, int op)
 {
@@ -952,6 +854,20 @@ void snp_set_memory_private(unsigned long vaddr, unsigned int npages)
 	set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
 }
 
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	unsigned long vaddr;
+	unsigned int npages;
+
+	if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
+		return;
+
+	vaddr = (unsigned long)__va(start);
+	npages = (end - start) >> PAGE_SHIFT;
+
+	set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
+}
+
 static int snp_set_vmsa(void *va, bool vmsa)
 {
 	u64 attrs;
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index 9ec2304272dc..b86ad6a8ddf5 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -9,6 +9,7 @@
 #include <asm/setup.h>
 #include <asm/shared/tdx.h>
 #include <asm/unaccepted_memory.h>
+#include <asm/sev.h>
 
 /* Protects unaccepted memory bitmap */
 static DEFINE_SPINLOCK(unaccepted_memory_lock);
@@ -66,6 +67,9 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
 		if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
 			tdx_accept_memory(range_start * PMD_SIZE,
 					  range_end * PMD_SIZE);
+		} else if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) {
+			snp_accept_memory(range_start * PMD_SIZE,
+					  range_end * PMD_SIZE);
 		} else {
 			panic("Cannot accept memory: unknown platform\n");
 		}
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* Re: [PATCH v5 1/6] x86/sev: Fix calculation of end address based on number of pages
  2022-09-27 17:04   ` [PATCH v5 1/6] x86/sev: Fix calculation of end address based on number of pages Tom Lendacky
@ 2022-09-27 17:10     ` Dave Hansen
  2022-09-27 20:45       ` Tom Lendacky
  2022-09-27 19:04     ` Dionna Amalie Glaze
  1 sibling, 1 reply; 200+ messages in thread
From: Dave Hansen @ 2022-09-27 17:10 UTC (permalink / raw)
  To: Tom Lendacky, linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On 9/27/22 10:04, Tom Lendacky wrote:
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -649,7 +649,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool valid
>  	int rc;
>  
>  	vaddr = vaddr & PAGE_MASK;
> -	vaddr_end = vaddr + (npages << PAGE_SHIFT);
> +	vaddr_end = vaddr + ((unsigned long)npages << PAGE_SHIFT);

Could we please just fix the fragile typing that cascaded down to this
point?

Shouldn't 'npages' in this interface be a long?

> struct x86_guest {
>         void (*enc_status_change_prepare)(unsigned long vaddr, int npages, bool enc);
>         bool (*enc_status_change_finish)(unsigned long vaddr, int npages, bool enc);
>         bool (*enc_tlb_flush_required)(bool enc);
>         bool (*enc_cache_flush_required)(void);
> };

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v5 1/6] x86/sev: Fix calculation of end address based on number of pages
  2022-09-27 17:04   ` [PATCH v5 1/6] x86/sev: Fix calculation of end address based on number of pages Tom Lendacky
  2022-09-27 17:10     ` Dave Hansen
@ 2022-09-27 19:04     ` Dionna Amalie Glaze
  1 sibling, 0 replies; 200+ messages in thread
From: Dionna Amalie Glaze @ 2022-09-27 19:04 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: LKML, the arch/x86 maintainers, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Kirill A. Shutemov, H. Peter Anvin,
	Michael Roth, Joerg Roedel, Andy Lutomirski, Peter Zijlstra

>
> When calculating an end address based on an unsigned int number of pages,
> the number of pages must be cast to an unsigned long so that any value
> greater than or equal to 0x100000 does not result in zero after the shift.
>
> Fixes: 5e5ccff60a29 ("x86/sev: Add helper for validating pages in early enc attribute changes")

Tested-by: Dionna Glaze <dionnaglaze@google.com>

-- 
-Dionna Glaze, PhD (she/her)

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v5 1/6] x86/sev: Fix calculation of end address based on number of pages
  2022-09-27 17:10     ` Dave Hansen
@ 2022-09-27 20:45       ` Tom Lendacky
  0 siblings, 0 replies; 200+ messages in thread
From: Tom Lendacky @ 2022-09-27 20:45 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Kirill A. Shutemov, H. Peter Anvin, Michael Roth, Joerg Roedel,
	Andy Lutomirski, Peter Zijlstra

On 9/27/22 12:10, Dave Hansen wrote:
> On 9/27/22 10:04, Tom Lendacky wrote:
>> --- a/arch/x86/kernel/sev.c
>> +++ b/arch/x86/kernel/sev.c
>> @@ -649,7 +649,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool valid
>>   	int rc;
>>   
>>   	vaddr = vaddr & PAGE_MASK;
>> -	vaddr_end = vaddr + (npages << PAGE_SHIFT);
>> +	vaddr_end = vaddr + ((unsigned long)npages << PAGE_SHIFT);
> 
> Could we please just fix the fragile typing that cascaded down to this
> point?
> 
> Shouldn't 'npages' in this interface be a long?

I'll take a look at that.

Thanks,
Tom

> 
>> struct x86_guest {
>>          void (*enc_status_change_prepare)(unsigned long vaddr, int npages, bool enc);
>>          bool (*enc_status_change_finish)(unsigned long vaddr, int npages, bool enc);
>>          bool (*enc_tlb_flush_required)(bool enc);
>>          bool (*enc_cache_flush_required)(void);
>> };

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 08/14] x86/mm: Reserve unaccepted memory bitmap
  2022-07-26  9:07   ` Borislav Petkov
@ 2022-11-30  1:28     ` Kirill A. Shutemov
  2022-12-01  9:37       ` Mike Rapoport
  0 siblings, 1 reply; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-11-30  1:28 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Mike Rapoport

On Tue, Jul 26, 2022 at 11:07:14AM +0200, Borislav Petkov wrote:
> On Tue, Jun 14, 2022 at 03:02:25PM +0300, Kirill A. Shutemov wrote:
> > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > index f267205f2d5a..22d1fe48dcba 100644
> > --- a/arch/x86/kernel/e820.c
> > +++ b/arch/x86/kernel/e820.c
> > @@ -1316,6 +1316,16 @@ void __init e820__memblock_setup(void)
> >  	int i;
> >  	u64 end;
> >  
> > +	/* Mark unaccepted memory bitmap reserved */
> > +	if (boot_params.unaccepted_memory) {
> > +		unsigned long size;
> > +
> > +		/* One bit per 2MB */
> > +		size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
> > +				    PMD_SIZE * BITS_PER_BYTE);
> > +		memblock_reserve(boot_params.unaccepted_memory, size);
> > +	}
> > +
> 
> Hmm, I don't like how this is dropped right in the middle of a unrelated
> function.
> 
> You're adding arch/x86/mm/unaccepted_memory.c later. Why don't you put
> that chunk in a function there which is called by early_reserve_memory()
> which does exactly what you want - reserve memory early, before memblock
> allocations?

early_reserve_memory() specifically called before e820__memory_setup()
(see comment in setup_arch()), so we don't have e820_table finalized and
we need it to get correct RAM size from e820__end_of_ram_pfn().

I guess we can hide the chunk in a function in unaccepted_memory.c and
call it from here, but it would require #ifdeffery in a header file as the
.c is only compiled for CONFIG_UNACCEPTED_MEMORY=y.

Looks like an overkill to me, no?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 08/14] x86/mm: Reserve unaccepted memory bitmap
  2022-11-30  1:28     ` Kirill A. Shutemov
@ 2022-12-01  9:37       ` Mike Rapoport
  2022-12-01 13:47         ` Kirill A. Shutemov
  0 siblings, 1 reply; 200+ messages in thread
From: Mike Rapoport @ 2022-12-01  9:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Borislav Petkov, Kirill A. Shutemov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Mike Rapoport

On Wed, Nov 30, 2022 at 04:28:40AM +0300, Kirill A. Shutemov wrote:
> On Tue, Jul 26, 2022 at 11:07:14AM +0200, Borislav Petkov wrote:
> > On Tue, Jun 14, 2022 at 03:02:25PM +0300, Kirill A. Shutemov wrote:
> > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > index f267205f2d5a..22d1fe48dcba 100644
> > > --- a/arch/x86/kernel/e820.c
> > > +++ b/arch/x86/kernel/e820.c
> > > @@ -1316,6 +1316,16 @@ void __init e820__memblock_setup(void)
> > >  	int i;
> > >  	u64 end;
> > >  
> > > +	/* Mark unaccepted memory bitmap reserved */
> > > +	if (boot_params.unaccepted_memory) {
> > > +		unsigned long size;
> > > +
> > > +		/* One bit per 2MB */
> > > +		size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
> > > +				    PMD_SIZE * BITS_PER_BYTE);
> > > +		memblock_reserve(boot_params.unaccepted_memory, size);
> > > +	}
> > > +
> > 
> > Hmm, I don't like how this is dropped right in the middle of a unrelated
> > function.
> > 
> > You're adding arch/x86/mm/unaccepted_memory.c later. Why don't you put
> > that chunk in a function there which is called by early_reserve_memory()
> > which does exactly what you want - reserve memory early, before memblock
> > allocations?
> 
> early_reserve_memory() specifically called before e820__memory_setup()
> (see comment in setup_arch()), so we don't have e820_table finalized and
> we need it to get correct RAM size from e820__end_of_ram_pfn().
> 
> I guess we can hide the chunk in a function in unaccepted_memory.c and
> call it from here, but it would require #ifdeffery in a header file as the
> .c is only compiled for CONFIG_UNACCEPTED_MEMORY=y.
> 
> Looks like an overkill to me, no?

Agree. Can we just extend the comment to explain why we reserve the bitmap
at e820__memblock_setup() rather than in early_reserve_memory(), pretty
much with the explanation above?
 
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCHv7 08/14] x86/mm: Reserve unaccepted memory bitmap
  2022-12-01  9:37       ` Mike Rapoport
@ 2022-12-01 13:47         ` Kirill A. Shutemov
  0 siblings, 0 replies; 200+ messages in thread
From: Kirill A. Shutemov @ 2022-12-01 13:47 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, David Hildenbrand, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Mike Rapoport

On Thu, Dec 01, 2022 at 11:37:10AM +0200, Mike Rapoport wrote:
> On Wed, Nov 30, 2022 at 04:28:40AM +0300, Kirill A. Shutemov wrote:
> > On Tue, Jul 26, 2022 at 11:07:14AM +0200, Borislav Petkov wrote:
> > > On Tue, Jun 14, 2022 at 03:02:25PM +0300, Kirill A. Shutemov wrote:
> > > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > > index f267205f2d5a..22d1fe48dcba 100644
> > > > --- a/arch/x86/kernel/e820.c
> > > > +++ b/arch/x86/kernel/e820.c
> > > > @@ -1316,6 +1316,16 @@ void __init e820__memblock_setup(void)
> > > >  	int i;
> > > >  	u64 end;
> > > >  
> > > > +	/* Mark unaccepted memory bitmap reserved */
> > > > +	if (boot_params.unaccepted_memory) {
> > > > +		unsigned long size;
> > > > +
> > > > +		/* One bit per 2MB */
> > > > +		size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
> > > > +				    PMD_SIZE * BITS_PER_BYTE);
> > > > +		memblock_reserve(boot_params.unaccepted_memory, size);
> > > > +	}
> > > > +
> > > 
> > > Hmm, I don't like how this is dropped right in the middle of a unrelated
> > > function.
> > > 
> > > You're adding arch/x86/mm/unaccepted_memory.c later. Why don't you put
> > > that chunk in a function there which is called by early_reserve_memory()
> > > which does exactly what you want - reserve memory early, before memblock
> > > allocations?
> > 
> > early_reserve_memory() specifically called before e820__memory_setup()
> > (see comment in setup_arch()), so we don't have e820_table finalized and
> > we need it to get correct RAM size from e820__end_of_ram_pfn().
> > 
> > I guess we can hide the chunk in a function in unaccepted_memory.c and
> > call it from here, but it would require #ifdeffery in a header file as the
> > .c is only compiled for CONFIG_UNACCEPTED_MEMORY=y.
> > 
> > Looks like an overkill to me, no?
> 
> Agree. Can we just extend the comment to explain why we reserve the bitmap
> at e820__memblock_setup() rather than in early_reserve_memory(), pretty
> much with the explanation above?

Okay, I will do this:

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 49b5164a4cba..62068956bb76 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1316,7 +1316,14 @@ void __init e820__memblock_setup(void)
 	int i;
 	u64 end;
 
-	/* Mark unaccepted memory bitmap reserved */
+	/*
+	 * Mark unaccepted memory bitmap reserved.
+	 *
+	 * This kind of reservation usually done from early_reserve_memory(),
+	 * but early_reserve_memory() called before e820__memory_setup(), so
+	 * e820_table is not finalized and e820__end_of_ram_pfn() cannot be
+	 * used to get correct RAM size.
+	 */
 	if (boot_params.unaccepted_memory) {
 		unsigned long size;
 
-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 200+ messages in thread

end of thread, other threads:[~2022-12-01 13:48 UTC | newest]

Thread overview: 200+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-14 12:02 [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
2022-06-14 12:02 ` [PATCHv7 01/14] x86/boot: Centralize __pa()/__va() definitions Kirill A. Shutemov
2022-06-23 17:37   ` Dave Hansen
2022-06-14 12:02 ` [PATCHv7 02/14] mm: Add support for unaccepted memory Kirill A. Shutemov
2022-06-14 12:57   ` Gupta, Pankaj
2022-06-17 19:28   ` Tom Lendacky
2022-06-17 20:53     ` Tom Lendacky
2022-07-21 15:14   ` Borislav Petkov
2022-07-21 15:49     ` Dave Hansen
2022-07-22 19:18       ` Borislav Petkov
2022-07-22 19:30         ` Dave Hansen
2022-07-25 12:23           ` Borislav Petkov
2022-07-25 12:38             ` David Hildenbrand
2022-07-25 12:53               ` Borislav Petkov
2022-07-26 14:30                 ` David Hildenbrand
2022-07-25 13:00             ` Mike Rapoport
2022-07-25 13:05               ` Borislav Petkov
2022-08-05 11:49   ` Vlastimil Babka
2022-08-05 12:09     ` David Hildenbrand
2022-08-05 13:38       ` Vlastimil Babka
2022-08-05 14:22         ` David Hildenbrand
2022-08-05 14:53           ` Dave Hansen
2022-08-05 14:41         ` Dave Hansen
2022-08-05 18:17           ` Vlastimil Babka
2022-08-08 15:55             ` Dave Hansen
2022-08-10 14:19     ` Mel Gorman
2022-08-15 21:08       ` Dionna Amalie Glaze
2022-08-15 22:02         ` Tom Lendacky
2022-08-29 16:02           ` Dionna Amalie Glaze
2022-08-29 16:19             ` Dave Hansen
2022-09-06 17:50               ` Dionna Amalie Glaze
2022-09-08 12:11                 ` Mike Rapoport
2022-09-08 16:23                   ` Dionna Amalie Glaze
2022-09-08 19:28                     ` Mike Rapoport
2022-09-22 14:31                       ` Tom Lendacky
2022-09-24  1:03                         ` Kirill A. Shutemov
2022-09-24  9:36                           ` Mike Rapoport
2022-09-26 12:10                           ` Kirill A. Shutemov
2022-09-26 13:38                             ` Tom Lendacky
2022-09-26 15:42                               ` Kirill A. Shutemov
2022-09-26 15:42                               ` Tom Lendacky
2022-06-14 12:02 ` [PATCHv7 03/14] mm: Report unaccepted memory in meminfo Kirill A. Shutemov
2022-07-26 14:33   ` David Hildenbrand
2022-06-14 12:02 ` [PATCHv7 04/14] efi/x86: Get full memory map in allocate_e820() Kirill A. Shutemov
2022-07-25 13:02   ` Borislav Petkov
2022-06-14 12:02 ` [PATCHv7 05/14] x86/boot: Add infrastructure required for unaccepted memory support Kirill A. Shutemov
2022-06-15 10:19   ` Peter Zijlstra
2022-06-15 15:05     ` Kirill A. Shutemov
2022-07-17 17:16       ` Borislav Petkov
2022-07-25 21:33   ` Borislav Petkov
2022-06-14 12:02 ` [PATCHv7 06/14] efi/x86: Implement support for unaccepted memory Kirill A. Shutemov
2022-06-22 19:58   ` Dave Hansen
2022-07-26  8:35   ` Borislav Petkov
2022-06-14 12:02 ` [PATCHv7 07/14] x86/boot/compressed: Handle " Kirill A. Shutemov
2022-06-14 12:02 ` [PATCHv7 08/14] x86/mm: Reserve unaccepted memory bitmap Kirill A. Shutemov
2022-07-26  9:07   ` Borislav Petkov
2022-11-30  1:28     ` Kirill A. Shutemov
2022-12-01  9:37       ` Mike Rapoport
2022-12-01 13:47         ` Kirill A. Shutemov
2022-06-14 12:02 ` [PATCHv7 09/14] x86/mm: Provide helpers for unaccepted memory Kirill A. Shutemov
2022-06-14 12:02 ` [PATCHv7 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into " Kirill A. Shutemov
2022-06-23 17:19   ` Dave Hansen
2022-07-26 10:21   ` Borislav Petkov
2022-08-02 23:46     ` Dave Hansen
2022-08-03 14:02       ` Dave Hansen
2022-08-11 11:26         ` Borislav Petkov
2022-08-13 16:11           ` Andy Lutomirski
2022-08-13 21:13             ` Kirill A. Shutemov
2022-08-13 16:04         ` Andy Lutomirski
2022-08-13 20:58           ` Kirill A. Shutemov
2022-07-26 17:25   ` Borislav Petkov
2022-07-26 17:46     ` Dave Hansen
2022-07-26 20:17   ` Andy Lutomirski
2022-08-09 11:38     ` Kirill A. Shutemov
2022-08-13 16:03       ` Andy Lutomirski
2022-08-13 21:02         ` Kirill A. Shutemov
2022-06-14 12:02 ` [PATCHv7 11/14] x86: Disable kexec if system has " Kirill A. Shutemov
2022-06-23 17:23   ` Dave Hansen
2022-06-23 21:48     ` Eric W. Biederman
2022-06-24  2:00       ` Kirill A. Shutemov
2022-06-28 23:51         ` Kirill A. Shutemov
2022-06-29  0:10           ` Dave Hansen
2022-06-29  0:59             ` Kirill A. Shutemov
2022-07-04  7:18               ` Dave Young
2022-06-14 12:02 ` [PATCHv7 12/14] x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub Kirill A. Shutemov
2022-06-23 17:25   ` Dave Hansen
2022-06-14 12:02 ` [PATCHv7 13/14] x86/tdx: Refactor try_accept_one() Kirill A. Shutemov
2022-06-23 17:31   ` Dave Hansen
2022-07-26 10:58   ` Borislav Petkov
2022-06-14 12:02 ` [PATCHv7 14/14] x86/tdx: Add unaccepted memory support Kirill A. Shutemov
2022-06-24 16:22   ` Dave Hansen
2022-06-27 10:42     ` Kirill A. Shutemov
2022-07-26 14:51   ` Borislav Petkov
2022-08-09 11:45     ` Kirill A. Shutemov
2022-08-10 10:27       ` Borislav Petkov
2022-06-24 16:37 ` [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted memory Peter Gonda
2022-06-24 16:57   ` Dave Hansen
2022-06-24 17:06     ` Marc Orr
2022-06-24 17:09       ` Dave Hansen
2022-06-24 17:15         ` Peter Gonda
2022-06-24 17:19         ` Marc Orr
2022-06-24 17:21           ` Peter Gonda
2022-06-24 17:47           ` Dave Hansen
2022-06-24 18:10             ` Peter Gonda
2022-06-24 18:13               ` Dave Hansen
2022-06-24 17:40   ` Michael Roth
2022-06-24 17:58     ` Michael Roth
2022-06-24 18:05     ` Peter Gonda
2022-06-27 11:30   ` Kirill A. Shutemov
2022-06-27 11:54     ` Ard Biesheuvel
2022-06-27 12:22       ` Kirill A. Shutemov
2022-06-27 16:17         ` Peter Gonda
2022-06-27 16:33           ` Ard Biesheuvel
2022-06-27 22:38             ` Kirill A. Shutemov
2022-06-28 17:17               ` Ard Biesheuvel
2022-07-18 17:21                 ` Kirill A. Shutemov
2022-07-18 23:32                   ` Dionna Amalie Glaze
2022-07-19  0:31                     ` Dionna Amalie Glaze
2022-07-19 18:29                       ` Dionna Amalie Glaze
2022-07-19 19:13                         ` Borislav Petkov
2022-07-19 20:45                           ` Ard Biesheuvel
2022-07-19 21:23                             ` Borislav Petkov
2022-07-19 21:35                               ` Dave Hansen
2022-07-19 21:50                                 ` Borislav Petkov
2022-07-19 22:01                                   ` Kirill A. Shutemov
2022-07-19 22:02                                   ` Dave Hansen
2022-07-19 22:08                                     ` Tom Lendacky
2022-07-20  0:26                                     ` Marc Orr
2022-07-20  5:44                                       ` Borislav Petkov
2022-07-20 17:03                                         ` Marc Orr
2022-07-22 15:07                                           ` Borislav Petkov
2022-07-21 17:12                                       ` Dave Hansen
2022-07-23 11:14                                         ` Ard Biesheuvel
2022-07-28 22:01                                           ` Dionna Amalie Glaze
2022-08-09 11:14                                           ` Kirill A. Shutemov
2022-08-09 11:36                                             ` Ard Biesheuvel
2022-08-09 11:54                                               ` Kirill A. Shutemov
2022-08-09 21:09                                                 ` Dionna Amalie Glaze
2022-07-19  2:48                     ` Yao, Jiewen
2022-07-29 14:01 ` [PATCH v1 0/2] Provide SEV-SNP " Tom Lendacky
2022-07-29 14:01   ` [PATCH v1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support Tom Lendacky
2022-07-29 14:18     ` Dave Hansen
2022-07-29 14:25       ` Tom Lendacky
2022-07-29 19:08         ` Dave Hansen
2022-07-29 19:22           ` Tom Lendacky
2022-07-29 19:28             ` Dave Hansen
2022-07-29 20:12               ` Tom Lendacky
2022-08-03 18:11                 ` [PATCH v1.1 0/2] Provide SEV-SNP support for unaccepted memory Tom Lendacky
2022-08-03 18:11                   ` [PATCH v1.1 1/2] x86/sev: Use per-CPU PSC structure in prep for unaccepted memory support Tom Lendacky
2022-08-03 18:17                     ` Dave Hansen
2022-08-03 18:21                       ` Tom Lendacky
2022-08-03 18:24                         ` Dave Hansen
2022-08-03 21:03                           ` Tom Lendacky
2022-08-03 21:18                             ` Dave Hansen
2022-08-03 21:34                               ` Tom Lendacky
2022-08-03 21:48                                 ` Dave Hansen
2022-08-03 22:17                                   ` Tom Lendacky
2022-08-03 18:18                     ` Tom Lendacky
2022-08-03 18:11                   ` [PATCH v1.1 2/2] x86/sev: Add SNP-specific " Tom Lendacky
2022-07-29 14:01   ` [PATCH v1 " Tom Lendacky
2022-08-23  0:24     ` Dionna Amalie Glaze
2022-08-23 14:28       ` Tom Lendacky
2022-08-23 23:28     ` Dionna Amalie Glaze
2022-08-08 17:16 ` [PATCH v2 0/2] Provide SEV-SNP support for unaccepted memory Tom Lendacky
2022-08-08 17:16   ` [PATCH v2 1/2] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support Tom Lendacky
2022-08-08 21:43     ` Dave Hansen
2022-08-08 22:18       ` Tom Lendacky
2022-08-08 22:33         ` Dave Hansen
2022-08-08 22:35           ` Tom Lendacky
2022-08-12 13:03     ` Borislav Petkov
2022-08-12 14:11       ` Tom Lendacky
2022-08-12 14:33         ` Borislav Petkov
2022-08-12 14:51           ` Tom Lendacky
2022-08-13 19:40             ` Borislav Petkov
2022-08-14 13:36               ` Tom Lendacky
2022-08-08 17:16   ` [PATCH v2 2/2] x86/sev: Add SNP-specific " Tom Lendacky
2022-08-15 15:57 ` [PATCH v3 0/2] Provide SEV-SNP support for unaccepted memory Tom Lendacky
2022-08-15 15:57   ` [PATCH v3 1/2] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support Tom Lendacky
2022-08-17 16:08     ` Borislav Petkov
2022-08-17 21:17       ` Tom Lendacky
2022-08-15 15:57   ` [PATCH v3 2/2] x86/sev: Add SNP-specific " Tom Lendacky
2022-08-18 13:39     ` Borislav Petkov
2022-08-25 14:23 ` [PATCH v4 0/4] Provide SEV-SNP support for unaccepted memory Tom Lendacky
2022-08-25 14:23   ` [PATCH v4 1/4] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support Tom Lendacky
2022-09-20 16:15     ` Borislav Petkov
2022-08-25 14:23   ` [PATCH v4 2/4] x86/sev: Allow for use of the early boot GHCB for PSC requests Tom Lendacky
2022-08-25 14:23   ` [PATCH v4 3/4] x86/sev: Use large PSC requests if applicable Tom Lendacky
2022-08-25 14:23   ` [PATCH v4 4/4] x86/sev: Add SNP-specific unaccepted memory support Tom Lendacky
2022-08-25 22:10     ` Dionna Amalie Glaze
2022-08-26 21:29       ` Tom Lendacky
2022-09-27 17:04 ` [PATCH v5 0/6] Provide SEV-SNP support for unaccepted memory Tom Lendacky
2022-09-27 17:04   ` [PATCH v5 1/6] x86/sev: Fix calculation of end address based on number of pages Tom Lendacky
2022-09-27 17:10     ` Dave Hansen
2022-09-27 20:45       ` Tom Lendacky
2022-09-27 19:04     ` Dionna Amalie Glaze
2022-09-27 17:04   ` [PATCH v5 2/6] " Tom Lendacky
2022-09-27 17:04   ` [PATCH v5 3/6] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support Tom Lendacky
2022-09-27 17:04   ` [PATCH v5 4/6] x86/sev: Allow for use of the early boot GHCB for PSC requests Tom Lendacky
2022-09-27 17:04   ` [PATCH v5 5/6] x86/sev: Use large PSC requests if applicable Tom Lendacky
2022-09-27 17:04   ` [PATCH v5 6/6] x86/sev: Add SNP-specific unaccepted memory support Tom Lendacky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).