All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv4 0/8] mm, x86/cc: Implement support for unaccepted memory
@ 2022-04-05 23:43 Kirill A. Shutemov
  2022-04-05 23:43 ` [PATCHv4 1/8] mm: Add " Kirill A. Shutemov
                   ` (8 more replies)
  0 siblings, 9 replies; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-05 23:43 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Brijesh Singh, Mike Rapoport, David Hildenbrand,
	x86, linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov

UEFI Specification version 2.9 introduces the concept of memory
acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, requiring memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific for the Virtual
Machine platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptance until memory is needed. It lowers boot time and reduces
memory overhead.

The kernel needs to know what memory has been accepted. Firmware
communicates this information via memory map: a new memory type --
EFI_UNACCEPTED_MEMORY -- indicates such memory.

Range-based tracking works fine for firmware, but it gets bulky for
the kernel: e820 has to be modified on every page acceptance. It leads
to table fragmentation, but there's a limited number of entries in the
e820 table

Another option is to mark such memory as usable in e820 and track if the
range has been accepted in a bitmap. One bit in the bitmap represents
2MiB in the address space: one 4k page is enough to track 64GiB or
physical address space.

In the worst-case scenario -- a huge hole in the middle of the
address space -- It needs 256MiB to handle 4PiB of the address
space.

Any unaccepted memory that is not aligned to 2M gets accepted upfront.

The approach lowers boot time substantially. Boot to shell is ~2.5x
faster for 4G TDX VM and ~4x faster for 64G.

Patches 1-6/7 are generic and don't have any dependencies on TDX. They
should serve AMD SEV needs as well. TDX-specific code isolated in the
last patch. This patch requires the core TDX patchset which is currently
under review.

v4:
 - PageBuddyUnaccepted() -> PageUnaccepted;
 - Use separate page_type, not shared with offline;
 - Rework interface between core-mm and arch code;
 - Adjust commit messages;
 - Ack from Mike;
Kirill A. Shutemov (8):
  mm: Add support for unaccepted memory
  efi/x86: Get full memory map in allocate_e820()
  efi/x86: Implement support for unaccepted memory
  x86/boot/compressed: Handle unaccepted memory
  x86/mm: Reserve unaccepted memory bitmap
  x86/mm: Provide helpers for unaccepted memory
  x86/tdx: Unaccepted memory support
  mm/vmstat: Add counter for memory accepting

 Documentation/x86/zero-page.rst              |  1 +
 arch/x86/Kconfig                             |  1 +
 arch/x86/boot/compressed/Makefile            |  1 +
 arch/x86/boot/compressed/bitmap.c            | 86 +++++++++++++++++++
 arch/x86/boot/compressed/kaslr.c             | 14 +++-
 arch/x86/boot/compressed/misc.c              | 11 +++
 arch/x86/boot/compressed/tdx.c               | 41 +++++++++
 arch/x86/boot/compressed/unaccepted_memory.c | 88 ++++++++++++++++++++
 arch/x86/coco/tdx/tdx.c                      | 29 +++++--
 arch/x86/include/asm/page.h                  |  5 ++
 arch/x86/include/asm/shared/tdx.h            | 20 +++++
 arch/x86/include/asm/tdx.h                   | 19 -----
 arch/x86/include/asm/unaccepted_memory.h     | 15 ++++
 arch/x86/include/uapi/asm/bootparam.h        |  3 +-
 arch/x86/kernel/e820.c                       | 10 +++
 arch/x86/mm/Makefile                         |  2 +
 arch/x86/mm/unaccepted_memory.c              | 58 +++++++++++++
 drivers/firmware/efi/Kconfig                 | 15 ++++
 drivers/firmware/efi/efi.c                   |  1 +
 drivers/firmware/efi/libstub/x86-stub.c      | 88 ++++++++++++++++----
 include/linux/efi.h                          |  3 +-
 include/linux/page-flags.h                   | 24 ++++++
 include/linux/vm_event_item.h                |  3 +
 mm/internal.h                                | 11 +++
 mm/memblock.c                                |  9 ++
 mm/page_alloc.c                              | 57 ++++++++++++-
 mm/vmstat.c                                  |  3 +
 27 files changed, 569 insertions(+), 49 deletions(-)
 create mode 100644 arch/x86/boot/compressed/bitmap.c
 create mode 100644 arch/x86/boot/compressed/unaccepted_memory.c
 create mode 100644 arch/x86/include/asm/unaccepted_memory.h
 create mode 100644 arch/x86/mm/unaccepted_memory.c

-- 
2.35.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-05 23:43 [PATCHv4 0/8] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
@ 2022-04-05 23:43 ` Kirill A. Shutemov
  2022-04-08 18:55   ` Dave Hansen
                     ` (2 more replies)
  2022-04-05 23:43 ` [PATCHv4 2/8] efi/x86: Get full memory map in allocate_e820() Kirill A. Shutemov
                   ` (7 subsequent siblings)
  8 siblings, 3 replies; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-05 23:43 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Brijesh Singh, Mike Rapoport, David Hildenbrand,
	x86, linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov, Mike Rapoport

UEFI Specification version 2.9 introduces the concept of memory
acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, requiring memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific for the Virtual Machine
platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptance until memory is needed. It lowers boot time and reduces
memory overhead.

Support of such memory requires a few changes in core-mm code:

  - memblock has to accept memory on allocation;

  - page allocator has to accept memory on the first allocation of the
    page;

Memblock change is trivial.

The page allocator is modified to accept pages on the first allocation.
PageUnaccepted() is used to indicate that the page requires acceptance.

Kernel only needs to accept memory once after boot, so during the boot
and warm up phase there will be a lot of memory acceptance. After things
are settled down the only price of the feature if couple of checks for
PageUnaccepted() in allocate and free paths. The check refers a hot
variable (that also encodes PageBuddy()), so it is cheap and not visible
on profiles.

Architecture has to provide two helpers if it wants to support
unaccepted memory:

 - accept_memory() makes a range of physical addresses accepted.

 - memory_is_unaccepted() checks anything within the range of physical
   addresses requires acceptance.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
---
 include/linux/page-flags.h | 24 ++++++++++++++++
 mm/internal.h              | 11 ++++++++
 mm/memblock.c              |  9 ++++++
 mm/page_alloc.c            | 57 ++++++++++++++++++++++++++++++++++++--
 4 files changed, 99 insertions(+), 2 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 9d8eeaa67d05..aaaedc111092 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -928,6 +928,7 @@ static inline bool is_page_hwpoison(struct page *page)
 #define PG_offline	0x00000100
 #define PG_table	0x00000200
 #define PG_guard	0x00000400
+#define PG_unaccepted	0x00000800
 
 #define PageType(page, flag)						\
 	((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
@@ -953,6 +954,18 @@ static __always_inline void __ClearPage##uname(struct page *page)	\
 	page->page_type |= PG_##lname;					\
 }
 
+#define PAGE_TYPE_OPS_FALSE(uname)					\
+static __always_inline int Page##uname(struct page *page)		\
+{									\
+	return false;							\
+}									\
+static __always_inline void __SetPage##uname(struct page *page)		\
+{									\
+}									\
+static __always_inline void __ClearPage##uname(struct page *page)	\
+{									\
+}
+
 /*
  * PageBuddy() indicates that the page is free and in the buddy system
  * (see mm/page_alloc.c).
@@ -983,6 +996,17 @@ PAGE_TYPE_OPS(Buddy, buddy)
  */
 PAGE_TYPE_OPS(Offline, offline)
 
+ /*
+  * PageUnaccepted() indicates that the page has to be "accepted" before it can
+  * be used. Page allocator has to call accept_page() before returning the page
+  * to the caller.
+  */
+#ifdef CONFIG_UNACCEPTED_MEMORY
+PAGE_TYPE_OPS(Unaccepted, unaccepted)
+#else
+PAGE_TYPE_OPS_FALSE(Unaccepted)
+#endif
+
 extern void page_offline_freeze(void);
 extern void page_offline_thaw(void);
 extern void page_offline_begin(void);
diff --git a/mm/internal.h b/mm/internal.h
index cf16280ce132..10302fe857c4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -758,4 +758,15 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags);
 
 DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
 
+#ifndef CONFIG_UNACCEPTED_MEMORY
+static inline bool memory_is_unaccepted(phys_addr_t start, phys_addr_t end)
+{
+	return false;
+}
+
+static inline void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+}
+#endif
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memblock.c b/mm/memblock.c
index e4f03a6e8e56..a1f7f8b304d5 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1405,6 +1405,15 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
 		 */
 		kmemleak_alloc_phys(found, size, 0, 0);
 
+	/*
+	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
+	 * require memory to be accepted before it can be used by the
+	 * guest.
+	 *
+	 * Accept the memory of the allocated buffer.
+	 */
+	accept_memory(found, found + size);
+
 	return found;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2db95780e003..53f4aa1c92a7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -121,6 +121,12 @@ typedef int __bitwise fpi_t;
  */
 #define FPI_SKIP_KASAN_POISON	((__force fpi_t)BIT(2))
 
+/*
+ * Check if the page needs to be marked as PageUnaccepted().
+ * Used for the new pages added to the buddy allocator for the first time.
+ */
+#define FPI_UNACCEPTED		((__force fpi_t)BIT(3))
+
 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
 static DEFINE_MUTEX(pcp_batch_high_lock);
 #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
@@ -1023,6 +1029,26 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
 	return page_is_buddy(higher_page, higher_buddy, order + 1);
 }
 
+static void accept_page(struct page *page, unsigned int order)
+{
+	phys_addr_t start = page_to_phys(page);
+	int i;
+
+	accept_memory(start, start + (PAGE_SIZE << order));
+
+	for (i = 0; i < (1 << order); i++) {
+		if (PageUnaccepted(page + i))
+			__ClearPageUnaccepted(page + i);
+	}
+}
+
+static bool page_is_unaccepted(struct page *page, unsigned int order)
+{
+	phys_addr_t start = page_to_phys(page);
+
+	return memory_is_unaccepted(start, start + (PAGE_SIZE << order));
+}
+
 /*
  * Freeing function for a buddy system allocator.
  *
@@ -1058,6 +1084,7 @@ static inline void __free_one_page(struct page *page,
 	unsigned long combined_pfn;
 	struct page *buddy;
 	bool to_tail;
+	bool unaccepted = PageUnaccepted(page);
 
 	VM_BUG_ON(!zone_is_initialized(zone));
 	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
@@ -1089,6 +1116,11 @@ static inline void __free_one_page(struct page *page,
 			clear_page_guard(zone, buddy, order, migratetype);
 		else
 			del_page_from_free_list(buddy, zone, order);
+
+		/* Mark page unaccepted if any of merged pages were unaccepted */
+		if (PageUnaccepted(buddy))
+			unaccepted = true;
+
 		combined_pfn = buddy_pfn & pfn;
 		page = page + (combined_pfn - pfn);
 		pfn = combined_pfn;
@@ -1124,6 +1156,17 @@ static inline void __free_one_page(struct page *page,
 done_merging:
 	set_buddy_order(page, order);
 
+	/*
+	 * Check if the page needs to be marked as PageUnaccepted().
+	 * Used for the new pages added to the buddy allocator for the first
+	 * time.
+	 */
+	if (!unaccepted && (fpi_flags & FPI_UNACCEPTED))
+		unaccepted = page_is_unaccepted(page, order);
+
+	if (unaccepted)
+		__SetPageUnaccepted(page);
+
 	if (fpi_flags & FPI_TO_TAIL)
 		to_tail = true;
 	else if (is_shuffle_order(order))
@@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page,
 static inline bool page_expected_state(struct page *page,
 					unsigned long check_flags)
 {
-	if (unlikely(atomic_read(&page->_mapcount) != -1))
+	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
+	    !PageUnaccepted(page))
 		return false;
 
 	if (unlikely((unsigned long)page->mapping |
@@ -1654,7 +1698,8 @@ void __free_pages_core(struct page *page, unsigned int order)
 	 * Bypass PCP and place fresh pages right to the tail, primarily
 	 * relevant for memory onlining.
 	 */
-	__free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON);
+	__free_pages_ok(page, order,
+			FPI_TO_TAIL | FPI_SKIP_KASAN_POISON | FPI_UNACCEPTED);
 }
 
 #ifdef CONFIG_NUMA
@@ -1807,6 +1852,7 @@ static void __init deferred_free_range(unsigned long pfn,
 		return;
 	}
 
+	accept_memory(pfn << PAGE_SHIFT, (pfn + nr_pages) << PAGE_SHIFT);
 	for (i = 0; i < nr_pages; i++, page++, pfn++) {
 		if ((pfn & (pageblock_nr_pages - 1)) == 0)
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
@@ -2266,6 +2312,10 @@ static inline void expand(struct zone *zone, struct page *page,
 		if (set_page_guard(zone, &page[size], high, migratetype))
 			continue;
 
+		/* Transfer PageUnaccepted() to the newly split pages */
+		if (PageUnaccepted(page))
+			__SetPageUnaccepted(&page[size]);
+
 		add_to_free_list(&page[size], zone, high, migratetype);
 		set_buddy_order(&page[size], high);
 	}
@@ -2396,6 +2446,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 	 */
 	kernel_unpoison_pages(page, 1 << order);
 
+	if (PageUnaccepted(page))
+		accept_page(page, order);
+
 	/*
 	 * As memory initialization might be integrated into KASAN,
 	 * KASAN unpoisoning and memory initializion code must be
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCHv4 2/8] efi/x86: Get full memory map in allocate_e820()
  2022-04-05 23:43 [PATCHv4 0/8] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
  2022-04-05 23:43 ` [PATCHv4 1/8] mm: Add " Kirill A. Shutemov
@ 2022-04-05 23:43 ` Kirill A. Shutemov
  2022-04-13  9:59   ` Borislav Petkov
  2022-04-05 23:43 ` [PATCHv4 3/8] efi/x86: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-05 23:43 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Brijesh Singh, Mike Rapoport, David Hildenbrand,
	x86, linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov

Currently allocate_e820() only interested in the size of map and size of
memory descriptor to determine how many e820 entries the kernel needs.

UEFI Specification version 2.9 introduces a new memory type --
unaccepted memory. To track unaccepted memory kernel needs to allocate
a bitmap. The size of the bitmap is dependent on the maximum physical
address present in the system. A full memory map is required to find
the maximum address.

Modify allocate_e820() to get a full memory map.

This is preparation for the next patch that implements handling of
unaccepted memory in EFI stub.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 drivers/firmware/efi/libstub/x86-stub.c | 28 ++++++++++++-------------
 1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index 01ddd4502e28..d18cac8ab436 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -569,30 +569,28 @@ static efi_status_t alloc_e820ext(u32 nr_desc, struct setup_data **e820ext,
 }
 
 static efi_status_t allocate_e820(struct boot_params *params,
+				  struct efi_boot_memmap *map,
 				  struct setup_data **e820ext,
 				  u32 *e820ext_size)
 {
-	unsigned long map_size, desc_size, map_key;
 	efi_status_t status;
-	__u32 nr_desc, desc_version;
+	__u32 nr_desc;
 
-	/* Only need the size of the mem map and size of each mem descriptor */
-	map_size = 0;
-	status = efi_bs_call(get_memory_map, &map_size, NULL, &map_key,
-			     &desc_size, &desc_version);
-	if (status != EFI_BUFFER_TOO_SMALL)
-		return (status != EFI_SUCCESS) ? status : EFI_UNSUPPORTED;
-
-	nr_desc = map_size / desc_size + EFI_MMAP_NR_SLACK_SLOTS;
+	status = efi_get_memory_map(map);
+	if (status != EFI_SUCCESS)
+		return status;
 
-	if (nr_desc > ARRAY_SIZE(params->e820_table)) {
-		u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table);
+	nr_desc = *map->map_size / *map->desc_size;
+	if (nr_desc > ARRAY_SIZE(params->e820_table) - EFI_MMAP_NR_SLACK_SLOTS) {
+		u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table) +
+			EFI_MMAP_NR_SLACK_SLOTS;
 
 		status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
 		if (status != EFI_SUCCESS)
-			return status;
+			goto out;
 	}
-
+out:
+	efi_bs_call(free_pool, *map->map);
 	return EFI_SUCCESS;
 }
 
@@ -642,7 +640,7 @@ static efi_status_t exit_boot(struct boot_params *boot_params, void *handle)
 	priv.boot_params	= boot_params;
 	priv.efi		= &boot_params->efi_info;
 
-	status = allocate_e820(boot_params, &e820ext, &e820ext_size);
+	status = allocate_e820(boot_params, &map, &e820ext, &e820ext_size);
 	if (status != EFI_SUCCESS)
 		return status;
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCHv4 3/8] efi/x86: Implement support for unaccepted memory
  2022-04-05 23:43 [PATCHv4 0/8] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
  2022-04-05 23:43 ` [PATCHv4 1/8] mm: Add " Kirill A. Shutemov
  2022-04-05 23:43 ` [PATCHv4 2/8] efi/x86: Get full memory map in allocate_e820() Kirill A. Shutemov
@ 2022-04-05 23:43 ` Kirill A. Shutemov
  2022-04-08 17:26   ` Dave Hansen
  2022-04-15 22:24   ` Borislav Petkov
  2022-04-05 23:43 ` [PATCHv4 4/8] x86/boot/compressed: Handle " Kirill A. Shutemov
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-05 23:43 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Brijesh Singh, Mike Rapoport, David Hildenbrand,
	x86, linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov

UEFI Specification version 2.9 introduces the concept of memory
acceptance: Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, requiring memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific for the Virtual
Machine platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptance until memory is needed. It lowers boot time and reduces
memory overhead.

The kernel needs to know what memory has been accepted. Firmware
communicates this information via memory map: a new memory type --
EFI_UNACCEPTED_MEMORY -- indicates such memory.

Range-based tracking works fine for firmware, but it gets bulky for
the kernel: e820 has to be modified on every page acceptance. It leads
to table fragmentation, but there's a limited number of entries in the
e820 table

Another option is to mark such memory as usable in e820 and track if the
range has been accepted in a bitmap. One bit in the bitmap represents
2MiB in the address space: one 4k page is enough to track 64GiB or
physical address space.

In the worst-case scenario -- a huge hole in the middle of the
address space -- It needs 256MiB to handle 4PiB of the address
space.

Any unaccepted memory that is not aligned to 2M gets accepted upfront.

The bitmap is allocated and constructed in the EFI stub and passed down
to the kernel via boot_params. allocate_e820() allocates the bitmap if
unaccepted memory is present, according to the maximum address in the
memory map.

The same boot_params.unaccepted_memory can be used to pass the bitmap
between two kernels on kexec, but the use-case is not yet implemented.
Make KEXEC and UNACCEPTED_MEMORY mutually exclusive for now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/x86/zero-page.rst              |  1 +
 arch/x86/boot/compressed/Makefile            |  1 +
 arch/x86/boot/compressed/bitmap.c            | 24 ++++++++
 arch/x86/boot/compressed/unaccepted_memory.c | 53 +++++++++++++++++
 arch/x86/include/asm/unaccepted_memory.h     | 12 ++++
 arch/x86/include/uapi/asm/bootparam.h        |  3 +-
 drivers/firmware/efi/Kconfig                 | 15 +++++
 drivers/firmware/efi/efi.c                   |  1 +
 drivers/firmware/efi/libstub/x86-stub.c      | 62 +++++++++++++++++++-
 include/linux/efi.h                          |  3 +-
 10 files changed, 172 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/boot/compressed/bitmap.c
 create mode 100644 arch/x86/boot/compressed/unaccepted_memory.c
 create mode 100644 arch/x86/include/asm/unaccepted_memory.h

diff --git a/Documentation/x86/zero-page.rst b/Documentation/x86/zero-page.rst
index f088f5881666..8e3447a4b373 100644
--- a/Documentation/x86/zero-page.rst
+++ b/Documentation/x86/zero-page.rst
@@ -42,4 +42,5 @@ Offset/Size	Proto	Name			Meaning
 2D0/A00		ALL	e820_table		E820 memory map table
 						(array of struct e820_entry)
 D00/1EC		ALL	eddbuf			EDD data (array of struct edd_info)
+ECC/008		ALL	unaccepted_memory	Bitmap of unaccepted memory (1bit == 2M)
 ===========	=====	=======================	=================================================
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 8fd0e6ae2e1f..09993797efa2 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -102,6 +102,7 @@ endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
 vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
+vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/bitmap.o $(obj)/unaccepted_memory.o
 
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
 efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/bitmap.c b/arch/x86/boot/compressed/bitmap.c
new file mode 100644
index 000000000000..bf58b259380a
--- /dev/null
+++ b/arch/x86/boot/compressed/bitmap.c
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Taken from lib/string.c */
+
+#include <linux/bitmap.h>
+
+void __bitmap_set(unsigned long *map, unsigned int start, int len)
+{
+	unsigned long *p = map + BIT_WORD(start);
+	const unsigned int size = start + len;
+	int bits_to_set = BITS_PER_LONG - (start % BITS_PER_LONG);
+	unsigned long mask_to_set = BITMAP_FIRST_WORD_MASK(start);
+
+	while (len - bits_to_set >= 0) {
+		*p |= mask_to_set;
+		len -= bits_to_set;
+		bits_to_set = BITS_PER_LONG;
+		mask_to_set = ~0UL;
+		p++;
+	}
+	if (len) {
+		mask_to_set &= BITMAP_LAST_WORD_MASK(size);
+		*p |= mask_to_set;
+	}
+}
diff --git a/arch/x86/boot/compressed/unaccepted_memory.c b/arch/x86/boot/compressed/unaccepted_memory.c
new file mode 100644
index 000000000000..d363acf59c08
--- /dev/null
+++ b/arch/x86/boot/compressed/unaccepted_memory.c
@@ -0,0 +1,53 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "error.h"
+#include "misc.h"
+
+static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	/* Platform-specific memory-acceptance call goes here */
+	error("Cannot accept memory");
+}
+
+void mark_unaccepted(struct boot_params *params, u64 start, u64 end)
+{
+	/*
+	 * The accepted memory bitmap only works at PMD_SIZE granularity.
+	 * If a request comes in to mark memory as unaccepted which is not
+	 * PMD_SIZE-aligned, simply accept the memory now since it can not be
+	 * *marked* as unaccepted.
+	 */
+
+	/*
+	 * Accept small regions that might not be able to be represented
+	 * in the bitmap:
+	 */
+	if (end - start < 2 * PMD_SIZE) {
+		__accept_memory(start, end);
+		return;
+	}
+
+	/*
+	 * No matter how the start and end are aligned, at least one unaccepted
+	 * PMD_SIZE area will remain.
+	 */
+
+	/* Immediately accept a <PMD_SIZE piece at the start: */
+	if (start & ~PMD_MASK) {
+		__accept_memory(start, round_up(start, PMD_SIZE));
+		start = round_up(start, PMD_SIZE);
+	}
+
+	/* Immediately accept a <PMD_SIZE piece at the end: */
+	if (end & ~PMD_MASK) {
+		__accept_memory(round_down(end, PMD_SIZE), end);
+		end = round_down(end, PMD_SIZE);
+	}
+
+	/*
+	 * 'start' and 'end' are now both PMD-aligned.
+	 * Record the range as being unaccepted:
+	 */
+	bitmap_set((unsigned long *)params->unaccepted_memory,
+		   start / PMD_SIZE, (end - start) / PMD_SIZE);
+}
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
new file mode 100644
index 000000000000..cbc24040b853
--- /dev/null
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2020 Intel Corporation */
+#ifndef _ASM_X86_UNACCEPTED_MEMORY_H
+#define _ASM_X86_UNACCEPTED_MEMORY_H
+
+#include <linux/types.h>
+
+struct boot_params;
+
+void mark_unaccepted(struct boot_params *params, u64 start, u64 num);
+
+#endif
diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h
index b25d3f82c2f3..16bc686a198d 100644
--- a/arch/x86/include/uapi/asm/bootparam.h
+++ b/arch/x86/include/uapi/asm/bootparam.h
@@ -217,7 +217,8 @@ struct boot_params {
 	struct boot_e820_entry e820_table[E820_MAX_ENTRIES_ZEROPAGE]; /* 0x2d0 */
 	__u8  _pad8[48];				/* 0xcd0 */
 	struct edd_info eddbuf[EDDMAXNR];		/* 0xd00 */
-	__u8  _pad9[276];				/* 0xeec */
+	__u64 unaccepted_memory;			/* 0xeec */
+	__u8  _pad9[268];				/* 0xef4 */
 } __attribute__((packed));
 
 /**
diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
index 2c3dac5ecb36..b17ceec757d0 100644
--- a/drivers/firmware/efi/Kconfig
+++ b/drivers/firmware/efi/Kconfig
@@ -243,6 +243,21 @@ config EFI_DISABLE_PCI_DMA
 	  options "efi=disable_early_pci_dma" or "efi=no_disable_early_pci_dma"
 	  may be used to override this option.
 
+config UNACCEPTED_MEMORY
+	bool
+	depends on EFI_STUB
+	depends on !KEXEC_CORE
+	help
+	   Some Virtual Machine platforms, such as Intel TDX, require
+	   some memory to be "accepted" by the guest before it can be used.
+	   This mechanism helps prevent malicious hosts from making changes
+	   to guest memory.
+
+	   UEFI specification v2.9 introduced EFI_UNACCEPTED_MEMORY memory type.
+
+	   This option adds support for unaccepted memory and makes such memory
+	   usable by kernel.
+
 endmenu
 
 config EFI_EMBEDDED_FIRMWARE
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 5502e176d51b..2c055afb1b11 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -747,6 +747,7 @@ static __initdata char memory_type_name[][13] = {
 	"MMIO Port",
 	"PAL Code",
 	"Persistent",
+	"Unaccepted",
 };
 
 char * __init efi_md_typeattr_format(char *buf, size_t size,
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index d18cac8ab436..e7601fd612aa 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -9,12 +9,14 @@
 #include <linux/efi.h>
 #include <linux/pci.h>
 #include <linux/stddef.h>
+#include <linux/bitmap.h>
 
 #include <asm/efi.h>
 #include <asm/e820/types.h>
 #include <asm/setup.h>
 #include <asm/desc.h>
 #include <asm/boot.h>
+#include <asm/unaccepted_memory.h>
 
 #include "efistub.h"
 
@@ -504,6 +506,13 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
 			e820_type = E820_TYPE_PMEM;
 			break;
 
+		case EFI_UNACCEPTED_MEMORY:
+			if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
+				continue;
+			e820_type = E820_TYPE_RAM;
+			mark_unaccepted(params, d->phys_addr,
+					d->phys_addr + PAGE_SIZE * d->num_pages);
+			break;
 		default:
 			continue;
 		}
@@ -575,6 +584,9 @@ static efi_status_t allocate_e820(struct boot_params *params,
 {
 	efi_status_t status;
 	__u32 nr_desc;
+	bool unaccepted_memory_present = false;
+	u64 max_addr = 0;
+	int i;
 
 	status = efi_get_memory_map(map);
 	if (status != EFI_SUCCESS)
@@ -589,9 +601,57 @@ static efi_status_t allocate_e820(struct boot_params *params,
 		if (status != EFI_SUCCESS)
 			goto out;
 	}
+
+	if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
+		goto out;
+
+	/* Check if there's any unaccepted memory and find the max address */
+	for (i = 0; i < nr_desc; i++) {
+		efi_memory_desc_t *d;
+
+		d = efi_early_memdesc_ptr(*map->map, *map->desc_size, i);
+		if (d->type == EFI_UNACCEPTED_MEMORY)
+			unaccepted_memory_present = true;
+		if (d->phys_addr + d->num_pages * PAGE_SIZE > max_addr)
+			max_addr = d->phys_addr + d->num_pages * PAGE_SIZE;
+	}
+
+	/*
+	 * If unaccepted memory is present allocate a bitmap to track what
+	 * memory has to be accepted before access.
+	 *
+	 * One bit in the bitmap represents 2MiB in the address space:
+	 * A 4k bitmap can track 64GiB of physical address space.
+	 *
+	 * In the worst case scenario -- a huge hole in the middle of the
+	 * address space -- It needs 256MiB to handle 4PiB of the address
+	 * space.
+	 *
+	 * TODO: handle situation if params->unaccepted_memory has already set.
+	 * It's required to deal with kexec.
+	 *
+	 * The bitmap will be populated in setup_e820() according to the memory
+	 * map after efi_exit_boot_services().
+	 */
+	if (unaccepted_memory_present) {
+		unsigned long *unaccepted_memory = NULL;
+		u64 size = DIV_ROUND_UP(max_addr, PMD_SIZE * BITS_PER_BYTE);
+
+		status = efi_allocate_pages(size,
+					    (unsigned long *)&unaccepted_memory,
+					    ULONG_MAX);
+		if (status != EFI_SUCCESS)
+			goto out;
+		memset(unaccepted_memory, 0, size);
+		params->unaccepted_memory = (unsigned long)unaccepted_memory;
+	} else {
+		params->unaccepted_memory = 0;
+	}
+
 out:
 	efi_bs_call(free_pool, *map->map);
-	return EFI_SUCCESS;
+	return status;
+
 }
 
 struct exit_boot_struct {
diff --git a/include/linux/efi.h b/include/linux/efi.h
index ccd4d3f91c98..b0240fdcaf5b 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -108,7 +108,8 @@ typedef	struct {
 #define EFI_MEMORY_MAPPED_IO_PORT_SPACE	12
 #define EFI_PAL_CODE			13
 #define EFI_PERSISTENT_MEMORY		14
-#define EFI_MAX_MEMORY_TYPE		15
+#define EFI_UNACCEPTED_MEMORY		15
+#define EFI_MAX_MEMORY_TYPE		16
 
 /* Attribute values: */
 #define EFI_MEMORY_UC		((u64)0x0000000000000001ULL)	/* uncached */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCHv4 4/8] x86/boot/compressed: Handle unaccepted memory
  2022-04-05 23:43 [PATCHv4 0/8] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2022-04-05 23:43 ` [PATCHv4 3/8] efi/x86: Implement support for unaccepted memory Kirill A. Shutemov
@ 2022-04-05 23:43 ` Kirill A. Shutemov
  2022-04-08 17:57   ` Dave Hansen
  2022-04-05 23:43 ` [PATCHv4 5/8] x86/mm: Reserve unaccepted memory bitmap Kirill A. Shutemov
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-05 23:43 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Brijesh Singh, Mike Rapoport, David Hildenbrand,
	x86, linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov

Firmware is responsible for accepting memory where compressed kernel
image and initrd land. But kernel has to accept memory for decompression
buffer: accept memory just before decompression starts.

KASLR is allowed to use unaccepted memory for the output buffer.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/bitmap.c            | 62 ++++++++++++++++++++
 arch/x86/boot/compressed/kaslr.c             | 14 ++++-
 arch/x86/boot/compressed/misc.c              | 11 ++++
 arch/x86/boot/compressed/unaccepted_memory.c | 14 +++++
 arch/x86/include/asm/unaccepted_memory.h     |  2 +
 5 files changed, 101 insertions(+), 2 deletions(-)

diff --git a/arch/x86/boot/compressed/bitmap.c b/arch/x86/boot/compressed/bitmap.c
index bf58b259380a..ba2de61c0823 100644
--- a/arch/x86/boot/compressed/bitmap.c
+++ b/arch/x86/boot/compressed/bitmap.c
@@ -2,6 +2,48 @@
 /* Taken from lib/string.c */
 
 #include <linux/bitmap.h>
+#include <linux/math.h>
+#include <linux/minmax.h>
+
+unsigned long _find_next_bit(const unsigned long *addr1,
+		const unsigned long *addr2, unsigned long nbits,
+		unsigned long start, unsigned long invert, unsigned long le)
+{
+	unsigned long tmp, mask;
+
+	if (unlikely(start >= nbits))
+		return nbits;
+
+	tmp = addr1[start / BITS_PER_LONG];
+	if (addr2)
+		tmp &= addr2[start / BITS_PER_LONG];
+	tmp ^= invert;
+
+	/* Handle 1st word. */
+	mask = BITMAP_FIRST_WORD_MASK(start);
+	if (le)
+		mask = swab(mask);
+
+	tmp &= mask;
+
+	start = round_down(start, BITS_PER_LONG);
+
+	while (!tmp) {
+		start += BITS_PER_LONG;
+		if (start >= nbits)
+			return nbits;
+
+		tmp = addr1[start / BITS_PER_LONG];
+		if (addr2)
+			tmp &= addr2[start / BITS_PER_LONG];
+		tmp ^= invert;
+	}
+
+	if (le)
+		tmp = swab(tmp);
+
+	return min(start + __ffs(tmp), nbits);
+}
 
 void __bitmap_set(unsigned long *map, unsigned int start, int len)
 {
@@ -22,3 +64,23 @@ void __bitmap_set(unsigned long *map, unsigned int start, int len)
 		*p |= mask_to_set;
 	}
 }
+
+void __bitmap_clear(unsigned long *map, unsigned int start, int len)
+{
+	unsigned long *p = map + BIT_WORD(start);
+	const unsigned int size = start + len;
+	int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+	unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+
+	while (len - bits_to_clear >= 0) {
+		*p &= ~mask_to_clear;
+		len -= bits_to_clear;
+		bits_to_clear = BITS_PER_LONG;
+		mask_to_clear = ~0UL;
+		p++;
+	}
+	if (len) {
+		mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+		*p &= ~mask_to_clear;
+	}
+}
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 411b268bc0a2..59db90626042 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -725,10 +725,20 @@ process_efi_entries(unsigned long minimum, unsigned long image_size)
 		 * but in practice there's firmware where using that memory leads
 		 * to crashes.
 		 *
-		 * Only EFI_CONVENTIONAL_MEMORY is guaranteed to be free.
+		 * Only EFI_CONVENTIONAL_MEMORY and EFI_UNACCEPTED_MEMORY (if
+		 * supported) are guaranteed to be free.
 		 */
-		if (md->type != EFI_CONVENTIONAL_MEMORY)
+
+		switch (md->type) {
+		case EFI_CONVENTIONAL_MEMORY:
+			break;
+		case EFI_UNACCEPTED_MEMORY:
+			if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
+				break;
 			continue;
+		default:
+			continue;
+		}
 
 		if (efi_soft_reserve_enabled() &&
 		    (md->attribute & EFI_MEMORY_SP))
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index fa8969fad011..c1d9d71a6615 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -18,6 +18,7 @@
 #include "../string.h"
 #include "../voffset.h"
 #include <asm/bootparam_utils.h>
+#include <asm/unaccepted_memory.h>
 
 /*
  * WARNING!!
@@ -43,6 +44,9 @@
 void *memmove(void *dest, const void *src, size_t n);
 #endif
 
+#undef __pa
+#define __pa(x)	((unsigned long)(x))
+
 /*
  * This is set up by the setup-routine at boot-time
  */
@@ -451,6 +455,13 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
 #endif
 
 	debug_putstr("\nDecompressing Linux... ");
+
+	if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) &&
+	    boot_params->unaccepted_memory) {
+		debug_putstr("Accepting memory... ");
+		accept_memory(__pa(output), __pa(output) + needed_size);
+	}
+
 	__decompress(input_data, input_len, NULL, NULL, output, output_len,
 			NULL, error);
 	parse_elf(output);
diff --git a/arch/x86/boot/compressed/unaccepted_memory.c b/arch/x86/boot/compressed/unaccepted_memory.c
index d363acf59c08..3ebab63789bb 100644
--- a/arch/x86/boot/compressed/unaccepted_memory.c
+++ b/arch/x86/boot/compressed/unaccepted_memory.c
@@ -51,3 +51,17 @@ void mark_unaccepted(struct boot_params *params, u64 start, u64 end)
 	bitmap_set((unsigned long *)params->unaccepted_memory,
 		   start / PMD_SIZE, (end - start) / PMD_SIZE);
 }
+
+void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	unsigned long *unaccepted_memory;
+	unsigned int rs, re;
+
+	unaccepted_memory = (unsigned long *)boot_params->unaccepted_memory;
+	rs = start / PMD_SIZE;
+	for_each_set_bitrange_from(rs, re, unaccepted_memory,
+				   DIV_ROUND_UP(end, PMD_SIZE)) {
+		__accept_memory(rs * PMD_SIZE, re * PMD_SIZE);
+		bitmap_clear(unaccepted_memory, rs, re - rs);
+	}
+}
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
index cbc24040b853..f1f835d3cd78 100644
--- a/arch/x86/include/asm/unaccepted_memory.h
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -9,4 +9,6 @@ struct boot_params;
 
 void mark_unaccepted(struct boot_params *params, u64 start, u64 num);
 
+void accept_memory(phys_addr_t start, phys_addr_t end);
+
 #endif
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCHv4 5/8] x86/mm: Reserve unaccepted memory bitmap
  2022-04-05 23:43 [PATCHv4 0/8] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2022-04-05 23:43 ` [PATCHv4 4/8] x86/boot/compressed: Handle " Kirill A. Shutemov
@ 2022-04-05 23:43 ` Kirill A. Shutemov
  2022-04-08 18:08   ` Dave Hansen
  2022-04-05 23:43 ` [PATCHv4 6/8] x86/mm: Provide helpers for unaccepted memory Kirill A. Shutemov
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-05 23:43 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Brijesh Singh, Mike Rapoport, David Hildenbrand,
	x86, linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov, Mike Rapoport

A given page of memory can only be accepted once.  The kernel has a need
to accept memory both in the early decompression stage and during normal
runtime.

A bitmap used to communicate the acceptance state of each page between the
decompression stage and normal runtime.  This eliminates the possibility of
attempting to double-accept a page.

The bitmap is allocated in EFI stub, decompression stage updates the state
of pages used for the kernel and initrd and hands the bitmap over to the
main kernel image via boot_params.

In the runtime kernel, reserve the bitmap's memory to ensure nothing
overwrites it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
---
 arch/x86/kernel/e820.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index f267205f2d5a..22d1fe48dcba 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1316,6 +1316,16 @@ void __init e820__memblock_setup(void)
 	int i;
 	u64 end;
 
+	/* Mark unaccepted memory bitmap reserved */
+	if (boot_params.unaccepted_memory) {
+		unsigned long size;
+
+		/* One bit per 2MB */
+		size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
+				    PMD_SIZE * BITS_PER_BYTE);
+		memblock_reserve(boot_params.unaccepted_memory, size);
+	}
+
 	/*
 	 * The bootstrap memblock region count maximum is 128 entries
 	 * (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCHv4 6/8] x86/mm: Provide helpers for unaccepted memory
  2022-04-05 23:43 [PATCHv4 0/8] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2022-04-05 23:43 ` [PATCHv4 5/8] x86/mm: Reserve unaccepted memory bitmap Kirill A. Shutemov
@ 2022-04-05 23:43 ` Kirill A. Shutemov
  2022-04-08 18:15   ` Dave Hansen
  2022-04-08 19:21   ` Dave Hansen
  2022-04-05 23:43 ` [PATCHv4 7/8] x86/tdx: Unaccepted memory support Kirill A. Shutemov
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-05 23:43 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Brijesh Singh, Mike Rapoport, David Hildenbrand,
	x86, linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov

Core-mm requires few helpers to support unaccepted memory:

 - accept_memory() checks the range of addresses against the bitmap and
   accept memory if needed.

 - memory_is_unaccepted() check if anything within the range requires
   acceptance.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/page.h              |  5 +++
 arch/x86/include/asm/unaccepted_memory.h |  1 +
 arch/x86/mm/Makefile                     |  2 +
 arch/x86/mm/unaccepted_memory.c          | 53 ++++++++++++++++++++++++
 4 files changed, 61 insertions(+)
 create mode 100644 arch/x86/mm/unaccepted_memory.c

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 9cc82f305f4b..9ae0064f97e5 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -19,6 +19,11 @@
 struct page;
 
 #include <linux/range.h>
+
+#ifdef CONFIG_UNACCEPTED_MEMORY
+#include <asm/unaccepted_memory.h>
+#endif
+
 extern struct range pfn_mapped[];
 extern int nr_pfn_mapped;
 
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
index f1f835d3cd78..a8d12ef1bda8 100644
--- a/arch/x86/include/asm/unaccepted_memory.h
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -10,5 +10,6 @@ struct boot_params;
 void mark_unaccepted(struct boot_params *params, u64 start, u64 num);
 
 void accept_memory(phys_addr_t start, phys_addr_t end);
+bool memory_is_unaccepted(phys_addr_t start, phys_addr_t end);
 
 #endif
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index fe3d3061fc11..e327f83e6bbf 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -60,3 +60,5 @@ obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_amd.o
 
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
+
+obj-$(CONFIG_UNACCEPTED_MEMORY)	+= unaccepted_memory.o
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
new file mode 100644
index 000000000000..3588a7cb954c
--- /dev/null
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -0,0 +1,53 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/memblock.h>
+#include <linux/mm.h>
+#include <linux/pfn.h>
+#include <linux/spinlock.h>
+
+#include <asm/io.h>
+#include <asm/setup.h>
+#include <asm/unaccepted_memory.h>
+
+static DEFINE_SPINLOCK(unaccepted_memory_lock);
+
+void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	unsigned long *unaccepted_memory;
+	unsigned long flags;
+	unsigned int rs, re;
+
+	if (!boot_params.unaccepted_memory)
+		return;
+
+	unaccepted_memory = __va(boot_params.unaccepted_memory);
+	rs = start / PMD_SIZE;
+
+	spin_lock_irqsave(&unaccepted_memory_lock, flags);
+	for_each_set_bitrange_from(rs, re, unaccepted_memory,
+				   DIV_ROUND_UP(end, PMD_SIZE)) {
+		/* Platform-specific memory-acceptance call goes here */
+		panic("Cannot accept memory");
+		bitmap_clear(unaccepted_memory, rs, re - rs);
+	}
+	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+}
+
+bool memory_is_unaccepted(phys_addr_t start, phys_addr_t end)
+{
+	unsigned long *unaccepted_memory = __va(boot_params.unaccepted_memory);
+	unsigned long flags;
+	bool ret = false;
+
+	spin_lock_irqsave(&unaccepted_memory_lock, flags);
+	while (start < end) {
+		if (test_bit(start / PMD_SIZE, unaccepted_memory)) {
+			ret = true;
+			break;
+		}
+
+		start += PMD_SIZE;
+	}
+	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+
+	return ret;
+}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCHv4 7/8] x86/tdx: Unaccepted memory support
  2022-04-05 23:43 [PATCHv4 0/8] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  2022-04-05 23:43 ` [PATCHv4 6/8] x86/mm: Provide helpers for unaccepted memory Kirill A. Shutemov
@ 2022-04-05 23:43 ` Kirill A. Shutemov
  2022-04-08 18:28   ` Dave Hansen
  2022-04-05 23:43 ` [PATCHv4 8/8] mm/vmstat: Add counter for memory accepting Kirill A. Shutemov
  2022-04-08 17:02 ` [PATCHv4 0/8] mm, x86/cc: Implement support for unaccepted memory Dave Hansen
  8 siblings, 1 reply; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-05 23:43 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Brijesh Singh, Mike Rapoport, David Hildenbrand,
	x86, linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov

All preparations are complete. Hookup TDX-specific code to accept memory.

There are two tdx_accept_memory() implementations: one in main kernel
and one in the decompresser.

The implementation in core kernel uses tdx_enc_status_changed().
The helper is not available in the decompresser, self-contained
implementation added there instead.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                             |  1 +
 arch/x86/boot/compressed/tdx.c               | 41 ++++++++++++++++++++
 arch/x86/boot/compressed/unaccepted_memory.c | 23 ++++++++++-
 arch/x86/coco/tdx/tdx.c                      | 29 +++++++++-----
 arch/x86/include/asm/shared/tdx.h            | 20 ++++++++++
 arch/x86/include/asm/tdx.h                   | 19 ---------
 arch/x86/mm/unaccepted_memory.c              |  6 ++-
 7 files changed, 109 insertions(+), 30 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7021ec725dd3..e4c31dbea6d7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -885,6 +885,7 @@ config INTEL_TDX_GUEST
 	select ARCH_HAS_CC_PLATFORM
 	select X86_MEM_ENCRYPT
 	select X86_MCE
+	select UNACCEPTED_MEMORY
 	help
 	  Support running as a guest under Intel TDX.  Without this support,
 	  the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
index 918a7606f53c..a0bd1426d235 100644
--- a/arch/x86/boot/compressed/tdx.c
+++ b/arch/x86/boot/compressed/tdx.c
@@ -9,6 +9,7 @@
 #include <uapi/asm/vmx.h>
 
 #include <asm/shared/tdx.h>
+#include <asm/page_types.h>
 
 /* Called from __tdx_hypercall() for unrecoverable failure */
 void __tdx_hypercall_failed(void)
@@ -75,3 +76,43 @@ void early_tdx_detect(void)
 	pio_ops.f_outb = tdx_outb;
 	pio_ops.f_outw = tdx_outw;
 }
+
+#define TDACCEPTPAGE		6
+#define TDVMCALL_MAP_GPA	0x10001
+
+/*
+ * Wrapper for standard use of __tdx_hypercall with no output aside from
+ * return code.
+ */
+static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+	struct tdx_hypercall_args args = {
+		.r10 = TDX_HYPERCALL_STANDARD,
+		.r11 = fn,
+		.r12 = r12,
+		.r13 = r13,
+		.r14 = r14,
+		.r15 = r15,
+	};
+
+	return __tdx_hypercall(&args, 0);
+}
+
+void tdx_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	int i;
+
+	if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
+		error("Cannot accept memory: MapGPA failed\n");
+
+	/*
+	 * For shared->private conversion, accept the page using TDACCEPTPAGE
+	 * TDX module call.
+	 */
+	for (i = 0; i < (end - start) / PAGE_SIZE; i++) {
+		if (__tdx_module_call(TDACCEPTPAGE, start + i * PAGE_SIZE,
+				      0, 0, 0, NULL)) {
+			error("Cannot accept memory: page accept failed\n");
+		}
+	}
+}
diff --git a/arch/x86/boot/compressed/unaccepted_memory.c b/arch/x86/boot/compressed/unaccepted_memory.c
index 3ebab63789bb..662ec32e3c42 100644
--- a/arch/x86/boot/compressed/unaccepted_memory.c
+++ b/arch/x86/boot/compressed/unaccepted_memory.c
@@ -1,12 +1,33 @@
 // SPDX-License-Identifier: GPL-2.0-only
 
+#include <asm/shared/tdx.h>
 #include "error.h"
 #include "misc.h"
+#include "tdx.h"
+
+static bool is_tdx_guest(void)
+{
+	static bool once;
+	static bool is_tdx;
+
+	if (!once) {
+		u32 eax, sig[3];
+
+		cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax,
+			    &sig[0], &sig[2],  &sig[1]);
+		is_tdx = !memcmp(TDX_IDENT, sig, sizeof(sig));
+	}
+
+	return is_tdx;
+}
 
 static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
 {
 	/* Platform-specific memory-acceptance call goes here */
-	error("Cannot accept memory");
+	if (is_tdx_guest())
+		tdx_accept_memory(start, end);
+	else
+		error("Cannot accept memory");
 }
 
 void mark_unaccepted(struct boot_params *params, u64 start, u64 end)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 03deb4d6920d..0fcb54e07797 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -606,16 +606,8 @@ static bool try_accept_one(phys_addr_t *start, unsigned long len,
 	return true;
 }
 
-/*
- * Inform the VMM of the guest's intent for this physical page: shared with
- * the VMM or private to the guest.  The VMM is expected to change its mapping
- * of the page in response.
- */
-static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
+static bool __tdx_enc_status_changed(phys_addr_t start, phys_addr_t end, bool enc)
 {
-	phys_addr_t start = __pa(vaddr);
-	phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
-
 	if (!enc) {
 		/* Set the shared (decrypted) bits: */
 		start |= cc_mkdec(0);
@@ -660,6 +652,25 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
 	return true;
 }
 
+void tdx_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	if (!__tdx_enc_status_changed(start, end, true))
+		panic("Accepting memory failed\n");
+}
+
+/*
+ * Inform the VMM of the guest's intent for this physical page: shared with
+ * the VMM or private to the guest.  The VMM is expected to change its mapping
+ * of the page in response.
+ */
+static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
+{
+	phys_addr_t start = __pa(vaddr);
+	phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
+
+	return __tdx_enc_status_changed(start, end, enc);
+}
+
 void __init tdx_early_init(void)
 {
 	u64 cc_mask;
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index e53f26228fbb..4eca6492b108 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -36,5 +36,25 @@ u64 __tdx_hypercall(struct tdx_hypercall_args *args, unsigned long flags);
 /* Called from __tdx_hypercall() for unrecoverable failure */
 void __tdx_hypercall_failed(void);
 
+/*
+ * Used in __tdx_module_call() to gather the output registers' values of the
+ * TDCALL instruction when requesting services from the TDX module. This is a
+ * software only structure and not part of the TDX module/VMM ABI
+ */
+struct tdx_module_output {
+	u64 rcx;
+	u64 rdx;
+	u64 r8;
+	u64 r9;
+	u64 r10;
+	u64 r11;
+};
+
+/* Used to communicate with the TDX module */
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+		      struct tdx_module_output *out);
+
+void tdx_accept_memory(phys_addr_t start, phys_addr_t end);
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_SHARED_TDX_H */
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 020c81a7c729..d9106d3e89f8 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -20,21 +20,6 @@
 
 #ifndef __ASSEMBLY__
 
-/*
- * Used to gather the output registers values of the TDCALL and SEAMCALL
- * instructions when requesting services from the TDX module.
- *
- * This is a software only structure and not part of the TDX module/VMM ABI.
- */
-struct tdx_module_output {
-	u64 rcx;
-	u64 rdx;
-	u64 r8;
-	u64 r9;
-	u64 r10;
-	u64 r11;
-};
-
 /*
  * Used by the #VE exception handler to gather the #VE exception
  * info from the TDX module. This is a software only structure
@@ -55,10 +40,6 @@ struct ve_info {
 
 void __init tdx_early_init(void);
 
-/* Used to communicate with the TDX module */
-u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
-		      struct tdx_module_output *out);
-
 void tdx_get_ve_info(struct ve_info *ve);
 
 bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index 3588a7cb954c..2f1c3c0375cd 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -6,6 +6,7 @@
 
 #include <asm/io.h>
 #include <asm/setup.h>
+#include <asm/shared/tdx.h>
 #include <asm/unaccepted_memory.h>
 
 static DEFINE_SPINLOCK(unaccepted_memory_lock);
@@ -26,7 +27,10 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
 	for_each_set_bitrange_from(rs, re, unaccepted_memory,
 				   DIV_ROUND_UP(end, PMD_SIZE)) {
 		/* Platform-specific memory-acceptance call goes here */
-		panic("Cannot accept memory");
+		if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+			tdx_accept_memory(rs * PMD_SIZE, re * PMD_SIZE);
+		else
+			panic("Cannot accept memory");
 		bitmap_clear(unaccepted_memory, rs, re - rs);
 	}
 	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCHv4 8/8] mm/vmstat: Add counter for memory accepting
  2022-04-05 23:43 [PATCHv4 0/8] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (6 preceding siblings ...)
  2022-04-05 23:43 ` [PATCHv4 7/8] x86/tdx: Unaccepted memory support Kirill A. Shutemov
@ 2022-04-05 23:43 ` Kirill A. Shutemov
  2022-04-12  8:18   ` David Hildenbrand
  2022-04-08 17:02 ` [PATCHv4 0/8] mm, x86/cc: Implement support for unaccepted memory Dave Hansen
  8 siblings, 1 reply; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-05 23:43 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Brijesh Singh, Mike Rapoport, David Hildenbrand,
	x86, linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov

The counter increased every time kernel accepts a memory region.

The counter allows to see if memory acceptation is still ongoing and
contributes to memory allocation latency.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/unaccepted_memory.c | 1 +
 include/linux/vm_event_item.h   | 3 +++
 mm/vmstat.c                     | 3 +++
 3 files changed, 7 insertions(+)

diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index 2f1c3c0375cd..7cfe0bd8d2be 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -32,6 +32,7 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
 		else
 			panic("Cannot accept memory");
 		bitmap_clear(unaccepted_memory, rs, re - rs);
+		count_vm_events(ACCEPT_MEMORY, PMD_SIZE / PAGE_SIZE);
 	}
 	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
 }
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 16a0a4fd000b..6a468164a2f9 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -136,6 +136,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_X86
 		DIRECT_MAP_LEVEL2_SPLIT,
 		DIRECT_MAP_LEVEL3_SPLIT,
+#endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+		ACCEPT_MEMORY,
 #endif
 		NR_VM_EVENT_ITEMS
 };
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b75b1a64b54c..4c9197f32406 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1397,6 +1397,9 @@ const char * const vmstat_text[] = {
 	"direct_map_level2_splits",
 	"direct_map_level3_splits",
 #endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	"accept_memory",
+#endif
 #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 0/8] mm, x86/cc: Implement support for unaccepted memory
  2022-04-05 23:43 [PATCHv4 0/8] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (7 preceding siblings ...)
  2022-04-05 23:43 ` [PATCHv4 8/8] mm/vmstat: Add counter for memory accepting Kirill A. Shutemov
@ 2022-04-08 17:02 ` Dave Hansen
  2022-04-09 23:44   ` Kirill A. Shutemov
  8 siblings, 1 reply; 67+ messages in thread
From: Dave Hansen @ 2022-04-08 17:02 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On 4/5/22 16:43, Kirill A. Shutemov wrote:
> Patches 1-6/7 are generic and don't have any dependencies on TDX. They
> should serve AMD SEV needs as well. TDX-specific code isolated in the
> last patch.

Oh, that's quite nice.  Are the SEV-SNP folks planning on using this?
If they are, acks/reviews would be much appreciated.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 3/8] efi/x86: Implement support for unaccepted memory
  2022-04-05 23:43 ` [PATCHv4 3/8] efi/x86: Implement support for unaccepted memory Kirill A. Shutemov
@ 2022-04-08 17:26   ` Dave Hansen
  2022-04-09 19:41     ` Kirill A. Shutemov
  2022-04-14 15:55     ` Borislav Petkov
  2022-04-15 22:24   ` Borislav Petkov
  1 sibling, 2 replies; 67+ messages in thread
From: Dave Hansen @ 2022-04-08 17:26 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On 4/5/22 16:43, Kirill A. Shutemov wrote:
> +void mark_unaccepted(struct boot_params *params, u64 start, u64 end)
> +{
> +	/*
> +	 * The accepted memory bitmap only works at PMD_SIZE granularity.
> +	 * If a request comes in to mark memory as unaccepted which is not
> +	 * PMD_SIZE-aligned, simply accept the memory now since it can not be
> +	 * *marked* as unaccepted.
> +	 */
> +
> +	/*
> +	 * Accept small regions that might not be able to be represented
> +	 * in the bitmap:
> +	 */
> +	if (end - start < 2 * PMD_SIZE) {
> +		__accept_memory(start, end);
> +		return;
> +	}

This is not my first time looking at this code and I still had to think
about this a bit.  That's not good.  That pathological case here is
actually something like this:

| 4k | 2044k + 2044k | 4k |
^ 0x0 	     ^ 2MB	  ^ 4MB

Where we have a 2MB-aligned 4k accepted area, a 4088k unaccepted area,
then another 4k accepted area.  That will not result in any bits being
set in the accepted memory bitmap because no 2MB region is fully accepted.

The one oddball case is this:

| 4k | 2044k |    2048k   |
^ 0x0 	     ^ 2MB	  ^ 4MB

Which would fall into the if() above, but *can* have part of its range
marked in the bitmap.

Maybe we need something more like this:

	/*
	 * Accept small regions that might not be able to be represented
	 * in the bitmap.  This is a bit imprecise and may accept some
	 * areas that could have been represented in the bitmap instead.
	 * But, the imprecision makes the code simpler by ensuring that
	 * at least one bit will be set int the bitmap below.
	 */

Although that's seeming a _bit_ verbose, and somewhat duplicates the
comment below.

> +	/*
> +	 * No matter how the start and end are aligned, at least one unaccepted
> +	 * PMD_SIZE area will remain.
> +	 */
> +
> +	/* Immediately accept a <PMD_SIZE piece at the start: */
> +	if (start & ~PMD_MASK) {
> +		__accept_memory(start, round_up(start, PMD_SIZE));
> +		start = round_up(start, PMD_SIZE);
> +	}
> +
> +	/* Immediately accept a <PMD_SIZE piece at the end: */
> +	if (end & ~PMD_MASK) {
> +		__accept_memory(round_down(end, PMD_SIZE), end);
> +		end = round_down(end, PMD_SIZE);
> +	}
> +
> +	/*
> +	 * 'start' and 'end' are now both PMD-aligned.
> +	 * Record the range as being unaccepted:
> +	 */
> +	bitmap_set((unsigned long *)params->unaccepted_memory,
> +		   start / PMD_SIZE, (end - start) / PMD_SIZE);
> +}
> diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
> new file mode 100644
> index 000000000000..cbc24040b853
> --- /dev/null
> +++ b/arch/x86/include/asm/unaccepted_memory.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (C) 2020 Intel Corporation */
> +#ifndef _ASM_X86_UNACCEPTED_MEMORY_H
> +#define _ASM_X86_UNACCEPTED_MEMORY_H
> +
> +#include <linux/types.h>
> +
> +struct boot_params;
> +
> +void mark_unaccepted(struct boot_params *params, u64 start, u64 num);
> +
> +#endif
> diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h
> index b25d3f82c2f3..16bc686a198d 100644
> --- a/arch/x86/include/uapi/asm/bootparam.h
> +++ b/arch/x86/include/uapi/asm/bootparam.h
> @@ -217,7 +217,8 @@ struct boot_params {
>  	struct boot_e820_entry e820_table[E820_MAX_ENTRIES_ZEROPAGE]; /* 0x2d0 */
>  	__u8  _pad8[48];				/* 0xcd0 */
>  	struct edd_info eddbuf[EDDMAXNR];		/* 0xd00 */
> -	__u8  _pad9[276];				/* 0xeec */
> +	__u64 unaccepted_memory;			/* 0xeec */
> +	__u8  _pad9[268];				/* 0xef4 */
>  } __attribute__((packed));
>  
>  /**
> diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
> index 2c3dac5ecb36..b17ceec757d0 100644
> --- a/drivers/firmware/efi/Kconfig
> +++ b/drivers/firmware/efi/Kconfig
> @@ -243,6 +243,21 @@ config EFI_DISABLE_PCI_DMA
>  	  options "efi=disable_early_pci_dma" or "efi=no_disable_early_pci_dma"
>  	  may be used to override this option.
>  
> +config UNACCEPTED_MEMORY
> +	bool
> +	depends on EFI_STUB
> +	depends on !KEXEC_CORE

The changelog should probably say something about how the kexec()
incompatibility is going to be rectified in the future.

> +	help
> +	   Some Virtual Machine platforms, such as Intel TDX, require
> +	   some memory to be "accepted" by the guest before it can be used.
> +	   This mechanism helps prevent malicious hosts from making changes
> +	   to guest memory.
> +
> +	   UEFI specification v2.9 introduced EFI_UNACCEPTED_MEMORY memory type.
> +
> +	   This option adds support for unaccepted memory and makes such memory
> +	   usable by kernel.
> +
>  endmenu

BTW, what happens if this is compiled out?  Do TDX guests just lose all
the unaccepted memory?

Should TDX be selecting this or something?

>  config EFI_EMBEDDED_FIRMWARE
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index 5502e176d51b..2c055afb1b11 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -747,6 +747,7 @@ static __initdata char memory_type_name[][13] = {
>  	"MMIO Port",
>  	"PAL Code",
>  	"Persistent",
> +	"Unaccepted",
>  };
>  
>  char * __init efi_md_typeattr_format(char *buf, size_t size,
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> index d18cac8ab436..e7601fd612aa 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -9,12 +9,14 @@
>  #include <linux/efi.h>
>  #include <linux/pci.h>
>  #include <linux/stddef.h>
> +#include <linux/bitmap.h>
>  
>  #include <asm/efi.h>
>  #include <asm/e820/types.h>
>  #include <asm/setup.h>
>  #include <asm/desc.h>
>  #include <asm/boot.h>
> +#include <asm/unaccepted_memory.h>
>  
>  #include "efistub.h"
>  
> @@ -504,6 +506,13 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
>  			e820_type = E820_TYPE_PMEM;
>  			break;
>  
> +		case EFI_UNACCEPTED_MEMORY:
> +			if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
> +				continue;

This seems worthy of a pr_info().  We're effectively throwing away
memory with this "continue", right?

> +			e820_type = E820_TYPE_RAM;
> +			mark_unaccepted(params, d->phys_addr,
> +					d->phys_addr + PAGE_SIZE * d->num_pages);
> +			break;
>  		default:
>  			continue;
>  		}
> @@ -575,6 +584,9 @@ static efi_status_t allocate_e820(struct boot_params *params,
>  {
>  	efi_status_t status;
>  	__u32 nr_desc;
> +	bool unaccepted_memory_present = false;
> +	u64 max_addr = 0;
> +	int i;
>  
>  	status = efi_get_memory_map(map);
>  	if (status != EFI_SUCCESS)
> @@ -589,9 +601,57 @@ static efi_status_t allocate_e820(struct boot_params *params,
>  		if (status != EFI_SUCCESS)
>  			goto out;
>  	}
> +
> +	if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
> +		goto out;
> +
> +	/* Check if there's any unaccepted memory and find the max address */
> +	for (i = 0; i < nr_desc; i++) {
> +		efi_memory_desc_t *d;
> +
> +		d = efi_early_memdesc_ptr(*map->map, *map->desc_size, i);
> +		if (d->type == EFI_UNACCEPTED_MEMORY)
> +			unaccepted_memory_present = true;
> +		if (d->phys_addr + d->num_pages * PAGE_SIZE > max_addr)
> +			max_addr = d->phys_addr + d->num_pages * PAGE_SIZE;
> +	}
> +
> +	/*
> +	 * If unaccepted memory is present allocate a bitmap to track what
> +	 * memory has to be accepted before access.
> +	 *
> +	 * One bit in the bitmap represents 2MiB in the address space:
> +	 * A 4k bitmap can track 64GiB of physical address space.
> +	 *
> +	 * In the worst case scenario -- a huge hole in the middle of the
> +	 * address space -- It needs 256MiB to handle 4PiB of the address
> +	 * space.
> +	 *
> +	 * TODO: handle situation if params->unaccepted_memory has already set.
> +	 * It's required to deal with kexec.

Ahhh....  That's good info for the changelog.

> +	 * The bitmap will be populated in setup_e820() according to the memory
> +	 * map after efi_exit_boot_services().
> +	 */
> +	if (unaccepted_memory_present) {
> +		unsigned long *unaccepted_memory = NULL;
> +		u64 size = DIV_ROUND_UP(max_addr, PMD_SIZE * BITS_PER_BYTE);
> +
> +		status = efi_allocate_pages(size,
> +					    (unsigned long *)&unaccepted_memory,
> +					    ULONG_MAX);
> +		if (status != EFI_SUCCESS)
> +			goto out;
> +		memset(unaccepted_memory, 0, size);
> +		params->unaccepted_memory = (unsigned long)unaccepted_memory;
> +	} else {
> +		params->unaccepted_memory = 0;
> +	}
> +
>  out:
>  	efi_bs_call(free_pool, *map->map);
> -	return EFI_SUCCESS;
> +	return status;
> +
>  }
>  
>  struct exit_boot_struct {
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index ccd4d3f91c98..b0240fdcaf5b 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -108,7 +108,8 @@ typedef	struct {
>  #define EFI_MEMORY_MAPPED_IO_PORT_SPACE	12
>  #define EFI_PAL_CODE			13
>  #define EFI_PERSISTENT_MEMORY		14
> -#define EFI_MAX_MEMORY_TYPE		15
> +#define EFI_UNACCEPTED_MEMORY		15
> +#define EFI_MAX_MEMORY_TYPE		16
>  
>  /* Attribute values: */
>  #define EFI_MEMORY_UC		((u64)0x0000000000000001ULL)	/* uncached */


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 4/8] x86/boot/compressed: Handle unaccepted memory
  2022-04-05 23:43 ` [PATCHv4 4/8] x86/boot/compressed: Handle " Kirill A. Shutemov
@ 2022-04-08 17:57   ` Dave Hansen
  2022-04-09 20:20     ` Kirill A. Shutemov
  0 siblings, 1 reply; 67+ messages in thread
From: Dave Hansen @ 2022-04-08 17:57 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On 4/5/22 16:43, Kirill A. Shutemov wrote:
> Firmware is responsible for accepting memory where compressed kernel
> image and initrd land. But kernel has to accept memory for decompression
> buffer: accept memory just before decompression starts.

I think I'd appreciate a sentence or two more about what's going on.
How about something like this?

The firmware starts the kernel by booting into the "compressed" kernel
stub.  That stub's job is to decompress the full kernel image and then
jump to it.

The firmware will pre-accept the memory used to run the stub.  But, the
stub is responsible for accepting the memory into which it decompresses
the main kernel.  Accept memory just before decompression starts.

The stub is also responsible for choosing a physical address in which to
place the decompressed kernel image.  The KASLR mechanism will randomize
this physical address.  Since the unaccepted memory region is relatively
small, KASLR would be quite ineffective if it only used the pre-accepted
area (EFI_CONVENTIONAL_MEMORY).  Ensure that KASLR randomizes among the
entire physical address space by also including EFI_UNACCEPTED_MEMORY.


> diff --git a/arch/x86/boot/compressed/bitmap.c b/arch/x86/boot/compressed/bitmap.c
> index bf58b259380a..ba2de61c0823 100644
> --- a/arch/x86/boot/compressed/bitmap.c
> +++ b/arch/x86/boot/compressed/bitmap.c
> @@ -2,6 +2,48 @@
>  /* Taken from lib/string.c */
>  
>  #include <linux/bitmap.h>
> +#include <linux/math.h>
> +#include <linux/minmax.h>
> +
> +unsigned long _find_next_bit(const unsigned long *addr1,
> +		const unsigned long *addr2, unsigned long nbits,
> +		unsigned long start, unsigned long invert, unsigned long le)
> +{
> +	unsigned long tmp, mask;
> +
> +	if (unlikely(start >= nbits))
> +		return nbits;
> +
> +	tmp = addr1[start / BITS_PER_LONG];
> +	if (addr2)
> +		tmp &= addr2[start / BITS_PER_LONG];
> +	tmp ^= invert;
> +
> +	/* Handle 1st word. */
> +	mask = BITMAP_FIRST_WORD_MASK(start);
> +	if (le)
> +		mask = swab(mask);
> +
> +	tmp &= mask;
> +
> +	start = round_down(start, BITS_PER_LONG);
> +
> +	while (!tmp) {
> +		start += BITS_PER_LONG;
> +		if (start >= nbits)
> +			return nbits;
> +
> +		tmp = addr1[start / BITS_PER_LONG];
> +		if (addr2)
> +			tmp &= addr2[start / BITS_PER_LONG];
> +		tmp ^= invert;
> +	}
> +
> +	if (le)
> +		tmp = swab(tmp);
> +
> +	return min(start + __ffs(tmp), nbits);
> +}
>  
>  void __bitmap_set(unsigned long *map, unsigned int start, int len)
>  {
> @@ -22,3 +64,23 @@ void __bitmap_set(unsigned long *map, unsigned int start, int len)
>  		*p |= mask_to_set;
>  	}
>  }
> +
> +void __bitmap_clear(unsigned long *map, unsigned int start, int len)
> +{
> +	unsigned long *p = map + BIT_WORD(start);
> +	const unsigned int size = start + len;
> +	int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
> +	unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
> +
> +	while (len - bits_to_clear >= 0) {
> +		*p &= ~mask_to_clear;
> +		len -= bits_to_clear;
> +		bits_to_clear = BITS_PER_LONG;
> +		mask_to_clear = ~0UL;
> +		p++;
> +	}
> +	if (len) {
> +		mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
> +		*p &= ~mask_to_clear;
> +	}
> +}

It's a real shame that we have to duplicate this code.  Is there
anything crazy we could do here like

#include "../../../lib/find_bit.c"

?

> diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
> index 411b268bc0a2..59db90626042 100644
> --- a/arch/x86/boot/compressed/kaslr.c
> +++ b/arch/x86/boot/compressed/kaslr.c
> @@ -725,10 +725,20 @@ process_efi_entries(unsigned long minimum, unsigned long image_size)
>  		 * but in practice there's firmware where using that memory leads
>  		 * to crashes.
>  		 *
> -		 * Only EFI_CONVENTIONAL_MEMORY is guaranteed to be free.
> +		 * Only EFI_CONVENTIONAL_MEMORY and EFI_UNACCEPTED_MEMORY (if
> +		 * supported) are guaranteed to be free.
>  		 */
> -		if (md->type != EFI_CONVENTIONAL_MEMORY)
> +
> +		switch (md->type) {
> +		case EFI_CONVENTIONAL_MEMORY:
> +			break;
> +		case EFI_UNACCEPTED_MEMORY:
> +			if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
> +				break;
>  			continue;
> +		default:
> +			continue;
> +		}
>  
>  		if (efi_soft_reserve_enabled() &&
>  		    (md->attribute & EFI_MEMORY_SP))
> diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
> index fa8969fad011..c1d9d71a6615 100644
> --- a/arch/x86/boot/compressed/misc.c
> +++ b/arch/x86/boot/compressed/misc.c
> @@ -18,6 +18,7 @@
>  #include "../string.h"
>  #include "../voffset.h"
>  #include <asm/bootparam_utils.h>
> +#include <asm/unaccepted_memory.h>
>  
>  /*
>   * WARNING!!
> @@ -43,6 +44,9 @@
>  void *memmove(void *dest, const void *src, size_t n);
>  #endif
>  
> +#undef __pa
> +#define __pa(x)	((unsigned long)(x))

Those #undef's always worry me.  Why is this one needed?

>  /*
>   * This is set up by the setup-routine at boot-time
>   */
> @@ -451,6 +455,13 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
>  #endif
>  
>  	debug_putstr("\nDecompressing Linux... ");
> +
> +	if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) &&
> +	    boot_params->unaccepted_memory) {
> +		debug_putstr("Accepting memory... ");
> +		accept_memory(__pa(output), __pa(output) + needed_size);
> +	}
> +
>  	__decompress(input_data, input_len, NULL, NULL, output, output_len,
>  			NULL, error);
>  	parse_elf(output);
> diff --git a/arch/x86/boot/compressed/unaccepted_memory.c b/arch/x86/boot/compressed/unaccepted_memory.c
> index d363acf59c08..3ebab63789bb 100644
> --- a/arch/x86/boot/compressed/unaccepted_memory.c
> +++ b/arch/x86/boot/compressed/unaccepted_memory.c
> @@ -51,3 +51,17 @@ void mark_unaccepted(struct boot_params *params, u64 start, u64 end)
>  	bitmap_set((unsigned long *)params->unaccepted_memory,
>  		   start / PMD_SIZE, (end - start) / PMD_SIZE);
>  }
> +
> +void accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +	unsigned long *unaccepted_memory;
> +	unsigned int rs, re;
> +
> +	unaccepted_memory = (unsigned long *)boot_params->unaccepted_memory;
> +	rs = start / PMD_SIZE;

OK, so start is a physical address, PMD_SIZE is 2^21, and 'rs' is an
unsigned int.  That means 'rs' can, at most, represent a physical
address at 2^(21+32), or 2^53.  That's cutting it a *bit* close, don't
you think?

Could we please just give 'rs' and 're' real names and make them
'unsigned long's, please?  It will surely save at least one other person
from doing math.  The find_next_bit() functions seem to take ulongs anyway.

> +	for_each_set_bitrange_from(rs, re, unaccepted_memory,
> +				   DIV_ROUND_UP(end, PMD_SIZE)) {
> +		__accept_memory(rs * PMD_SIZE, re * PMD_SIZE);
> +		bitmap_clear(unaccepted_memory, rs, re - rs);
> +	}
> +}

Could we please introduce some intermediate variable?  For instance:

	unsigned long bitmap_size = DIV_ROUND_UP(end, PMD_SIZE);

That will make this all a lot easier to read.

> diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
> index cbc24040b853..f1f835d3cd78 100644
> --- a/arch/x86/include/asm/unaccepted_memory.h
> +++ b/arch/x86/include/asm/unaccepted_memory.h
> @@ -9,4 +9,6 @@ struct boot_params;
>  
>  void mark_unaccepted(struct boot_params *params, u64 start, u64 num);
>  
> +void accept_memory(phys_addr_t start, phys_addr_t end);
> +
>  #endif


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 5/8] x86/mm: Reserve unaccepted memory bitmap
  2022-04-05 23:43 ` [PATCHv4 5/8] x86/mm: Reserve unaccepted memory bitmap Kirill A. Shutemov
@ 2022-04-08 18:08   ` Dave Hansen
  2022-04-09 20:43     ` Kirill A. Shutemov
  0 siblings, 1 reply; 67+ messages in thread
From: Dave Hansen @ 2022-04-08 18:08 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Mike Rapoport

On 4/5/22 16:43, Kirill A. Shutemov wrote:
> A given page of memory can only be accepted once.  The kernel has a need
> to accept memory both in the early decompression stage and during normal
> runtime.
> 
> A bitmap used to communicate the acceptance state of each page between the
> decompression stage and normal runtime.  This eliminates the possibility of
> attempting to double-accept a page.

... which is fatal, right?  Could you include that an also the rationale
for why it is fatal?

> The bitmap is allocated in EFI stub, decompression stage updates the state
> of pages used for the kernel and initrd and hands the bitmap over to the
> main kernel image via boot_params.

This is really good info.  Could we maybe expand it a bit?

There are several steps in the bitmap's lifecycle:
1. Bitmap is allocated in the EFI stub
2. The kernel decompression code locates it, accepts some pages before
   decompression and marks them in the bitmap.
3. Decompression code points to the bitmap via a boot_param
4. Main kernel locates bitmap via the boot_param and uses it to guide
   runtime acceptance decisions.

> In the runtime kernel, reserve the bitmap's memory to ensure nothing
> overwrites it.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>  arch/x86/kernel/e820.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index f267205f2d5a..22d1fe48dcba 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1316,6 +1316,16 @@ void __init e820__memblock_setup(void)
>  	int i;
>  	u64 end;
>  
> +	/* Mark unaccepted memory bitmap reserved */
> +	if (boot_params.unaccepted_memory) {
> +		unsigned long size;
> +
> +		/* One bit per 2MB */
> +		size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
> +				    PMD_SIZE * BITS_PER_BYTE);
> +		memblock_reserve(boot_params.unaccepted_memory, size);
> +	}

One oddity here: The size is implied by the e820's contents.  Did you
mention somewhere that unaccepted memory is considered E820_TYPE_RAM?
It *has* to be in order for e820__end_of_ram_pfn() to work, right?


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 6/8] x86/mm: Provide helpers for unaccepted memory
  2022-04-05 23:43 ` [PATCHv4 6/8] x86/mm: Provide helpers for unaccepted memory Kirill A. Shutemov
@ 2022-04-08 18:15   ` Dave Hansen
  2022-04-08 19:21   ` Dave Hansen
  1 sibling, 0 replies; 67+ messages in thread
From: Dave Hansen @ 2022-04-08 18:15 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On 4/5/22 16:43, Kirill A. Shutemov wrote:
> Core-mm requires few helpers to support unaccepted memory:
> 
>  - accept_memory() checks the range of addresses against the bitmap and
>    accept memory if needed.
> 
>  - memory_is_unaccepted() check if anything within the range requires
>    acceptance.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/include/asm/page.h              |  5 +++
>  arch/x86/include/asm/unaccepted_memory.h |  1 +
>  arch/x86/mm/Makefile                     |  2 +
>  arch/x86/mm/unaccepted_memory.c          | 53 ++++++++++++++++++++++++
>  4 files changed, 61 insertions(+)
>  create mode 100644 arch/x86/mm/unaccepted_memory.c
> 
> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> index 9cc82f305f4b..9ae0064f97e5 100644
> --- a/arch/x86/include/asm/page.h
> +++ b/arch/x86/include/asm/page.h
> @@ -19,6 +19,11 @@
>  struct page;
>  
>  #include <linux/range.h>
> +
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +#include <asm/unaccepted_memory.h>
> +#endif

It's a lot nicer to just to the #ifdefs inside the header.  Is there a
specific reason to do it this way?

>  extern struct range pfn_mapped[];
>  extern int nr_pfn_mapped;
>  
> diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
> index f1f835d3cd78..a8d12ef1bda8 100644
> --- a/arch/x86/include/asm/unaccepted_memory.h
> +++ b/arch/x86/include/asm/unaccepted_memory.h
> @@ -10,5 +10,6 @@ struct boot_params;
>  void mark_unaccepted(struct boot_params *params, u64 start, u64 num);
>  
>  void accept_memory(phys_addr_t start, phys_addr_t end);
> +bool memory_is_unaccepted(phys_addr_t start, phys_addr_t end);
>  
>  #endif
> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> index fe3d3061fc11..e327f83e6bbf 100644
> --- a/arch/x86/mm/Makefile
> +++ b/arch/x86/mm/Makefile
> @@ -60,3 +60,5 @@ obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_amd.o
>  
>  obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_identity.o
>  obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
> +
> +obj-$(CONFIG_UNACCEPTED_MEMORY)	+= unaccepted_memory.o
> diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
> new file mode 100644
> index 000000000000..3588a7cb954c
> --- /dev/null
> +++ b/arch/x86/mm/unaccepted_memory.c
> @@ -0,0 +1,53 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include <linux/memblock.h>
> +#include <linux/mm.h>
> +#include <linux/pfn.h>
> +#include <linux/spinlock.h>
> +
> +#include <asm/io.h>
> +#include <asm/setup.h>
> +#include <asm/unaccepted_memory.h>
> +
> +static DEFINE_SPINLOCK(unaccepted_memory_lock);

We need some documentation on what the lock does, either here or in the
changelog.

> +void accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +	unsigned long *unaccepted_memory;
> +	unsigned long flags;
> +	unsigned int rs, re;
> +
> +	if (!boot_params.unaccepted_memory)
> +		return;
> +
> +	unaccepted_memory = __va(boot_params.unaccepted_memory);
> +	rs = start / PMD_SIZE;
> +
> +	spin_lock_irqsave(&unaccepted_memory_lock, flags);
> +	for_each_set_bitrange_from(rs, re, unaccepted_memory,
> +				   DIV_ROUND_UP(end, PMD_SIZE)) {
> +		/* Platform-specific memory-acceptance call goes here */
> +		panic("Cannot accept memory");
> +		bitmap_clear(unaccepted_memory, rs, re - rs);
> +	}
> +	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
> +}

That panic() is making me nervous.  Is this bisect-safe?  Is it safe
because there are no callers of this function yet?

> +bool memory_is_unaccepted(phys_addr_t start, phys_addr_t end)
> +{
> +	unsigned long *unaccepted_memory = __va(boot_params.unaccepted_memory);
> +	unsigned long flags;
> +	bool ret = false;
> +
> +	spin_lock_irqsave(&unaccepted_memory_lock, flags);
> +	while (start < end) {
> +		if (test_bit(start / PMD_SIZE, unaccepted_memory)) {
> +			ret = true;
> +			break;
> +		}
> +
> +		start += PMD_SIZE;
> +	}
> +	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
> +
> +	return ret;
> +}


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 7/8] x86/tdx: Unaccepted memory support
  2022-04-05 23:43 ` [PATCHv4 7/8] x86/tdx: Unaccepted memory support Kirill A. Shutemov
@ 2022-04-08 18:28   ` Dave Hansen
  0 siblings, 0 replies; 67+ messages in thread
From: Dave Hansen @ 2022-04-08 18:28 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On 4/5/22 16:43, Kirill A. Shutemov wrote:
> All preparations are complete. Hookup TDX-specific code to accept memory.
> 
> There are two tdx_accept_memory() implementations: one in main kernel
> and one in the decompresser.
> 
> The implementation in core kernel uses tdx_enc_status_changed().
> The helper is not available in the decompresser, self-contained
> implementation added there instead.

Why isn't it available?

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 7021ec725dd3..e4c31dbea6d7 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -885,6 +885,7 @@ config INTEL_TDX_GUEST
>  	select ARCH_HAS_CC_PLATFORM
>  	select X86_MEM_ENCRYPT
>  	select X86_MCE
> +	select UNACCEPTED_MEMORY
>  	help
>  	  Support running as a guest under Intel TDX.  Without this support,
>  	  the guest kernel can not boot or run under TDX.

Ahh, there we go.  Nice.

> diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
> index 918a7606f53c..a0bd1426d235 100644
> --- a/arch/x86/boot/compressed/tdx.c
> +++ b/arch/x86/boot/compressed/tdx.c
> @@ -9,6 +9,7 @@
>  #include <uapi/asm/vmx.h>
>  
>  #include <asm/shared/tdx.h>
> +#include <asm/page_types.h>
>  
>  /* Called from __tdx_hypercall() for unrecoverable failure */
>  void __tdx_hypercall_failed(void)
> @@ -75,3 +76,43 @@ void early_tdx_detect(void)
>  	pio_ops.f_outb = tdx_outb;
>  	pio_ops.f_outw = tdx_outw;
>  }
> +
> +#define TDACCEPTPAGE		6
> +#define TDVMCALL_MAP_GPA	0x10001

That seems like unnecessary duplication.  Can't a #define be trivially
shared?

> +/*
> + * Wrapper for standard use of __tdx_hypercall with no output aside from
> + * return code.
> + */
> +static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
> +{
> +	struct tdx_hypercall_args args = {
> +		.r10 = TDX_HYPERCALL_STANDARD,
> +		.r11 = fn,
> +		.r12 = r12,
> +		.r13 = r13,
> +		.r14 = r14,
> +		.r15 = r15,
> +	};
> +
> +	return __tdx_hypercall(&args, 0);
> +}

Ditto on the sharing.

> +void tdx_accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +	int i;
> +
> +	if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
> +		error("Cannot accept memory: MapGPA failed\n");
> +
> +	/*
> +	 * For shared->private conversion, accept the page using TDACCEPTPAGE
> +	 * TDX module call.
> +	 */

What does the shared->private conversion have to do with this?  I feel
like I'm missing something.  Is unaccepted memory shared initially?

> +	for (i = 0; i < (end - start) / PAGE_SIZE; i++) {
> +		if (__tdx_module_call(TDACCEPTPAGE, start + i * PAGE_SIZE,
> +				      0, 0, 0, NULL)) {
> +			error("Cannot accept memory: page accept failed\n");
> +		}
> +	}
> +}
> diff --git a/arch/x86/boot/compressed/unaccepted_memory.c b/arch/x86/boot/compressed/unaccepted_memory.c
> index 3ebab63789bb..662ec32e3c42 100644
> --- a/arch/x86/boot/compressed/unaccepted_memory.c
> +++ b/arch/x86/boot/compressed/unaccepted_memory.c
> @@ -1,12 +1,33 @@
>  // SPDX-License-Identifier: GPL-2.0-only
>  
> +#include <asm/shared/tdx.h>
>  #include "error.h"
>  #include "misc.h"
> +#include "tdx.h"
> +
> +static bool is_tdx_guest(void)
> +{
> +	static bool once;
> +	static bool is_tdx;
> +
> +	if (!once) {
> +		u32 eax, sig[3];
> +
> +		cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax,
> +			    &sig[0], &sig[2],  &sig[1]);
> +		is_tdx = !memcmp(TDX_IDENT, sig, sizeof(sig));
> +	}
> +
> +	return is_tdx;
> +}

Do you need to set 'once'?

>  static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
>  {
>  	/* Platform-specific memory-acceptance call goes here */
> -	error("Cannot accept memory");
> +	if (is_tdx_guest())
> +		tdx_accept_memory(start, end);
> +	else
> +		error("Cannot accept memory");
>  }

Should those is_tdx_guest() checks be a new CC_ attribute, say:

	cc_guest_has(CC_UNACCEPTED_MEMORY)

and then have a:

	cc_accept_memory(start, end);

?

Or is that overkill when we only have TDX?

>  void mark_unaccepted(struct boot_params *params, u64 start, u64 end)
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 03deb4d6920d..0fcb54e07797 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -606,16 +606,8 @@ static bool try_accept_one(phys_addr_t *start, unsigned long len,
>  	return true;
>  }
>  
> -/*
> - * Inform the VMM of the guest's intent for this physical page: shared with
> - * the VMM or private to the guest.  The VMM is expected to change its mapping
> - * of the page in response.
> - */
> -static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
> +static bool __tdx_enc_status_changed(phys_addr_t start, phys_addr_t end, bool enc)
>  {
> -	phys_addr_t start = __pa(vaddr);
> -	phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
> -
>  	if (!enc) {
>  		/* Set the shared (decrypted) bits: */
>  		start |= cc_mkdec(0);
> @@ -660,6 +652,25 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
>  	return true;
>  }
>  
> +void tdx_accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +	if (!__tdx_enc_status_changed(start, end, true))
> +		panic("Accepting memory failed\n");
> +}

Could we make a bit better use of our naming bytes?  "__" tells us
nothing.  What if we had:

	tdx_enc_status_changed_phys()

we could call the other one _virt() or leave it alone too.

> +/*
> + * Inform the VMM of the guest's intent for this physical page: shared with
> + * the VMM or private to the guest.  The VMM is expected to change its mapping
> + * of the page in response.
> + */
> +static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
> +{
> +	phys_addr_t start = __pa(vaddr);
> +	phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
> +
> +	return __tdx_enc_status_changed(start, end, enc);
> +}
> +
>  void __init tdx_early_init(void)
>  {
>  	u64 cc_mask;
> diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
> index e53f26228fbb..4eca6492b108 100644
> --- a/arch/x86/include/asm/shared/tdx.h
> +++ b/arch/x86/include/asm/shared/tdx.h
> @@ -36,5 +36,25 @@ u64 __tdx_hypercall(struct tdx_hypercall_args *args, unsigned long flags);
>  /* Called from __tdx_hypercall() for unrecoverable failure */
>  void __tdx_hypercall_failed(void);
>  
> +/*
> + * Used in __tdx_module_call() to gather the output registers' values of the
> + * TDCALL instruction when requesting services from the TDX module. This is a
> + * software only structure and not part of the TDX module/VMM ABI
> + */
> +struct tdx_module_output {
> +	u64 rcx;
> +	u64 rdx;
> +	u64 r8;
> +	u64 r9;
> +	u64 r10;
> +	u64 r11;
> +};

Is this the first module call with an output?

> +/* Used to communicate with the TDX module */
> +u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> +		      struct tdx_module_output *out);
> +
> +void tdx_accept_memory(phys_addr_t start, phys_addr_t end);
> +
>  #endif /* !__ASSEMBLY__ */
>  #endif /* _ASM_X86_SHARED_TDX_H */
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 020c81a7c729..d9106d3e89f8 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -20,21 +20,6 @@
>  
>  #ifndef __ASSEMBLY__
>  
> -/*
> - * Used to gather the output registers values of the TDCALL and SEAMCALL
> - * instructions when requesting services from the TDX module.
> - *
> - * This is a software only structure and not part of the TDX module/VMM ABI.
> - */
> -struct tdx_module_output {
> -	u64 rcx;
> -	u64 rdx;
> -	u64 r8;
> -	u64 r9;
> -	u64 r10;
> -	u64 r11;
> -};
> -
>  /*
>   * Used by the #VE exception handler to gather the #VE exception
>   * info from the TDX module. This is a software only structure
> @@ -55,10 +40,6 @@ struct ve_info {
>  
>  void __init tdx_early_init(void);
>  
> -/* Used to communicate with the TDX module */
> -u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> -		      struct tdx_module_output *out);
> -
>  void tdx_get_ve_info(struct ve_info *ve);

There's enough TDX helper munging going on here that I'd probably break
this out into a separate patch.

> diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
> index 3588a7cb954c..2f1c3c0375cd 100644
> --- a/arch/x86/mm/unaccepted_memory.c
> +++ b/arch/x86/mm/unaccepted_memory.c
> @@ -6,6 +6,7 @@
>  
>  #include <asm/io.h>
>  #include <asm/setup.h>
> +#include <asm/shared/tdx.h>
>  #include <asm/unaccepted_memory.h>
>  
>  static DEFINE_SPINLOCK(unaccepted_memory_lock);
> @@ -26,7 +27,10 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
>  	for_each_set_bitrange_from(rs, re, unaccepted_memory,
>  				   DIV_ROUND_UP(end, PMD_SIZE)) {
>  		/* Platform-specific memory-acceptance call goes here */
> -		panic("Cannot accept memory");
> +		if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
> +			tdx_accept_memory(rs * PMD_SIZE, re * PMD_SIZE);
> +		else
> +			panic("Cannot accept memory");
>  		bitmap_clear(unaccepted_memory, rs, re - rs);
>  	}
>  	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-05 23:43 ` [PATCHv4 1/8] mm: Add " Kirill A. Shutemov
@ 2022-04-08 18:55   ` Dave Hansen
  2022-04-09 15:54     ` Kirill A. Shutemov
  2022-04-08 19:04   ` David Hildenbrand
  2022-04-08 19:11   ` Dave Hansen
  2 siblings, 1 reply; 67+ messages in thread
From: Dave Hansen @ 2022-04-08 18:55 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Mike Rapoport

On 4/5/22 16:43, Kirill A. Shutemov wrote:
> UEFI Specification version 2.9 introduces the concept of memory
> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
> SEV-SNP, requiring memory to be accepted before it can be used by the

		^ require

> guest. Accepting happens via a protocol specific for the Virtual Machine
> platform.

							^ s/for/to

> Accepting memory is costly and it makes VMM allocate memory for the
> accepted guest physical address range. It's better to postpone memory
> acceptance until memory is needed. It lowers boot time and reduces
> memory overhead.
> 
> Support of such memory requires a few changes in core-mm code:
> 
>   - memblock has to accept memory on allocation;
> 
>   - page allocator has to accept memory on the first allocation of the
>     page;
> 
> Memblock change is trivial.
> 
> The page allocator is modified to accept pages on the first allocation.
> PageUnaccepted() is used to indicate that the page requires acceptance.

Does this consume an actual page flag or is it aliased?

> Kernel only needs to accept memory once after boot, so during the boot
> and warm up phase there will be a lot of memory acceptance. After things
> are settled down the only price of the feature if couple of checks for
> PageUnaccepted() in allocate and free paths. The check refers a hot

							       ^ to

...
> + /*
> +  * PageUnaccepted() indicates that the page has to be "accepted" before it can
> +  * be used. Page allocator has to call accept_page() before returning the page
> +  * to the caller.
> +  */

Let's talk about "used" with a bit more detail.  Maybe:

/*
 * PageUnaccepted() indicates that the page has to be "accepted" before
 * it can be read or written.  The page allocator must to call
 * accept_page() before touching the page or returning it to the caller.
 */

...
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2db95780e003..53f4aa1c92a7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -121,6 +121,12 @@ typedef int __bitwise fpi_t;
>   */
>  #define FPI_SKIP_KASAN_POISON	((__force fpi_t)BIT(2))
>  
> +/*
> + * Check if the page needs to be marked as PageUnaccepted().
> + * Used for the new pages added to the buddy allocator for the first time.
> + */
> +#define FPI_UNACCEPTED		((__force fpi_t)BIT(3))
> +
>  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>  static DEFINE_MUTEX(pcp_batch_high_lock);
>  #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
> @@ -1023,6 +1029,26 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
>  	return page_is_buddy(higher_page, higher_buddy, order + 1);
>  }
>  
> +static void accept_page(struct page *page, unsigned int order)
> +{
> +	phys_addr_t start = page_to_phys(page);
> +	int i;
> +
> +	accept_memory(start, start + (PAGE_SIZE << order));
> +
> +	for (i = 0; i < (1 << order); i++) {
> +		if (PageUnaccepted(page + i))
> +			__ClearPageUnaccepted(page + i);
> +	}
> +}

It's probably worth a comment somewhere that this can be really slow.

> +static bool page_is_unaccepted(struct page *page, unsigned int order)
> +{
> +	phys_addr_t start = page_to_phys(page);
> +
> +	return memory_is_unaccepted(start, start + (PAGE_SIZE << order));
> +}
> +
>  /*
>   * Freeing function for a buddy system allocator.
>   *
> @@ -1058,6 +1084,7 @@ static inline void __free_one_page(struct page *page,
>  	unsigned long combined_pfn;
>  	struct page *buddy;
>  	bool to_tail;
> +	bool unaccepted = PageUnaccepted(page);
>  
>  	VM_BUG_ON(!zone_is_initialized(zone));
>  	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> @@ -1089,6 +1116,11 @@ static inline void __free_one_page(struct page *page,
>  			clear_page_guard(zone, buddy, order, migratetype);
>  		else
>  			del_page_from_free_list(buddy, zone, order);
> +
> +		/* Mark page unaccepted if any of merged pages were unaccepted */
> +		if (PageUnaccepted(buddy))
> +			unaccepted = true;

Naming nit: following the logic with a double-negative like !unaccepted
is a bit hard.  Would this be more readable if it were:

	bool page_needs_acceptance = PageUnaccepted(page);

and then the code below...

>  		combined_pfn = buddy_pfn & pfn;
>  		page = page + (combined_pfn - pfn);
>  		pfn = combined_pfn;
> @@ -1124,6 +1156,17 @@ static inline void __free_one_page(struct page *page,
>  done_merging:
>  	set_buddy_order(page, order);
>  
> +	/*
> +	 * Check if the page needs to be marked as PageUnaccepted().
> +	 * Used for the new pages added to the buddy allocator for the first
> +	 * time.
> +	 */
> +	if (!unaccepted && (fpi_flags & FPI_UNACCEPTED))
> +		unaccepted = page_is_unaccepted(page, order);

	if (page_needs_acceptance && (fpi_flags & FPI_UNACCEPTED))
		page_needs_acceptance = page_is_unaccepted(page, order);

> +	if (unaccepted)
> +		__SetPageUnaccepted(page);

This is getting hard for me to follow.

There are:
1. Pages that come in here with PageUnaccepted()==1
2. Pages that come in here with PageUnaccepted()==0, but a buddy that
   was PageUnaccepted()==1

In either of those cases, the bitmap will be consulted to see if the
page is *truly* unaccepted or not.  But, I'm struggling to figure out
how a page could end up in one of those scenarios and *not* be
page_is_unaccepted().

There are three pieces of information that come in:
1. PageUnaccepted(page)
2. PageUnaccepted(buddies[])
3. the bitmap

and one piece of information going out:

PageUnaccepted(page);

I think I need a more coherent description of how those four things fit
together.

>  	if (fpi_flags & FPI_TO_TAIL)
>  		to_tail = true;
>  	else if (is_shuffle_order(order))
> @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page,
>  static inline bool page_expected_state(struct page *page,
>  					unsigned long check_flags)
>  {
> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
> +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
> +	    !PageUnaccepted(page))
>  		return false;

That probably deserves a comment, and maybe its own if() statement.

>  	if (unlikely((unsigned long)page->mapping |
> @@ -1654,7 +1698,8 @@ void __free_pages_core(struct page *page, unsigned int order)
>  	 * Bypass PCP and place fresh pages right to the tail, primarily
>  	 * relevant for memory onlining.
>  	 */
> -	__free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON);
> +	__free_pages_ok(page, order,
> +			FPI_TO_TAIL | FPI_SKIP_KASAN_POISON | FPI_UNACCEPTED);
>  }
>  
>  #ifdef CONFIG_NUMA
> @@ -1807,6 +1852,7 @@ static void __init deferred_free_range(unsigned long pfn,
>  		return;
>  	}
>  
> +	accept_memory(pfn << PAGE_SHIFT, (pfn + nr_pages) << PAGE_SHIFT);
>  	for (i = 0; i < nr_pages; i++, page++, pfn++) {
>  		if ((pfn & (pageblock_nr_pages - 1)) == 0)
>  			set_pageblock_migratetype(page, MIGRATE_MOVABLE);

Comment, please.  I assume doing the slow accept up front is OK here
because this is in the deferred path.  But, it would be nice to know for
sure.

> @@ -2266,6 +2312,10 @@ static inline void expand(struct zone *zone, struct page *page,
>  		if (set_page_guard(zone, &page[size], high, migratetype))
>  			continue;
>  
> +		/* Transfer PageUnaccepted() to the newly split pages */
> +		if (PageUnaccepted(page))
> +			__SetPageUnaccepted(&page[size]);

We don't want to just accept the page here, right?  Because we're
holding the zone lock?  Maybe we should mention that:

		/*
		 * Transfer PageUnaccepted() to the newly split pages so
		 * they can be accepted after dropping the zone lock.
		 */

>  		add_to_free_list(&page[size], zone, high, migratetype);
>  		set_buddy_order(&page[size], high);
>  	}
> @@ -2396,6 +2446,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>  	 */
>  	kernel_unpoison_pages(page, 1 << order);
>  
> +	if (PageUnaccepted(page))
> +		accept_page(page, order);
> +
>  	/*
>  	 * As memory initialization might be integrated into KASAN,
>  	 * KASAN unpoisoning and memory initializion code must be

Is accepted memory guaranteed to be zeroed?  Do we want to skip the
__GFP_ZERO behavior later in this function?  Or is that just a silly
over-optimization?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-05 23:43 ` [PATCHv4 1/8] mm: Add " Kirill A. Shutemov
  2022-04-08 18:55   ` Dave Hansen
@ 2022-04-08 19:04   ` David Hildenbrand
  2022-04-08 19:11   ` Dave Hansen
  2 siblings, 0 replies; 67+ messages in thread
From: David Hildenbrand @ 2022-04-08 19:04 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Brijesh Singh, Mike Rapoport, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Mike Rapoport

On 06.04.22 01:43, Kirill A. Shutemov wrote:
> UEFI Specification version 2.9 introduces the concept of memory
> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
> SEV-SNP, requiring memory to be accepted before it can be used by the
> guest. Accepting happens via a protocol specific for the Virtual Machine
> platform.
> 
> Accepting memory is costly and it makes VMM allocate memory for the
> accepted guest physical address range. It's better to postpone memory
> acceptance until memory is needed. It lowers boot time and reduces
> memory overhead.
> 
> Support of such memory requires a few changes in core-mm code:
> 
>   - memblock has to accept memory on allocation;
> 
>   - page allocator has to accept memory on the first allocation of the
>     page;
> 
> Memblock change is trivial.
> 
> The page allocator is modified to accept pages on the first allocation.
> PageUnaccepted() is used to indicate that the page requires acceptance.
> 
> Kernel only needs to accept memory once after boot, so during the boot
> and warm up phase there will be a lot of memory acceptance. After things
> are settled down the only price of the feature if couple of checks for
> PageUnaccepted() in allocate and free paths. The check refers a hot
> variable (that also encodes PageBuddy()), so it is cheap and not visible
> on profiles.
> 
> Architecture has to provide two helpers if it wants to support
> unaccepted memory:
> 
>  - accept_memory() makes a range of physical addresses accepted.
> 
>  - memory_is_unaccepted() checks anything within the range of physical
>    addresses requires acceptance.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock

Thanks, I skimmed over most parts and nothing obvious jumped at me. Dave
has some good suggestions; I'll try giving it a thorough in the near future.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-05 23:43 ` [PATCHv4 1/8] mm: Add " Kirill A. Shutemov
  2022-04-08 18:55   ` Dave Hansen
  2022-04-08 19:04   ` David Hildenbrand
@ 2022-04-08 19:11   ` Dave Hansen
  2022-04-09 17:52     ` Kirill A. Shutemov
  2022-04-12  8:15     ` David Hildenbrand
  2 siblings, 2 replies; 67+ messages in thread
From: Dave Hansen @ 2022-04-08 19:11 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Mike Rapoport

On 4/5/22 16:43, Kirill A. Shutemov wrote:
> Kernel only needs to accept memory once after boot, so during the boot
> and warm up phase there will be a lot of memory acceptance. After things
> are settled down the only price of the feature if couple of checks for
> PageUnaccepted() in allocate and free paths. The check refers a hot
> variable (that also encodes PageBuddy()), so it is cheap and not visible
> on profiles.

Let's also not sugar-coat this.  Page acceptance is hideously slow.
It's agonizingly slow.  To boot, it's done holding a global spinlock
with interrupts disabled (see patch 6/8).  At the very, very least, each
acceptance operation involves a couple of what are effectively ring
transitions, a 2MB memset(), and a bunch of cache flushing.

The system is going to be downright unusable during this time, right?

Sure, it's *temporary* and only happens once at boot.  But, it's going
to suck.

Am I over-stating this in any way?

The ACCEPT_MEMORY vmstat is good to have around.  Thanks for adding it.
 But, I think we should also write down some guidance like:

	If your TDX system seems as slow as snail after boot, look at
	the "accept_memory" counter in /proc/vmstat.  If it is
	incrementing, then TDX memory acceptance is likely to blame.

Do we need anything more discrete to tell users when acceptance is over?
 For instance, maybe they run something and it goes really slow, they
watch "accept_memory" until it stops.  They rejoice at their good
fortune!  Then, memory allocation starts falling over to a new node and
the agony beings anew.

I can think of dealing with this in two ways:

	cat /sys/.../unaccepted_pages_left

which just walks the bitmap and counts the amount of pages remaining. or
something like:

	echo 1 > /sys/devices/system/node/node0/make_the_pain_stop

Which will, well, make the pain stop on node0.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 6/8] x86/mm: Provide helpers for unaccepted memory
  2022-04-05 23:43 ` [PATCHv4 6/8] x86/mm: Provide helpers for unaccepted memory Kirill A. Shutemov
  2022-04-08 18:15   ` Dave Hansen
@ 2022-04-08 19:21   ` Dave Hansen
  2022-04-13 16:08     ` Kirill A. Shutemov
  1 sibling, 1 reply; 67+ messages in thread
From: Dave Hansen @ 2022-04-08 19:21 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On 4/5/22 16:43, Kirill A. Shutemov wrote:
> +void accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +	unsigned long *unaccepted_memory;
> +	unsigned long flags;
> +	unsigned int rs, re;
> +
> +	if (!boot_params.unaccepted_memory)
> +		return;
> +
> +	unaccepted_memory = __va(boot_params.unaccepted_memory);
> +	rs = start / PMD_SIZE;
> +
> +	spin_lock_irqsave(&unaccepted_memory_lock, flags);
> +	for_each_set_bitrange_from(rs, re, unaccepted_memory,
> +				   DIV_ROUND_UP(end, PMD_SIZE)) {
> +		/* Platform-specific memory-acceptance call goes here */
> +		panic("Cannot accept memory");
> +		bitmap_clear(unaccepted_memory, rs, re - rs);
> +	}
> +	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
> +}

Just to reiterate: this is a global spinlock.  It's disabling
interrupts.  "Platform-specific memory-acceptance call" will soon become:

	tdx_accept_memory(rs * PMD_SIZE, re * PMD_SIZE);

which is a page-by-page __tdx_module_call():

> +	for (i = 0; i < (end - start) / PAGE_SIZE; i++) {
> +		if (__tdx_module_call(TDACCEPTPAGE, start + i * PAGE_SIZE,
> +				      0, 0, 0, NULL)) {
> +			error("Cannot accept memory: page accept failed\n");
> +		}
> +	}

Each __tdx_module_call() involves a privilege transition that also
surely includes things like changing CR3.  It can't be cheap.  It also
is presumably touching the memory and probably flushing it out of the
CPU caches.  It's also unbounded:

	spin_lock_irqsave(&unaccepted_memory_lock, flags);
	for (i = 0; i < (end - start) / PAGE_SIZE; i++)
		// thousands?  tens-of-thousands of cycles??
	spin_lock_irqsave(&unaccepted_memory_lock, flags);

How far apart can end and start be?  It's at *least* 2MB in the page
allocator, which is on the order of a millisecond.  Are we sure there
aren't any callers that want to do this at a gigabyte granularity?  That
would hold the global lock and disable interrupts on the order of a second.

Do we want to bound the time that the lock can be held?  Or, should we
just let the lockup detectors tell us that we're being naughty?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-08 18:55   ` Dave Hansen
@ 2022-04-09 15:54     ` Kirill A. Shutemov
  2022-04-11  6:38       ` Dave Hansen
  2022-04-11  8:47       ` David Hildenbrand
  0 siblings, 2 replies; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-09 15:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Mike Rapoport

On Fri, Apr 08, 2022 at 11:55:43AM -0700, Dave Hansen wrote:
> On 4/5/22 16:43, Kirill A. Shutemov wrote:
> > UEFI Specification version 2.9 introduces the concept of memory
> > acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
> > SEV-SNP, requiring memory to be accepted before it can be used by the
> 
> 		^ require

Heh. That's wording form the spec.

> > guest. Accepting happens via a protocol specific for the Virtual Machine
> > platform.
> 
> 							^ s/for/to
> 
> > Accepting memory is costly and it makes VMM allocate memory for the
> > accepted guest physical address range. It's better to postpone memory
> > acceptance until memory is needed. It lowers boot time and reduces
> > memory overhead.
> > 
> > Support of such memory requires a few changes in core-mm code:
> > 
> >   - memblock has to accept memory on allocation;
> > 
> >   - page allocator has to accept memory on the first allocation of the
> >     page;
> > 
> > Memblock change is trivial.
> > 
> > The page allocator is modified to accept pages on the first allocation.
> > PageUnaccepted() is used to indicate that the page requires acceptance.
> 
> Does this consume an actual page flag or is it aliased?

It is encoded as a page type in mapcount of unallocated memory. It is not
aliased with PageOffline() as I did before.

I will mention that it is a new page type.

> > Kernel only needs to accept memory once after boot, so during the boot
> > and warm up phase there will be a lot of memory acceptance. After things
> > are settled down the only price of the feature if couple of checks for
> > PageUnaccepted() in allocate and free paths. The check refers a hot
> 
> 							       ^ to
> 
> ...
> > + /*
> > +  * PageUnaccepted() indicates that the page has to be "accepted" before it can
> > +  * be used. Page allocator has to call accept_page() before returning the page
> > +  * to the caller.
> > +  */
> 
> Let's talk about "used" with a bit more detail.  Maybe:
> 
> /*
>  * PageUnaccepted() indicates that the page has to be "accepted" before
>  * it can be read or written.  The page allocator must to call
>  * accept_page() before touching the page or returning it to the caller.
>  */

I guess s/must to call/must call/, right?

> ...
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 2db95780e003..53f4aa1c92a7 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -121,6 +121,12 @@ typedef int __bitwise fpi_t;
> >   */
> >  #define FPI_SKIP_KASAN_POISON	((__force fpi_t)BIT(2))
> >  
> > +/*
> > + * Check if the page needs to be marked as PageUnaccepted().
> > + * Used for the new pages added to the buddy allocator for the first time.
> > + */
> > +#define FPI_UNACCEPTED		((__force fpi_t)BIT(3))
> > +
> >  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
> >  static DEFINE_MUTEX(pcp_batch_high_lock);
> >  #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
> > @@ -1023,6 +1029,26 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
> >  	return page_is_buddy(higher_page, higher_buddy, order + 1);
> >  }
> >  
> > +static void accept_page(struct page *page, unsigned int order)
> > +{
> > +	phys_addr_t start = page_to_phys(page);
> > +	int i;
> > +
> > +	accept_memory(start, start + (PAGE_SIZE << order));
> > +
> > +	for (i = 0; i < (1 << order); i++) {
> > +		if (PageUnaccepted(page + i))
> > +			__ClearPageUnaccepted(page + i);
> > +	}
> > +}
> 
> It's probably worth a comment somewhere that this can be really slow.
> 
> > +static bool page_is_unaccepted(struct page *page, unsigned int order)
> > +{
> > +	phys_addr_t start = page_to_phys(page);
> > +
> > +	return memory_is_unaccepted(start, start + (PAGE_SIZE << order));
> > +}
> > +
> >  /*
> >   * Freeing function for a buddy system allocator.
> >   *
> > @@ -1058,6 +1084,7 @@ static inline void __free_one_page(struct page *page,
> >  	unsigned long combined_pfn;
> >  	struct page *buddy;
> >  	bool to_tail;
> > +	bool unaccepted = PageUnaccepted(page);
> >  
> >  	VM_BUG_ON(!zone_is_initialized(zone));
> >  	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> > @@ -1089,6 +1116,11 @@ static inline void __free_one_page(struct page *page,
> >  			clear_page_guard(zone, buddy, order, migratetype);
> >  		else
> >  			del_page_from_free_list(buddy, zone, order);
> > +
> > +		/* Mark page unaccepted if any of merged pages were unaccepted */
> > +		if (PageUnaccepted(buddy))
> > +			unaccepted = true;
> 
> Naming nit: following the logic with a double-negative like !unaccepted
> is a bit hard.  Would this be more readable if it were:
> 
> 	bool page_needs_acceptance = PageUnaccepted(page);
> 
> and then the code below...
> 
> >  		combined_pfn = buddy_pfn & pfn;
> >  		page = page + (combined_pfn - pfn);
> >  		pfn = combined_pfn;
> > @@ -1124,6 +1156,17 @@ static inline void __free_one_page(struct page *page,
> >  done_merging:
> >  	set_buddy_order(page, order);
> >  
> > +	/*
> > +	 * Check if the page needs to be marked as PageUnaccepted().
> > +	 * Used for the new pages added to the buddy allocator for the first
> > +	 * time.
> > +	 */
> > +	if (!unaccepted && (fpi_flags & FPI_UNACCEPTED))
> > +		unaccepted = page_is_unaccepted(page, order);
> 
> 	if (page_needs_acceptance && (fpi_flags & FPI_UNACCEPTED))
> 		page_needs_acceptance = page_is_unaccepted(page, order);
> 
> > +	if (unaccepted)
> > +		__SetPageUnaccepted(page);
> 
> This is getting hard for me to follow.
> 
> There are:
> 1. Pages that come in here with PageUnaccepted()==1
> 2. Pages that come in here with PageUnaccepted()==0, but a buddy that
>    was PageUnaccepted()==1
> 
> In either of those cases, the bitmap will be consulted to see if the
> page is *truly* unaccepted or not.  But, I'm struggling to figure out
> how a page could end up in one of those scenarios and *not* be
> page_is_unaccepted().
> 
> There are three pieces of information that come in:
> 1. PageUnaccepted(page)
> 2. PageUnaccepted(buddies[])
> 3. the bitmap

1 and 2 are the same conceptionally: merged-in pieces of the resulting page.

> 
> and one piece of information going out:
> 
> PageUnaccepted(page);
> 
> I think I need a more coherent description of how those four things fit
> together.

The page gets marked as PageUnaccepted() if any of merged-in pages is
PageUnaccepted().

For new pages, just being added to buddy allocator, consult
page_is_unaccepted(). FPI_UNACCEPTED indicates that the page is new and
page_is_unaccepted() check is required.

Avoid calling page_is_unaccepted() if it is known that the page needs
acceptance anyway. It can be costly.

Is it good enough explanation?

FPI_UNACCEPTED is not a good name. Any help with a better one?
FPI_CHECK_UNACCEPTED?

> >  	if (fpi_flags & FPI_TO_TAIL)
> >  		to_tail = true;
> >  	else if (is_shuffle_order(order))
> > @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page,
> >  static inline bool page_expected_state(struct page *page,
> >  					unsigned long check_flags)
> >  {
> > -	if (unlikely(atomic_read(&page->_mapcount) != -1))
> > +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
> > +	    !PageUnaccepted(page))
> >  		return false;
> 
> That probably deserves a comment, and maybe its own if() statement.

Own if does not work. PageUnaccepted() is encoded in _mapcount.

What about this:

	/*
	 * page->_mapcount is expected to be -1.
	 *
	 * There is an exception for PageUnaccepted(). The page type can be set
	 * for pages on free list. Page types are encoded in _mapcount.
	 *
	 * PageUnaccepted() will get cleared in post_alloc_hook().
	 */
	if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1))
		return false;

?

> >  	if (unlikely((unsigned long)page->mapping |
> > @@ -1654,7 +1698,8 @@ void __free_pages_core(struct page *page, unsigned int order)
> >  	 * Bypass PCP and place fresh pages right to the tail, primarily
> >  	 * relevant for memory onlining.
> >  	 */
> > -	__free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON);
> > +	__free_pages_ok(page, order,
> > +			FPI_TO_TAIL | FPI_SKIP_KASAN_POISON | FPI_UNACCEPTED);
> >  }
> >  
> >  #ifdef CONFIG_NUMA
> > @@ -1807,6 +1852,7 @@ static void __init deferred_free_range(unsigned long pfn,
> >  		return;
> >  	}
> >  
> > +	accept_memory(pfn << PAGE_SHIFT, (pfn + nr_pages) << PAGE_SHIFT);
> >  	for (i = 0; i < nr_pages; i++, page++, pfn++) {
> >  		if ((pfn & (pageblock_nr_pages - 1)) == 0)
> >  			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> 
> Comment, please.  I assume doing the slow accept up front is OK here
> because this is in the deferred path.  But, it would be nice to know for
> sure.

It is acceptance of smaller than page block upfront. I will add a comment.

> 
> > @@ -2266,6 +2312,10 @@ static inline void expand(struct zone *zone, struct page *page,
> >  		if (set_page_guard(zone, &page[size], high, migratetype))
> >  			continue;
> >  
> > +		/* Transfer PageUnaccepted() to the newly split pages */
> > +		if (PageUnaccepted(page))
> > +			__SetPageUnaccepted(&page[size]);
> 
> We don't want to just accept the page here, right?  Because we're
> holding the zone lock?  Maybe we should mention that:
> 
> 		/*
> 		 * Transfer PageUnaccepted() to the newly split pages so
> 		 * they can be accepted after dropping the zone lock.
> 		 */

Okay.

> >  		add_to_free_list(&page[size], zone, high, migratetype);
> >  		set_buddy_order(&page[size], high);
> >  	}
> > @@ -2396,6 +2446,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
> >  	 */
> >  	kernel_unpoison_pages(page, 1 << order);
> >  
> > +	if (PageUnaccepted(page))
> > +		accept_page(page, order);
> > +
> >  	/*
> >  	 * As memory initialization might be integrated into KASAN,
> >  	 * KASAN unpoisoning and memory initializion code must be
> 
> Is accepted memory guaranteed to be zeroed?  Do we want to skip the
> __GFP_ZERO behavior later in this function?  Or is that just a silly
> over-optimization?

For TDX, it is true that the memory gets cleared on acceptance, but I
don't we can say the same for any possible implementation.

I would rather leave __GFP_ZERO for peace of mind. Clearing the cache-hot
page for the second time shouldn't be a big deal comparing to acceptance
cost.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-08 19:11   ` Dave Hansen
@ 2022-04-09 17:52     ` Kirill A. Shutemov
  2022-04-11  6:41       ` Dave Hansen
  2022-04-12  8:15     ` David Hildenbrand
  1 sibling, 1 reply; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-09 17:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Mike Rapoport

On Fri, Apr 08, 2022 at 12:11:58PM -0700, Dave Hansen wrote:
> On 4/5/22 16:43, Kirill A. Shutemov wrote:
> > Kernel only needs to accept memory once after boot, so during the boot
> > and warm up phase there will be a lot of memory acceptance. After things
> > are settled down the only price of the feature if couple of checks for
> > PageUnaccepted() in allocate and free paths. The check refers a hot
> > variable (that also encodes PageBuddy()), so it is cheap and not visible
> > on profiles.
> 
> Let's also not sugar-coat this.  Page acceptance is hideously slow.
> It's agonizingly slow.  To boot, it's done holding a global spinlock
> with interrupts disabled (see patch 6/8).  At the very, very least, each
> acceptance operation involves a couple of what are effectively ring
> transitions, a 2MB memset(), and a bunch of cache flushing.
> 
> The system is going to be downright unusable during this time, right?

Well, yes. The CPU that doing accepting is completely blocked by it.
But other CPUs may proceed until in in its turn steps onto memory
accepting.

> Sure, it's *temporary* and only happens once at boot.  But, it's going
> to suck.
> 
> Am I over-stating this in any way?
> 
> The ACCEPT_MEMORY vmstat is good to have around.  Thanks for adding it.
>  But, I think we should also write down some guidance like:
> 
> 	If your TDX system seems as slow as snail after boot, look at
> 	the "accept_memory" counter in /proc/vmstat.  If it is
> 	incrementing, then TDX memory acceptance is likely to blame.

Sure. Will add to commit message.

> Do we need anything more discrete to tell users when acceptance is over?

I can imagine setups that where acceptance is never over. A VM running
a workload with fixed dataset can have planty of memory unaccepted.

I don't think "make it over" should be the goal.

>  For instance, maybe they run something and it goes really slow, they
> watch "accept_memory" until it stops.  They rejoice at their good
> fortune!  Then, memory allocation starts falling over to a new node and
> the agony beings anew.
> 
> I can think of dealing with this in two ways:
> 
> 	cat /sys/.../unaccepted_pages_left
> 
> which just walks the bitmap and counts the amount of pages remaining. or
> something like:
> 
> 	echo 1 > /sys/devices/system/node/node0/make_the_pain_stop
> 
> Which will, well, make the pain stop on node0.

Sure we can add handles. But API is hard. Maybe we should wait and see
what is actually needed. (Yes, I'm lazy.:)

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 3/8] efi/x86: Implement support for unaccepted memory
  2022-04-08 17:26   ` Dave Hansen
@ 2022-04-09 19:41     ` Kirill A. Shutemov
  2022-04-14 15:55     ` Borislav Petkov
  1 sibling, 0 replies; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-09 19:41 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Fri, Apr 08, 2022 at 10:26:14AM -0700, Dave Hansen wrote:
> On 4/5/22 16:43, Kirill A. Shutemov wrote:
> > +void mark_unaccepted(struct boot_params *params, u64 start, u64 end)
> > +{
> > +	/*
> > +	 * The accepted memory bitmap only works at PMD_SIZE granularity.
> > +	 * If a request comes in to mark memory as unaccepted which is not
> > +	 * PMD_SIZE-aligned, simply accept the memory now since it can not be
> > +	 * *marked* as unaccepted.
> > +	 */
> > +
> > +	/*
> > +	 * Accept small regions that might not be able to be represented
> > +	 * in the bitmap:
> > +	 */
> > +	if (end - start < 2 * PMD_SIZE) {
> > +		__accept_memory(start, end);
> > +		return;
> > +	}
> 
> This is not my first time looking at this code and I still had to think
> about this a bit.  That's not good.  That pathological case here is
> actually something like this:
> 
> | 4k | 2044k + 2044k | 4k |
> ^ 0x0 	     ^ 2MB	  ^ 4MB
> 
> Where we have a 2MB-aligned 4k accepted area, a 4088k unaccepted area,
> then another 4k accepted area.  That will not result in any bits being
> set in the accepted memory bitmap because no 2MB region is fully accepted.
> 
> The one oddball case is this:
> 
> | 4k | 2044k |    2048k   |
> ^ 0x0 	     ^ 2MB	  ^ 4MB
> 
> Which would fall into the if() above, but *can* have part of its range
> marked in the bitmap.
> 
> Maybe we need something more like this:
> 
> 	/*
> 	 * Accept small regions that might not be able to be represented
> 	 * in the bitmap.  This is a bit imprecise and may accept some
> 	 * areas that could have been represented in the bitmap instead.
> 	 * But, the imprecision makes the code simpler by ensuring that
> 	 * at least one bit will be set int the bitmap below.
> 	 */

Okay, will change.

> > diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
> > index 2c3dac5ecb36..b17ceec757d0 100644
> > --- a/drivers/firmware/efi/Kconfig
> > +++ b/drivers/firmware/efi/Kconfig
> > @@ -243,6 +243,21 @@ config EFI_DISABLE_PCI_DMA
> >  	  options "efi=disable_early_pci_dma" or "efi=no_disable_early_pci_dma"
> >  	  may be used to override this option.
> >  
> > +config UNACCEPTED_MEMORY
> > +	bool
> > +	depends on EFI_STUB
> > +	depends on !KEXEC_CORE
> 
> The changelog should probably say something about how the kexec()
> incompatibility is going to be rectified in the future.

Okay.

> > +	help
> > +	   Some Virtual Machine platforms, such as Intel TDX, require
> > +	   some memory to be "accepted" by the guest before it can be used.
> > +	   This mechanism helps prevent malicious hosts from making changes
> > +	   to guest memory.
> > +
> > +	   UEFI specification v2.9 introduced EFI_UNACCEPTED_MEMORY memory type.
> > +
> > +	   This option adds support for unaccepted memory and makes such memory
> > +	   usable by kernel.
> > +
> >  endmenu
> 
> BTW, what happens if this is compiled out?  Do TDX guests just lose all
> the unaccepted memory?

No. It will not have access to unaccepted memory and will only use memory
accepted by BIOS.

> Should TDX be selecting this or something?

Yes, it should and we do this.

> > @@ -504,6 +506,13 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
> >  			e820_type = E820_TYPE_PMEM;
> >  			break;
> >  
> > +		case EFI_UNACCEPTED_MEMORY:
> > +			if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
> > +				continue;
> 
> This seems worthy of a pr_info().  We're effectively throwing away
> memory with this "continue", right?

Yes. In this case we threat unaccepted as reserved and inaccessible to
kernel.

Maybe pr_warn() is more appropriate.


-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 4/8] x86/boot/compressed: Handle unaccepted memory
  2022-04-08 17:57   ` Dave Hansen
@ 2022-04-09 20:20     ` Kirill A. Shutemov
  2022-04-11  6:49       ` Dave Hansen
  0 siblings, 1 reply; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-09 20:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Fri, Apr 08, 2022 at 10:57:17AM -0700, Dave Hansen wrote:
> On 4/5/22 16:43, Kirill A. Shutemov wrote:
> > Firmware is responsible for accepting memory where compressed kernel
> > image and initrd land. But kernel has to accept memory for decompression
> > buffer: accept memory just before decompression starts.
> 
> I think I'd appreciate a sentence or two more about what's going on.
> How about something like this?
> 
> The firmware starts the kernel by booting into the "compressed" kernel
> stub.  That stub's job is to decompress the full kernel image and then
> jump to it.
> 
> The firmware will pre-accept the memory used to run the stub.  But, the
> stub is responsible for accepting the memory into which it decompresses
> the main kernel.  Accept memory just before decompression starts.
> 
> The stub is also responsible for choosing a physical address in which to
> place the decompressed kernel image.  The KASLR mechanism will randomize
> this physical address.  Since the unaccepted memory region is relatively
> small, KASLR would be quite ineffective if it only used the pre-accepted
> area (EFI_CONVENTIONAL_MEMORY).  Ensure that KASLR randomizes among the
> entire physical address space by also including EFI_UNACCEPTED_MEMORY.

Sure, looks good.

> > diff --git a/arch/x86/boot/compressed/bitmap.c b/arch/x86/boot/compressed/bitmap.c
> > index bf58b259380a..ba2de61c0823 100644
> > --- a/arch/x86/boot/compressed/bitmap.c
> > +++ b/arch/x86/boot/compressed/bitmap.c
> > @@ -2,6 +2,48 @@
> >  /* Taken from lib/string.c */
> >  
> >  #include <linux/bitmap.h>
> > +#include <linux/math.h>
> > +#include <linux/minmax.h>
> > +
> > +unsigned long _find_next_bit(const unsigned long *addr1,
> > +		const unsigned long *addr2, unsigned long nbits,
> > +		unsigned long start, unsigned long invert, unsigned long le)
> > +{
> > +	unsigned long tmp, mask;
> > +
> > +	if (unlikely(start >= nbits))
> > +		return nbits;
> > +
> > +	tmp = addr1[start / BITS_PER_LONG];
> > +	if (addr2)
> > +		tmp &= addr2[start / BITS_PER_LONG];
> > +	tmp ^= invert;
> > +
> > +	/* Handle 1st word. */
> > +	mask = BITMAP_FIRST_WORD_MASK(start);
> > +	if (le)
> > +		mask = swab(mask);
> > +
> > +	tmp &= mask;
> > +
> > +	start = round_down(start, BITS_PER_LONG);
> > +
> > +	while (!tmp) {
> > +		start += BITS_PER_LONG;
> > +		if (start >= nbits)
> > +			return nbits;
> > +
> > +		tmp = addr1[start / BITS_PER_LONG];
> > +		if (addr2)
> > +			tmp &= addr2[start / BITS_PER_LONG];
> > +		tmp ^= invert;
> > +	}
> > +
> > +	if (le)
> > +		tmp = swab(tmp);
> > +
> > +	return min(start + __ffs(tmp), nbits);
> > +}
> >  
> >  void __bitmap_set(unsigned long *map, unsigned int start, int len)
> >  {
> > @@ -22,3 +64,23 @@ void __bitmap_set(unsigned long *map, unsigned int start, int len)
> >  		*p |= mask_to_set;
> >  	}
> >  }
> > +
> > +void __bitmap_clear(unsigned long *map, unsigned int start, int len)
> > +{
> > +	unsigned long *p = map + BIT_WORD(start);
> > +	const unsigned int size = start + len;
> > +	int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
> > +	unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
> > +
> > +	while (len - bits_to_clear >= 0) {
> > +		*p &= ~mask_to_clear;
> > +		len -= bits_to_clear;
> > +		bits_to_clear = BITS_PER_LONG;
> > +		mask_to_clear = ~0UL;
> > +		p++;
> > +	}
> > +	if (len) {
> > +		mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
> > +		*p &= ~mask_to_clear;
> > +	}
> > +}
> 
> It's a real shame that we have to duplicate this code.  Is there
> anything crazy we could do here like
> 
> #include "../../../lib/find_bit.c"
> 
> ?

Well, it would require fracturing source files on the kernel side.

__bitmap_set() and __bitmap_clear() are now in lib/bitmap.c.

_find_next_bit() is in lib/find_bit.c.

Both lib/bitmap.c and lib/find_bit.c have a lot of stuff that are not used
here. I guess we would need to split them into few pieces to make it in
sane way. Do you want me to go this path?

> 
> > diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
> > index 411b268bc0a2..59db90626042 100644
> > --- a/arch/x86/boot/compressed/kaslr.c
> > +++ b/arch/x86/boot/compressed/kaslr.c
> > @@ -725,10 +725,20 @@ process_efi_entries(unsigned long minimum, unsigned long image_size)
> >  		 * but in practice there's firmware where using that memory leads
> >  		 * to crashes.
> >  		 *
> > -		 * Only EFI_CONVENTIONAL_MEMORY is guaranteed to be free.
> > +		 * Only EFI_CONVENTIONAL_MEMORY and EFI_UNACCEPTED_MEMORY (if
> > +		 * supported) are guaranteed to be free.
> >  		 */
> > -		if (md->type != EFI_CONVENTIONAL_MEMORY)
> > +
> > +		switch (md->type) {
> > +		case EFI_CONVENTIONAL_MEMORY:
> > +			break;
> > +		case EFI_UNACCEPTED_MEMORY:
> > +			if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
> > +				break;
> >  			continue;
> > +		default:
> > +			continue;
> > +		}
> >  
> >  		if (efi_soft_reserve_enabled() &&
> >  		    (md->attribute & EFI_MEMORY_SP))
> > diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
> > index fa8969fad011..c1d9d71a6615 100644
> > --- a/arch/x86/boot/compressed/misc.c
> > +++ b/arch/x86/boot/compressed/misc.c
> > @@ -18,6 +18,7 @@
> >  #include "../string.h"
> >  #include "../voffset.h"
> >  #include <asm/bootparam_utils.h>
> > +#include <asm/unaccepted_memory.h>
> >  
> >  /*
> >   * WARNING!!
> > @@ -43,6 +44,9 @@
> >  void *memmove(void *dest, const void *src, size_t n);
> >  #endif
> >  
> > +#undef __pa
> > +#define __pa(x)	((unsigned long)(x))
> 
> Those #undef's always worry me.  Why is this one needed?

arch/x86/boot/compressed/misc.c:47:9: warning: '__pa' macro redefined [-Wmacro-redefined]
#define __pa(x) ((unsigned long)(x))
        ^
arch/x86/include/asm/page.h:47:9: note: previous definition is here
#define __pa(x)         __phys_addr((unsigned long)(x))

Note that sev.c does the same. At least we are consistent :)

> >  /*
> >   * This is set up by the setup-routine at boot-time
> >   */
> > @@ -451,6 +455,13 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
> >  #endif
> >  
> >  	debug_putstr("\nDecompressing Linux... ");
> > +
> > +	if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) &&
> > +	    boot_params->unaccepted_memory) {
> > +		debug_putstr("Accepting memory... ");
> > +		accept_memory(__pa(output), __pa(output) + needed_size);
> > +	}
> > +
> >  	__decompress(input_data, input_len, NULL, NULL, output, output_len,
> >  			NULL, error);
> >  	parse_elf(output);
> > diff --git a/arch/x86/boot/compressed/unaccepted_memory.c b/arch/x86/boot/compressed/unaccepted_memory.c
> > index d363acf59c08..3ebab63789bb 100644
> > --- a/arch/x86/boot/compressed/unaccepted_memory.c
> > +++ b/arch/x86/boot/compressed/unaccepted_memory.c
> > @@ -51,3 +51,17 @@ void mark_unaccepted(struct boot_params *params, u64 start, u64 end)
> >  	bitmap_set((unsigned long *)params->unaccepted_memory,
> >  		   start / PMD_SIZE, (end - start) / PMD_SIZE);
> >  }
> > +
> > +void accept_memory(phys_addr_t start, phys_addr_t end)
> > +{
> > +	unsigned long *unaccepted_memory;
> > +	unsigned int rs, re;
> > +
> > +	unaccepted_memory = (unsigned long *)boot_params->unaccepted_memory;
> > +	rs = start / PMD_SIZE;
> 
> OK, so start is a physical address, PMD_SIZE is 2^21, and 'rs' is an
> unsigned int.  That means 'rs' can, at most, represent a physical
> address at 2^(21+32), or 2^53.  That's cutting it a *bit* close, don't
> you think?
> 
> Could we please just give 'rs' and 're' real names and make them
> 'unsigned long's, please?  It will surely save at least one other person
> from doing math.  The find_next_bit() functions seem to take ulongs anyway.

Okay. 'range_start' and 'range_end' are good enough names?

> 
> > +	for_each_set_bitrange_from(rs, re, unaccepted_memory,
> > +				   DIV_ROUND_UP(end, PMD_SIZE)) {
> > +		__accept_memory(rs * PMD_SIZE, re * PMD_SIZE);
> > +		bitmap_clear(unaccepted_memory, rs, re - rs);
> > +	}
> > +}
> 
> Could we please introduce some intermediate variable?  For instance:
> 
> 	unsigned long bitmap_size = DIV_ROUND_UP(end, PMD_SIZE);
> 
> That will make this all a lot easier to read.

Okay.

> 
> > diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
> > index cbc24040b853..f1f835d3cd78 100644
> > --- a/arch/x86/include/asm/unaccepted_memory.h
> > +++ b/arch/x86/include/asm/unaccepted_memory.h
> > @@ -9,4 +9,6 @@ struct boot_params;
> >  
> >  void mark_unaccepted(struct boot_params *params, u64 start, u64 num);
> >  
> > +void accept_memory(phys_addr_t start, phys_addr_t end);
> > +
> >  #endif
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 5/8] x86/mm: Reserve unaccepted memory bitmap
  2022-04-08 18:08   ` Dave Hansen
@ 2022-04-09 20:43     ` Kirill A. Shutemov
  0 siblings, 0 replies; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-09 20:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Mike Rapoport

On Fri, Apr 08, 2022 at 11:08:48AM -0700, Dave Hansen wrote:
> On 4/5/22 16:43, Kirill A. Shutemov wrote:
> > A given page of memory can only be accepted once.  The kernel has a need
> > to accept memory both in the early decompression stage and during normal
> > runtime.
> > 
> > A bitmap used to communicate the acceptance state of each page between the
> > decompression stage and normal runtime.  This eliminates the possibility of
> > attempting to double-accept a page.
> 
> ... which is fatal, right?  Could you include that an also the rationale
> for why it is fatal?

Actually, no. For TDX, TDX_ACCEPT_PAGE would just return an error. It is
not fatal, but it indicates that there is security hole. VMM can clear the
range of the memory if it can trick the guest to re-accept the memory
blindly:

1. VMM removes the memory range from the guest. VMM can do it at any
   point.
2. VMM re-adds memory for the same range of the guest physical addresses.
3. VMM tricks guest into re-accepting this memory blindly. It makes the
   range of guest physical memory filled with 0.
4. ???
5. PROFIT!

TDX relies on guest to be careful with accepting memory and only do this
for memory that is not in use. 

This patchet is designed to keep unaccepted
memory accounted and prevent such double accpept.


> > The bitmap is allocated in EFI stub, decompression stage updates the state
> > of pages used for the kernel and initrd and hands the bitmap over to the
> > main kernel image via boot_params.
> 
> This is really good info.  Could we maybe expand it a bit?
> 
> There are several steps in the bitmap's lifecycle:
> 1. Bitmap is allocated in the EFI stub

Allocated and populated.

> 2. The kernel decompression code locates it, accepts some pages before
>    decompression and marks them in the bitmap.
> 3. Decompression code points to the bitmap via a boot_param

Actually EFI stub makes boot_param point to the bitmap. Decompression code
and main kernel consume it from there.

> 4. Main kernel locates bitmap via the boot_param and uses it to guide
>    runtime acceptance decisions.
> 
> > In the runtime kernel, reserve the bitmap's memory to ensure nothing
> > overwrites it.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Acked-by: Mike Rapoport <rppt@linux.ibm.com>
> > ---
> >  arch/x86/kernel/e820.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > index f267205f2d5a..22d1fe48dcba 100644
> > --- a/arch/x86/kernel/e820.c
> > +++ b/arch/x86/kernel/e820.c
> > @@ -1316,6 +1316,16 @@ void __init e820__memblock_setup(void)
> >  	int i;
> >  	u64 end;
> >  
> > +	/* Mark unaccepted memory bitmap reserved */
> > +	if (boot_params.unaccepted_memory) {
> > +		unsigned long size;
> > +
> > +		/* One bit per 2MB */
> > +		size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
> > +				    PMD_SIZE * BITS_PER_BYTE);
> > +		memblock_reserve(boot_params.unaccepted_memory, size);
> > +	}
> 
> One oddity here: The size is implied by the e820's contents.  Did you
> mention somewhere that unaccepted memory is considered E820_TYPE_RAM?
> It *has* to be in order for e820__end_of_ram_pfn() to work, right?

"e820_type = E820_TYPE_RAM;" for "case EFI_UNACCEPTED_MEMORY:" in patch
3/8 does this.

I didn't wrote down it explitictly. I will.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 0/8] mm, x86/cc: Implement support for unaccepted memory
  2022-04-08 17:02 ` [PATCHv4 0/8] mm, x86/cc: Implement support for unaccepted memory Dave Hansen
@ 2022-04-09 23:44   ` Kirill A. Shutemov
  2022-04-21 12:29     ` Borislav Petkov
  0 siblings, 1 reply; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-09 23:44 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Fri, Apr 08, 2022 at 10:02:13AM -0700, Dave Hansen wrote:
> On 4/5/22 16:43, Kirill A. Shutemov wrote:
> > Patches 1-6/7 are generic and don't have any dependencies on TDX. They
> > should serve AMD SEV needs as well. TDX-specific code isolated in the
> > last patch.
> 
> Oh, that's quite nice.  Are the SEV-SNP folks planning on using this?
> If they are, acks/reviews would be much appreciated.

AMD folks tested one of previous revision and reported that it works, but
I don't remember seeing the code that hook ups AMD implementation.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-09 15:54     ` Kirill A. Shutemov
@ 2022-04-11  6:38       ` Dave Hansen
  2022-04-11 10:07         ` Mike Rapoport
  2022-04-11  8:47       ` David Hildenbrand
  1 sibling, 1 reply; 67+ messages in thread
From: Dave Hansen @ 2022-04-11  6:38 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Mike Rapoport

On 4/9/22 08:54, Kirill A. Shutemov wrote:
> On Fri, Apr 08, 2022 at 11:55:43AM -0700, Dave Hansen wrote:
>>> The page allocator is modified to accept pages on the first allocation.
>>> PageUnaccepted() is used to indicate that the page requires acceptance.
>>
>> Does this consume an actual page flag or is it aliased?
> 
> It is encoded as a page type in mapcount of unallocated memory. It is not
> aliased with PageOffline() as I did before.
> 
> I will mention that it is a new page type.

Guess I should have looked at the code. :)

Are we just increasingly using the StudlyNames() for anything to do with
pages?

>>> + /*
>>> +  * PageUnaccepted() indicates that the page has to be "accepted" before it can
>>> +  * be used. Page allocator has to call accept_page() before returning the page
>>> +  * to the caller.
>>> +  */
>>
>> Let's talk about "used" with a bit more detail.  Maybe:
>>
>> /*
>>  * PageUnaccepted() indicates that the page has to be "accepted" before
>>  * it can be read or written.  The page allocator must to call
>>  * accept_page() before touching the page or returning it to the caller.
>>  */
> 
> I guess s/must to call/must call/, right?

Yep.

...
>>> +	/*
>>> +	 * Check if the page needs to be marked as PageUnaccepted().
>>> +	 * Used for the new pages added to the buddy allocator for the first
>>> +	 * time.
>>> +	 */
>>> +	if (!unaccepted && (fpi_flags & FPI_UNACCEPTED))
>>> +		unaccepted = page_is_unaccepted(page, order);
>>
>> 	if (page_needs_acceptance && (fpi_flags & FPI_UNACCEPTED))
>> 		page_needs_acceptance = page_is_unaccepted(page, order);
>>
>>> +	if (unaccepted)
>>> +		__SetPageUnaccepted(page);
>>
>> This is getting hard for me to follow.
>>
>> There are:
>> 1. Pages that come in here with PageUnaccepted()==1
>> 2. Pages that come in here with PageUnaccepted()==0, but a buddy that
>>    was PageUnaccepted()==1
>>
>> In either of those cases, the bitmap will be consulted to see if the
>> page is *truly* unaccepted or not.  But, I'm struggling to figure out
>> how a page could end up in one of those scenarios and *not* be
>> page_is_unaccepted().
>>
>> There are three pieces of information that come in:
>> 1. PageUnaccepted(page)
>> 2. PageUnaccepted(buddies[])
>> 3. the bitmap
> 
> 1 and 2 are the same conceptionally: merged-in pieces of the resulting page.
> 
>>
>> and one piece of information going out:
>>
>> PageUnaccepted(page);
>>
>> I think I need a more coherent description of how those four things fit
>> together.
> 
> The page gets marked as PageUnaccepted() if any of merged-in pages is
> PageUnaccepted().
> 
> For new pages, just being added to buddy allocator, consult
> page_is_unaccepted(). FPI_UNACCEPTED indicates that the page is new and
> page_is_unaccepted() check is required.
> 
> Avoid calling page_is_unaccepted() if it is known that the page needs
> acceptance anyway. It can be costly.
> 
> Is it good enough explanation?

Yeah, elaborating on the slow and fast paths makes a lot of sense.

> FPI_UNACCEPTED is not a good name. Any help with a better one?
> FPI_CHECK_UNACCEPTED?

Maybe even something like FPI_UNACCEPTED_SLOWPATH.  Then you can say
that when this is passed in the pages might not have PageUnaccepted()
set and the slow bitmap needs to be consulted.

>>>  	if (fpi_flags & FPI_TO_TAIL)
>>>  		to_tail = true;
>>>  	else if (is_shuffle_order(order))
>>> @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page,
>>>  static inline bool page_expected_state(struct page *page,
>>>  					unsigned long check_flags)
>>>  {
>>> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
>>> +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
>>> +	    !PageUnaccepted(page))
>>>  		return false;
>>
>> That probably deserves a comment, and maybe its own if() statement.
> 
> Own if does not work. PageUnaccepted() is encoded in _mapcount.
> 
> What about this:
> 
> 	/*
> 	 * page->_mapcount is expected to be -1.
> 	 *
> 	 * There is an exception for PageUnaccepted(). The page type can be set
> 	 * for pages on free list. Page types are encoded in _mapcount.
> 	 *
> 	 * PageUnaccepted() will get cleared in post_alloc_hook().
> 	 */
> 	if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1))
> 		return false;
> 
> ?

That's better.  But, aren't the PG_* names usually reserved for real
page->flags bits?  That naming might be part of my confusion.

>>>  		add_to_free_list(&page[size], zone, high, migratetype);
>>>  		set_buddy_order(&page[size], high);
>>>  	}
>>> @@ -2396,6 +2446,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>>>  	 */
>>>  	kernel_unpoison_pages(page, 1 << order);
>>>  
>>> +	if (PageUnaccepted(page))
>>> +		accept_page(page, order);
>>> +
>>>  	/*
>>>  	 * As memory initialization might be integrated into KASAN,
>>>  	 * KASAN unpoisoning and memory initializion code must be
>>
>> Is accepted memory guaranteed to be zeroed?  Do we want to skip the
>> __GFP_ZERO behavior later in this function?  Or is that just a silly
>> over-optimization?
> 
> For TDX, it is true that the memory gets cleared on acceptance, but I
> don't we can say the same for any possible implementation.
> 
> I would rather leave __GFP_ZERO for peace of mind. Clearing the cache-hot
> page for the second time shouldn't be a big deal comparing to acceptance
> cost.

Sure, fair enough.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-09 17:52     ` Kirill A. Shutemov
@ 2022-04-11  6:41       ` Dave Hansen
  2022-04-11 15:55         ` Borislav Petkov
  0 siblings, 1 reply; 67+ messages in thread
From: Dave Hansen @ 2022-04-11  6:41 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Mike Rapoport

On 4/9/22 10:52, Kirill A. Shutemov wrote:
> On Fri, Apr 08, 2022 at 12:11:58PM -0700, Dave Hansen wrote:
>> On 4/5/22 16:43, Kirill A. Shutemov wrote:
>>> Kernel only needs to accept memory once after boot, so during the boot
>>> and warm up phase there will be a lot of memory acceptance. After things
>>> are settled down the only price of the feature if couple of checks for
>>> PageUnaccepted() in allocate and free paths. The check refers a hot
>>> variable (that also encodes PageBuddy()), so it is cheap and not visible
>>> on profiles.
>>
>> Let's also not sugar-coat this.  Page acceptance is hideously slow.
>> It's agonizingly slow.  To boot, it's done holding a global spinlock
>> with interrupts disabled (see patch 6/8).  At the very, very least, each
>> acceptance operation involves a couple of what are effectively ring
>> transitions, a 2MB memset(), and a bunch of cache flushing.
>>
>> The system is going to be downright unusable during this time, right?
...
>> Do we need anything more discrete to tell users when acceptance is over?
> 
> I can imagine setups that where acceptance is never over. A VM running
> a workload with fixed dataset can have planty of memory unaccepted.
> 
> I don't think "make it over" should be the goal.

I agree, there will be users that don't care when acceptance is over.
But, I'm also sure that there are users that will care deeply.

>>  For instance, maybe they run something and it goes really slow, they
>> watch "accept_memory" until it stops.  They rejoice at their good
>> fortune!  Then, memory allocation starts falling over to a new node and
>> the agony beings anew.
>>
>> I can think of dealing with this in two ways:
>>
>> 	cat /sys/.../unaccepted_pages_left
>>
>> which just walks the bitmap and counts the amount of pages remaining. or
>> something like:
>>
>> 	echo 1 > /sys/devices/system/node/node0/make_the_pain_stop
>>
>> Which will, well, make the pain stop on node0.
> 
> Sure we can add handles. But API is hard. Maybe we should wait and see
> what is actually needed. (Yes, I'm lazy.:)

Let's just call out the possible (probable?) need for new ABI here.
Maybe it will cue folks who care to speak up.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 4/8] x86/boot/compressed: Handle unaccepted memory
  2022-04-09 20:20     ` Kirill A. Shutemov
@ 2022-04-11  6:49       ` Dave Hansen
  0 siblings, 0 replies; 67+ messages in thread
From: Dave Hansen @ 2022-04-11  6:49 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On 4/9/22 13:20, Kirill A. Shutemov wrote:
> On Fri, Apr 08, 2022 at 10:57:17AM -0700, Dave Hansen wrote:
...
>> It's a real shame that we have to duplicate this code.  Is there
>> anything crazy we could do here like
>>
>> #include "../../../lib/find_bit.c"
>>
>> ?
> 
> Well, it would require fracturing source files on the kernel side.
> 
> __bitmap_set() and __bitmap_clear() are now in lib/bitmap.c.
> 
> _find_next_bit() is in lib/find_bit.c.
> 
> Both lib/bitmap.c and lib/find_bit.c have a lot of stuff that are not used
> here. I guess we would need to split them into few pieces to make it in
> sane way. Do you want me to go this path?

I'd be curious if others have any sane ideas for how to do it.

One idea would be to stick most of the implementation in a header that
we can #include.  Then, lib/find_bit.c #includes that header and does
something simple like:

#include "header.h"
int _find_next_bit(...)
{
	return _find_next_bit_from_header();
}
EXPORT_SYMBOL(_find_next_bit);


>>> diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
>>> index fa8969fad011..c1d9d71a6615 100644
>>> --- a/arch/x86/boot/compressed/misc.c
>>> +++ b/arch/x86/boot/compressed/misc.c
>>> @@ -18,6 +18,7 @@
>>>  #include "../string.h"
>>>  #include "../voffset.h"
>>>  #include <asm/bootparam_utils.h>
>>> +#include <asm/unaccepted_memory.h>
>>>  
>>>  /*
>>>   * WARNING!!
>>> @@ -43,6 +44,9 @@
>>>  void *memmove(void *dest, const void *src, size_t n);
>>>  #endif
>>>  
>>> +#undef __pa
>>> +#define __pa(x)	((unsigned long)(x))
>>
>> Those #undef's always worry me.  Why is this one needed?
> 
> arch/x86/boot/compressed/misc.c:47:9: warning: '__pa' macro redefined [-Wmacro-redefined]
> #define __pa(x) ((unsigned long)(x))
>         ^
> arch/x86/include/asm/page.h:47:9: note: previous definition is here
> #define __pa(x)         __phys_addr((unsigned long)(x))
> 
> Note that sev.c does the same. At least we are consistent :)

Ugh.  Please do look into fixing this properly.  The SEV folks will
thank you. :)

>>> +void accept_memory(phys_addr_t start, phys_addr_t end)
>>> +{
>>> +	unsigned long *unaccepted_memory;
>>> +	unsigned int rs, re;
>>> +
>>> +	unaccepted_memory = (unsigned long *)boot_params->unaccepted_memory;
>>> +	rs = start / PMD_SIZE;
>>
>> OK, so start is a physical address, PMD_SIZE is 2^21, and 'rs' is an
>> unsigned int.  That means 'rs' can, at most, represent a physical
>> address at 2^(21+32), or 2^53.  That's cutting it a *bit* close, don't
>> you think?
>>
>> Could we please just give 'rs' and 're' real names and make them
>> 'unsigned long's, please?  It will surely save at least one other person
>> from doing math.  The find_next_bit() functions seem to take ulongs anyway.
> 
> Okay. 'range_start' and 'range_end' are good enough names?

Yep, works for me.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-09 15:54     ` Kirill A. Shutemov
  2022-04-11  6:38       ` Dave Hansen
@ 2022-04-11  8:47       ` David Hildenbrand
  1 sibling, 0 replies; 67+ messages in thread
From: David Hildenbrand @ 2022-04-11  8:47 UTC (permalink / raw)
  To: Kirill A. Shutemov, Dave Hansen
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Mike Rapoport

>>>  	if (fpi_flags & FPI_TO_TAIL)
>>>  		to_tail = true;
>>>  	else if (is_shuffle_order(order))
>>> @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page,
>>>  static inline bool page_expected_state(struct page *page,
>>>  					unsigned long check_flags)
>>>  {
>>> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
>>> +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
>>> +	    !PageUnaccepted(page))
>>>  		return false;
>>
>> That probably deserves a comment, and maybe its own if() statement.
> 
> Own if does not work. PageUnaccepted() is encoded in _mapcount.
> 
> What about this:
> 
> 	/*
> 	 * page->_mapcount is expected to be -1.
> 	 *
> 	 * There is an exception for PageUnaccepted(). The page type can be set
> 	 * for pages on free list. Page types are encoded in _mapcount.
> 	 *
> 	 * PageUnaccepted() will get cleared in post_alloc_hook().
> 	 */
> 	if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1))
> 		return false;
> 
> ?
> 

Please don't. Keep the usage of PG_* details inside page-flags.h


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-11  6:38       ` Dave Hansen
@ 2022-04-11 10:07         ` Mike Rapoport
  2022-04-13 11:40           ` Kirill A. Shutemov
  0 siblings, 1 reply; 67+ messages in thread
From: Mike Rapoport @ 2022-04-11 10:07 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Brijesh Singh,
	David Hildenbrand, x86, linux-mm, linux-coco, linux-efi,
	linux-kernel, Mike Rapoport

On Sun, Apr 10, 2022 at 11:38:08PM -0700, Dave Hansen wrote:
> On 4/9/22 08:54, Kirill A. Shutemov wrote:
> > On Fri, Apr 08, 2022 at 11:55:43AM -0700, Dave Hansen wrote:
> 
> >>>  	if (fpi_flags & FPI_TO_TAIL)
> >>>  		to_tail = true;
> >>>  	else if (is_shuffle_order(order))
> >>> @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page,
> >>>  static inline bool page_expected_state(struct page *page,
> >>>  					unsigned long check_flags)
> >>>  {
> >>> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
> >>> +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
> >>> +	    !PageUnaccepted(page))
> >>>  		return false;
> >>
> >> That probably deserves a comment, and maybe its own if() statement.
> > 
> > Own if does not work. PageUnaccepted() is encoded in _mapcount.
> > 
> > What about this:
> > 
> > 	/*
> > 	 * page->_mapcount is expected to be -1.
> > 	 *
> > 	 * There is an exception for PageUnaccepted(). The page type can be set
> > 	 * for pages on free list. Page types are encoded in _mapcount.
> > 	 *
> > 	 * PageUnaccepted() will get cleared in post_alloc_hook().
> > 	 */
> > 	if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1))

Maybe I'm missing something, but isn't this true for any PageType?

> > 		return false;
> > 
> > ?
> 
> That's better.  But, aren't the PG_* names usually reserved for real
> page->flags bits?  That naming might be part of my confusion.

We use them for PageType as well like PG_buddy, PG_offline, PG_Table.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-11  6:41       ` Dave Hansen
@ 2022-04-11 15:55         ` Borislav Petkov
  2022-04-11 16:27           ` Dave Hansen
  0 siblings, 1 reply; 67+ messages in thread
From: Borislav Petkov @ 2022-04-11 15:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Kirill A. Shutemov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Mike Rapoport

On Sun, Apr 10, 2022 at 11:41:57PM -0700, Dave Hansen wrote:
> Let's just call out the possible (probable?) need for new ABI here.
> Maybe it will cue folks who care to speak up.

Err, why would you teach the user to go poke at some arbitrary sysfs
nodes when the accepting code can simply issue a printk from time to
time

  "Guest unnaccepted memory progress: XX%. This slows down operations at the moment."

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-11 15:55         ` Borislav Petkov
@ 2022-04-11 16:27           ` Dave Hansen
  2022-04-11 18:55             ` Tom Lendacky
  0 siblings, 1 reply; 67+ messages in thread
From: Dave Hansen @ 2022-04-11 16:27 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kirill A. Shutemov, Kirill A. Shutemov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Mike Rapoport

On 4/11/22 08:55, Borislav Petkov wrote:
> On Sun, Apr 10, 2022 at 11:41:57PM -0700, Dave Hansen wrote:
>> Let's just call out the possible (probable?) need for new ABI here.
>> Maybe it will cue folks who care to speak up.
> Err, why would you teach the user to go poke at some arbitrary sysfs
> nodes when the accepting code can simply issue a printk from time to
> time
> 
>   "Guest unnaccepted memory progress: XX%. This slows down operations at the moment."

I guess that's not a horrible place to start.  It's also not *horribly*
different from how guests work today.  If hosts lazily allocate RAM,
they'll see largely the same kind of behavior.

What ends up determining how much memory is pre-accepted versus being
done from the guest?  Is that just a normal part of setting up a TDX
guest, like from the qemu cmdline?  Or, is there some convention with
the virtual firmware?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-11 16:27           ` Dave Hansen
@ 2022-04-11 18:55             ` Tom Lendacky
  0 siblings, 0 replies; 67+ messages in thread
From: Tom Lendacky @ 2022-04-11 18:55 UTC (permalink / raw)
  To: Dave Hansen, Borislav Petkov
  Cc: Kirill A. Shutemov, Kirill A. Shutemov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Brijesh Singh,
	Mike Rapoport, David Hildenbrand, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Mike Rapoport

On 4/11/22 11:27, Dave Hansen wrote:
> On 4/11/22 08:55, Borislav Petkov wrote:
>> On Sun, Apr 10, 2022 at 11:41:57PM -0700, Dave Hansen wrote:
>>> Let's just call out the possible (probable?) need for new ABI here.
>>> Maybe it will cue folks who care to speak up.
>> Err, why would you teach the user to go poke at some arbitrary sysfs
>> nodes when the accepting code can simply issue a printk from time to
>> time
>>
>>    "Guest unnaccepted memory progress: XX%. This slows down operations at the moment."
> 
> I guess that's not a horrible place to start.  It's also not *horribly*
> different from how guests work today.  If hosts lazily allocate RAM,
> they'll see largely the same kind of behavior.
> 
> What ends up determining how much memory is pre-accepted versus being
> done from the guest?  Is that just a normal part of setting up a TDX
> guest, like from the qemu cmdline?  Or, is there some convention with
> the virtual firmware?

With SNP, some memory will be accepted as part of the LAUNCH_UPDATE 
sequences that the hypervisor performs, but that is not all of the guest 
memory. Once the guest is started, the (initial implementation of) OVMF 
SNP support will accept (PVALIDATE) all of the remaining guest memory. 
When the kernel boots, there isn't any unaccepted memory.

Once support is available in the kernel for unaccepted memory, then OVMF 
could be updated to only accept a limited amount of memory and pass the 
information about the unaccepted memory to the kernel through the EFI 
memory map.

The approaches would have to be measured to see which ends up being the 
best one. The GHCB specification allows for lots of memory to be accepted 
in a single VMGEXIT (world switch) vs performing a VMGEXIT for each 2MB of 
memory being accepted.

Thanks,
Tom

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-08 19:11   ` Dave Hansen
  2022-04-09 17:52     ` Kirill A. Shutemov
@ 2022-04-12  8:15     ` David Hildenbrand
  2022-04-12 16:08       ` Dave Hansen
  1 sibling, 1 reply; 67+ messages in thread
From: David Hildenbrand @ 2022-04-12  8:15 UTC (permalink / raw)
  To: Dave Hansen, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Mike Rapoport

On 08.04.22 21:11, Dave Hansen wrote:
> On 4/5/22 16:43, Kirill A. Shutemov wrote:
>> Kernel only needs to accept memory once after boot, so during the boot
>> and warm up phase there will be a lot of memory acceptance. After things
>> are settled down the only price of the feature if couple of checks for
>> PageUnaccepted() in allocate and free paths. The check refers a hot
>> variable (that also encodes PageBuddy()), so it is cheap and not visible
>> on profiles.
> 
> Let's also not sugar-coat this.  Page acceptance is hideously slow.
> It's agonizingly slow.  To boot, it's done holding a global spinlock
> with interrupts disabled (see patch 6/8).  At the very, very least, each
> acceptance operation involves a couple of what are effectively ring
> transitions, a 2MB memset(), and a bunch of cache flushing.
> 
> The system is going to be downright unusable during this time, right?
> 
> Sure, it's *temporary* and only happens once at boot.  But, it's going
> to suck.
> 
> Am I over-stating this in any way?
> 
> The ACCEPT_MEMORY vmstat is good to have around.  Thanks for adding it.
>  But, I think we should also write down some guidance like:
> 
> 	If your TDX system seems as slow as snail after boot, look at
> 	the "accept_memory" counter in /proc/vmstat.  If it is
> 	incrementing, then TDX memory acceptance is likely to blame.
> 
> Do we need anything more discrete to tell users when acceptance is over?
>  For instance, maybe they run something and it goes really slow, they
> watch "accept_memory" until it stops.  They rejoice at their good
> fortune!  Then, memory allocation starts falling over to a new node and
> the agony beings anew.
> 
> I can think of dealing with this in two ways:
> 
> 	cat /sys/.../unaccepted_pages_left
> 
> which just walks the bitmap and counts the amount of pages remaining. or
> something like:
> 
> 	echo 1 > /sys/devices/system/node/node0/make_the_pain_stop
> 
> Which will, well, make the pain stop on node0.
> 

Either I'm missing something important or the random pain might just
take a really long time to stop?

I mean, we tend to reallocate the memory first that we freed last
(putting it to the head of the freelist when freeing and picking from
the head when allocating).

So unless your kernel goes crazy and allocates each and every page right
after boot, essentially accepting all memory, you might have random
unaccepted pages lurking at the tail of the freelists.

So if the VM is running for 355 days without significant memory
pressure, you can still run into unaccepted pages at day 356 that
results in a random delay due to acceptance of memory.


I think we most certainly want some way to make the random pain stop, or
to make the random pain go away after boot quickly. The
"unaccepted_pages_left" indicator would just be a "hey, there might be
random delays, but you cannot do anything about it". Magic toggles like
"make_the_pain_stop" are not so nice.

Can we simply automate this using a kthread or smth like that, which
just traverses the free page lists and accepts pages (similar, but
different to free page reporting)?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 8/8] mm/vmstat: Add counter for memory accepting
  2022-04-05 23:43 ` [PATCHv4 8/8] mm/vmstat: Add counter for memory accepting Kirill A. Shutemov
@ 2022-04-12  8:18   ` David Hildenbrand
  0 siblings, 0 replies; 67+ messages in thread
From: David Hildenbrand @ 2022-04-12  8:18 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Dave Hansen, Brijesh Singh, Mike Rapoport, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On 06.04.22 01:43, Kirill A. Shutemov wrote:
> The counter increased every time kernel accepts a memory region.
> 
> The counter allows to see if memory acceptation is still ongoing and
> contributes to memory allocation latency.
> 

Does it? See my other mail, can't we have the counter essentially not
changing for a long time but still some unaccepted pages sitting at the
tail of the freelists that will only get allocated+accepted under real
memory pressure?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-12  8:15     ` David Hildenbrand
@ 2022-04-12 16:08       ` Dave Hansen
  2022-04-13 10:36         ` David Hildenbrand
  0 siblings, 1 reply; 67+ messages in thread
From: Dave Hansen @ 2022-04-12 16:08 UTC (permalink / raw)
  To: David Hildenbrand, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Mike Rapoport

On 4/12/22 01:15, David Hildenbrand wrote:
> Can we simply automate this using a kthread or smth like that, which
> just traverses the free page lists and accepts pages (similar, but
> different to free page reporting)?

That's definitely doable.

The downside is that this will force premature consumption of physical
memory resources that the guest may never use.  That's a particular
problem on TDX systems since there is no way for a VMM to reclaim guest
memory short of killing the guest.

In other words, I can see a good argument either way:
1. The kernel should accept everything to avoid the perf nastiness
2. The kernel should accept only what it needs in order to reduce memory
   use

I'm kinda partial to #1 though, if I had to pick only one.

The other option might be to tie this all to DEFERRED_STRUCT_PAGE_INIT.
 Have the rule that everything that gets a 'struct page' must be
accepted.  If you want to do delayed acceptance, you do it via
DEFERRED_STRUCT_PAGE_INIT.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 2/8] efi/x86: Get full memory map in allocate_e820()
  2022-04-05 23:43 ` [PATCHv4 2/8] efi/x86: Get full memory map in allocate_e820() Kirill A. Shutemov
@ 2022-04-13  9:59   ` Borislav Petkov
  2022-04-13 11:45     ` Kirill A. Shutemov
  0 siblings, 1 reply; 67+ messages in thread
From: Borislav Petkov @ 2022-04-13  9:59 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Wed, Apr 06, 2022 at 02:43:37AM +0300, Kirill A. Shutemov wrote:
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> index 01ddd4502e28..d18cac8ab436 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -569,30 +569,28 @@ static efi_status_t alloc_e820ext(u32 nr_desc, struct setup_data **e820ext,
>  }
>  
>  static efi_status_t allocate_e820(struct boot_params *params,
> +				  struct efi_boot_memmap *map,
>  				  struct setup_data **e820ext,
>  				  u32 *e820ext_size)
>  {
> -	unsigned long map_size, desc_size, map_key;
>  	efi_status_t status;
> -	__u32 nr_desc, desc_version;
> +	__u32 nr_desc;
>  
> -	/* Only need the size of the mem map and size of each mem descriptor */
> -	map_size = 0;
> -	status = efi_bs_call(get_memory_map, &map_size, NULL, &map_key,
> -			     &desc_size, &desc_version);
> -	if (status != EFI_BUFFER_TOO_SMALL)
> -		return (status != EFI_SUCCESS) ? status : EFI_UNSUPPORTED;
> -
> -	nr_desc = map_size / desc_size + EFI_MMAP_NR_SLACK_SLOTS;
> +	status = efi_get_memory_map(map);
> +	if (status != EFI_SUCCESS)
> +		return status;
>  
> -	if (nr_desc > ARRAY_SIZE(params->e820_table)) {
> -		u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table);
> +	nr_desc = *map->map_size / *map->desc_size;
> +	if (nr_desc > ARRAY_SIZE(params->e820_table) - EFI_MMAP_NR_SLACK_SLOTS) {
> +		u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table) +
> +			EFI_MMAP_NR_SLACK_SLOTS;
>  
>  		status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
>  		if (status != EFI_SUCCESS)
> -			return status;
> +			goto out;

This looks weird. With the goto out of the way, this code turns into:

  		status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
		if (status != EFI_SUCCESS) {
			efi_bs_call(free_pool, *map->map);
			return EFI_SUCCESS;
		}

I think you want to return status as above after having called

	efi_bs_call(free_pool, *map->map);

...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-12 16:08       ` Dave Hansen
@ 2022-04-13 10:36         ` David Hildenbrand
  2022-04-13 11:30           ` Kirill A. Shutemov
  2022-04-13 14:39           ` Mike Rapoport
  0 siblings, 2 replies; 67+ messages in thread
From: David Hildenbrand @ 2022-04-13 10:36 UTC (permalink / raw)
  To: Dave Hansen, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Mike Rapoport

On 12.04.22 18:08, Dave Hansen wrote:
> On 4/12/22 01:15, David Hildenbrand wrote:
>> Can we simply automate this using a kthread or smth like that, which
>> just traverses the free page lists and accepts pages (similar, but
>> different to free page reporting)?
> 
> That's definitely doable.
> 
> The downside is that this will force premature consumption of physical
> memory resources that the guest may never use.  That's a particular
> problem on TDX systems since there is no way for a VMM to reclaim guest
> memory short of killing the guest.

IIRC, the hypervisor will usually effectively populate all guest RAM
either way right now. So yes, for hypervisors that might optimize for
that, that statement would be true. But I lost track how helpful it
would be in the near future e.g., with the fd-based private guest memory
-- maybe they already optimize for delayed acceptance of memory, turning
it into delayed population.

> 
> In other words, I can see a good argument either way:
> 1. The kernel should accept everything to avoid the perf nastiness
> 2. The kernel should accept only what it needs in order to reduce memory
>    use
> 
> I'm kinda partial to #1 though, if I had to pick only one.
> 
> The other option might be to tie this all to DEFERRED_STRUCT_PAGE_INIT.
>  Have the rule that everything that gets a 'struct page' must be
> accepted.  If you want to do delayed acceptance, you do it via
> DEFERRED_STRUCT_PAGE_INIT.

That could also be an option, yes. At least being able to chose would be
good. But IIRC, DEFERRED_STRUCT_PAGE_INIT will still make the system get
stuck during boot and wait until everything was accepted.

I see the following variants:

1) Slow boot; after boot, all memory is already accepted.
2) Fast boot; after boot, all memory will slowly but steadily get
   accepted in the background. After a while, all memory is accepted and
   can be signaled to user space.
3) Fast boot; after boot, memory gets accepted on demand. This is what
   we have in this series.

I somehow don't quite like 3), but with deferred population in the
hypervisor, it might just make sense.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-13 10:36         ` David Hildenbrand
@ 2022-04-13 11:30           ` Kirill A. Shutemov
  2022-04-13 11:32             ` David Hildenbrand
  2022-04-13 15:36             ` Dave Hansen
  2022-04-13 14:39           ` Mike Rapoport
  1 sibling, 2 replies; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-13 11:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dave Hansen, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Brijesh Singh,
	Mike Rapoport, x86, linux-mm, linux-coco, linux-efi,
	linux-kernel, Mike Rapoport

On Wed, Apr 13, 2022 at 12:36:11PM +0200, David Hildenbrand wrote:
> On 12.04.22 18:08, Dave Hansen wrote:
> > On 4/12/22 01:15, David Hildenbrand wrote:
> >> Can we simply automate this using a kthread or smth like that, which
> >> just traverses the free page lists and accepts pages (similar, but
> >> different to free page reporting)?
> > 
> > That's definitely doable.
> > 
> > The downside is that this will force premature consumption of physical
> > memory resources that the guest may never use.  That's a particular
> > problem on TDX systems since there is no way for a VMM to reclaim guest
> > memory short of killing the guest.
> 
> IIRC, the hypervisor will usually effectively populate all guest RAM
> either way right now.

No, it is not usual. By default QEMU/KVM uses anonymous mapping and
fault-in memory on demand.

Yes, there's an option to pre-populate guest memory, but it is not the
default.

> So yes, for hypervisors that might optimize for
> that, that statement would be true. But I lost track how helpful it
> would be in the near future e.g., with the fd-based private guest memory
> -- maybe they already optimize for delayed acceptance of memory, turning
> it into delayed population.
> 
> > 
> > In other words, I can see a good argument either way:
> > 1. The kernel should accept everything to avoid the perf nastiness
> > 2. The kernel should accept only what it needs in order to reduce memory
> >    use
> > 
> > I'm kinda partial to #1 though, if I had to pick only one.
> > 
> > The other option might be to tie this all to DEFERRED_STRUCT_PAGE_INIT.
> >  Have the rule that everything that gets a 'struct page' must be
> > accepted.  If you want to do delayed acceptance, you do it via
> > DEFERRED_STRUCT_PAGE_INIT.
> 
> That could also be an option, yes. At least being able to chose would be
> good. But IIRC, DEFERRED_STRUCT_PAGE_INIT will still make the system get
> stuck during boot and wait until everything was accepted.

Right. It deferred page init has to be done before init.

> I see the following variants:
> 
> 1) Slow boot; after boot, all memory is already accepted.
> 2) Fast boot; after boot, all memory will slowly but steadily get
>    accepted in the background. After a while, all memory is accepted and
>    can be signaled to user space.
> 3) Fast boot; after boot, memory gets accepted on demand. This is what
>    we have in this series.
> 
> I somehow don't quite like 3), but with deferred population in the
> hypervisor, it might just make sense.

Conceptionally, 3 is not different from what happens now. The first time
normal VM touches the page (like on handling __GFP_ZERO) the page gets
allocated on host. It can take very long time if it kicks in direct
reclaim on the host.

The only difference is that it is *usually* slower.

I guest we can make a case for making 1 an option to match pre-populated
use case for normal VMs.

Frankly, I think option 2 is the worst one. You still CPU cycles from the
workload after boot to do the job that may or may not be needed. It is an
half-measure that helps nobody.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-13 11:30           ` Kirill A. Shutemov
@ 2022-04-13 11:32             ` David Hildenbrand
  2022-04-13 15:36             ` Dave Hansen
  1 sibling, 0 replies; 67+ messages in thread
From: David Hildenbrand @ 2022-04-13 11:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Brijesh Singh,
	Mike Rapoport, x86, linux-mm, linux-coco, linux-efi,
	linux-kernel, Mike Rapoport

On 13.04.22 13:30, Kirill A. Shutemov wrote:
> On Wed, Apr 13, 2022 at 12:36:11PM +0200, David Hildenbrand wrote:
>> On 12.04.22 18:08, Dave Hansen wrote:
>>> On 4/12/22 01:15, David Hildenbrand wrote:
>>>> Can we simply automate this using a kthread or smth like that, which
>>>> just traverses the free page lists and accepts pages (similar, but
>>>> different to free page reporting)?
>>>
>>> That's definitely doable.
>>>
>>> The downside is that this will force premature consumption of physical
>>> memory resources that the guest may never use.  That's a particular
>>> problem on TDX systems since there is no way for a VMM to reclaim guest
>>> memory short of killing the guest.
>>
>> IIRC, the hypervisor will usually effectively populate all guest RAM
>> either way right now.
> 
> No, it is not usual. By default QEMU/KVM uses anonymous mapping and
> fault-in memory on demand.
> 
> Yes, there's an option to pre-populate guest memory, but it is not the
> default.

Let me be clearer: I'm talking about the TDX/SEV world, not ordinary
unencrypted VMs. For ordinary encrypted VMs we do have populate on
demand frequently.

For SEV we currently pin all guest memory and consequently don't have
populate on demand. For TDX, again, I did not follow how fd-based
private guest memory will behave. I thought I remembered that we will
similarly not have populate-on-demand.

Preallocation is usually used with huge pages, but I guess that's out of
scope right now for encrypted VMs.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-11 10:07         ` Mike Rapoport
@ 2022-04-13 11:40           ` Kirill A. Shutemov
  2022-04-13 14:48             ` Mike Rapoport
  0 siblings, 1 reply; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-13 11:40 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Dave Hansen, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Brijesh Singh,
	David Hildenbrand, x86, linux-mm, linux-coco, linux-efi,
	linux-kernel, Mike Rapoport

On Mon, Apr 11, 2022 at 01:07:29PM +0300, Mike Rapoport wrote:
> On Sun, Apr 10, 2022 at 11:38:08PM -0700, Dave Hansen wrote:
> > On 4/9/22 08:54, Kirill A. Shutemov wrote:
> > > On Fri, Apr 08, 2022 at 11:55:43AM -0700, Dave Hansen wrote:
> > 
> > >>>  	if (fpi_flags & FPI_TO_TAIL)
> > >>>  		to_tail = true;
> > >>>  	else if (is_shuffle_order(order))
> > >>> @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page,
> > >>>  static inline bool page_expected_state(struct page *page,
> > >>>  					unsigned long check_flags)
> > >>>  {
> > >>> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
> > >>> +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
> > >>> +	    !PageUnaccepted(page))
> > >>>  		return false;
> > >>
> > >> That probably deserves a comment, and maybe its own if() statement.
> > > 
> > > Own if does not work. PageUnaccepted() is encoded in _mapcount.
> > > 
> > > What about this:
> > > 
> > > 	/*
> > > 	 * page->_mapcount is expected to be -1.
> > > 	 *
> > > 	 * There is an exception for PageUnaccepted(). The page type can be set
> > > 	 * for pages on free list. Page types are encoded in _mapcount.
> > > 	 *
> > > 	 * PageUnaccepted() will get cleared in post_alloc_hook().
> > > 	 */
> > > 	if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1))
> 
> Maybe I'm missing something, but isn't this true for any PageType?
> 
> > > 		return false;
> > > 
> > > ?
> > 
> > That's better.  But, aren't the PG_* names usually reserved for real
> > page->flags bits?  That naming might be part of my confusion.
> 
> We use them for PageType as well like PG_buddy, PG_offline, PG_Table.

PG_buddy gets clear on remove from the free list, before the chec.

PG_offline and PG_table pages are never on free lists.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 2/8] efi/x86: Get full memory map in allocate_e820()
  2022-04-13  9:59   ` Borislav Petkov
@ 2022-04-13 11:45     ` Kirill A. Shutemov
  0 siblings, 0 replies; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-13 11:45 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Wed, Apr 13, 2022 at 11:59:21AM +0200, Borislav Petkov wrote:
> On Wed, Apr 06, 2022 at 02:43:37AM +0300, Kirill A. Shutemov wrote:
> > diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> > index 01ddd4502e28..d18cac8ab436 100644
> > --- a/drivers/firmware/efi/libstub/x86-stub.c
> > +++ b/drivers/firmware/efi/libstub/x86-stub.c
> > @@ -569,30 +569,28 @@ static efi_status_t alloc_e820ext(u32 nr_desc, struct setup_data **e820ext,
> >  }
> >  
> >  static efi_status_t allocate_e820(struct boot_params *params,
> > +				  struct efi_boot_memmap *map,
> >  				  struct setup_data **e820ext,
> >  				  u32 *e820ext_size)
> >  {
> > -	unsigned long map_size, desc_size, map_key;
> >  	efi_status_t status;
> > -	__u32 nr_desc, desc_version;
> > +	__u32 nr_desc;
> >  
> > -	/* Only need the size of the mem map and size of each mem descriptor */
> > -	map_size = 0;
> > -	status = efi_bs_call(get_memory_map, &map_size, NULL, &map_key,
> > -			     &desc_size, &desc_version);
> > -	if (status != EFI_BUFFER_TOO_SMALL)
> > -		return (status != EFI_SUCCESS) ? status : EFI_UNSUPPORTED;
> > -
> > -	nr_desc = map_size / desc_size + EFI_MMAP_NR_SLACK_SLOTS;
> > +	status = efi_get_memory_map(map);
> > +	if (status != EFI_SUCCESS)
> > +		return status;
> >  
> > -	if (nr_desc > ARRAY_SIZE(params->e820_table)) {
> > -		u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table);
> > +	nr_desc = *map->map_size / *map->desc_size;
> > +	if (nr_desc > ARRAY_SIZE(params->e820_table) - EFI_MMAP_NR_SLACK_SLOTS) {
> > +		u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table) +
> > +			EFI_MMAP_NR_SLACK_SLOTS;
> >  
> >  		status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
> >  		if (status != EFI_SUCCESS)
> > -			return status;
> > +			goto out;
> 
> This looks weird. With the goto out of the way, this code turns into:
> 
>   		status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
> 		if (status != EFI_SUCCESS) {
> 			efi_bs_call(free_pool, *map->map);
> 			return EFI_SUCCESS;
> 		}
> 
> I think you want to return status as above after having called
> 
> 	efi_bs_call(free_pool, *map->map);
> 
> ...

Ah. Right. I actually fix this in the next patch.

Will move it here.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-13 10:36         ` David Hildenbrand
  2022-04-13 11:30           ` Kirill A. Shutemov
@ 2022-04-13 14:39           ` Mike Rapoport
  1 sibling, 0 replies; 67+ messages in thread
From: Mike Rapoport @ 2022-04-13 14:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dave Hansen, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Brijesh Singh, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel, Mike Rapoport

On Wed, Apr 13, 2022 at 12:36:11PM +0200, David Hildenbrand wrote:
> On 12.04.22 18:08, Dave Hansen wrote:
> > On 4/12/22 01:15, David Hildenbrand wrote:
> > 
> > The other option might be to tie this all to DEFERRED_STRUCT_PAGE_INIT.
> >  Have the rule that everything that gets a 'struct page' must be
> > accepted.  If you want to do delayed acceptance, you do it via
> > DEFERRED_STRUCT_PAGE_INIT.
> 
> That could also be an option, yes. At least being able to chose would be
> good. But IIRC, DEFERRED_STRUCT_PAGE_INIT will still make the system get
> stuck during boot and wait until everything was accepted.

The deferred page init runs multithreaded, so guest with SMP will be stuck
for less time.
 
> I see the following variants:
> 
> 1) Slow boot; after boot, all memory is already accepted.
> 2) Fast boot; after boot, all memory will slowly but steadily get
>    accepted in the background. After a while, all memory is accepted and
>    can be signaled to user space.
> 3) Fast boot; after boot, memory gets accepted on demand. This is what
>    we have in this series.
> 
> I somehow don't quite like 3), but with deferred population in the
> hypervisor, it might just make sense.

IMHO, deferred population in hypervisor will be way more complex than this
series with similar "visible" performance.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-13 11:40           ` Kirill A. Shutemov
@ 2022-04-13 14:48             ` Mike Rapoport
  2022-04-13 15:15               ` Kirill A. Shutemov
  0 siblings, 1 reply; 67+ messages in thread
From: Mike Rapoport @ 2022-04-13 14:48 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Brijesh Singh,
	David Hildenbrand, x86, linux-mm, linux-coco, linux-efi,
	linux-kernel, Mike Rapoport

On Wed, Apr 13, 2022 at 02:40:01PM +0300, Kirill A. Shutemov wrote:
> On Mon, Apr 11, 2022 at 01:07:29PM +0300, Mike Rapoport wrote:
> > On Sun, Apr 10, 2022 at 11:38:08PM -0700, Dave Hansen wrote:
> > > On 4/9/22 08:54, Kirill A. Shutemov wrote:
> > > > On Fri, Apr 08, 2022 at 11:55:43AM -0700, Dave Hansen wrote:
> > > 
> > > >>>  	if (fpi_flags & FPI_TO_TAIL)
> > > >>>  		to_tail = true;
> > > >>>  	else if (is_shuffle_order(order))
> > > >>> @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page,
> > > >>>  static inline bool page_expected_state(struct page *page,
> > > >>>  					unsigned long check_flags)
> > > >>>  {
> > > >>> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
> > > >>> +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
> > > >>> +	    !PageUnaccepted(page))
> > > >>>  		return false;
> > > >>
> > > >> That probably deserves a comment, and maybe its own if() statement.
> > > > 
> > > > Own if does not work. PageUnaccepted() is encoded in _mapcount.
> > > > 
> > > > What about this:
> > > > 
> > > > 	/*
> > > > 	 * page->_mapcount is expected to be -1.
> > > > 	 *
> > > > 	 * There is an exception for PageUnaccepted(). The page type can be set
> > > > 	 * for pages on free list. Page types are encoded in _mapcount.
> > > > 	 *
> > > > 	 * PageUnaccepted() will get cleared in post_alloc_hook().
> > > > 	 */
> > > > 	if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1))
> > 
> > Maybe I'm missing something, but isn't this true for any PageType?
> 
> PG_buddy gets clear on remove from the free list, before the chec.
> 
> PG_offline and PG_table pages are never on free lists.

Right, this will work 'cause PageType is inverted. I still think this
condition is hard to parse and I liked the old variant with
!PageUnaccepted() better.

Maybe if we wrap the whole construct in a helper it will be less eye
hurting.
 
> -- 
>  Kirill A. Shutemov

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-13 14:48             ` Mike Rapoport
@ 2022-04-13 15:15               ` Kirill A. Shutemov
  2022-04-13 20:06                 ` Mike Rapoport
  0 siblings, 1 reply; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-13 15:15 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Dave Hansen, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Brijesh Singh,
	David Hildenbrand, x86, linux-mm, linux-coco, linux-efi,
	linux-kernel, Mike Rapoport

On Wed, Apr 13, 2022 at 05:48:09PM +0300, Mike Rapoport wrote:
> On Wed, Apr 13, 2022 at 02:40:01PM +0300, Kirill A. Shutemov wrote:
> > On Mon, Apr 11, 2022 at 01:07:29PM +0300, Mike Rapoport wrote:
> > > On Sun, Apr 10, 2022 at 11:38:08PM -0700, Dave Hansen wrote:
> > > > On 4/9/22 08:54, Kirill A. Shutemov wrote:
> > > > > On Fri, Apr 08, 2022 at 11:55:43AM -0700, Dave Hansen wrote:
> > > > 
> > > > >>>  	if (fpi_flags & FPI_TO_TAIL)
> > > > >>>  		to_tail = true;
> > > > >>>  	else if (is_shuffle_order(order))
> > > > >>> @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page,
> > > > >>>  static inline bool page_expected_state(struct page *page,
> > > > >>>  					unsigned long check_flags)
> > > > >>>  {
> > > > >>> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
> > > > >>> +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
> > > > >>> +	    !PageUnaccepted(page))
> > > > >>>  		return false;
> > > > >>
> > > > >> That probably deserves a comment, and maybe its own if() statement.
> > > > > 
> > > > > Own if does not work. PageUnaccepted() is encoded in _mapcount.
> > > > > 
> > > > > What about this:
> > > > > 
> > > > > 	/*
> > > > > 	 * page->_mapcount is expected to be -1.
> > > > > 	 *
> > > > > 	 * There is an exception for PageUnaccepted(). The page type can be set
> > > > > 	 * for pages on free list. Page types are encoded in _mapcount.
> > > > > 	 *
> > > > > 	 * PageUnaccepted() will get cleared in post_alloc_hook().
> > > > > 	 */
> > > > > 	if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1))
> > > 
> > > Maybe I'm missing something, but isn't this true for any PageType?
> > 
> > PG_buddy gets clear on remove from the free list, before the chec.
> > 
> > PG_offline and PG_table pages are never on free lists.
> 
> Right, this will work 'cause PageType is inverted. I still think this
> condition is hard to parse and I liked the old variant with
> !PageUnaccepted() better.

Well the old way to deal with PageUnaccepted() had a flaw: if the page is
PageUnaccepted() it will allow any other page types to pass here. Like
PG_unaccepted + PG_buddy will slide here.

> Maybe if we wrap the whole construct in a helper it will be less eye
> hurting.

Hm. Any suggestion how such helper could look like? Cannot think of
anything sane.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-13 11:30           ` Kirill A. Shutemov
  2022-04-13 11:32             ` David Hildenbrand
@ 2022-04-13 15:36             ` Dave Hansen
  2022-04-13 16:07               ` David Hildenbrand
  2022-04-13 16:24               ` Kirill A. Shutemov
  1 sibling, 2 replies; 67+ messages in thread
From: Dave Hansen @ 2022-04-13 15:36 UTC (permalink / raw)
  To: Kirill A. Shutemov, David Hildenbrand
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Mike Rapoport

On 4/13/22 04:30, Kirill A. Shutemov wrote:
>> 2) Fast boot; after boot, all memory will slowly but steadily get
>>    accepted in the background. After a while, all memory is accepted and
>>    can be signaled to user space.
...
> Frankly, I think option 2 is the worst one. You still CPU cycles from the
> workload after boot to do the job that may or may not be needed. It is an
> half-measure that helps nobody.

Let's not be too hyperbolic here.  "Worst" is entirely subjective and it
totally depends on your perspective and what you care about.

There are basically four options:

 * Accept everything in early boot
 * Accept with deferred page free
 * Accept with kthread after boot
 * Accept on demand

and four things that matter:

 * Code complexity
 * Time to a shell prompt
 * CPU/Memory waste
 * Deterministic overhead

Did I miss any?

News flash: none of the options wins on all the things that matter.
We're going to have to pick one (or maybe two).  I'm also not horribly
convinced that there's a problem here worth solving, especially one that
requires surgery in the core of the buddy allocator.

This is essentially making a performance argument: it takes too long to
boot if we go with a simpler solution.  Yet, I haven't seen any data.  I
think we need to go with the simplest approach(es) until there's some
actual data to guide us here.

Here's another way to look at it:

> https://docs.google.com/spreadsheets/d/1Fpv0Yp0CTF5_JXHR2pywvNtImTwUVGTxDMlJ5t8qiis/edit?usp=sharing


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-13 15:36             ` Dave Hansen
@ 2022-04-13 16:07               ` David Hildenbrand
  2022-04-13 16:13                 ` Dave Hansen
  2022-04-13 16:24               ` Kirill A. Shutemov
  1 sibling, 1 reply; 67+ messages in thread
From: David Hildenbrand @ 2022-04-13 16:07 UTC (permalink / raw)
  To: Dave Hansen, Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Mike Rapoport

On 13.04.22 17:36, Dave Hansen wrote:
> On 4/13/22 04:30, Kirill A. Shutemov wrote:
>>> 2) Fast boot; after boot, all memory will slowly but steadily get
>>>    accepted in the background. After a while, all memory is accepted and
>>>    can be signaled to user space.
> ...
>> Frankly, I think option 2 is the worst one. You still CPU cycles from the
>> workload after boot to do the job that may or may not be needed. It is an
>> half-measure that helps nobody.
> 
> Let's not be too hyperbolic here.  "Worst" is entirely subjective and it
> totally depends on your perspective and what you care about.

Right. Some people might want to start their workload as soon as the
pain is really over. Some might want to have a functional system before
that, others might not care.

> 
> There are basically four options:
> 
>  * Accept everything in early boot
>  * Accept with deferred page free
>  * Accept with kthread after boot
>  * Accept on demand
> 
> and four things that matter:
> 
>  * Code complexity
>  * Time to a shell prompt
>  * CPU/Memory waste
>  * Deterministic overhead
> 
> Did I miss any?

Nothing that comes to mind.

> 
> News flash: none of the options wins on all the things that matter.
> We're going to have to pick one (or maybe two).  I'm also not horribly
> convinced that there's a problem here worth solving, especially one that
> requires surgery in the core of the buddy allocator.
> 
> This is essentially making a performance argument: it takes too long to
> boot if we go with a simpler solution.  Yet, I haven't seen any data.  I
> think we need to go with the simplest approach(es) until there's some
> actual data to guide us here.

Simplest meaning: accept everything during early boot and don't touch
core-mm/buddy code, correct?

> 
> Here's another way to look at it:
> 
>> https://docs.google.com/spreadsheets/d/1Fpv0Yp0CTF5_JXHR2pywvNtImTwUVGTxDMlJ5t8qiis/edit?usp=sharing
> 


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 6/8] x86/mm: Provide helpers for unaccepted memory
  2022-04-08 19:21   ` Dave Hansen
@ 2022-04-13 16:08     ` Kirill A. Shutemov
  0 siblings, 0 replies; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-13 16:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Fri, Apr 08, 2022 at 12:21:19PM -0700, Dave Hansen wrote:
> On 4/5/22 16:43, Kirill A. Shutemov wrote:
> > +void accept_memory(phys_addr_t start, phys_addr_t end)
> > +{
> > +	unsigned long *unaccepted_memory;
> > +	unsigned long flags;
> > +	unsigned int rs, re;
> > +
> > +	if (!boot_params.unaccepted_memory)
> > +		return;
> > +
> > +	unaccepted_memory = __va(boot_params.unaccepted_memory);
> > +	rs = start / PMD_SIZE;
> > +
> > +	spin_lock_irqsave(&unaccepted_memory_lock, flags);
> > +	for_each_set_bitrange_from(rs, re, unaccepted_memory,
> > +				   DIV_ROUND_UP(end, PMD_SIZE)) {
> > +		/* Platform-specific memory-acceptance call goes here */
> > +		panic("Cannot accept memory");
> > +		bitmap_clear(unaccepted_memory, rs, re - rs);
> > +	}
> > +	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
> > +}
> 
> Just to reiterate: this is a global spinlock.  It's disabling
> interrupts.  "Platform-specific memory-acceptance call" will soon become:
> 
> 	tdx_accept_memory(rs * PMD_SIZE, re * PMD_SIZE);
> 
> which is a page-by-page __tdx_module_call():
> 
> > +	for (i = 0; i < (end - start) / PAGE_SIZE; i++) {
> > +		if (__tdx_module_call(TDACCEPTPAGE, start + i * PAGE_SIZE,
> > +				      0, 0, 0, NULL)) {
> > +			error("Cannot accept memory: page accept failed\n");
> > +		}
> > +	}
> 
> Each __tdx_module_call() involves a privilege transition that also
> surely includes things like changing CR3.  It can't be cheap.  It also
> is presumably touching the memory and probably flushing it out of the
> CPU caches.  It's also unbounded:
> 
> 	spin_lock_irqsave(&unaccepted_memory_lock, flags);
> 	for (i = 0; i < (end - start) / PAGE_SIZE; i++)
> 		// thousands?  tens-of-thousands of cycles??
> 	spin_lock_irqsave(&unaccepted_memory_lock, flags);
> 
> How far apart can end and start be?  It's at *least* 2MB in the page
> allocator, which is on the order of a millisecond.  Are we sure there
> aren't any callers that want to do this at a gigabyte granularity?  That
> would hold the global lock and disable interrupts on the order of a second.

This codepath only gets invoked with orders <MAX_ORDER or 4M on x86-64.

> Do we want to bound the time that the lock can be held?  Or, should we
> just let the lockup detectors tell us that we're being naughty?

Host can always DoS the guess, so yes this can lead to lockups.


-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-13 16:07               ` David Hildenbrand
@ 2022-04-13 16:13                 ` Dave Hansen
  0 siblings, 0 replies; 67+ messages in thread
From: Dave Hansen @ 2022-04-13 16:13 UTC (permalink / raw)
  To: David Hildenbrand, Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli,
	Brijesh Singh, Mike Rapoport, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Mike Rapoport

On 4/13/22 09:07, David Hildenbrand wrote:
> Simplest meaning: accept everything during early boot and don't touch
> core-mm/buddy code, correct?

Yes, exactly.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-13 15:36             ` Dave Hansen
  2022-04-13 16:07               ` David Hildenbrand
@ 2022-04-13 16:24               ` Kirill A. Shutemov
  1 sibling, 0 replies; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-13 16:24 UTC (permalink / raw)
  To: Dave Hansen
  Cc: David Hildenbrand, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Brijesh Singh,
	Mike Rapoport, x86, linux-mm, linux-coco, linux-efi,
	linux-kernel, Mike Rapoport

On Wed, Apr 13, 2022 at 08:36:52AM -0700, Dave Hansen wrote:
> On 4/13/22 04:30, Kirill A. Shutemov wrote:
> >> 2) Fast boot; after boot, all memory will slowly but steadily get
> >>    accepted in the background. After a while, all memory is accepted and
> >>    can be signaled to user space.
> ...
> > Frankly, I think option 2 is the worst one. You still CPU cycles from the
> > workload after boot to do the job that may or may not be needed. It is an
> > half-measure that helps nobody.
> 
> Let's not be too hyperbolic here.  "Worst" is entirely subjective and it
> totally depends on your perspective and what you care about.
> 
> There are basically four options:
> 
>  * Accept everything in early boot
>  * Accept with deferred page free
>  * Accept with kthread after boot
>  * Accept on demand
> 
> and four things that matter:
> 
>  * Code complexity
>  * Time to a shell prompt
>  * CPU/Memory waste
>  * Deterministic overhead
> 
> Did I miss any?

"Time to shell" is not equal to "time to do the job". Real workloads do
stuff beyond memory allocations. But, yes, it is harder quantify.

> News flash: none of the options wins on all the things that matter.
> We're going to have to pick one (or maybe two).  I'm also not horribly
> convinced that there's a problem here worth solving, especially one that
> requires surgery in the core of the buddy allocator.
> 
> This is essentially making a performance argument: it takes too long to
> boot if we go with a simpler solution.  Yet, I haven't seen any data.  I
> think we need to go with the simplest approach(es) until there's some
> actual data to guide us here.
> 
> Here's another way to look at it:
> 
> > https://docs.google.com/spreadsheets/d/1Fpv0Yp0CTF5_JXHR2pywvNtImTwUVGTxDMlJ5t8qiis/edit?usp=sharing

The link is view-only.

AFAICS, complexity of the kthread approach is on par or greater comparing
to on-demand. You need coordination between allocator and the thread.
It can be hard to hit right balance for the kthread between being CPU hog
and not providing enough accepted memory.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 1/8] mm: Add support for unaccepted memory
  2022-04-13 15:15               ` Kirill A. Shutemov
@ 2022-04-13 20:06                 ` Mike Rapoport
  0 siblings, 0 replies; 67+ messages in thread
From: Mike Rapoport @ 2022-04-13 20:06 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Brijesh Singh,
	David Hildenbrand, x86, linux-mm, linux-coco, linux-efi,
	linux-kernel, Mike Rapoport

On Wed, Apr 13, 2022 at 06:15:17PM +0300, Kirill A. Shutemov wrote:
> On Wed, Apr 13, 2022 at 05:48:09PM +0300, Mike Rapoport wrote:
> > On Wed, Apr 13, 2022 at 02:40:01PM +0300, Kirill A. Shutemov wrote:
> > > On Mon, Apr 11, 2022 at 01:07:29PM +0300, Mike Rapoport wrote:
> > > > On Sun, Apr 10, 2022 at 11:38:08PM -0700, Dave Hansen wrote:
> > > > > On 4/9/22 08:54, Kirill A. Shutemov wrote:
> > > > > > On Fri, Apr 08, 2022 at 11:55:43AM -0700, Dave Hansen wrote:
> > > > > 
> > > > > >>>  	if (fpi_flags & FPI_TO_TAIL)
> > > > > >>>  		to_tail = true;
> > > > > >>>  	else if (is_shuffle_order(order))
> > > > > >>> @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page,
> > > > > >>>  static inline bool page_expected_state(struct page *page,
> > > > > >>>  					unsigned long check_flags)
> > > > > >>>  {
> > > > > >>> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
> > > > > >>> +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
> > > > > >>> +	    !PageUnaccepted(page))
> > > > > >>>  		return false;
> > > > > >>
> > > > > >> That probably deserves a comment, and maybe its own if() statement.
> > > > > > 
> > > > > > Own if does not work. PageUnaccepted() is encoded in _mapcount.
> > > > > > 
> > > > > > What about this:
> > > > > > 
> > > > > > 	/*
> > > > > > 	 * page->_mapcount is expected to be -1.
> > > > > > 	 *
> > > > > > 	 * There is an exception for PageUnaccepted(). The page type can be set
> > > > > > 	 * for pages on free list. Page types are encoded in _mapcount.
> > > > > > 	 *
> > > > > > 	 * PageUnaccepted() will get cleared in post_alloc_hook().
> > > > > > 	 */
> > > > > > 	if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1))
> > > > 
> > > > Maybe I'm missing something, but isn't this true for any PageType?
> > > 
> > > PG_buddy gets clear on remove from the free list, before the chec.
> > > 
> > > PG_offline and PG_table pages are never on free lists.
> > 
> > Right, this will work 'cause PageType is inverted. I still think this
> > condition is hard to parse and I liked the old variant with
> > !PageUnaccepted() better.
> 
> Well the old way to deal with PageUnaccepted() had a flaw: if the page is
> PageUnaccepted() it will allow any other page types to pass here. Like
> PG_unaccepted + PG_buddy will slide here.

It seems to me that there was an implicit assumption that page types are
exclusive and PG_unaccepted would break it.
 
> > Maybe if we wrap the whole construct in a helper it will be less eye
> > hurting.
> 
> Hm. Any suggestion how such helper could look like? Cannot think of
> anything sane.

Me neither :(

How about updating the comment to be

	/*
	 * The page must not be mapped to userspace and must not have a
	 * PageType other than PageUnaccepted.
	 * This means that page->_mapcount must be -1 or have only
	 * PG_unaccepted bit cleared.
	 */
	if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1))
 
> -- 
>  Kirill A. Shutemov

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 3/8] efi/x86: Implement support for unaccepted memory
  2022-04-08 17:26   ` Dave Hansen
  2022-04-09 19:41     ` Kirill A. Shutemov
@ 2022-04-14 15:55     ` Borislav Petkov
  1 sibling, 0 replies; 67+ messages in thread
From: Borislav Petkov @ 2022-04-14 15:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Brijesh Singh,
	Mike Rapoport, David Hildenbrand, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel

On Fri, Apr 08, 2022 at 10:26:14AM -0700, Dave Hansen wrote:
> > +	/*
> > +	 * Accept small regions that might not be able to be represented
> > +	 * in the bitmap:
> > +	 */
> > +	if (end - start < 2 * PMD_SIZE) {
> > +		__accept_memory(start, end);
> > +		return;
> > +	}
> 
> This is not my first time looking at this code and I still had to think
> about this a bit.  That's not good.  That pathological case here is
> actually something like this:
> 
> | 4k | 2044k + 2044k | 4k |
> ^ 0x0 	     ^ 2MB	  ^ 4MB
> 
> Where we have a 2MB-aligned 4k accepted area, a 4088k unaccepted area,
> then another 4k accepted area.  That will not result in any bits being
> set in the accepted memory bitmap because no 2MB region is fully accepted.

I could use that ascii art very well in a comment above it instead of
having to paint it in my mind each time.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 3/8] efi/x86: Implement support for unaccepted memory
  2022-04-05 23:43 ` [PATCHv4 3/8] efi/x86: Implement support for unaccepted memory Kirill A. Shutemov
  2022-04-08 17:26   ` Dave Hansen
@ 2022-04-15 22:24   ` Borislav Petkov
  2022-04-18 15:55     ` Kirill A. Shutemov
  1 sibling, 1 reply; 67+ messages in thread
From: Borislav Petkov @ 2022-04-15 22:24 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Wed, Apr 06, 2022 at 02:43:38AM +0300, Kirill A. Shutemov wrote:
> diff --git a/Documentation/x86/zero-page.rst b/Documentation/x86/zero-page.rst
> index f088f5881666..8e3447a4b373 100644
> --- a/Documentation/x86/zero-page.rst
> +++ b/Documentation/x86/zero-page.rst
> @@ -42,4 +42,5 @@ Offset/Size	Proto	Name		Meaning
>  2D0/A00		ALL	e820_table		E820 memory map table
>  						(array of struct e820_entry)
>  D00/1EC		ALL	eddbuf			EDD data (array of struct edd_info)
> +ECC/008		ALL	unaccepted_memory	Bitmap of unaccepted memory (1bit == 2M)

There's a perfectly fine spot at 0x78:

	__u8  _pad3[8];                                 /* 0x078 */

why not take that one?

>  ===========	=====	=======================	=================================================
> diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
> index 8fd0e6ae2e1f..09993797efa2 100644
> --- a/arch/x86/boot/compressed/Makefile
> +++ b/arch/x86/boot/compressed/Makefile
> @@ -102,6 +102,7 @@ endif
>  
>  vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
>  vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
> +vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/bitmap.o $(obj)/unaccepted_memory.o
>  
>  vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
>  efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
> diff --git a/arch/x86/boot/compressed/bitmap.c b/arch/x86/boot/compressed/bitmap.c
> new file mode 100644
> index 000000000000..bf58b259380a
> --- /dev/null
> +++ b/arch/x86/boot/compressed/bitmap.c
> @@ -0,0 +1,24 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Taken from lib/string.c */
> +
> +#include <linux/bitmap.h>

verify_include_paths: Warning: Kernel-proper include at arch/x86/boot/compressed/bitmap.c:4 [+#include <linux/bitmap.h>]

Same game as before: put the stuff you need into a separate or a shared
header and avoid the linux/ namespace include.

> +void __bitmap_set(unsigned long *map, unsigned int start, int len)
> +{
> +	unsigned long *p = map + BIT_WORD(start);
> +	const unsigned int size = start + len;
> +	int bits_to_set = BITS_PER_LONG - (start % BITS_PER_LONG);
> +	unsigned long mask_to_set = BITMAP_FIRST_WORD_MASK(start);
> +
> +	while (len - bits_to_set >= 0) {
> +		*p |= mask_to_set;
> +		len -= bits_to_set;
> +		bits_to_set = BITS_PER_LONG;
> +		mask_to_set = ~0UL;
> +		p++;
> +	}
> +	if (len) {
> +		mask_to_set &= BITMAP_LAST_WORD_MASK(size);
> +		*p |= mask_to_set;
> +	}
> +}
> diff --git a/arch/x86/boot/compressed/unaccepted_memory.c b/arch/x86/boot/compressed/unaccepted_memory.c
> new file mode 100644
> index 000000000000..d363acf59c08
> --- /dev/null
> +++ b/arch/x86/boot/compressed/unaccepted_memory.c

arch/x86/boot/compressed/mem.c

simply. That "unaccepted_memory" everywhere is a mouthful and too specific.

> @@ -0,0 +1,53 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include "error.h"
> +#include "misc.h"
> +
> +static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +	/* Platform-specific memory-acceptance call goes here */
> +	error("Cannot accept memory");
> +}
> +
> +void mark_unaccepted(struct boot_params *params, u64 start, u64 end)

That name is kinda misleading? It is not only marking as unaccepted - it
is also accepting weird 2M misaligned chunks...

> +{
> +	/*
> +	 * The accepted memory bitmap only works at PMD_SIZE granularity.
> +	 * If a request comes in to mark memory as unaccepted which is not
> +	 * PMD_SIZE-aligned, simply accept the memory now since it can not be
> +	 * *marked* as unaccepted.
> +	 */

That comment goes over the function name.

> +	/*
> +	 * Accept small regions that might not be able to be represented
> +	 * in the bitmap:
> +	 */
> +	if (end - start < 2 * PMD_SIZE) {
> +		__accept_memory(start, end);
> +		return;
> +	}
> +
> +	/*
> +	 * No matter how the start and end are aligned, at least one unaccepted
> +	 * PMD_SIZE area will remain.
> +	 */
> +
> +	/* Immediately accept a <PMD_SIZE piece at the start: */

Immediately? As opposed to delayed?

> +	if (start & ~PMD_MASK) {
> +		__accept_memory(start, round_up(start, PMD_SIZE));
> +		start = round_up(start, PMD_SIZE);
> +	}
> +
> +	/* Immediately accept a <PMD_SIZE piece at the end: */
> +	if (end & ~PMD_MASK) {
> +		__accept_memory(round_down(end, PMD_SIZE), end);
> +		end = round_down(end, PMD_SIZE);
> +	}
> +
> +	/*
> +	 * 'start' and 'end' are now both PMD-aligned.
> +	 * Record the range as being unaccepted:
> +	 */
> +	bitmap_set((unsigned long *)params->unaccepted_memory,
> +		   start / PMD_SIZE, (end - start) / PMD_SIZE);
> +}
> diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h

Why do you need a separate header?

We already have

arch/x86/include/asm/mem_encrypt.h

and this is kinda very much related...

> diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
> index 2c3dac5ecb36..b17ceec757d0 100644
> --- a/drivers/firmware/efi/Kconfig
> +++ b/drivers/firmware/efi/Kconfig
> @@ -243,6 +243,21 @@ config EFI_DISABLE_PCI_DMA
>  	  options "efi=disable_early_pci_dma" or "efi=no_disable_early_pci_dma"
>  	  may be used to override this option.
>  
> +config UNACCEPTED_MEMORY
> +	bool
> +	depends on EFI_STUB
> +	depends on !KEXEC_CORE
> +	help
> +	   Some Virtual Machine platforms, such as Intel TDX, require
> +	   some memory to be "accepted" by the guest before it can be used.
> +	   This mechanism helps prevent malicious hosts from making changes
> +	   to guest memory.
> +
> +	   UEFI specification v2.9 introduced EFI_UNACCEPTED_MEMORY memory type.
> +
> +	   This option adds support for unaccepted memory and makes such memory
> +	   usable by kernel.

... by *the* kernel.

> +
>  endmenu
>  
>  config EFI_EMBEDDED_FIRMWARE
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index 5502e176d51b..2c055afb1b11 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -747,6 +747,7 @@ static __initdata char memory_type_name[][13] = {
>  	"MMIO Port",
>  	"PAL Code",
>  	"Persistent",
> +	"Unaccepted",
>  };
>  
>  char * __init efi_md_typeattr_format(char *buf, size_t size,
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> index d18cac8ab436..e7601fd612aa 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -9,12 +9,14 @@
>  #include <linux/efi.h>
>  #include <linux/pci.h>
>  #include <linux/stddef.h>
> +#include <linux/bitmap.h>
>  
>  #include <asm/efi.h>
>  #include <asm/e820/types.h>
>  #include <asm/setup.h>
>  #include <asm/desc.h>
>  #include <asm/boot.h>
> +#include <asm/unaccepted_memory.h>
>  
>  #include "efistub.h"
>  
> @@ -504,6 +506,13 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
>  			e820_type = E820_TYPE_PMEM;
>  			break;
>  
> +		case EFI_UNACCEPTED_MEMORY:
> +			if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
> +				continue;
> +			e820_type = E820_TYPE_RAM;
> +			mark_unaccepted(params, d->phys_addr,
> +					d->phys_addr + PAGE_SIZE * d->num_pages);
> +			break;
>  		default:
>  			continue;
>  		}
> @@ -575,6 +584,9 @@ static efi_status_t allocate_e820(struct boot_params *params,
>  {
>  	efi_status_t status;
>  	__u32 nr_desc;
> +	bool unaccepted_memory_present = false;

This wholly written out "unaccepted_memory" everywhere is too much and
too long. How about 

	bool unaccept_mem;

or so?

> +	u64 max_addr = 0;
> +	int i;
>  
>  	status = efi_get_memory_map(map);
>  	if (status != EFI_SUCCESS)
> @@ -589,9 +601,57 @@ static efi_status_t allocate_e820(struct boot_params *params,
>  		if (status != EFI_SUCCESS)
>  			goto out;
>  	}

This whole chunk you're adding here begs to be a separate function with
the big fat comment placed over the function name.

Might just as well call it after allocate_e820() has been called.

> +
> +	if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
> +		goto out;
> +
> +	/* Check if there's any unaccepted memory and find the max address */
> +	for (i = 0; i < nr_desc; i++) {
> +		efi_memory_desc_t *d;
> +
> +		d = efi_early_memdesc_ptr(*map->map, *map->desc_size, i);
> +		if (d->type == EFI_UNACCEPTED_MEMORY)
> +			unaccepted_memory_present = true;
> +		if (d->phys_addr + d->num_pages * PAGE_SIZE > max_addr)
> +			max_addr = d->phys_addr + d->num_pages * PAGE_SIZE;
> +	}
> +
> +	/*
> +	 * If unaccepted memory is present allocate a bitmap to track what
> +	 * memory has to be accepted before access.
> +	 *
> +	 * One bit in the bitmap represents 2MiB in the address space:
> +	 * A 4k bitmap can track 64GiB of physical address space.
> +	 *
> +	 * In the worst case scenario -- a huge hole in the middle of the
> +	 * address space -- It needs 256MiB to handle 4PiB of the address
> +	 * space.

And you're saying that that efi_allocate_pages() below can really give a
256M contiguous chunk?

> +	 *
> +	 * TODO: handle situation if params->unaccepted_memory has already set.
> +	 * It's required to deal with kexec.
> +	 *
> +	 * The bitmap will be populated in setup_e820() according to the memory
> +	 * map after efi_exit_boot_services().
> +	 */
> +	if (unaccepted_memory_present) {
> +		unsigned long *unaccepted_memory = NULL;

So if you call this simply

		unsigned long *mem = ...

> +		u64 size = DIV_ROUND_UP(max_addr, PMD_SIZE * BITS_PER_BYTE);
> +
> +		status = efi_allocate_pages(size,
> +					    (unsigned long *)&unaccepted_memory,
> +					    ULONG_MAX);

... you'd have this on a single line:

		status = efi_allocate_pages(size, (unsigned long *)&mem, ULONG_MAX);

> +		if (status != EFI_SUCCESS)
> +			goto out;
> +		memset(unaccepted_memory, 0, size);
> +		params->unaccepted_memory = (unsigned long)unaccepted_memory;

... and then have this assignment more readable:

		params->unaccepted_memory = (unsigned long)mem;

as it shows the important var being ->unaccepted_memory and mem only a
local helper.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 3/8] efi/x86: Implement support for unaccepted memory
  2022-04-15 22:24   ` Borislav Petkov
@ 2022-04-18 15:55     ` Kirill A. Shutemov
  2022-04-18 16:38       ` Borislav Petkov
  0 siblings, 1 reply; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-18 15:55 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Sat, Apr 16, 2022 at 12:24:26AM +0200, Borislav Petkov wrote:
> On Wed, Apr 06, 2022 at 02:43:38AM +0300, Kirill A. Shutemov wrote:
> > diff --git a/Documentation/x86/zero-page.rst b/Documentation/x86/zero-page.rst
> > index f088f5881666..8e3447a4b373 100644
> > --- a/Documentation/x86/zero-page.rst
> > +++ b/Documentation/x86/zero-page.rst
> > @@ -42,4 +42,5 @@ Offset/Size	Proto	Name		Meaning
> >  2D0/A00		ALL	e820_table		E820 memory map table
> >  						(array of struct e820_entry)
> >  D00/1EC		ALL	eddbuf			EDD data (array of struct edd_info)
> > +ECC/008		ALL	unaccepted_memory	Bitmap of unaccepted memory (1bit == 2M)
> 
> There's a perfectly fine spot at 0x78:
> 
> 	__u8  _pad3[8];                                 /* 0x078 */
> 
> why not take that one?

Good point. Will do.

> 
> >  ===========	=====	=======================	=================================================
> > diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
> > index 8fd0e6ae2e1f..09993797efa2 100644
> > --- a/arch/x86/boot/compressed/Makefile
> > +++ b/arch/x86/boot/compressed/Makefile
> > @@ -102,6 +102,7 @@ endif
> >  
> >  vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
> >  vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
> > +vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/bitmap.o $(obj)/unaccepted_memory.o
> >  
> >  vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
> >  efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
> > diff --git a/arch/x86/boot/compressed/bitmap.c b/arch/x86/boot/compressed/bitmap.c
> > new file mode 100644
> > index 000000000000..bf58b259380a
> > --- /dev/null
> > +++ b/arch/x86/boot/compressed/bitmap.c
> > @@ -0,0 +1,24 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/* Taken from lib/string.c */
> > +
> > +#include <linux/bitmap.h>
> 
> verify_include_paths: Warning: Kernel-proper include at arch/x86/boot/compressed/bitmap.c:4 [+#include <linux/bitmap.h>]
> 
> Same game as before: put the stuff you need into a separate or a shared
> header and avoid the linux/ namespace include.

I'm confused here. What is wrong with linux/ include namespace?

Yes, we had story with <asm/io.h> that actually caused issue in
decompression code, but linux/ has a lot of perfectly portable
library-like stuff.

Could you explain what rules are?

> > @@ -0,0 +1,53 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +
> > +#include "error.h"
> > +#include "misc.h"
> > +
> > +static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
> > +{
> > +	/* Platform-specific memory-acceptance call goes here */
> > +	error("Cannot accept memory");
> > +}
> > +
> > +void mark_unaccepted(struct boot_params *params, u64 start, u64 end)
> 
> That name is kinda misleading? It is not only marking as unaccepted - it
> is also accepting weird 2M misaligned chunks...

Hm. accept_or_mark_unaccepted()?

> > +{
> > +	/*
> > +	 * The accepted memory bitmap only works at PMD_SIZE granularity.
> > +	 * If a request comes in to mark memory as unaccepted which is not
> > +	 * PMD_SIZE-aligned, simply accept the memory now since it can not be
> > +	 * *marked* as unaccepted.
> > +	 */
> 
> That comment goes over the function name.
> 
> > +	/*
> > +	 * Accept small regions that might not be able to be represented
> > +	 * in the bitmap:
> > +	 */
> > +	if (end - start < 2 * PMD_SIZE) {
> > +		__accept_memory(start, end);
> > +		return;
> > +	}
> > +
> > +	/*
> > +	 * No matter how the start and end are aligned, at least one unaccepted
> > +	 * PMD_SIZE area will remain.
> > +	 */
> > +
> > +	/* Immediately accept a <PMD_SIZE piece at the start: */
> 
> Immediately? As opposed to delayed?

Yes. Otherwise accept is delayed until the first allocation of the memory.

> > +	if (start & ~PMD_MASK) {
> > +		__accept_memory(start, round_up(start, PMD_SIZE));
> > +		start = round_up(start, PMD_SIZE);
> > +	}
> > +
> > +	/* Immediately accept a <PMD_SIZE piece at the end: */
> > +	if (end & ~PMD_MASK) {
> > +		__accept_memory(round_down(end, PMD_SIZE), end);
> > +		end = round_down(end, PMD_SIZE);
> > +	}
> > +
> > +	/*
> > +	 * 'start' and 'end' are now both PMD-aligned.
> > +	 * Record the range as being unaccepted:
> > +	 */
> > +	bitmap_set((unsigned long *)params->unaccepted_memory,
> > +		   start / PMD_SIZE, (end - start) / PMD_SIZE);
> > +}
> > diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
> 
> Why do you need a separate header?
> 
> We already have
> 
> arch/x86/include/asm/mem_encrypt.h
> 
> and this is kinda very much related...

I don't see it.

Memory encryption can be a reason to have unaccepted memory, but it is not
1:1 match. Unaccepted memory can be present without memory ecnryption if
data secruty and integrity guaranteed by other means.

<asm/mem_encrypt.h> is very AMD SME/SEV centric. I'm not sure it need to
exist in the way it is now.

> > +	u64 max_addr = 0;
> > +	int i;
> >  
> >  	status = efi_get_memory_map(map);
> >  	if (status != EFI_SUCCESS)
> > @@ -589,9 +601,57 @@ static efi_status_t allocate_e820(struct boot_params *params,
> >  		if (status != EFI_SUCCESS)
> >  			goto out;
> >  	}
> 
> This whole chunk you're adding here begs to be a separate function with
> the big fat comment placed over the function name.
> 
> Might just as well call it after allocate_e820() has been called.

Okay, I will move it into a separate function, but it has to be called
from allocate_e820() because it allocates and free the map.

> > +
> > +	if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
> > +		goto out;
> > +
> > +	/* Check if there's any unaccepted memory and find the max address */
> > +	for (i = 0; i < nr_desc; i++) {
> > +		efi_memory_desc_t *d;
> > +
> > +		d = efi_early_memdesc_ptr(*map->map, *map->desc_size, i);
> > +		if (d->type == EFI_UNACCEPTED_MEMORY)
> > +			unaccepted_memory_present = true;
> > +		if (d->phys_addr + d->num_pages * PAGE_SIZE > max_addr)
> > +			max_addr = d->phys_addr + d->num_pages * PAGE_SIZE;
> > +	}
> > +
> > +	/*
> > +	 * If unaccepted memory is present allocate a bitmap to track what
> > +	 * memory has to be accepted before access.
> > +	 *
> > +	 * One bit in the bitmap represents 2MiB in the address space:
> > +	 * A 4k bitmap can track 64GiB of physical address space.
> > +	 *
> > +	 * In the worst case scenario -- a huge hole in the middle of the
> > +	 * address space -- It needs 256MiB to handle 4PiB of the address
> > +	 * space.
> 
> And you're saying that that efi_allocate_pages() below can really give a
> 256M contiguous chunk?

Yes, that's assumption. Is it too high ask to deal with 4PiB of PA?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 3/8] efi/x86: Implement support for unaccepted memory
  2022-04-18 15:55     ` Kirill A. Shutemov
@ 2022-04-18 16:38       ` Borislav Petkov
  2022-04-18 20:24         ` Kirill A. Shutemov
  0 siblings, 1 reply; 67+ messages in thread
From: Borislav Petkov @ 2022-04-18 16:38 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Mon, Apr 18, 2022 at 06:55:45PM +0300, Kirill A. Shutemov wrote:
> I'm confused here. What is wrong with linux/ include namespace?

The problem is that you need all kinds of workarounds so that the
decompressor builds. Just look at the beginning of

arch/x86/boot/compressed/misc.h

Even you had to do them:

/* cpu_feature_enabled() cannot be used this early */
#define USE_EARLY_PGTABLE_L5

That thing sprinkled everywhere is not a clean solution.

> Yes, we had story with <asm/io.h> that actually caused issue in
> decompression code, but linux/ has a lot of perfectly portable
> library-like stuff.

Yes, those are fine except that not everything that leaks into the
decompressor code through includes is perfectly portable.

> Could you explain what rules are?

Library-like stuff like types.h, linkage.h, etc we could include for now
but including linux/kernel.h which pulls in everything but the kitchen
sink is bad.

So I'd like for the decompressor to be completely separate from kernel
proper because it is a whole different thing and I want for us to be
able to include headers in it without ugly workarounds just so that
kernel proper include changes do not influence the decompressor.

> Hm. accept_or_mark_unaccepted()?

What's wrong with early_accept_memory()?

> > Immediately? As opposed to delayed?
> 
> Yes. Otherwise accept is delayed until the first allocation of the memory.

Yes, put that in the comment pls.

> Memory encryption can be a reason to have unaccepted memory, but it is not
> 1:1 match. Unaccepted memory can be present without memory ecnryption if
> data secruty and integrity guaranteed by other means.

Really?

Please elaborate. I thought memory acceptance is a feature solely for
TDX and SNP guests to use.

> <asm/mem_encrypt.h> is very AMD SME/SEV centric.

So?

> I'm not sure it need to exist in the way it is now.

I'm not sure what your argument actually is for having yet another
separate header vs putting it in a header which already deals with that
stuff.

> Okay, I will move it into a separate function, but it has to be called
> from allocate_e820() because it allocates and free the map.

You mean, you want for allocate_e820() to call this new function because
both allocate and free?

Might have to explain what you mean here exactly.

> > And you're saying that that efi_allocate_pages() below can really give a
> > 256M contiguous chunk?
> 
> Yes, that's assumption. Is it too high ask to deal with 4PiB of PA?

From my experience, asking firmware to do stuff for ya is always a risky
thing. I guess such a huge allocation, when it fails, will be caught
early in platform verification so whatever...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 3/8] efi/x86: Implement support for unaccepted memory
  2022-04-18 16:38       ` Borislav Petkov
@ 2022-04-18 20:24         ` Kirill A. Shutemov
  2022-04-18 21:01           ` Borislav Petkov
  0 siblings, 1 reply; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-18 20:24 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Mon, Apr 18, 2022 at 06:38:52PM +0200, Borislav Petkov wrote:
> > Could you explain what rules are?
> 
> Library-like stuff like types.h, linkage.h, etc we could include for now
> but including linux/kernel.h which pulls in everything but the kitchen
> sink is bad.

<linux/bitmap> doesn't include <linux/kernel.h> or similar things.
Is it okay for now?


> > Hm. accept_or_mark_unaccepted()?
> 
> What's wrong with early_accept_memory()?

But the goal of the function is not to accept the memory, but mark it
as unaccepted in the bitmap. Your proposal is more confusing, not less.

> > > Immediately? As opposed to delayed?
> > 
> > Yes. Otherwise accept is delayed until the first allocation of the memory.
> 
> Yes, put that in the comment pls.

Okay.

> memory, but it is not
> > 1:1 match. Unaccepted memory can be present without memory ecnryption if
> > data secruty and integrity guaranteed by other means.
> 
> Really?
> 
> Please elaborate. I thought memory acceptance is a feature solely for
> TDX and SNP guests to use.

Conceptionally, it is just memory that requires additional action before
it can be accessed. Yes, at the moment TDX and SEV are the only users.
It is implementation detail that TDX and SEV use memory encryption.

> > <asm/mem_encrypt.h> is very AMD SME/SEV centric.
> 
> So?
> 
> > I'm not sure it need to exist in the way it is now.
> 
> I'm not sure what your argument actually is for having yet another
> separate header vs putting it in a header which already deals with that
> stuff.

Because I don't think it is a good fit. Frankly, even <asm/coco.h> fits
better, although I'm no a fan either.

Do we have file shortage? I would rather keep it separate.

> > Okay, I will move it into a separate function, but it has to be called
> > from allocate_e820() because it allocates and free the map.
> 
> You mean, you want for allocate_e820() to call this new function because
> both allocate and free?
> 
> Might have to explain what you mean here exactly.

Both allocate_e820() and handling unaccepted memory requires access to the
efi memory map. We only need the size of memory map for e820, while
unaccepted memory requires walking the map. We can serve both by
requesting the map from the firmware once. It requires allocation and
freeing memory for the map.

Makes sense?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 3/8] efi/x86: Implement support for unaccepted memory
  2022-04-18 20:24         ` Kirill A. Shutemov
@ 2022-04-18 21:01           ` Borislav Petkov
  2022-04-18 23:50             ` Kirill A. Shutemov
  0 siblings, 1 reply; 67+ messages in thread
From: Borislav Petkov @ 2022-04-18 21:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Mon, Apr 18, 2022 at 11:24:31PM +0300, Kirill A. Shutemov wrote:
> <linux/bitmap> doesn't include <linux/kernel.h> or similar things.
> Is it okay for now?

No, it is not ok because those linux/ includes are moving targets. They
keep changing and then that indirectly influences the decompressor.

How much functionality from linux/bitmap.h do you actually need?

> But the goal of the function is not to accept the memory, but mark it
> as unaccepted in the bitmap.

Really?

+	 * Accept small regions that might not be able to be represented
+	 * in the bitmap:
+	 */
+	if (end - start < 2 * PMD_SIZE) {
+		__accept_memory(start, end);

That looks like it is accepting to me.

> Conceptionally, it is just memory that requires additional action before
> it can be accessed. Yes, at the moment TDX and SEV are the only users.
> It is implementation detail that TDX and SEV use memory encryption.

So there *might* be some potential future use. Nothing concrete at the
moment.

> Because I don't think it is a good fit. Frankly, even <asm/coco.h> fits
> better, although I'm no a fan either.
> 
> Do we have file shortage? I would rather keep it separate.

So I have not read a single argument for why the unaccepted memory gunk
should be separate.

We have perfectly fine mem_encrypt.[ch] files everywhere which already
contain code which deals with the kernel running as encrypted guest. The
unaccepted memory stuff is part of that - not something separate.

If it gets to get used for something different, sure, then it can be
carved out because it might need to be built separately, without the
rest of the encryption code. But as it is now, it doesn't have to. So
please put it in those files.

> Both allocate_e820() and handling unaccepted memory requires access to the
> efi memory map. We only need the size of memory map for e820, while
> unaccepted memory requires walking the map. We can serve both by
> requesting the map from the firmware once. It requires allocation and
> freeing memory for the map.
> 
> Makes sense?

Ok, thanks.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 3/8] efi/x86: Implement support for unaccepted memory
  2022-04-18 21:01           ` Borislav Petkov
@ 2022-04-18 23:50             ` Kirill A. Shutemov
  2022-04-19  7:39               ` Borislav Petkov
  0 siblings, 1 reply; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-18 23:50 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Mon, Apr 18, 2022 at 11:01:12PM +0200, Borislav Petkov wrote:
> On Mon, Apr 18, 2022 at 11:24:31PM +0300, Kirill A. Shutemov wrote:
> > <linux/bitmap> doesn't include <linux/kernel.h> or similar things.
> > Is it okay for now?
> 
> No, it is not ok because those linux/ includes are moving targets. They
> keep changing and then that indirectly influences the decompressor.
> 
> How much functionality from linux/bitmap.h do you actually need?

Below is the bare minimum required to compile bitmap.c in decompresser.
I only made it work on my config/compiler and did not care about all
#ifdef branches.

I find it strange that you go after <linux/bitmap.h> which has limited
exposure while <linux/acpi.h> and <linux/efi.h> are there already.
Starting small will backfire if once we find out that monstrous headers
depend on what we try to replace. Bit fish has to be addressed first.

What do you want me to do here?

// <linux/bitmap.h>
//
#define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 1)))
#define BITMAP_LAST_WORD_MASK(nbits) (~0UL >> (-(nbits) & (BITS_PER_LONG - 1)))

// <uapi/linux/swab.>

/**
 * __swab64 - return a byteswapped 64-bit value
 * @x: value to byteswap
 */
#ifdef __HAVE_BUILTIN_BSWAP64__
#define __swab64(x) (__u64)__builtin_bswap64((__u64)(x))
#else
#define __swab64(x)				\
	(__builtin_constant_p((__u64)(x)) ?	\
	___constant_swab64(x) :			\
	__fswab64(x))
#endif

static __always_inline unsigned long __swab(const unsigned long y)
{
#if __BITS_PER_LONG == 64
	return __swab64(y);
#else /* __BITS_PER_LONG == 32 */
	return __swab32(y);
#endif
}

// <linux/swab.h>

# define swab __swab

// <linux/bits.h>

#define BIT_WORD(nr)		((nr) / BITS_PER_LONG)

// <asm/bitops.h>
//
/**
 * __ffs - find first set bit in word
 * @word: The word to search
 *
 * Undefined if no bit exists, so code should check against 0 first.
 */
static __always_inline unsigned long __ffs(unsigned long word)
{
	asm("rep; bsf %1,%0"
		: "=r" (word)
		: "rm" (word));
	return word;
}


> > But the goal of the function is not to accept the memory, but mark it
> > as unaccepted in the bitmap.
> 
> Really?
> 
> +	 * Accept small regions that might not be able to be represented
> +	 * in the bitmap:
> +	 */
> +	if (end - start < 2 * PMD_SIZE) {
> +		__accept_memory(start, end);
> 
> That looks like it is accepting to me.

Yes, really.

As 1 bit represents 2M, not all chunks can be represented in the bitmap
and they have to be accepted. But the *goal* is to record unaccepted
memory into bitmap. Some accepting is a side effect.

The early_accept_memory() name is just wrong.

> > Conceptionally, it is just memory that requires additional action before
> > it can be accessed. Yes, at the moment TDX and SEV are the only users.
> > It is implementation detail that TDX and SEV use memory encryption.
> 
> So there *might* be some potential future use. Nothing concrete at the
> moment.
> 
> > Because I don't think it is a good fit. Frankly, even <asm/coco.h> fits
> > better, although I'm no a fan either.
> > 
> > Do we have file shortage? I would rather keep it separate.
> 
> So I have not read a single argument for why the unaccepted memory gunk
> should be separate.

> We have perfectly fine mem_encrypt.[ch] files everywhere which already
> contain code which deals with the kernel running as encrypted guest.

And some stuff for encrypted host (SME).

> The unaccepted memory stuff is part of that - not something separate. If
> it gets to get used for something different, sure, then it can be carved
> out because it might need to be built separately, without the rest of
> the encryption code. But as it is now, it doesn't have to. So please put
> it in those files.

Okay, I will do as you want, but I really hate it.

With one hand you try to unwind header mess in decompresser code and with
another propose to create a kitchen-sink header because topics somewhat
related. Looks contradictory.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 3/8] efi/x86: Implement support for unaccepted memory
  2022-04-18 23:50             ` Kirill A. Shutemov
@ 2022-04-19  7:39               ` Borislav Petkov
  2022-04-19 15:30                 ` Kirill A. Shutemov
                                   ` (2 more replies)
  0 siblings, 3 replies; 67+ messages in thread
From: Borislav Petkov @ 2022-04-19  7:39 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Tue, Apr 19, 2022 at 02:50:15AM +0300, Kirill A. Shutemov wrote:
> I find it strange that you go after <linux/bitmap.h> which has limited
> exposure while <linux/acpi.h> and <linux/efi.h> are there already.

Funny you should mention that:

https://lore.kernel.org/r/YlCKWhMJEMUgJmjF@zn.tnic

I *have* been working towards that but it's a losing whack-a-mole game
when you and others keep adding new stuff.

So no, we won't take a pile of changes and let the maintainer clean it
up afterwards.

> What do you want me to do here?

I think the stuff coming from the linux/ namespace you can simply copy
into a header in compressed/, like I've done with efi.h.

> // <asm/bitops.h>

The asm/ stuff can be put into a shared/ namespace header like the io
stuff you did.

> As 1 bit represents 2M, not all chunks can be represented in the bitmap
> and they have to be accepted. But the *goal* is to record unaccepted
> memory into bitmap. Some accepting is a side effect.
> 
> The early_accept_memory() name is just wrong.

Ok, how about process_unaccepted_memory(). It should be generic enough.

> Okay, I will do as you want, but I really hate it.

I find it really weird that you feel so strongly about it. If I would
have been asked to do it, I would've done it without even considering
it. But ok, since you feel so strongly about it, I've asked what the
other maintainers think.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 3/8] efi/x86: Implement support for unaccepted memory
  2022-04-19  7:39               ` Borislav Petkov
@ 2022-04-19 15:30                 ` Kirill A. Shutemov
  2022-04-19 16:38                   ` Dave Hansen
  2022-04-19 19:23                   ` Borislav Petkov
  2022-04-21 12:26                 ` Borislav Petkov
  2022-04-22  0:21                 ` Kirill A. Shutemov
  2 siblings, 2 replies; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-19 15:30 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Tue, Apr 19, 2022 at 09:39:53AM +0200, Borislav Petkov wrote:
> On Tue, Apr 19, 2022 at 02:50:15AM +0300, Kirill A. Shutemov wrote:
> > I find it strange that you go after <linux/bitmap.h> which has limited
> > exposure while <linux/acpi.h> and <linux/efi.h> are there already.
> 
> Funny you should mention that:
> 
> https://lore.kernel.org/r/YlCKWhMJEMUgJmjF@zn.tnic
> 
> I *have* been working towards that but it's a losing whack-a-mole game
> when you and others keep adding new stuff.
> 
> So no, we won't take a pile of changes and let the maintainer clean it
> up afterwards.
> 
> > What do you want me to do here?
> 
> I think the stuff coming from the linux/ namespace you can simply copy
> into a header in compressed/, like I've done with efi.h.

Hm. Dave was worried about having copies of _find_next_bit() and
__bitmap_*() inside compressed/.

How do we rectify code duplication and making decompresser self-contained?
Do we care about multiple copies of the same code in the kernel?
Do we care about keeping them in sync?

> > // <asm/bitops.h>
> 
> The asm/ stuff can be put into a shared/ namespace header like the io
> stuff you did.
> 
> > As 1 bit represents 2M, not all chunks can be represented in the bitmap
> > and they have to be accepted. But the *goal* is to record unaccepted
> > memory into bitmap. Some accepting is a side effect.
> > 
> > The early_accept_memory() name is just wrong.
> 
> Ok, how about process_unaccepted_memory(). It should be generic enough.

Sounds good.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 3/8] efi/x86: Implement support for unaccepted memory
  2022-04-19 15:30                 ` Kirill A. Shutemov
@ 2022-04-19 16:38                   ` Dave Hansen
  2022-04-19 19:23                   ` Borislav Petkov
  1 sibling, 0 replies; 67+ messages in thread
From: Dave Hansen @ 2022-04-19 16:38 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Brijesh Singh,
	Mike Rapoport, David Hildenbrand, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 868 bytes --]

On 4/19/22 08:30, Kirill A. Shutemov wrote:
>> I think the stuff coming from the linux/ namespace you can simply copy
>> into a header in compressed/, like I've done with efi.h.
> Hm. Dave was worried about having copies of _find_next_bit() and
> __bitmap_*() inside compressed/.
> 
> How do we rectify code duplication and making decompresser self-contained?
> Do we care about multiple copies of the same code in the kernel?
> Do we care about keeping them in sync?

Would it be feasible to have the common code defined as a 'static
inline' in a header that both the main kernel and the decompressor could
include?  Something like the attached patch.

I'd much rather duplicate something like this:

int strncasecmp(const char *s1, const char *s2, size_t len)
{
        return __lib_strncasecmp(s1, s1, len);
}

in the decompressor versus a real full implementation.

[-- Attachment #2: lib-header.patch --]
[-- Type: text/x-patch, Size: 1770 bytes --]

commit d94fa6842b25809958903fdd33d498bb622695a5
Author: Dave Hansen <dave.hansen@linux.intel.com>
Date:   Tue Apr 19 09:31:07 2022 -0700

    foo
    
    bar1

diff --git a/lib/string-internal.h b/lib/string-internal.h
new file mode 100644
index 000000000000..230c22864b75
--- /dev/null
+++ b/lib/string-internal.h
@@ -0,0 +1,31 @@
+#include <linux/ctype.h>
+
+/**
+ * strncasecmp - Case insensitive, length-limited string comparison
+ * @s1: One string
+ * @s2: The other string
+ * @len: the maximum number of characters to compare
+ */
+static inline int __lib_strncasecmp(const char *s1, const char *s2, size_t len)
+{
+	/* Yes, Virginia, it had better be unsigned */
+	unsigned char c1, c2;
+
+	if (!len)
+		return 0;
+
+	do {
+		c1 = *s1++;
+		c2 = *s2++;
+		if (!c1 || !c2)
+			break;
+		if (c1 == c2)
+			continue;
+		c1 = tolower(c1);
+		c2 = tolower(c2);
+		if (c1 != c2)
+			break;
+	} while (--len);
+	return (int)c1 - (int)c2;
+}
+#endif
diff --git a/lib/string.c b/lib/string.c
index 485777c9da83..705b799e3b5c 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -29,6 +29,7 @@
 #include <asm/word-at-a-time.h>
 #include <asm/page.h>
 
+#include "string-internal.h"
 #ifndef __HAVE_ARCH_STRNCASECMP
 /**
  * strncasecmp - Case insensitive, length-limited string comparison
@@ -38,25 +39,7 @@
  */
 int strncasecmp(const char *s1, const char *s2, size_t len)
 {
-	/* Yes, Virginia, it had better be unsigned */
-	unsigned char c1, c2;
-
-	if (!len)
-		return 0;
-
-	do {
-		c1 = *s1++;
-		c2 = *s2++;
-		if (!c1 || !c2)
-			break;
-		if (c1 == c2)
-			continue;
-		c1 = tolower(c1);
-		c2 = tolower(c2);
-		if (c1 != c2)
-			break;
-	} while (--len);
-	return (int)c1 - (int)c2;
+	return __lib_strncasecmp(s1, s1, len);
 }
 EXPORT_SYMBOL(strncasecmp);
 #endif

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 3/8] efi/x86: Implement support for unaccepted memory
  2022-04-19 15:30                 ` Kirill A. Shutemov
  2022-04-19 16:38                   ` Dave Hansen
@ 2022-04-19 19:23                   ` Borislav Petkov
  1 sibling, 0 replies; 67+ messages in thread
From: Borislav Petkov @ 2022-04-19 19:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Tue, Apr 19, 2022 at 06:30:02PM +0300, Kirill A. Shutemov wrote:
> Hm. Dave was worried about having copies of _find_next_bit() and
> __bitmap_*() inside compressed/.

That's fine.

> How do we rectify code duplication and making decompresser self-contained?

Also fine - as long as the decompressor and kernel-proper are
independent.

> Do we care about multiple copies of the same code in the kernel?

The copied versions in the decompressor should be simply sufficient for
its use. And there shouldn't be that much of duplication.

Note that we're using the same strategy with perf tool - it does copy
kernel facilities when it needs them.

> Do we care about keeping them in sync?

Nope - as long as they're sufficient for the decompressor. My
expectation here is that the decompressor won't need too many
facilities.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 3/8] efi/x86: Implement support for unaccepted memory
  2022-04-19  7:39               ` Borislav Petkov
  2022-04-19 15:30                 ` Kirill A. Shutemov
@ 2022-04-21 12:26                 ` Borislav Petkov
  2022-04-22  0:21                 ` Kirill A. Shutemov
  2 siblings, 0 replies; 67+ messages in thread
From: Borislav Petkov @ 2022-04-21 12:26 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Tue, Apr 19, 2022 at 09:39:53AM +0200, Borislav Petkov wrote:
> I find it really weird that you feel so strongly about it. If I would
> have been asked to do it, I would've done it without even considering
> it. But ok, since you feel so strongly about it, I've asked what the
> other maintainers think.

Ok, Dave thinks separate files are better so let's leave it at that. We
can always change things later.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 0/8] mm, x86/cc: Implement support for unaccepted memory
  2022-04-09 23:44   ` Kirill A. Shutemov
@ 2022-04-21 12:29     ` Borislav Petkov
  0 siblings, 0 replies; 67+ messages in thread
From: Borislav Petkov @ 2022-04-21 12:29 UTC (permalink / raw)
  To: Kirill A. Shutemov, Tom Lendacky, Michael Roth
  Cc: Dave Hansen, Kirill A. Shutemov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Brijesh Singh,
	Mike Rapoport, David Hildenbrand, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel

On Sun, Apr 10, 2022 at 02:44:58AM +0300, Kirill A. Shutemov wrote:
> On Fri, Apr 08, 2022 at 10:02:13AM -0700, Dave Hansen wrote:
> > On 4/5/22 16:43, Kirill A. Shutemov wrote:
> > > Patches 1-6/7 are generic and don't have any dependencies on TDX. They
> > > should serve AMD SEV needs as well. TDX-specific code isolated in the
> > > last patch.
> > 
> > Oh, that's quite nice.  Are the SEV-SNP folks planning on using this?
> > If they are, acks/reviews would be much appreciated.
> 
> AMD folks tested one of previous revision and reported that it works, but
> I don't remember seeing the code that hook ups AMD implementation.

Yes, they will be using this eventually.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 3/8] efi/x86: Implement support for unaccepted memory
  2022-04-19  7:39               ` Borislav Petkov
  2022-04-19 15:30                 ` Kirill A. Shutemov
  2022-04-21 12:26                 ` Borislav Petkov
@ 2022-04-22  0:21                 ` Kirill A. Shutemov
  2022-04-22  9:30                   ` Borislav Petkov
  2 siblings, 1 reply; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-22  0:21 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Tue, Apr 19, 2022 at 09:39:53AM +0200, Borislav Petkov wrote:
> On Tue, Apr 19, 2022 at 02:50:15AM +0300, Kirill A. Shutemov wrote:
> > I find it strange that you go after <linux/bitmap.h> which has limited
> > exposure while <linux/acpi.h> and <linux/efi.h> are there already.
> 
> Funny you should mention that:
> 
> https://lore.kernel.org/r/YlCKWhMJEMUgJmjF@zn.tnic

There's still #include <linux/efi.h> in misc.h. You removed one, but
there's a second one for some reason.

Any plans for <linux/acpi.h>? It includes <linux/bitmap.h>:

In file included from ./include/linux/cpumask.h:12,
                 from ./include/linux/smp.h:13,
                 from ./include/linux/lockdep.h:14,
                 from ./include/linux/mutex.h:17,
                 from ./include/linux/kernfs.h:11,
                 from ./include/linux/sysfs.h:16,
                 from ./include/linux/kobject.h:20,
                 from ./include/linux/of.h:17,
                 from ./include/linux/irqdomain.h:35,
                 from ./include/linux/acpi.h:13,
                 from arch/x86/boot/compressed/misc.h:3

We will get name conflicts if we try to copy <linux/bitmap.h> stuff.
Hm.

I also underesitmated what is required to be copied because of the
indirect include. The list was only to compile bitmap.c. mem.c (former
unaccepted_memory.c) would require more.

BTW, do we have a white list of linux/ includes that allowed? minmax.h?
math.h? What is the line.

Maybe allow what is included directly or indirectly now? (Yes, it is my
poor attempt to slide under closing door.)

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 3/8] efi/x86: Implement support for unaccepted memory
  2022-04-22  0:21                 ` Kirill A. Shutemov
@ 2022-04-22  9:30                   ` Borislav Petkov
  2022-04-22 13:26                     ` Kirill A. Shutemov
  0 siblings, 1 reply; 67+ messages in thread
From: Borislav Petkov @ 2022-04-22  9:30 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Fri, Apr 22, 2022 at 03:21:24AM +0300, Kirill A. Shutemov wrote:
> There's still #include <linux/efi.h> in misc.h. You removed one, but
> there's a second one for some reason.

I don't know which tree you're looking at but latest tip/master has:

$ git grep -E "efi\.h" arch/x86/boot/
arch/x86/boot/compressed/acpi.c:6:#include "efi.h"
arch/x86/boot/compressed/kaslr.c:25:#include "efi.h"
arch/x86/boot/compressed/misc.h:40:#include "efi.h"
arch/x86/boot/compressed/pgtable_64.c:7:#include "efi.h"

> Any plans for <linux/acpi.h>? It includes <linux/bitmap.h>:

So if misc.h is including linux/bitmap.h indirectly, you can simply
include misc.h right?

And then you'll slide under the closing door, as you say below. :)

> I also underesitmated what is required to be copied because of the
> indirect include. The list was only to compile bitmap.c. mem.c (former
> unaccepted_memory.c) would require more.

More like?

Maybe I can help out converting some of the stuff. You could push your
current state somewhere - even if it doesn't build - so that I can take
a look...

> BTW, do we have a white list of linux/ includes that allowed? minmax.h?
> math.h? What is the line.

Well, that's the thing. Even if those look innocuous now, if they get
new includes added to them, that has an influence on the decompressor.

So I'm thinking copying the required bits would be the proper way
forward.

> Maybe allow what is included directly or indirectly now? (Yes, it is my
> poor attempt to slide under closing door.)

That's basically saying, can I get this done so that I can mark my
checkbox that my task is done - you can deal with the crap later
yourself.

How about we take our time and solve this properly instead of hurrying
constantly?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHv4 3/8] efi/x86: Implement support for unaccepted memory
  2022-04-22  9:30                   ` Borislav Petkov
@ 2022-04-22 13:26                     ` Kirill A. Shutemov
  0 siblings, 0 replies; 67+ messages in thread
From: Kirill A. Shutemov @ 2022-04-22 13:26 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, Dave Hansen,
	Brijesh Singh, Mike Rapoport, David Hildenbrand, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Fri, Apr 22, 2022 at 11:30:11AM +0200, Borislav Petkov wrote:
> On Fri, Apr 22, 2022 at 03:21:24AM +0300, Kirill A. Shutemov wrote:
> > There's still #include <linux/efi.h> in misc.h. You removed one, but
> > there's a second one for some reason.
> 
> I don't know which tree you're looking at but latest tip/master has:
> 
> $ git grep -E "efi\.h" arch/x86/boot/
> arch/x86/boot/compressed/acpi.c:6:#include "efi.h"
> arch/x86/boot/compressed/kaslr.c:25:#include "efi.h"
> arch/x86/boot/compressed/misc.h:40:#include "efi.h"
> arch/x86/boot/compressed/pgtable_64.c:7:#include "efi.h"

Sorry for the noise. I read 'elf.h' as 'efi.h' :/

But it also includes <linux/bitmap.h> indirectly:

In file included from include/linux/elf.h:6:
In file included from arch/x86/include/asm/elf.h:8:
In file included from include/linux/thread_info.h:60:
In file included from arch/x86/include/asm/thread_info.h:53:
In file included from arch/x86/include/asm/cpufeature.h:5:
In file included from arch/x86/include/asm/processor.h:22:
In file included from arch/x86/include/asm/msr.h:11:
In file included from arch/x86/include/asm/cpumask.h:5:
In file included from include/linux/cpumask.h:12:

> > Any plans for <linux/acpi.h>? It includes <linux/bitmap.h>:
> 
> So if misc.h is including linux/bitmap.h indirectly, you can simply
> include misc.h right?

Yes.

> And then you'll slide under the closing door, as you say below. :)

Is it sarcasm or clearance to go?

> > I also underesitmated what is required to be copied because of the
> > indirect include. The list was only to compile bitmap.c. mem.c (former
> > unaccepted_memory.c) would require more.
> 
> More like?

for_each_clear_bitrange() is pain to unwind.

> Maybe I can help out converting some of the stuff. You could push your
> current state somewhere - even if it doesn't build - so that I can take
> a look...

I will push what I have a bit later today.

> > BTW, do we have a white list of linux/ includes that allowed? minmax.h?
> > math.h? What is the line.
> 
> Well, that's the thing. Even if those look innocuous now, if they get
> new includes added to them, that has an influence on the decompressor.
> 
> So I'm thinking copying the required bits would be the proper way
> forward.

I understand where you comes from. But on my side I face suddenly higher
entry bar. Yes, it is bad excuse, I know.

> > Maybe allow what is included directly or indirectly now? (Yes, it is my
> > poor attempt to slide under closing door.)
> 
> That's basically saying, can I get this done so that I can mark my
> checkbox that my task is done - you can deal with the crap later
> yourself.
> 
> How about we take our time and solve this properly instead of hurrying
> constantly?

I'm okay with this. But I lack coherent understating on how you want it
to look like.

Like, looking on your new "efi.h", I see it still implicitly depends on
<linux/types.h> and <linux/uuid.h>. Why is it okay? Is it temporary? What
is criteria of what is okay to keep for now?

You mentioned having <asm/shared/bitops.h> as we do <asm/shared/io.h>. But
<asm/bitops.h> has non-trivial dependencies on its own.

Okay, we can move them into asm/shared as well, but how to deal with
asm-generic/ things? And linux/ dependencies? Do we create a copy in
x86/include?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2022-04-22 13:24 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-05 23:43 [PATCHv4 0/8] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
2022-04-05 23:43 ` [PATCHv4 1/8] mm: Add " Kirill A. Shutemov
2022-04-08 18:55   ` Dave Hansen
2022-04-09 15:54     ` Kirill A. Shutemov
2022-04-11  6:38       ` Dave Hansen
2022-04-11 10:07         ` Mike Rapoport
2022-04-13 11:40           ` Kirill A. Shutemov
2022-04-13 14:48             ` Mike Rapoport
2022-04-13 15:15               ` Kirill A. Shutemov
2022-04-13 20:06                 ` Mike Rapoport
2022-04-11  8:47       ` David Hildenbrand
2022-04-08 19:04   ` David Hildenbrand
2022-04-08 19:11   ` Dave Hansen
2022-04-09 17:52     ` Kirill A. Shutemov
2022-04-11  6:41       ` Dave Hansen
2022-04-11 15:55         ` Borislav Petkov
2022-04-11 16:27           ` Dave Hansen
2022-04-11 18:55             ` Tom Lendacky
2022-04-12  8:15     ` David Hildenbrand
2022-04-12 16:08       ` Dave Hansen
2022-04-13 10:36         ` David Hildenbrand
2022-04-13 11:30           ` Kirill A. Shutemov
2022-04-13 11:32             ` David Hildenbrand
2022-04-13 15:36             ` Dave Hansen
2022-04-13 16:07               ` David Hildenbrand
2022-04-13 16:13                 ` Dave Hansen
2022-04-13 16:24               ` Kirill A. Shutemov
2022-04-13 14:39           ` Mike Rapoport
2022-04-05 23:43 ` [PATCHv4 2/8] efi/x86: Get full memory map in allocate_e820() Kirill A. Shutemov
2022-04-13  9:59   ` Borislav Petkov
2022-04-13 11:45     ` Kirill A. Shutemov
2022-04-05 23:43 ` [PATCHv4 3/8] efi/x86: Implement support for unaccepted memory Kirill A. Shutemov
2022-04-08 17:26   ` Dave Hansen
2022-04-09 19:41     ` Kirill A. Shutemov
2022-04-14 15:55     ` Borislav Petkov
2022-04-15 22:24   ` Borislav Petkov
2022-04-18 15:55     ` Kirill A. Shutemov
2022-04-18 16:38       ` Borislav Petkov
2022-04-18 20:24         ` Kirill A. Shutemov
2022-04-18 21:01           ` Borislav Petkov
2022-04-18 23:50             ` Kirill A. Shutemov
2022-04-19  7:39               ` Borislav Petkov
2022-04-19 15:30                 ` Kirill A. Shutemov
2022-04-19 16:38                   ` Dave Hansen
2022-04-19 19:23                   ` Borislav Petkov
2022-04-21 12:26                 ` Borislav Petkov
2022-04-22  0:21                 ` Kirill A. Shutemov
2022-04-22  9:30                   ` Borislav Petkov
2022-04-22 13:26                     ` Kirill A. Shutemov
2022-04-05 23:43 ` [PATCHv4 4/8] x86/boot/compressed: Handle " Kirill A. Shutemov
2022-04-08 17:57   ` Dave Hansen
2022-04-09 20:20     ` Kirill A. Shutemov
2022-04-11  6:49       ` Dave Hansen
2022-04-05 23:43 ` [PATCHv4 5/8] x86/mm: Reserve unaccepted memory bitmap Kirill A. Shutemov
2022-04-08 18:08   ` Dave Hansen
2022-04-09 20:43     ` Kirill A. Shutemov
2022-04-05 23:43 ` [PATCHv4 6/8] x86/mm: Provide helpers for unaccepted memory Kirill A. Shutemov
2022-04-08 18:15   ` Dave Hansen
2022-04-08 19:21   ` Dave Hansen
2022-04-13 16:08     ` Kirill A. Shutemov
2022-04-05 23:43 ` [PATCHv4 7/8] x86/tdx: Unaccepted memory support Kirill A. Shutemov
2022-04-08 18:28   ` Dave Hansen
2022-04-05 23:43 ` [PATCHv4 8/8] mm/vmstat: Add counter for memory accepting Kirill A. Shutemov
2022-04-12  8:18   ` David Hildenbrand
2022-04-08 17:02 ` [PATCHv4 0/8] mm, x86/cc: Implement support for unaccepted memory Dave Hansen
2022-04-09 23:44   ` Kirill A. Shutemov
2022-04-21 12:29     ` Borislav Petkov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.