All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv2 0/7] Implement support for unaccepted memory
@ 2022-01-11 11:33 Kirill A. Shutemov
  2022-01-11 11:33 ` [PATCHv2 1/7] mm: Add " Kirill A. Shutemov
                   ` (7 more replies)
  0 siblings, 8 replies; 25+ messages in thread
From: Kirill A. Shutemov @ 2022-01-11 11:33 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov

UEFI Specification version 2.9 introduces the concept of memory
acceptance: Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, requiring memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific for the Virtual
Machine platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptance until memory is needed. It lowers boot time and reduces
memory overhead.

The kernel needs to know what memory has been accepted. Firmware
communicates this information via memory map: a new memory type --
EFI_UNACCEPTED_MEMORY -- indicates such memory.

Range-based tracking works fine for firmware, but it gets bulky for
the kernel: e820 has to be modified on every page acceptance. It leads
to table fragmentation, but there's a limited number of entries in the
e820 table

Another option is to mark such memory as usable in e820 and track if the
range has been accepted in a bitmap. One bit in the bitmap represents
2MiB in the address space: one 4k page is enough to track 64GiB or
physical address space.

In the worst-case scenario -- a huge hole in the middle of the
address space -- It needs 256MiB to handle 4PiB of the address
space.

Any unaccepted memory that is not aligned to 2M gets accepted upfront.

The approach lowers boot time substantially. Boot to shell is ~2.5x
faster for 4G TDX VM and ~4x faster for 64G.

Patches 1-6/7 are generic and don't have any dependencies on TDX. They
should serve AMD SEV needs as well. TDX-specific code isolated in the
last patch. This patch requires the core TDX patchset which is currently
under review.

Kirill A. Shutemov (7):
  mm: Add support for unaccepted memory
  efi/x86: Get full memory map in allocate_e820()
  efi/x86: Implement support for unaccepted memory
  x86/boot/compressed: Handle unaccepted memory
  x86/mm: Reserve unaccepted memory bitmap
  x86/mm: Provide helpers for unaccepted memory
  x86/tdx: Unaccepted memory support

 Documentation/x86/zero-page.rst              |  1 +
 arch/x86/Kconfig                             |  1 +
 arch/x86/boot/compressed/Makefile            |  1 +
 arch/x86/boot/compressed/bitmap.c            | 86 ++++++++++++++++++
 arch/x86/boot/compressed/kaslr.c             | 14 ++-
 arch/x86/boot/compressed/misc.c              |  9 ++
 arch/x86/boot/compressed/tdx.c               | 67 ++++++++++++++
 arch/x86/boot/compressed/unaccepted_memory.c | 64 +++++++++++++
 arch/x86/include/asm/page.h                  |  5 ++
 arch/x86/include/asm/tdx.h                   |  2 +
 arch/x86/include/asm/unaccepted_memory.h     | 17 ++++
 arch/x86/include/uapi/asm/bootparam.h        |  3 +-
 arch/x86/kernel/e820.c                       | 10 +++
 arch/x86/kernel/tdx.c                        |  7 ++
 arch/x86/mm/Makefile                         |  2 +
 arch/x86/mm/unaccepted_memory.c              | 94 ++++++++++++++++++++
 drivers/firmware/efi/Kconfig                 | 14 +++
 drivers/firmware/efi/efi.c                   |  1 +
 drivers/firmware/efi/libstub/x86-stub.c      | 86 ++++++++++++++----
 include/linux/efi.h                          |  3 +-
 include/linux/page-flags.h                   |  4 +
 mm/internal.h                                | 15 ++++
 mm/memblock.c                                |  1 +
 mm/page_alloc.c                              | 21 ++++-
 24 files changed, 508 insertions(+), 20 deletions(-)
 create mode 100644 arch/x86/boot/compressed/bitmap.c
 create mode 100644 arch/x86/boot/compressed/unaccepted_memory.c
 create mode 100644 arch/x86/include/asm/unaccepted_memory.h
 create mode 100644 arch/x86/mm/unaccepted_memory.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCHv2 1/7] mm: Add support for unaccepted memory
  2022-01-11 11:33 [PATCHv2 0/7] Implement support for unaccepted memory Kirill A. Shutemov
@ 2022-01-11 11:33 ` Kirill A. Shutemov
  2022-01-11 19:46   ` Dave Hansen
  2022-01-11 11:33 ` [PATCHv2 2/7] efi/x86: Get full memory map in allocate_e820() Kirill A. Shutemov
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Kirill A. Shutemov @ 2022-01-11 11:33 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov

UEFI Specification version 2.9 introduces the concept of memory
acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, requiring memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific for the Virtual Machine
platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptance until memory is needed. It lowers boot time and reduces
memory overhead.

Support of such memory requires a few changes in core-mm code:

  - memblock has to accept memory on allocation;

  - page allocator has to accept memory on the first allocation of the
    page;

Memblock change is trivial.

The page allocator is modified to accept pages on the first allocation.
PageOffline() is used to indicate that the page requires acceptance.
The flag is currently used by hotplug and ballooning. Such pages are not
available to the page allocator.

Architecture has to provide three helpers if it wants to support
unaccepted memory:

 - accept_memory() makes a range of physical addresses accepted.

 - maybe_set_page_offline() marks a page PageOffline() if it requires
   acceptance. Used during boot to put pages on free lists.

 - accept_and_clear_page_offline() makes a page accepted and clears
   PageOffline().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/page-flags.h |  4 ++++
 mm/internal.h              | 15 +++++++++++++++
 mm/memblock.c              |  1 +
 mm/page_alloc.c            | 21 ++++++++++++++++++++-
 4 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 52ec4b5e5615..281f70da329c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -887,6 +887,10 @@ PAGE_TYPE_OPS(Buddy, buddy)
  * any further access to page content. PFN walkers that read content of random
  * pages should check PageOffline() and synchronize with such drivers using
  * page_offline_freeze()/page_offline_thaw().
+ *
+ * If a PageOffline() page encountered on a buddy allocator's free list it has
+ * to be "accepted" before it can be used.
+ * See accept_and_clear_page_offline() and CONFIG_UNACCEPTED_MEMORY.
  */
 PAGE_TYPE_OPS(Offline, offline)
 
diff --git a/mm/internal.h b/mm/internal.h
index 3b79a5c9427a..1738a4e2a27e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -713,4 +713,19 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
 int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 		      unsigned long addr, int page_nid, int *flags);
 
+#ifndef CONFIG_UNACCEPTED_MEMORY
+static inline void maybe_set_page_offline(struct page *page, unsigned int order)
+{
+}
+
+static inline void accept_and_clear_page_offline(struct page *page,
+						 unsigned int order)
+{
+}
+
+static inline void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+}
+#endif
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memblock.c b/mm/memblock.c
index 1018e50566f3..6dfa594192de 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1400,6 +1400,7 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
 		 */
 		kmemleak_alloc_phys(found, size, 0, 0);
 
+	accept_memory(found, found + size);
 	return found;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c5952749ad40..5707b4b5f774 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1064,6 +1064,7 @@ static inline void __free_one_page(struct page *page,
 	unsigned int max_order;
 	struct page *buddy;
 	bool to_tail;
+	bool offline = PageOffline(page);
 
 	max_order = min_t(unsigned int, MAX_ORDER - 1, pageblock_order);
 
@@ -1097,6 +1098,10 @@ static inline void __free_one_page(struct page *page,
 			clear_page_guard(zone, buddy, order, migratetype);
 		else
 			del_page_from_free_list(buddy, zone, order);
+
+		if (PageOffline(buddy))
+			offline = true;
+
 		combined_pfn = buddy_pfn & pfn;
 		page = page + (combined_pfn - pfn);
 		pfn = combined_pfn;
@@ -1130,6 +1135,9 @@ static inline void __free_one_page(struct page *page,
 done_merging:
 	set_buddy_order(page, order);
 
+	if (offline)
+		__SetPageOffline(page);
+
 	if (fpi_flags & FPI_TO_TAIL)
 		to_tail = true;
 	else if (is_shuffle_order(order))
@@ -1155,7 +1163,8 @@ static inline void __free_one_page(struct page *page,
 static inline bool page_expected_state(struct page *page,
 					unsigned long check_flags)
 {
-	if (unlikely(atomic_read(&page->_mapcount) != -1))
+	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
+	    !PageOffline(page))
 		return false;
 
 	if (unlikely((unsigned long)page->mapping |
@@ -1734,6 +1743,8 @@ void __init memblock_free_pages(struct page *page, unsigned long pfn,
 {
 	if (early_page_uninitialised(pfn))
 		return;
+
+	maybe_set_page_offline(page, order);
 	__free_pages_core(page, order);
 }
 
@@ -1823,10 +1834,12 @@ static void __init deferred_free_range(unsigned long pfn,
 	if (nr_pages == pageblock_nr_pages &&
 	    (pfn & (pageblock_nr_pages - 1)) == 0) {
 		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+		maybe_set_page_offline(page, pageblock_order);
 		__free_pages_core(page, pageblock_order);
 		return;
 	}
 
+	accept_memory(pfn << PAGE_SHIFT, (pfn + nr_pages) << PAGE_SHIFT);
 	for (i = 0; i < nr_pages; i++, page++, pfn++) {
 		if ((pfn & (pageblock_nr_pages - 1)) == 0)
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
@@ -2297,6 +2310,9 @@ static inline void expand(struct zone *zone, struct page *page,
 		if (set_page_guard(zone, &page[size], high, migratetype))
 			continue;
 
+		if (PageOffline(page))
+			__SetPageOffline(&page[size]);
+
 		add_to_free_list(&page[size], zone, high, migratetype);
 		set_buddy_order(&page[size], high);
 	}
@@ -2393,6 +2409,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 	 */
 	kernel_unpoison_pages(page, 1 << order);
 
+	if (PageOffline(page))
+		accept_and_clear_page_offline(page, order);
+
 	/*
 	 * As memory initialization might be integrated into KASAN,
 	 * kasan_alloc_pages and kernel_init_free_pages must be
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCHv2 2/7] efi/x86: Get full memory map in allocate_e820()
  2022-01-11 11:33 [PATCHv2 0/7] Implement support for unaccepted memory Kirill A. Shutemov
  2022-01-11 11:33 ` [PATCHv2 1/7] mm: Add " Kirill A. Shutemov
@ 2022-01-11 11:33 ` Kirill A. Shutemov
  2022-01-11 11:33 ` [PATCHv2 3/7] efi/x86: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Kirill A. Shutemov @ 2022-01-11 11:33 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov

Currently allocate_e820() only interested in the size of map and size of
memory descriptor to determine how many e820 entries the kernel needs.

UEFI Specification version 2.9 introduces a new memory type --
unaccepted memory. To track unaccepted memory kernel needs to allocate
a bitmap. The size of the bitmap is dependent on the maximum physical
address present in the system. A full memory map is required to find
the maximum address.

Modify allocate_e820() to get a full memory map.

This is preparation for the next patch that implements handling of
unaccepted memory in EFI stub.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 drivers/firmware/efi/libstub/x86-stub.c | 28 ++++++++++++-------------
 1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index f14c4ff5839f..a0b946182b5e 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -569,30 +569,28 @@ static efi_status_t alloc_e820ext(u32 nr_desc, struct setup_data **e820ext,
 }
 
 static efi_status_t allocate_e820(struct boot_params *params,
+				  struct efi_boot_memmap *map,
 				  struct setup_data **e820ext,
 				  u32 *e820ext_size)
 {
-	unsigned long map_size, desc_size, map_key;
 	efi_status_t status;
-	__u32 nr_desc, desc_version;
+	__u32 nr_desc;
 
-	/* Only need the size of the mem map and size of each mem descriptor */
-	map_size = 0;
-	status = efi_bs_call(get_memory_map, &map_size, NULL, &map_key,
-			     &desc_size, &desc_version);
-	if (status != EFI_BUFFER_TOO_SMALL)
-		return (status != EFI_SUCCESS) ? status : EFI_UNSUPPORTED;
-
-	nr_desc = map_size / desc_size + EFI_MMAP_NR_SLACK_SLOTS;
+	status = efi_get_memory_map(map);
+	if (status != EFI_SUCCESS)
+		return status;
 
-	if (nr_desc > ARRAY_SIZE(params->e820_table)) {
-		u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table);
+	nr_desc = *map->map_size / *map->desc_size;
+	if (nr_desc > ARRAY_SIZE(params->e820_table) - EFI_MMAP_NR_SLACK_SLOTS) {
+		u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table) +
+			EFI_MMAP_NR_SLACK_SLOTS;
 
 		status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
 		if (status != EFI_SUCCESS)
-			return status;
+			goto out;
 	}
-
+out:
+	efi_bs_call(free_pool, *map->map);
 	return EFI_SUCCESS;
 }
 
@@ -642,7 +640,7 @@ static efi_status_t exit_boot(struct boot_params *boot_params, void *handle)
 	priv.boot_params	= boot_params;
 	priv.efi		= &boot_params->efi_info;
 
-	status = allocate_e820(boot_params, &e820ext, &e820ext_size);
+	status = allocate_e820(boot_params, &map, &e820ext, &e820ext_size);
 	if (status != EFI_SUCCESS)
 		return status;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCHv2 3/7] efi/x86: Implement support for unaccepted memory
  2022-01-11 11:33 [PATCHv2 0/7] Implement support for unaccepted memory Kirill A. Shutemov
  2022-01-11 11:33 ` [PATCHv2 1/7] mm: Add " Kirill A. Shutemov
  2022-01-11 11:33 ` [PATCHv2 2/7] efi/x86: Get full memory map in allocate_e820() Kirill A. Shutemov
@ 2022-01-11 11:33 ` Kirill A. Shutemov
  2022-01-11 17:17   ` Dave Hansen
  2022-01-11 11:33 ` [PATCHv2 4/7] x86/boot/compressed: Handle " Kirill A. Shutemov
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Kirill A. Shutemov @ 2022-01-11 11:33 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov

UEFI Specification version 2.9 introduces the concept of memory
acceptance: Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, requiring memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific for the Virtual
Machine platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptance until memory is needed. It lowers boot time and reduces
memory overhead.

The kernel needs to know what memory has been accepted. Firmware
communicates this information via memory map: a new memory type --
EFI_UNACCEPTED_MEMORY -- indicates such memory.

Range-based tracking works fine for firmware, but it gets bulky for
the kernel: e820 has to be modified on every page acceptance. It leads
to table fragmentation, but there's a limited number of entries in the
e820 table

Another option is to mark such memory as usable in e820 and track if the
range has been accepted in a bitmap. One bit in the bitmap represents
2MiB in the address space: one 4k page is enough to track 64GiB or
physical address space.

In the worst-case scenario -- a huge hole in the middle of the
address space -- It needs 256MiB to handle 4PiB of the address
space.

Any unaccepted memory that is not aligned to 2M gets accepted upfront.

The bitmap is allocated and constructed in the EFI stub and passed down
to the kernel via boot_params. allocate_e820() allocates the bitmap if
unaccepted memory is present, according to the maximum address in the
memory map.

The same boot_params.unaccepted_memory can be used to pass the bitmap
between two kernels on kexec, but the use-case is not yet implemented.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/x86/zero-page.rst              |  1 +
 arch/x86/boot/compressed/Makefile            |  1 +
 arch/x86/boot/compressed/bitmap.c            | 24 ++++++++
 arch/x86/boot/compressed/unaccepted_memory.c | 45 +++++++++++++++
 arch/x86/include/asm/unaccepted_memory.h     | 12 ++++
 arch/x86/include/uapi/asm/bootparam.h        |  3 +-
 drivers/firmware/efi/Kconfig                 | 14 +++++
 drivers/firmware/efi/efi.c                   |  1 +
 drivers/firmware/efi/libstub/x86-stub.c      | 60 +++++++++++++++++++-
 include/linux/efi.h                          |  3 +-
 10 files changed, 161 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/boot/compressed/bitmap.c
 create mode 100644 arch/x86/boot/compressed/unaccepted_memory.c
 create mode 100644 arch/x86/include/asm/unaccepted_memory.h

diff --git a/Documentation/x86/zero-page.rst b/Documentation/x86/zero-page.rst
index f088f5881666..8e3447a4b373 100644
--- a/Documentation/x86/zero-page.rst
+++ b/Documentation/x86/zero-page.rst
@@ -42,4 +42,5 @@ Offset/Size	Proto	Name			Meaning
 2D0/A00		ALL	e820_table		E820 memory map table
 						(array of struct e820_entry)
 D00/1EC		ALL	eddbuf			EDD data (array of struct edd_info)
+ECC/008		ALL	unaccepted_memory	Bitmap of unaccepted memory (1bit == 2M)
 ===========	=====	=======================	=================================================
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 1bfe30ebadbe..f5b49e74d728 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -100,6 +100,7 @@ endif
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
 vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
 vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdcall.o
+vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/bitmap.o $(obj)/unaccepted_memory.o
 
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
 efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/bitmap.c b/arch/x86/boot/compressed/bitmap.c
new file mode 100644
index 000000000000..bf58b259380a
--- /dev/null
+++ b/arch/x86/boot/compressed/bitmap.c
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Taken from lib/string.c */
+
+#include <linux/bitmap.h>
+
+void __bitmap_set(unsigned long *map, unsigned int start, int len)
+{
+	unsigned long *p = map + BIT_WORD(start);
+	const unsigned int size = start + len;
+	int bits_to_set = BITS_PER_LONG - (start % BITS_PER_LONG);
+	unsigned long mask_to_set = BITMAP_FIRST_WORD_MASK(start);
+
+	while (len - bits_to_set >= 0) {
+		*p |= mask_to_set;
+		len -= bits_to_set;
+		bits_to_set = BITS_PER_LONG;
+		mask_to_set = ~0UL;
+		p++;
+	}
+	if (len) {
+		mask_to_set &= BITMAP_LAST_WORD_MASK(size);
+		*p |= mask_to_set;
+	}
+}
diff --git a/arch/x86/boot/compressed/unaccepted_memory.c b/arch/x86/boot/compressed/unaccepted_memory.c
new file mode 100644
index 000000000000..d8081cde0eed
--- /dev/null
+++ b/arch/x86/boot/compressed/unaccepted_memory.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "error.h"
+#include "misc.h"
+
+static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	/* Platform-specific memory-acceptance call goes here */
+	error("Cannot accept memory");
+}
+
+void mark_unaccepted(struct boot_params *params, u64 start, u64 end)
+{
+	/*
+	 * The accepted memory bitmap only works at PMD_SIZE granularity.
+	 * If a request comes in to mark memory as unaccepted which is not
+	 * PMD_SIZE-aligned, simply accept the memory now since it can not be
+	 * *marked* as unaccepted.
+	 */
+
+	/* Immediately accept whole range if it is within a PMD_SIZE block: */
+	if ((start & PMD_MASK) == (end & PMD_MASK)) {
+		npages = (end - start) / PAGE_SIZE;
+		__accept_memory(start, start + npages * PAGE_SIZE);
+		return;
+	}
+
+	/* Immediately accept a <PMD_SIZE piece at the start: */
+	if (start & ~PMD_MASK) {
+		__accept_memory(start, round_up(start, PMD_SIZE));
+		start = round_up(start, PMD_SIZE);
+	}
+
+	/* Immediately accept a <PMD_SIZE piece at the end: */
+	if (end & ~PMD_MASK) {
+		__accept_memory(round_down(end, PMD_SIZE), end);
+		end = round_down(end, PMD_SIZE);
+	}
+
+	if (start == end)
+		return;
+
+	bitmap_set((unsigned long *)params->unaccepted_memory,
+		   start / PMD_SIZE, (end - start) / PMD_SIZE);
+}
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
new file mode 100644
index 000000000000..cbc24040b853
--- /dev/null
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2020 Intel Corporation */
+#ifndef _ASM_X86_UNACCEPTED_MEMORY_H
+#define _ASM_X86_UNACCEPTED_MEMORY_H
+
+#include <linux/types.h>
+
+struct boot_params;
+
+void mark_unaccepted(struct boot_params *params, u64 start, u64 num);
+
+#endif
diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h
index b25d3f82c2f3..16bc686a198d 100644
--- a/arch/x86/include/uapi/asm/bootparam.h
+++ b/arch/x86/include/uapi/asm/bootparam.h
@@ -217,7 +217,8 @@ struct boot_params {
 	struct boot_e820_entry e820_table[E820_MAX_ENTRIES_ZEROPAGE]; /* 0x2d0 */
 	__u8  _pad8[48];				/* 0xcd0 */
 	struct edd_info eddbuf[EDDMAXNR];		/* 0xd00 */
-	__u8  _pad9[276];				/* 0xeec */
+	__u64 unaccepted_memory;			/* 0xeec */
+	__u8  _pad9[268];				/* 0xef4 */
 } __attribute__((packed));
 
 /**
diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
index 2c3dac5ecb36..36c1bf33f112 100644
--- a/drivers/firmware/efi/Kconfig
+++ b/drivers/firmware/efi/Kconfig
@@ -243,6 +243,20 @@ config EFI_DISABLE_PCI_DMA
 	  options "efi=disable_early_pci_dma" or "efi=no_disable_early_pci_dma"
 	  may be used to override this option.
 
+config UNACCEPTED_MEMORY
+	bool
+	depends on EFI_STUB
+	help
+	   Some Virtual Machine platforms, such as Intel TDX, introduce
+	   the concept of memory acceptance, requiring memory to be accepted
+	   before it can be used by the guest. This protects against a class of
+	   attacks by the virtual machine platform.
+
+	   UEFI specification v2.9 introduced EFI_UNACCEPTED_MEMORY memory type.
+
+	   This option adds support for unaccepted memory and makes such memory
+	   usable by kernel.
+
 endmenu
 
 config EFI_EMBEDDED_FIRMWARE
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index ae79c3300129..abe862c381b6 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -740,6 +740,7 @@ static __initdata char memory_type_name[][13] = {
 	"MMIO Port",
 	"PAL Code",
 	"Persistent",
+	"Unaccepted",
 };
 
 char * __init efi_md_typeattr_format(char *buf, size_t size,
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index a0b946182b5e..346b12d6f1b2 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -9,12 +9,14 @@
 #include <linux/efi.h>
 #include <linux/pci.h>
 #include <linux/stddef.h>
+#include <linux/bitmap.h>
 
 #include <asm/efi.h>
 #include <asm/e820/types.h>
 #include <asm/setup.h>
 #include <asm/desc.h>
 #include <asm/boot.h>
+#include <asm/unaccepted_memory.h>
 
 #include "efistub.h"
 
@@ -504,6 +506,13 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
 			e820_type = E820_TYPE_PMEM;
 			break;
 
+		case EFI_UNACCEPTED_MEMORY:
+			if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
+				continue;
+			e820_type = E820_TYPE_RAM;
+			mark_unaccepted(params, d->phys_addr,
+					d->phys_addr + PAGE_SIZE * d->num_pages);
+			break;
 		default:
 			continue;
 		}
@@ -575,6 +584,9 @@ static efi_status_t allocate_e820(struct boot_params *params,
 {
 	efi_status_t status;
 	__u32 nr_desc;
+	bool unaccepted_memory_present = false;
+	u64 max_addr = 0;
+	int i;
 
 	status = efi_get_memory_map(map);
 	if (status != EFI_SUCCESS)
@@ -589,9 +601,55 @@ static efi_status_t allocate_e820(struct boot_params *params,
 		if (status != EFI_SUCCESS)
 			goto out;
 	}
+
+	if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
+		goto out;
+
+	/* Check if there's any unaccepted memory and find the max address */
+	for (i = 0; i < nr_desc; i++) {
+		efi_memory_desc_t *d;
+
+		d = efi_early_memdesc_ptr(*map->map, *map->desc_size, i);
+		if (d->type == EFI_UNACCEPTED_MEMORY)
+			unaccepted_memory_present = true;
+		if (d->phys_addr + d->num_pages * PAGE_SIZE > max_addr)
+			max_addr = d->phys_addr + d->num_pages * PAGE_SIZE;
+	}
+
+	/*
+	 * If unaccepted memory present allocate a bitmap to track what memory
+	 * has to be accepted before access.
+	 *
+	 * One bit in the bitmap represents 2MiB in the address space: one 4k
+	 * page is enough to track 64GiB or physical address space.
+	 *
+	 * In the worst case scenario -- a huge hole in the middle of the
+	 * address space -- It needs 256MiB to handle 4PiB of the address
+	 * space.
+	 *
+	 * TODO: handle situation if params->unaccepted_memory has already set.
+	 * It's required to deal with kexec.
+	 *
+	 * The bitmap will be populated in setup_e820() according to the memory
+	 * map after efi_exit_boot_services().
+	 */
+	if (unaccepted_memory_present) {
+		unsigned long *unaccepted_memory = NULL;
+		u64 size = DIV_ROUND_UP(max_addr, PMD_SIZE * BITS_PER_BYTE);
+
+		status = efi_allocate_pages(size,
+					    (unsigned long *)&unaccepted_memory,
+					    ULONG_MAX);
+		if (status != EFI_SUCCESS)
+			goto out;
+		memset(unaccepted_memory, 0, size);
+		params->unaccepted_memory = (u64)unaccepted_memory;
+	}
+
 out:
 	efi_bs_call(free_pool, *map->map);
-	return EFI_SUCCESS;
+	return status;
+
 }
 
 struct exit_boot_struct {
diff --git a/include/linux/efi.h b/include/linux/efi.h
index dbd39b20e034..270333b9b94d 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -108,7 +108,8 @@ typedef	struct {
 #define EFI_MEMORY_MAPPED_IO_PORT_SPACE	12
 #define EFI_PAL_CODE			13
 #define EFI_PERSISTENT_MEMORY		14
-#define EFI_MAX_MEMORY_TYPE		15
+#define EFI_UNACCEPTED_MEMORY		15
+#define EFI_MAX_MEMORY_TYPE		16
 
 /* Attribute values: */
 #define EFI_MEMORY_UC		((u64)0x0000000000000001ULL)	/* uncached */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCHv2 4/7] x86/boot/compressed: Handle unaccepted memory
  2022-01-11 11:33 [PATCHv2 0/7] Implement support for unaccepted memory Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2022-01-11 11:33 ` [PATCHv2 3/7] efi/x86: Implement support for unaccepted memory Kirill A. Shutemov
@ 2022-01-11 11:33 ` Kirill A. Shutemov
  2022-01-11 11:33 ` [PATCHv2 5/7] x86/mm: Reserve unaccepted memory bitmap Kirill A. Shutemov
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Kirill A. Shutemov @ 2022-01-11 11:33 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov

Firmware is responsible for accepting memory where compressed kernel
image and initrd land. But kernel has to accept memory for decompression
buffer: accept memory just before decompression starts.

KASLR is allowed to use unaccepted memory for the output buffer.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/bitmap.c            | 62 ++++++++++++++++++++
 arch/x86/boot/compressed/kaslr.c             | 14 ++++-
 arch/x86/boot/compressed/misc.c              |  9 +++
 arch/x86/boot/compressed/unaccepted_memory.c | 13 ++++
 arch/x86/include/asm/unaccepted_memory.h     |  2 +
 5 files changed, 98 insertions(+), 2 deletions(-)

diff --git a/arch/x86/boot/compressed/bitmap.c b/arch/x86/boot/compressed/bitmap.c
index bf58b259380a..ba2de61c0823 100644
--- a/arch/x86/boot/compressed/bitmap.c
+++ b/arch/x86/boot/compressed/bitmap.c
@@ -2,6 +2,48 @@
 /* Taken from lib/string.c */
 
 #include <linux/bitmap.h>
+#include <linux/math.h>
+#include <linux/minmax.h>
+
+unsigned long _find_next_bit(const unsigned long *addr1,
+		const unsigned long *addr2, unsigned long nbits,
+		unsigned long start, unsigned long invert, unsigned long le)
+{
+	unsigned long tmp, mask;
+
+	if (unlikely(start >= nbits))
+		return nbits;
+
+	tmp = addr1[start / BITS_PER_LONG];
+	if (addr2)
+		tmp &= addr2[start / BITS_PER_LONG];
+	tmp ^= invert;
+
+	/* Handle 1st word. */
+	mask = BITMAP_FIRST_WORD_MASK(start);
+	if (le)
+		mask = swab(mask);
+
+	tmp &= mask;
+
+	start = round_down(start, BITS_PER_LONG);
+
+	while (!tmp) {
+		start += BITS_PER_LONG;
+		if (start >= nbits)
+			return nbits;
+
+		tmp = addr1[start / BITS_PER_LONG];
+		if (addr2)
+			tmp &= addr2[start / BITS_PER_LONG];
+		tmp ^= invert;
+	}
+
+	if (le)
+		tmp = swab(tmp);
+
+	return min(start + __ffs(tmp), nbits);
+}
 
 void __bitmap_set(unsigned long *map, unsigned int start, int len)
 {
@@ -22,3 +64,23 @@ void __bitmap_set(unsigned long *map, unsigned int start, int len)
 		*p |= mask_to_set;
 	}
 }
+
+void __bitmap_clear(unsigned long *map, unsigned int start, int len)
+{
+	unsigned long *p = map + BIT_WORD(start);
+	const unsigned int size = start + len;
+	int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+	unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+
+	while (len - bits_to_clear >= 0) {
+		*p &= ~mask_to_clear;
+		len -= bits_to_clear;
+		bits_to_clear = BITS_PER_LONG;
+		mask_to_clear = ~0UL;
+		p++;
+	}
+	if (len) {
+		mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+		*p &= ~mask_to_clear;
+	}
+}
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 411b268bc0a2..59db90626042 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -725,10 +725,20 @@ process_efi_entries(unsigned long minimum, unsigned long image_size)
 		 * but in practice there's firmware where using that memory leads
 		 * to crashes.
 		 *
-		 * Only EFI_CONVENTIONAL_MEMORY is guaranteed to be free.
+		 * Only EFI_CONVENTIONAL_MEMORY and EFI_UNACCEPTED_MEMORY (if
+		 * supported) are guaranteed to be free.
 		 */
-		if (md->type != EFI_CONVENTIONAL_MEMORY)
+
+		switch (md->type) {
+		case EFI_CONVENTIONAL_MEMORY:
+			break;
+		case EFI_UNACCEPTED_MEMORY:
+			if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
+				break;
 			continue;
+		default:
+			continue;
+		}
 
 		if (efi_soft_reserve_enabled() &&
 		    (md->attribute & EFI_MEMORY_SP))
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index d8373d766672..1e3efd0a8e11 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -18,6 +18,7 @@
 #include "../string.h"
 #include "../voffset.h"
 #include <asm/bootparam_utils.h>
+#include <asm/unaccepted_memory.h>
 
 /*
  * WARNING!!
@@ -446,6 +447,14 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
 #endif
 
 	debug_putstr("\nDecompressing Linux... ");
+
+	if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) &&
+	    boot_params->unaccepted_memory) {
+		debug_putstr("Accepting memory... ");
+		accept_memory((phys_addr_t)output,
+			      (phys_addr_t)output + needed_size);
+	}
+
 	__decompress(input_data, input_len, NULL, NULL, output, output_len,
 			NULL, error);
 	parse_elf(output);
diff --git a/arch/x86/boot/compressed/unaccepted_memory.c b/arch/x86/boot/compressed/unaccepted_memory.c
index d8081cde0eed..91db800d5f5e 100644
--- a/arch/x86/boot/compressed/unaccepted_memory.c
+++ b/arch/x86/boot/compressed/unaccepted_memory.c
@@ -43,3 +43,16 @@ void mark_unaccepted(struct boot_params *params, u64 start, u64 end)
 	bitmap_set((unsigned long *)params->unaccepted_memory,
 		   start / PMD_SIZE, (end - start) / PMD_SIZE);
 }
+
+void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	unsigned long *unaccepted_memory;
+	unsigned int rs, re;
+
+	unaccepted_memory = (unsigned long *)boot_params->unaccepted_memory;
+	bitmap_for_each_set_region(unaccepted_memory, rs, re,
+				   start / PMD_SIZE, end / PMD_SIZE) {
+		__accept_memory(rs * PMD_SIZE, re * PMD_SIZE);
+		bitmap_clear(unaccepted_memory, rs, re - rs);
+	}
+}
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
index cbc24040b853..f1f835d3cd78 100644
--- a/arch/x86/include/asm/unaccepted_memory.h
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -9,4 +9,6 @@ struct boot_params;
 
 void mark_unaccepted(struct boot_params *params, u64 start, u64 num);
 
+void accept_memory(phys_addr_t start, phys_addr_t end);
+
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCHv2 5/7] x86/mm: Reserve unaccepted memory bitmap
  2022-01-11 11:33 [PATCHv2 0/7] Implement support for unaccepted memory Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2022-01-11 11:33 ` [PATCHv2 4/7] x86/boot/compressed: Handle " Kirill A. Shutemov
@ 2022-01-11 11:33 ` Kirill A. Shutemov
  2022-01-11 19:10   ` Dave Hansen
  2022-01-11 11:33 ` [PATCHv2 6/7] x86/mm: Provide helpers for unaccepted memory Kirill A. Shutemov
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Kirill A. Shutemov @ 2022-01-11 11:33 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov

Unaccepted memory bitmap is allocated during decompression stage and
handed over to main kernel image via boot_params. The bitmap is used to
track if memory has been accepted.

Reserve unaccepted memory bitmap has to prevent reallocating memory for
other means.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/e820.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index bc0657f0deed..dc9048e2d421 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1290,6 +1290,16 @@ void __init e820__memory_setup(void)
 
 	pr_info("BIOS-provided physical RAM map:\n");
 	e820__print_table(who);
+
+	/* Mark unaccepted memory bitmap reserved */
+	if (boot_params.unaccepted_memory) {
+		unsigned long size;
+
+		/* One bit per 2MB */
+		size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
+				    PMD_SIZE * BITS_PER_BYTE);
+		memblock_reserve(boot_params.unaccepted_memory, size);
+	}
 }
 
 void __init e820__memblock_setup(void)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCHv2 6/7] x86/mm: Provide helpers for unaccepted memory
  2022-01-11 11:33 [PATCHv2 0/7] Implement support for unaccepted memory Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2022-01-11 11:33 ` [PATCHv2 5/7] x86/mm: Reserve unaccepted memory bitmap Kirill A. Shutemov
@ 2022-01-11 11:33 ` Kirill A. Shutemov
  2022-01-11 20:01   ` Dave Hansen
  2022-01-11 11:33 ` [PATCHv2 7/7] x86/tdx: Unaccepted memory support Kirill A. Shutemov
  2022-01-18 21:05 ` [PATCHv2 0/7] Implement support for unaccepted memory Brijesh Singh
  7 siblings, 1 reply; 25+ messages in thread
From: Kirill A. Shutemov @ 2022-01-11 11:33 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov

Core-mm requires few helpers to support unaccepted memory:

 - accept_memory() checks the range of addresses against the bitmap and
   accept memory if needed;

 - maybe_set_page_offline() checks the bitmap and marks a page with
   PageOffline() if memory acceptance required on the first
   allocation of the page.

 - accept_and_clear_page_offline() accepts memory for the page and clears
   PageOffline().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/unaccepted_memory.c |  3 +-
 arch/x86/include/asm/page.h                  |  5 ++
 arch/x86/include/asm/unaccepted_memory.h     |  3 +
 arch/x86/mm/Makefile                         |  2 +
 arch/x86/mm/unaccepted_memory.c              | 90 ++++++++++++++++++++
 5 files changed, 101 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/mm/unaccepted_memory.c

diff --git a/arch/x86/boot/compressed/unaccepted_memory.c b/arch/x86/boot/compressed/unaccepted_memory.c
index 91db800d5f5e..b6caca4d3d22 100644
--- a/arch/x86/boot/compressed/unaccepted_memory.c
+++ b/arch/x86/boot/compressed/unaccepted_memory.c
@@ -20,8 +20,7 @@ void mark_unaccepted(struct boot_params *params, u64 start, u64 end)
 
 	/* Immediately accept whole range if it is within a PMD_SIZE block: */
 	if ((start & PMD_MASK) == (end & PMD_MASK)) {
-		npages = (end - start) / PAGE_SIZE;
-		__accept_memory(start, start + npages * PAGE_SIZE);
+		__accept_memory(start, end);
 		return;
 	}
 
diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 4d5810c8fab7..1e56d76ca474 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -19,6 +19,11 @@
 struct page;
 
 #include <linux/range.h>
+
+#ifdef CONFIG_UNACCEPTED_MEMORY
+#include <asm/unaccepted_memory.h>
+#endif
+
 extern struct range pfn_mapped[];
 extern int nr_pfn_mapped;
 
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
index f1f835d3cd78..8a06ac8fc9e9 100644
--- a/arch/x86/include/asm/unaccepted_memory.h
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -6,9 +6,12 @@
 #include <linux/types.h>
 
 struct boot_params;
+struct page;
 
 void mark_unaccepted(struct boot_params *params, u64 start, u64 num);
 
 void accept_memory(phys_addr_t start, phys_addr_t end);
 
+void maybe_set_page_offline(struct page *page, unsigned int order);
+void accept_and_clear_page_offline(struct page *page, unsigned int order);
 #endif
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index fe3d3061fc11..e327f83e6bbf 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -60,3 +60,5 @@ obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_amd.o
 
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
+
+obj-$(CONFIG_UNACCEPTED_MEMORY)	+= unaccepted_memory.o
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
new file mode 100644
index 000000000000..984eaead0b11
--- /dev/null
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -0,0 +1,90 @@
+#include <linux/memblock.h>
+#include <linux/mm.h>
+#include <linux/pfn.h>
+#include <linux/spinlock.h>
+
+#include <asm/io.h>
+#include <asm/setup.h>
+#include <asm/unaccepted_memory.h>
+
+static DEFINE_SPINLOCK(unaccepted_memory_lock);
+
+#define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT)
+
+static void __accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	unsigned long *unaccepted_memory;
+	unsigned int rs, re;
+
+	unaccepted_memory = __va(boot_params.unaccepted_memory);
+	bitmap_for_each_set_region(unaccepted_memory, rs, re,
+				   start / PMD_SIZE,
+				   DIV_ROUND_UP(end, PMD_SIZE)) {
+		/* Platform-specific memory-acceptance call goes here */
+		panic("Cannot accept memory");
+		bitmap_clear(unaccepted_memory, rs, re - rs);
+	}
+}
+
+void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	unsigned long flags;
+	if (!boot_params.unaccepted_memory)
+		return;
+
+	spin_lock_irqsave(&unaccepted_memory_lock, flags);
+	__accept_memory(start, end);
+	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+}
+
+void __init maybe_set_page_offline(struct page *page, unsigned int order)
+{
+	unsigned long *unaccepted_memory;
+	phys_addr_t addr = page_to_phys(page);
+	unsigned long flags;
+	bool unaccepted = false;
+	unsigned int i;
+
+	if (!boot_params.unaccepted_memory)
+		return;
+
+	unaccepted_memory = __va(boot_params.unaccepted_memory);
+	spin_lock_irqsave(&unaccepted_memory_lock, flags);
+	if (order < PMD_ORDER) {
+		BUG_ON(test_bit(addr / PMD_SIZE, unaccepted_memory));
+		goto out;
+	}
+
+	for (i = 0; i < (1 << (order - PMD_ORDER)); i++) {
+		if (test_bit(addr / PMD_SIZE + i, unaccepted_memory)) {
+			unaccepted = true;
+			break;
+		}
+	}
+
+	/* At least part of page is uneccepted */
+	if (unaccepted)
+		__SetPageOffline(page);
+out:
+	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+}
+
+void accept_and_clear_page_offline(struct page *page, unsigned int order)
+{
+	phys_addr_t addr = round_down(page_to_phys(page), PMD_SIZE);
+	int i;
+
+	/* PageOffline() page on a free list, but no unaccepted memory? Hm. */
+	WARN_ON_ONCE(!boot_params.unaccepted_memory);
+
+	page = pfn_to_page(addr >> PAGE_SHIFT);
+	if (order < PMD_ORDER)
+		order = PMD_ORDER;
+
+	accept_memory(addr, addr + (PAGE_SIZE << order));
+
+	for (i = 0; i < (1 << order); i++) {
+		if (PageOffline(page + i))
+			__ClearPageOffline(page + i);
+	}
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCHv2 7/7] x86/tdx: Unaccepted memory support
  2022-01-11 11:33 [PATCHv2 0/7] Implement support for unaccepted memory Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  2022-01-11 11:33 ` [PATCHv2 6/7] x86/mm: Provide helpers for unaccepted memory Kirill A. Shutemov
@ 2022-01-11 11:33 ` Kirill A. Shutemov
  2022-01-18 21:05 ` [PATCHv2 0/7] Implement support for unaccepted memory Brijesh Singh
  7 siblings, 0 replies; 25+ messages in thread
From: Kirill A. Shutemov @ 2022-01-11 11:33 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov

All preparation is complete. Hookup TDX-specific code to accept memory.

There are two tdx_accept_memory() implementations: one in main kernel
and one in the decompressor.

The implementation in core kernel uses tdx_hcall_gpa_intent().
The helper is not available in the decompressor, self-contained
implementation added there instead.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                             |  1 +
 arch/x86/boot/compressed/tdx.c               | 67 ++++++++++++++++++++
 arch/x86/boot/compressed/unaccepted_memory.c |  9 ++-
 arch/x86/include/asm/tdx.h                   |  2 +
 arch/x86/kernel/tdx.c                        |  7 ++
 arch/x86/mm/unaccepted_memory.c              |  6 +-
 6 files changed, 90 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e2ed1684f399..5d0f99bd3538 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -879,6 +879,7 @@ config INTEL_TDX_GUEST
 	select ARCH_HAS_CC_PLATFORM
 	select X86_MCE
 	select X86_MEM_ENCRYPT
+	select UNACCEPTED_MEMORY
 	help
 	  Support running as a guest under Intel TDX.  Without this support,
 	  the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
index 50c8145bd0f3..587e6d948953 100644
--- a/arch/x86/boot/compressed/tdx.c
+++ b/arch/x86/boot/compressed/tdx.c
@@ -5,12 +5,54 @@
 
 #include "../cpuflags.h"
 #include "../string.h"
+#include "error.h"
 
+#include <asm/page_types.h>
+
+#define TDX_HYPERCALL_STANDARD	0
 #define TDX_CPUID_LEAF_ID	0x21
 #define TDX_IDENT		"IntelTDX    "
 
+/*
+ * Used in __tdx_module_call() helper function to gather the
+ * output registers' values of TDCALL instruction when requesting
+ * services from the TDX module. This is software only structure
+ * and not related to TDX module/VMM.
+ */
+struct tdx_module_output {
+	u64 rcx;
+	u64 rdx;
+	u64 r8;
+	u64 r9;
+	u64 r10;
+	u64 r11;
+};
+
+/*
+ * Used in __tdx_hypercall() helper function to gather the
+ * output registers' values of TDCALL instruction when requesting
+ * services from the VMM. This is software only structure
+ * and not related to TDX module/VMM.
+ */
+struct tdx_hypercall_output {
+	u64 r10;
+	u64 r11;
+	u64 r12;
+	u64 r13;
+	u64 r14;
+	u64 r15;
+};
+
 static bool tdx_guest_detected;
 
+/* Helper function used to communicate with the TDX module */
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+		      struct tdx_module_output *out);
+
+/* Helper function used to request services from VMM */
+u64 __tdx_hypercall(u64 type, u64 fn, u64 r12, u64 r13, u64 r14,
+		    u64 r15, struct tdx_hypercall_output *out);
+
 void early_tdx_detect(void)
 {
 	u32 eax, sig[3];
@@ -28,3 +70,28 @@ bool early_is_tdx_guest(void)
 {
 	return tdx_guest_detected;
 }
+
+#define TDACCEPTPAGE		6
+#define TDVMCALL_MAP_GPA	0x10001
+
+void tdx_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	struct tdx_hypercall_output outl = {0};
+	int i;
+
+	if (__tdx_hypercall(TDX_HYPERCALL_STANDARD, TDVMCALL_MAP_GPA,
+			    start, end, 0, 0, &outl)) {
+		error("Cannot accept memory: MapGPA failed\n");
+	}
+
+	/*
+	 * For shared->private conversion, accept the page using TDACCEPTPAGE
+	 * TDX module call.
+	 */
+	for (i = 0; i < (end - start) / PAGE_SIZE; i++) {
+		if (__tdx_module_call(TDACCEPTPAGE, start + i * PAGE_SIZE,
+				      0, 0, 0, NULL)) {
+			error("Cannot accept memory: page accept failed\n");
+		}
+	}
+}
diff --git a/arch/x86/boot/compressed/unaccepted_memory.c b/arch/x86/boot/compressed/unaccepted_memory.c
index b6caca4d3d22..c23526c25e50 100644
--- a/arch/x86/boot/compressed/unaccepted_memory.c
+++ b/arch/x86/boot/compressed/unaccepted_memory.c
@@ -2,11 +2,15 @@
 
 #include "error.h"
 #include "misc.h"
+#include "tdx.h"
 
 static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
 {
 	/* Platform-specific memory-acceptance call goes here */
-	error("Cannot accept memory");
+	if (early_is_tdx_guest())
+		tdx_accept_memory(start, end);
+	else
+		error("Cannot accept memory");
 }
 
 void mark_unaccepted(struct boot_params *params, u64 start, u64 end)
@@ -18,6 +22,9 @@ void mark_unaccepted(struct boot_params *params, u64 start, u64 end)
 	 * *marked* as unaccepted.
 	 */
 
+	/* __accept_memory() needs to know if kernel runs in TDX environment */
+	early_tdx_detect();
+
 	/* Immediately accept whole range if it is within a PMD_SIZE block: */
 	if ((start & PMD_MASK) == (end & PMD_MASK)) {
 		__accept_memory(start, end);
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 6d901cb6d607..fbbe4644cc7b 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -90,6 +90,8 @@ phys_addr_t tdx_shared_mask(void);
 int tdx_hcall_request_gpa_type(phys_addr_t start, phys_addr_t end,
 			       enum tdx_map_type map_type);
 
+extern void tdx_accept_memory(phys_addr_t start, phys_addr_t end);
+
 #else
 
 static inline void tdx_early_init(void) { };
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 0f8f7285c05b..a0ff720425d8 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -162,6 +162,13 @@ int tdx_hcall_request_gpa_type(phys_addr_t start, phys_addr_t end,
 	return 0;
 }
 
+void tdx_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	if (tdx_hcall_request_gpa_type(start, end, TDX_MAP_PRIVATE)) {
+		panic("Accepting memory failed\n");
+	}
+}
+
 static u64 __cpuidle _tdx_halt(const bool irq_disabled, const bool do_sti)
 {
 	/*
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index 984eaead0b11..9f468d58d51f 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -5,6 +5,7 @@
 
 #include <asm/io.h>
 #include <asm/setup.h>
+#include <asm/tdx.h>
 #include <asm/unaccepted_memory.h>
 
 static DEFINE_SPINLOCK(unaccepted_memory_lock);
@@ -21,7 +22,10 @@ static void __accept_memory(phys_addr_t start, phys_addr_t end)
 				   start / PMD_SIZE,
 				   DIV_ROUND_UP(end, PMD_SIZE)) {
 		/* Platform-specific memory-acceptance call goes here */
-		panic("Cannot accept memory");
+		if (cc_platform_has(CC_ATTR_GUEST_TDX))
+			tdx_accept_memory(rs * PMD_SIZE, re * PMD_SIZE);
+		else
+			panic("Cannot accept memory");
 		bitmap_clear(unaccepted_memory, rs, re - rs);
 	}
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCHv2 3/7] efi/x86: Implement support for unaccepted memory
  2022-01-11 11:33 ` [PATCHv2 3/7] efi/x86: Implement support for unaccepted memory Kirill A. Shutemov
@ 2022-01-11 17:17   ` Dave Hansen
  2022-01-12 19:29     ` Kirill A. Shutemov
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Hansen @ 2022-01-11 17:17 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel

On 1/11/22 03:33, Kirill A. Shutemov wrote:
...
> +void mark_unaccepted(struct boot_params *params, u64 start, u64 end)
> +{
> +	/*
> +	 * The accepted memory bitmap only works at PMD_SIZE granularity.
> +	 * If a request comes in to mark memory as unaccepted which is not
> +	 * PMD_SIZE-aligned, simply accept the memory now since it can not be
> +	 * *marked* as unaccepted.
> +	 */
> +
> +	/* Immediately accept whole range if it is within a PMD_SIZE block: */
> +	if ((start & PMD_MASK) == (end & PMD_MASK)) {
> +		npages = (end - start) / PAGE_SIZE;
> +		__accept_memory(start, start + npages * PAGE_SIZE);
> +		return;
> +	}

I still don't quite like how this turned out.  It's still a bit unclear 
to the reader that this has covered all the corner cases.  I think this 
needs a better comment:

	/*
	 * Handle <PMD_SIZE blocks that do not end at a PMD boundary.
	 *
	 * Immediately accept the whole block.  This handles the case
	 * where the below round_{up,down}() would "lose" a small,
	 * <PMD_SIZE block.
	 */
	if ((start & PMD_MASK) == (end & PMD_MASK)) {
		...
		return;
	}

	/*
	 * There is at least one more block to accept.  Both 'start'
	 * and 'end' may not be PMD-aligned.
	 */


> +	/* Immediately accept a <PMD_SIZE piece at the start: */
> +	if (start & ~PMD_MASK) {
> +		__accept_memory(start, round_up(start, PMD_SIZE));
> +		start = round_up(start, PMD_SIZE);
> +	}
> +
> +	/* Immediately accept a <PMD_SIZE piece at the end: */
> +	if (end & ~PMD_MASK) {
> +		__accept_memory(round_down(end, PMD_SIZE), end);
> +		end = round_down(end, PMD_SIZE);
> +	}

	/*
	 * 'start' and 'end' are now both PMD-aligned.
	 * Record the range as being unaccepted:
	 */

> +	if (start == end)
> +		return;

Does bitmap_set()not accept zero-sized 'len' arguments?

> +	bitmap_set((unsigned long *)params->unaccepted_memory,
> +		   start / PMD_SIZE, (end - start) / PMD_SIZE);
> +}

The code you have there is _precise_.  It will never eagerly accept any 
area that _can_ be represented in the bitmap.  But, that's kinda hard to 
describe.  Maybe we should be a bit more sloppy about accepting things 
up front to make it easier to describe:

	/*
	 * Accept small regions that might not be
	 * able to be represented in the bitmap:
	 */
	if (end - start < PMD_SIZE*2) {
		npages = (end - start) / PAGE_SIZE;
		__accept_memory(start, start + npages * PAGE_SIZE);
		return;
	}

	/*
	 * No matter how the start and end are aligned, at
	 * least one unaccepted PMD_SIZE area will remain.
	 */

	... now do the start/end rounding

That has the downside of accepting a few things that it doesn't *HAVE* 
to accept.  But, its behavior is very easy to describe.

> diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
> new file mode 100644
> index 000000000000..cbc24040b853
> --- /dev/null
> +++ b/arch/x86/include/asm/unaccepted_memory.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (C) 2020 Intel Corporation */
> +#ifndef _ASM_X86_UNACCEPTED_MEMORY_H
> +#define _ASM_X86_UNACCEPTED_MEMORY_H
> +
> +#include <linux/types.h>
> +
> +struct boot_params;
> +
> +void mark_unaccepted(struct boot_params *params, u64 start, u64 num);
> +
> +#endif
> diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h
> index b25d3f82c2f3..16bc686a198d 100644
> --- a/arch/x86/include/uapi/asm/bootparam.h
> +++ b/arch/x86/include/uapi/asm/bootparam.h
> @@ -217,7 +217,8 @@ struct boot_params {
>   	struct boot_e820_entry e820_table[E820_MAX_ENTRIES_ZEROPAGE]; /* 0x2d0 */
>   	__u8  _pad8[48];				/* 0xcd0 */
>   	struct edd_info eddbuf[EDDMAXNR];		/* 0xd00 */
> -	__u8  _pad9[276];				/* 0xeec */
> +	__u64 unaccepted_memory;			/* 0xeec */
> +	__u8  _pad9[268];				/* 0xef4 */
>   } __attribute__((packed));
>   
>   /**
> diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
> index 2c3dac5ecb36..36c1bf33f112 100644
> --- a/drivers/firmware/efi/Kconfig
> +++ b/drivers/firmware/efi/Kconfig
> @@ -243,6 +243,20 @@ config EFI_DISABLE_PCI_DMA
>   	  options "efi=disable_early_pci_dma" or "efi=no_disable_early_pci_dma"
>   	  may be used to override this option.
>   
> +config UNACCEPTED_MEMORY
> +	bool
> +	depends on EFI_STUB
> +	help
> +	   Some Virtual Machine platforms, such as Intel TDX, introduce
> +	   the concept of memory acceptance, requiring memory to be accepted
> +	   before it can be used by the guest. This protects against a class of
> +	   attacks by the virtual machine platform.

	Some Virtual Machine platforms, such as Intel TDX, require
	some memory to be "accepted" by the guest before it can be used.
	This requirement protects against a class of attacks by the
	virtual machine platform.

Can we make this "class of attacks" a bit more concrete?  Maybe:

	This mechanism helps prevent malicious hosts from making changes
	to guest memory.

??

> +	   UEFI specification v2.9 introduced EFI_UNACCEPTED_MEMORY memory type.
> +
> +	   This option adds support for unaccepted memory and makes such memory
> +	   usable by kernel.
> +
>   endmenu
>   
>   config EFI_EMBEDDED_FIRMWARE
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index ae79c3300129..abe862c381b6 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -740,6 +740,7 @@ static __initdata char memory_type_name[][13] = {
>   	"MMIO Port",
>   	"PAL Code",
>   	"Persistent",
> +	"Unaccepted",
>   };
>   
>   char * __init efi_md_typeattr_format(char *buf, size_t size,
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> index a0b946182b5e..346b12d6f1b2 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -9,12 +9,14 @@
>   #include <linux/efi.h>
>   #include <linux/pci.h>
>   #include <linux/stddef.h>
> +#include <linux/bitmap.h>
>   
>   #include <asm/efi.h>
>   #include <asm/e820/types.h>
>   #include <asm/setup.h>
>   #include <asm/desc.h>
>   #include <asm/boot.h>
> +#include <asm/unaccepted_memory.h>
>   
>   #include "efistub.h"
>   
> @@ -504,6 +506,13 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
>   			e820_type = E820_TYPE_PMEM;
>   			break;
>   
> +		case EFI_UNACCEPTED_MEMORY:
> +			if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
> +				continue;
> +			e820_type = E820_TYPE_RAM;
> +			mark_unaccepted(params, d->phys_addr,
> +					d->phys_addr + PAGE_SIZE * d->num_pages);
> +			break;
>   		default:
>   			continue;
>   		}
> @@ -575,6 +584,9 @@ static efi_status_t allocate_e820(struct boot_params *params,
>   {
>   	efi_status_t status;
>   	__u32 nr_desc;
> +	bool unaccepted_memory_present = false;
> +	u64 max_addr = 0;
> +	int i;
>   
>   	status = efi_get_memory_map(map);
>   	if (status != EFI_SUCCESS)
> @@ -589,9 +601,55 @@ static efi_status_t allocate_e820(struct boot_params *params,
>   		if (status != EFI_SUCCESS)
>   			goto out;
>   	}
> +
> +	if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
> +		goto out;
> +
> +	/* Check if there's any unaccepted memory and find the max address */
> +	for (i = 0; i < nr_desc; i++) {
> +		efi_memory_desc_t *d;
> +
> +		d = efi_early_memdesc_ptr(*map->map, *map->desc_size, i);
> +		if (d->type == EFI_UNACCEPTED_MEMORY)
> +			unaccepted_memory_present = true;
> +		if (d->phys_addr + d->num_pages * PAGE_SIZE > max_addr)
> +			max_addr = d->phys_addr + d->num_pages * PAGE_SIZE;
> +	}
> +
> +	/*
> +	 * If unaccepted memory present allocate a bitmap to track what memory

			       ^ is

> +	 * has to be accepted before access.
> +	 *
> +	 * One bit in the bitmap represents 2MiB in the address space: one 4k
> +	 * page is enough to track 64GiB or physical address space.

That's a bit awkward and needs a "or->of".  Perhaps:

	* One bit in the bitmap represents 2MiB in the address space:
	* A 4k bitmap can track 64GiB of physical address space.

> +	 * In the worst case scenario -- a huge hole in the middle of the
> +	 * address space -- It needs 256MiB to handle 4PiB of the address
> +	 * space.
> +	 *
> +	 * TODO: handle situation if params->unaccepted_memory has already set.
> +	 * It's required to deal with kexec.

What happens today with kexec() since its not dealt with?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHv2 5/7] x86/mm: Reserve unaccepted memory bitmap
  2022-01-11 11:33 ` [PATCHv2 5/7] x86/mm: Reserve unaccepted memory bitmap Kirill A. Shutemov
@ 2022-01-11 19:10   ` Dave Hansen
  2022-01-12 19:43     ` Kirill A. Shutemov
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Hansen @ 2022-01-11 19:10 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel

On 1/11/22 03:33, Kirill A. Shutemov wrote:
> Unaccepted memory bitmap is allocated during decompression stage and
> handed over to main kernel image via boot_params. The bitmap is used to
> track if memory has been accepted.
> 
> Reserve unaccepted memory bitmap has to prevent reallocating memory for
> other means.

I'm having a hard time parsing that changelog, especially the second 
paragraph.  Could you give it another shot?

> +	/* Mark unaccepted memory bitmap reserved */
> +	if (boot_params.unaccepted_memory) {
> +		unsigned long size;
> +
> +		/* One bit per 2MB */
> +		size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
> +				    PMD_SIZE * BITS_PER_BYTE);
> +		memblock_reserve(boot_params.unaccepted_memory, size);
> +	}

Is it OK that the size of the bitmap is inferred from 
e820__end_of_ram_pfn()?  Is this OK in the presence of mem= and other 
things that muck with the e820?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHv2 1/7] mm: Add support for unaccepted memory
  2022-01-11 11:33 ` [PATCHv2 1/7] mm: Add " Kirill A. Shutemov
@ 2022-01-11 19:46   ` Dave Hansen
  2022-01-12 11:31     ` David Hildenbrand
  2022-01-12 18:30     ` Kirill A. Shutemov
  0 siblings, 2 replies; 25+ messages in thread
From: Dave Hansen @ 2022-01-11 19:46 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel

> diff --git a/mm/memblock.c b/mm/memblock.c
> index 1018e50566f3..6dfa594192de 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1400,6 +1400,7 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
>   		 */
>   		kmemleak_alloc_phys(found, size, 0, 0);
>   
> +	accept_memory(found, found + size);
>   	return found;
>   }

This could use a comment.

Looking at this, I also have to wonder if accept_memory() is a bit too 
generic.  Should it perhaps be: cc_accept_memory() or 
cc_guest_accept_memory()?

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c5952749ad40..5707b4b5f774 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1064,6 +1064,7 @@ static inline void __free_one_page(struct page *page,
>   	unsigned int max_order;
>   	struct page *buddy;
>   	bool to_tail;
> +	bool offline = PageOffline(page);
>   
>   	max_order = min_t(unsigned int, MAX_ORDER - 1, pageblock_order);
>   
> @@ -1097,6 +1098,10 @@ static inline void __free_one_page(struct page *page,
>   			clear_page_guard(zone, buddy, order, migratetype);
>   		else
>   			del_page_from_free_list(buddy, zone, order);
> +
> +		if (PageOffline(buddy))
> +			offline = true;
> +
>   		combined_pfn = buddy_pfn & pfn;
>   		page = page + (combined_pfn - pfn);
>   		pfn = combined_pfn;
> @@ -1130,6 +1135,9 @@ static inline void __free_one_page(struct page *page,
>   done_merging:
>   	set_buddy_order(page, order);
>   
> +	if (offline)
> +		__SetPageOffline(page);
> +
>   	if (fpi_flags & FPI_TO_TAIL)
>   		to_tail = true;
>   	else if (is_shuffle_order(order))

This is touching some pretty hot code paths.  You mention both that 
accepting memory is slow and expensive, yet you're doing it in the core 
allocator.

That needs at least some discussion in the changelog.

> @@ -1155,7 +1163,8 @@ static inline void __free_one_page(struct page *page,
>   static inline bool page_expected_state(struct page *page,
>   					unsigned long check_flags)
>   {
> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
> +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
> +	    !PageOffline(page))
>   		return false;

Looking at stuff like this, I can't help but think that a:

	#define PageOffline PageUnaccepted

and some other renaming would be a fine idea.  I get that the Offline 
bit can be reused, but I'm not sure that the "Offline" *naming* should 
be reused.  What you're doing here is logically distinct from existing 
offlining.

>   	if (unlikely((unsigned long)page->mapping |
> @@ -1734,6 +1743,8 @@ void __init memblock_free_pages(struct page *page, unsigned long pfn,
>   {
>   	if (early_page_uninitialised(pfn))
>   		return;
> +
> +	maybe_set_page_offline(page, order);
>   	__free_pages_core(page, order);
>   }
>   
> @@ -1823,10 +1834,12 @@ static void __init deferred_free_range(unsigned long pfn,
>   	if (nr_pages == pageblock_nr_pages &&
>   	    (pfn & (pageblock_nr_pages - 1)) == 0) {
>   		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> +		maybe_set_page_offline(page, pageblock_order);
>   		__free_pages_core(page, pageblock_order);
>   		return;
>   	}
>   
> +	accept_memory(pfn << PAGE_SHIFT, (pfn + nr_pages) << PAGE_SHIFT);
>   	for (i = 0; i < nr_pages; i++, page++, pfn++) {
>   		if ((pfn & (pageblock_nr_pages - 1)) == 0)
>   			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> @@ -2297,6 +2310,9 @@ static inline void expand(struct zone *zone, struct page *page,
>   		if (set_page_guard(zone, &page[size], high, migratetype))
>   			continue;
>   
> +		if (PageOffline(page))
> +			__SetPageOffline(&page[size]);

Yeah, this is really begging for comments.  Please add some.

>   		add_to_free_list(&page[size], zone, high, migratetype);
>   		set_buddy_order(&page[size], high);
>   	}
> @@ -2393,6 +2409,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>   	 */
>   	kernel_unpoison_pages(page, 1 << order);
>   
> +	if (PageOffline(page))
> +		accept_and_clear_page_offline(page, order);
> +
>   	/*
>   	 * As memory initialization might be integrated into KASAN,
>   	 * kasan_alloc_pages and kernel_init_free_pages must be

I guess once there are no more PageOffline() pages in the allocator, the 
only impact from these patches will be a bunch of conditional branches 
from the "if (PageOffline(page))" that always have the same result.  The 
branch predictors should do a good job with that.

*BUT*, that overhead is going to be universally inflicted on all users 
on x86, even those without TDX.  I guess the compiler will save non-x86 
users because they'll have an empty stub for 
accept_and_clear_page_offline() which the compiler will optimize away.

It sure would be nice to have some changelog material about why this is 
OK, though.  This is especially true since there's a global spinlock 
hidden in accept_and_clear_page_offline() wrapping a slow and "costly" 
operation.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHv2 6/7] x86/mm: Provide helpers for unaccepted memory
  2022-01-11 11:33 ` [PATCHv2 6/7] x86/mm: Provide helpers for unaccepted memory Kirill A. Shutemov
@ 2022-01-11 20:01   ` Dave Hansen
  2022-01-12 19:43     ` Kirill A. Shutemov
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Hansen @ 2022-01-11 20:01 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel

On 1/11/22 03:33, Kirill A. Shutemov wrote:
> Core-mm requires few helpers to support unaccepted memory:
> 
>   - accept_memory() checks the range of addresses against the bitmap and
>     accept memory if needed;
> 
>   - maybe_set_page_offline() checks the bitmap and marks a page with
>     PageOffline() if memory acceptance required on the first
>     allocation of the page.
> 
>   - accept_and_clear_page_offline() accepts memory for the page and clears
>     PageOffline().
> 
...
> +void accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +	unsigned long flags;
> +	if (!boot_params.unaccepted_memory)
> +		return;
> +
> +	spin_lock_irqsave(&unaccepted_memory_lock, flags);
> +	__accept_memory(start, end);
> +	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
> +}

Not a big deal, but please cc me on all the patches in the series.  This 
is called from the core mm patches which I wasn't cc'd on.

This also isn't obvious, but this introduces a new, global lock into the 
fast path of the page allocator and holds it for extended periods of 
time.  It won't be taken any more once all memory is accepted, but you 
can sure bet that it will be noticeable until that happens.

*PLEASE* document this.  It needs changelog and probably code comments.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHv2 1/7] mm: Add support for unaccepted memory
  2022-01-11 19:46   ` Dave Hansen
@ 2022-01-12 11:31     ` David Hildenbrand
  2022-01-12 19:15       ` Kirill A. Shutemov
  2022-01-12 18:30     ` Kirill A. Shutemov
  1 sibling, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2022-01-12 11:31 UTC (permalink / raw)
  To: Dave Hansen, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel


> 
> Looking at stuff like this, I can't help but think that a:
> 
> 	#define PageOffline PageUnaccepted
> 
> and some other renaming would be a fine idea.  I get that the Offline 
> bit can be reused, but I'm not sure that the "Offline" *naming* should 
> be reused.  What you're doing here is logically distinct from existing 
> offlining.

Yes, or using a new pagetype bit to make the distinction clearer.
Especially the function names like maybe_set_page_offline() et. Al are
confusing IMHO. They are all about accepting unaccepted memory ... and
should express that.

I assume PageOffline() will be set only on the first sub-page of a
high-order PageBuddy() page, correct?

Then we'll have to monitor all PageOffline() users such that they can
actually deal with PageBuddy() pages spanning *multiple* base pages for
a PageBuddy() page. For now it's clear that if a page is PageOffline(),
it cannot be PageBuddy() and cannot span more than one base page.

E.g., fs/proc/kcore.c:read_kcore() assumes that PageOffline() is set on
individual base pages.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHv2 1/7] mm: Add support for unaccepted memory
  2022-01-11 19:46   ` Dave Hansen
  2022-01-12 11:31     ` David Hildenbrand
@ 2022-01-12 18:30     ` Kirill A. Shutemov
  2022-01-12 18:40       ` Dave Hansen
  1 sibling, 1 reply; 25+ messages in thread
From: Kirill A. Shutemov @ 2022-01-12 18:30 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel

On Tue, Jan 11, 2022 at 11:46:37AM -0800, Dave Hansen wrote:
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 1018e50566f3..6dfa594192de 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -1400,6 +1400,7 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
> >   		 */
> >   		kmemleak_alloc_phys(found, size, 0, 0);
> > +	accept_memory(found, found + size);
> >   	return found;
> >   }
> 
> This could use a comment.

How about this:

	/*
	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
	 * requiring memory to be accepted before it can be used by the
	 * guest.
	 *
	 * Accept the memory of the allocated buffer.
	 */
> 
> Looking at this, I also have to wonder if accept_memory() is a bit too
> generic.  Should it perhaps be: cc_accept_memory() or
> cc_guest_accept_memory()?

I'll rename accept_memory() to cc_accept_memory() and
accept_and_clear_page_offline() to cc_accept_and_clear_page_offline().

> 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c5952749ad40..5707b4b5f774 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1064,6 +1064,7 @@ static inline void __free_one_page(struct page *page,
> >   	unsigned int max_order;
> >   	struct page *buddy;
> >   	bool to_tail;
> > +	bool offline = PageOffline(page);
> >   	max_order = min_t(unsigned int, MAX_ORDER - 1, pageblock_order);
> > @@ -1097,6 +1098,10 @@ static inline void __free_one_page(struct page *page,
> >   			clear_page_guard(zone, buddy, order, migratetype);
> >   		else
> >   			del_page_from_free_list(buddy, zone, order);
> > +
> > +		if (PageOffline(buddy))
> > +			offline = true;
> > +
> >   		combined_pfn = buddy_pfn & pfn;
> >   		page = page + (combined_pfn - pfn);
> >   		pfn = combined_pfn;
> > @@ -1130,6 +1135,9 @@ static inline void __free_one_page(struct page *page,
> >   done_merging:
> >   	set_buddy_order(page, order);
> > +	if (offline)
> > +		__SetPageOffline(page);
> > +

I'll add

	/* Mark page PageOffline() if any merged page was PageOffline() */

above the 'if'.

> >   	if (fpi_flags & FPI_TO_TAIL)
> >   		to_tail = true;
> >   	else if (is_shuffle_order(order))
> 
> This is touching some pretty hot code paths.  You mention both that
> accepting memory is slow and expensive, yet you're doing it in the core
> allocator.
> 
> That needs at least some discussion in the changelog.

That is page type transfer on page merging. What expensive do you see here?
The cachelines with both struct pages are hot already.

> > @@ -1155,7 +1163,8 @@ static inline void __free_one_page(struct page *page,
> >   static inline bool page_expected_state(struct page *page,
> >   					unsigned long check_flags)
> >   {
> > -	if (unlikely(atomic_read(&page->_mapcount) != -1))
> > +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
> > +	    !PageOffline(page))
> >   		return false;
> 
> Looking at stuff like this, I can't help but think that a:
> 
> 	#define PageOffline PageUnaccepted
> 
> and some other renaming would be a fine idea.  I get that the Offline bit
> can be reused, but I'm not sure that the "Offline" *naming* should be
> reused.  What you're doing here is logically distinct from existing
> offlining.

I find the Offline name fitting. In both cases page is not accessible
without additional preparation.

Why do you want to multiply entities?

> >   	if (unlikely((unsigned long)page->mapping |
> > @@ -1734,6 +1743,8 @@ void __init memblock_free_pages(struct page *page, unsigned long pfn,
> >   {
> >   	if (early_page_uninitialised(pfn))
> >   		return;
> > +
> > +	maybe_set_page_offline(page, order);
> >   	__free_pages_core(page, order);
> >   }
> > @@ -1823,10 +1834,12 @@ static void __init deferred_free_range(unsigned long pfn,
> >   	if (nr_pages == pageblock_nr_pages &&
> >   	    (pfn & (pageblock_nr_pages - 1)) == 0) {
> >   		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> > +		maybe_set_page_offline(page, pageblock_order);
> >   		__free_pages_core(page, pageblock_order);
> >   		return;
> >   	}
> > +	accept_memory(pfn << PAGE_SHIFT, (pfn + nr_pages) << PAGE_SHIFT);
> >   	for (i = 0; i < nr_pages; i++, page++, pfn++) {
> >   		if ((pfn & (pageblock_nr_pages - 1)) == 0)
> >   			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> > @@ -2297,6 +2310,9 @@ static inline void expand(struct zone *zone, struct page *page,
> >   		if (set_page_guard(zone, &page[size], high, migratetype))
> >   			continue;
> > +		if (PageOffline(page))
> > +			__SetPageOffline(&page[size]);
> 
> Yeah, this is really begging for comments.  Please add some.

I'll add
		/* Transfer PageOffline() to newly split pages */
> 
> >   		add_to_free_list(&page[size], zone, high, migratetype);
> >   		set_buddy_order(&page[size], high);
> >   	}
> > @@ -2393,6 +2409,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
> >   	 */
> >   	kernel_unpoison_pages(page, 1 << order);
> > +	if (PageOffline(page))
> > +		accept_and_clear_page_offline(page, order);
> > +
> >   	/*
> >   	 * As memory initialization might be integrated into KASAN,
> >   	 * kasan_alloc_pages and kernel_init_free_pages must be
> 
> I guess once there are no more PageOffline() pages in the allocator, the
> only impact from these patches will be a bunch of conditional branches from
> the "if (PageOffline(page))" that always have the same result.  The branch
> predictors should do a good job with that.
> 
> *BUT*, that overhead is going to be universally inflicted on all users on
> x86, even those without TDX.  I guess the compiler will save non-x86 users
> because they'll have an empty stub for accept_and_clear_page_offline() which
> the compiler will optimize away.
> 
> It sure would be nice to have some changelog material about why this is OK,
> though.  This is especially true since there's a global spinlock hidden in
> accept_and_clear_page_offline() wrapping a slow and "costly" operation.

Okay, I will come up with an explanation in commit message.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHv2 1/7] mm: Add support for unaccepted memory
  2022-01-12 18:30     ` Kirill A. Shutemov
@ 2022-01-12 18:40       ` Dave Hansen
  2022-01-13  7:42         ` Mike Rapoport
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Hansen @ 2022-01-12 18:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel

On 1/12/22 10:30, Kirill A. Shutemov wrote:
> On Tue, Jan 11, 2022 at 11:46:37AM -0800, Dave Hansen wrote:
>>> diff --git a/mm/memblock.c b/mm/memblock.c
>>> index 1018e50566f3..6dfa594192de 100644
>>> --- a/mm/memblock.c
>>> +++ b/mm/memblock.c
>>> @@ -1400,6 +1400,7 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
>>>    		 */
>>>    		kmemleak_alloc_phys(found, size, 0, 0);
>>> +	accept_memory(found, found + size);
>>>    	return found;
>>>    }
>>
>> This could use a comment.
> 
> How about this:
> 
> 	/*
> 	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> 	 * requiring memory to be accepted before it can be used by the
> 	 * guest.
> 	 *
> 	 * Accept the memory of the allocated buffer.
> 	 */

I think a one-liner that might cue the reader to go look at 
accept_memory() itself would be fine.  Maybe:

	/* Make the memblock usable when running in picky VM guests: */

That implies that the memory isn't usable without doing this and also 
points out that it's related to running in a guest.

>> Looking at this, I also have to wonder if accept_memory() is a bit too
>> generic.  Should it perhaps be: cc_accept_memory() or
>> cc_guest_accept_memory()?
> 
> I'll rename accept_memory() to cc_accept_memory() and
> accept_and_clear_page_offline() to cc_accept_and_clear_page_offline().
> 
>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index c5952749ad40..5707b4b5f774 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -1064,6 +1064,7 @@ static inline void __free_one_page(struct page *page,
>>>    	unsigned int max_order;
>>>    	struct page *buddy;
>>>    	bool to_tail;
>>> +	bool offline = PageOffline(page);
>>>    	max_order = min_t(unsigned int, MAX_ORDER - 1, pageblock_order);
>>> @@ -1097,6 +1098,10 @@ static inline void __free_one_page(struct page *page,
>>>    			clear_page_guard(zone, buddy, order, migratetype);
>>>    		else
>>>    			del_page_from_free_list(buddy, zone, order);
>>> +
>>> +		if (PageOffline(buddy))
>>> +			offline = true;
>>> +
>>>    		combined_pfn = buddy_pfn & pfn;
>>>    		page = page + (combined_pfn - pfn);
>>>    		pfn = combined_pfn;
>>> @@ -1130,6 +1135,9 @@ static inline void __free_one_page(struct page *page,
>>>    done_merging:
>>>    	set_buddy_order(page, order);
>>> +	if (offline)
>>> +		__SetPageOffline(page);
>>> +
> 
> I'll add
> 
> 	/* Mark page PageOffline() if any merged page was PageOffline() */
> 
> above the 'if'.
> 
>>>    	if (fpi_flags & FPI_TO_TAIL)
>>>    		to_tail = true;
>>>    	else if (is_shuffle_order(order))
>>
>> This is touching some pretty hot code paths.  You mention both that
>> accepting memory is slow and expensive, yet you're doing it in the core
>> allocator.
>>
>> That needs at least some discussion in the changelog.
> 
> That is page type transfer on page merging. What expensive do you see here?
> The cachelines with both struct pages are hot already.

I meant that comment generically rather than at this specific hunk.

Just in general, I think this series needs to acknowledge that it is 
touching very core parts of the allocator and might make page allocation 
*MASSIVELY* slower, albeit temporarily.

>>> @@ -1155,7 +1163,8 @@ static inline void __free_one_page(struct page *page,
>>>    static inline bool page_expected_state(struct page *page,
>>>    					unsigned long check_flags)
>>>    {
>>> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
>>> +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
>>> +	    !PageOffline(page))
>>>    		return false;
>>
>> Looking at stuff like this, I can't help but think that a:
>>
>> 	#define PageOffline PageUnaccepted
>>
>> and some other renaming would be a fine idea.  I get that the Offline bit
>> can be reused, but I'm not sure that the "Offline" *naming* should be
>> reused.  What you're doing here is logically distinct from existing
>> offlining.
> 
> I find the Offline name fitting. In both cases page is not accessible
> without additional preparation.
> 
> Why do you want to multiply entities?

The name wouldn't be bad *if* there was no other use of "Offline".  But, 
logically, your use of "Offline" and the existing use of "Offline" are 
different things.  They are totally orthogonal areas of the code.  They 
should have different names.

Again, I'm fine with using the same _bit_ in page->flags.  But, the two 
logical uses need two different names.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHv2 1/7] mm: Add support for unaccepted memory
  2022-01-12 11:31     ` David Hildenbrand
@ 2022-01-12 19:15       ` Kirill A. Shutemov
  2022-01-14 13:22         ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: Kirill A. Shutemov @ 2022-01-12 19:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dave Hansen, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Wed, Jan 12, 2022 at 12:31:10PM +0100, David Hildenbrand wrote:
> 
> > 
> > Looking at stuff like this, I can't help but think that a:
> > 
> > 	#define PageOffline PageUnaccepted
> > 
> > and some other renaming would be a fine idea.  I get that the Offline 
> > bit can be reused, but I'm not sure that the "Offline" *naming* should 
> > be reused.  What you're doing here is logically distinct from existing 
> > offlining.
> 
> Yes, or using a new pagetype bit to make the distinction clearer.
> Especially the function names like maybe_set_page_offline() et. Al are
> confusing IMHO. They are all about accepting unaccepted memory ... and
> should express that.

"Unaccepted" is UEFI treminology and I'm not sure we want to expose
core-mm to it. Power/S390/ARM may have a different name for the same
concept. Offline/online is neutral terminology, familiar to MM developers.

What if I change accept->online in function names and document the meaning
properly?

> I assume PageOffline() will be set only on the first sub-page of a
> high-order PageBuddy() page, correct?
> 
> Then we'll have to monitor all PageOffline() users such that they can
> actually deal with PageBuddy() pages spanning *multiple* base pages for
> a PageBuddy() page. For now it's clear that if a page is PageOffline(),
> it cannot be PageBuddy() and cannot span more than one base page.

> E.g., fs/proc/kcore.c:read_kcore() assumes that PageOffline() is set on
> individual base pages.

Right, pages that offline from hotplug POV are never on page allocator's
free lists, so it cannot ever step on them.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHv2 3/7] efi/x86: Implement support for unaccepted memory
  2022-01-11 17:17   ` Dave Hansen
@ 2022-01-12 19:29     ` Kirill A. Shutemov
  2022-01-12 19:35       ` Dave Hansen
  0 siblings, 1 reply; 25+ messages in thread
From: Kirill A. Shutemov @ 2022-01-12 19:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel

On Tue, Jan 11, 2022 at 09:17:19AM -0800, Dave Hansen wrote:
> On 1/11/22 03:33, Kirill A. Shutemov wrote:
> ...
> > +void mark_unaccepted(struct boot_params *params, u64 start, u64 end)
> > +{
> > +	/*
> > +	 * The accepted memory bitmap only works at PMD_SIZE granularity.
> > +	 * If a request comes in to mark memory as unaccepted which is not
> > +	 * PMD_SIZE-aligned, simply accept the memory now since it can not be
> > +	 * *marked* as unaccepted.
> > +	 */
> > +
> > +	/* Immediately accept whole range if it is within a PMD_SIZE block: */
> > +	if ((start & PMD_MASK) == (end & PMD_MASK)) {
> > +		npages = (end - start) / PAGE_SIZE;
> > +		__accept_memory(start, start + npages * PAGE_SIZE);
> > +		return;
> > +	}
> 
> I still don't quite like how this turned out.  It's still a bit unclear to
> the reader that this has covered all the corner cases.  I think this needs a
> better comment:
> 
> 	/*
> 	 * Handle <PMD_SIZE blocks that do not end at a PMD boundary.
> 	 *
> 	 * Immediately accept the whole block.  This handles the case
> 	 * where the below round_{up,down}() would "lose" a small,
> 	 * <PMD_SIZE block.
> 	 */
> 	if ((start & PMD_MASK) == (end & PMD_MASK)) {
> 		...
> 		return;
> 	}
> 
> 	/*
> 	 * There is at least one more block to accept.  Both 'start'
> 	 * and 'end' may not be PMD-aligned.
> 	 */

Okay, looks better. Thanks.

> > +	/* Immediately accept a <PMD_SIZE piece at the start: */
> > +	if (start & ~PMD_MASK) {
> > +		__accept_memory(start, round_up(start, PMD_SIZE));
> > +		start = round_up(start, PMD_SIZE);
> > +	}
> > +
> > +	/* Immediately accept a <PMD_SIZE piece at the end: */
> > +	if (end & ~PMD_MASK) {
> > +		__accept_memory(round_down(end, PMD_SIZE), end);
> > +		end = round_down(end, PMD_SIZE);
> > +	}
> 
> 	/*
> 	 * 'start' and 'end' are now both PMD-aligned.
> 	 * Record the range as being unaccepted:
> 	 */

Okay.

> > +	if (start == end)
> > +		return;
> 
> Does bitmap_set()not accept zero-sized 'len' arguments?

Looks like it does. Will drop this.

> > +	bitmap_set((unsigned long *)params->unaccepted_memory,
> > +		   start / PMD_SIZE, (end - start) / PMD_SIZE);
> > +}
> 
> The code you have there is _precise_.  It will never eagerly accept any area
> that _can_ be represented in the bitmap.  But, that's kinda hard to
> describe.  Maybe we should be a bit more sloppy about accepting things up
> front to make it easier to describe:
> 
> 	/*
> 	 * Accept small regions that might not be
> 	 * able to be represented in the bitmap:
> 	 */
> 	if (end - start < PMD_SIZE*2) {
> 		npages = (end - start) / PAGE_SIZE;
> 		__accept_memory(start, start + npages * PAGE_SIZE);
> 		return;
> 	}
> 
> 	/*
> 	 * No matter how the start and end are aligned, at
> 	 * least one unaccepted PMD_SIZE area will remain.
> 	 */
> 
> 	... now do the start/end rounding
> 
> That has the downside of accepting a few things that it doesn't *HAVE* to
> accept.  But, its behavior is very easy to describe.

Hm. Okay. I will give it a try. I like how it is now, but maybe it will be
better.

> 
> > diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
> > new file mode 100644
> > index 000000000000..cbc24040b853
> > --- /dev/null
> > +++ b/arch/x86/include/asm/unaccepted_memory.h
> > @@ -0,0 +1,12 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/* Copyright (C) 2020 Intel Corporation */
> > +#ifndef _ASM_X86_UNACCEPTED_MEMORY_H
> > +#define _ASM_X86_UNACCEPTED_MEMORY_H
> > +
> > +#include <linux/types.h>
> > +
> > +struct boot_params;
> > +
> > +void mark_unaccepted(struct boot_params *params, u64 start, u64 num);
> > +
> > +#endif
> > diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h
> > index b25d3f82c2f3..16bc686a198d 100644
> > --- a/arch/x86/include/uapi/asm/bootparam.h
> > +++ b/arch/x86/include/uapi/asm/bootparam.h
> > @@ -217,7 +217,8 @@ struct boot_params {
> >   	struct boot_e820_entry e820_table[E820_MAX_ENTRIES_ZEROPAGE]; /* 0x2d0 */
> >   	__u8  _pad8[48];				/* 0xcd0 */
> >   	struct edd_info eddbuf[EDDMAXNR];		/* 0xd00 */
> > -	__u8  _pad9[276];				/* 0xeec */
> > +	__u64 unaccepted_memory;			/* 0xeec */
> > +	__u8  _pad9[268];				/* 0xef4 */
> >   } __attribute__((packed));
> >   /**
> > diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
> > index 2c3dac5ecb36..36c1bf33f112 100644
> > --- a/drivers/firmware/efi/Kconfig
> > +++ b/drivers/firmware/efi/Kconfig
> > @@ -243,6 +243,20 @@ config EFI_DISABLE_PCI_DMA
> >   	  options "efi=disable_early_pci_dma" or "efi=no_disable_early_pci_dma"
> >   	  may be used to override this option.
> > +config UNACCEPTED_MEMORY
> > +	bool
> > +	depends on EFI_STUB
> > +	help
> > +	   Some Virtual Machine platforms, such as Intel TDX, introduce
> > +	   the concept of memory acceptance, requiring memory to be accepted
> > +	   before it can be used by the guest. This protects against a class of
> > +	   attacks by the virtual machine platform.
> 
> 	Some Virtual Machine platforms, such as Intel TDX, require
> 	some memory to be "accepted" by the guest before it can be used.
> 	This requirement protects against a class of attacks by the
> 	virtual machine platform.
> 
> Can we make this "class of attacks" a bit more concrete?  Maybe:
> 
> 	This mechanism helps prevent malicious hosts from making changes
> 	to guest memory.
> 
> ??

Okay.

> > +	   UEFI specification v2.9 introduced EFI_UNACCEPTED_MEMORY memory type.
> > +
> > +	   This option adds support for unaccepted memory and makes such memory
> > +	   usable by kernel.
> > +
> >   endmenu
> >   config EFI_EMBEDDED_FIRMWARE
> > diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> > index ae79c3300129..abe862c381b6 100644
> > --- a/drivers/firmware/efi/efi.c
> > +++ b/drivers/firmware/efi/efi.c
> > @@ -740,6 +740,7 @@ static __initdata char memory_type_name[][13] = {
> >   	"MMIO Port",
> >   	"PAL Code",
> >   	"Persistent",
> > +	"Unaccepted",
> >   };
> >   char * __init efi_md_typeattr_format(char *buf, size_t size,
> > diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> > index a0b946182b5e..346b12d6f1b2 100644
> > --- a/drivers/firmware/efi/libstub/x86-stub.c
> > +++ b/drivers/firmware/efi/libstub/x86-stub.c
> > @@ -9,12 +9,14 @@
> >   #include <linux/efi.h>
> >   #include <linux/pci.h>
> >   #include <linux/stddef.h>
> > +#include <linux/bitmap.h>
> >   #include <asm/efi.h>
> >   #include <asm/e820/types.h>
> >   #include <asm/setup.h>
> >   #include <asm/desc.h>
> >   #include <asm/boot.h>
> > +#include <asm/unaccepted_memory.h>
> >   #include "efistub.h"
> > @@ -504,6 +506,13 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
> >   			e820_type = E820_TYPE_PMEM;
> >   			break;
> > +		case EFI_UNACCEPTED_MEMORY:
> > +			if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
> > +				continue;
> > +			e820_type = E820_TYPE_RAM;
> > +			mark_unaccepted(params, d->phys_addr,
> > +					d->phys_addr + PAGE_SIZE * d->num_pages);
> > +			break;
> >   		default:
> >   			continue;
> >   		}
> > @@ -575,6 +584,9 @@ static efi_status_t allocate_e820(struct boot_params *params,
> >   {
> >   	efi_status_t status;
> >   	__u32 nr_desc;
> > +	bool unaccepted_memory_present = false;
> > +	u64 max_addr = 0;
> > +	int i;
> >   	status = efi_get_memory_map(map);
> >   	if (status != EFI_SUCCESS)
> > @@ -589,9 +601,55 @@ static efi_status_t allocate_e820(struct boot_params *params,
> >   		if (status != EFI_SUCCESS)
> >   			goto out;
> >   	}
> > +
> > +	if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
> > +		goto out;
> > +
> > +	/* Check if there's any unaccepted memory and find the max address */
> > +	for (i = 0; i < nr_desc; i++) {
> > +		efi_memory_desc_t *d;
> > +
> > +		d = efi_early_memdesc_ptr(*map->map, *map->desc_size, i);
> > +		if (d->type == EFI_UNACCEPTED_MEMORY)
> > +			unaccepted_memory_present = true;
> > +		if (d->phys_addr + d->num_pages * PAGE_SIZE > max_addr)
> > +			max_addr = d->phys_addr + d->num_pages * PAGE_SIZE;
> > +	}
> > +
> > +	/*
> > +	 * If unaccepted memory present allocate a bitmap to track what memory
> 
> 			       ^ is
> 
> > +	 * has to be accepted before access.
> > +	 *
> > +	 * One bit in the bitmap represents 2MiB in the address space: one 4k
> > +	 * page is enough to track 64GiB or physical address space.
> 
> That's a bit awkward and needs a "or->of".  Perhaps:
> 
> 	* One bit in the bitmap represents 2MiB in the address space:
> 	* A 4k bitmap can track 64GiB of physical address space.

Okay.

> 
> > +	 * In the worst case scenario -- a huge hole in the middle of the
> > +	 * address space -- It needs 256MiB to handle 4PiB of the address
> > +	 * space.
> > +	 *
> > +	 * TODO: handle situation if params->unaccepted_memory has already set.
> > +	 * It's required to deal with kexec.
> 
> What happens today with kexec() since its not dealt with?

I didn't give it a try, but I assume it will hang.

There are more things to do to make kexec working and safe. We will get
there, but it is not top priority.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHv2 3/7] efi/x86: Implement support for unaccepted memory
  2022-01-12 19:29     ` Kirill A. Shutemov
@ 2022-01-12 19:35       ` Dave Hansen
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2022-01-12 19:35 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel

On 1/12/22 11:29 AM, Kirill A. Shutemov wrote:
>>> +	 * In the worst case scenario -- a huge hole in the middle of the
>>> +	 * address space -- It needs 256MiB to handle 4PiB of the address
>>> +	 * space.
>>> +	 *
>>> +	 * TODO: handle situation if params->unaccepted_memory has already set.
>>> +	 * It's required to deal with kexec.
>> What happens today with kexec() since its not dealt with?
> I didn't give it a try, but I assume it will hang.
> 
> There are more things to do to make kexec working and safe. We will get
> there, but it is not top priority.

Well, if we know it's broken, shouldn't we at least turn kexec off?

It would be dirt simple to do in Kconfig.  As would setting:

	kexec_load_disabled = true;

which would probably also do the trick.  That's from three seconds of
looking.  I'm sure you can come up with something better.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHv2 5/7] x86/mm: Reserve unaccepted memory bitmap
  2022-01-11 19:10   ` Dave Hansen
@ 2022-01-12 19:43     ` Kirill A. Shutemov
  2022-01-12 19:53       ` Dave Hansen
  0 siblings, 1 reply; 25+ messages in thread
From: Kirill A. Shutemov @ 2022-01-12 19:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel

On Tue, Jan 11, 2022 at 11:10:40AM -0800, Dave Hansen wrote:
> On 1/11/22 03:33, Kirill A. Shutemov wrote:
> > Unaccepted memory bitmap is allocated during decompression stage and
> > handed over to main kernel image via boot_params. The bitmap is used to
> > track if memory has been accepted.
> > 
> > Reserve unaccepted memory bitmap has to prevent reallocating memory for
> > other means.
> 
> I'm having a hard time parsing that changelog, especially the second
> paragraph.  Could you give it another shot?

What about this:

	Unaccepted memory bitmap is allocated during decompression stage and
	handed over to main kernel image via boot_params.

	Kernel tracks what memory has been accepted in the bitmap.

	Reserve memory where the bitmap is placed to prevent memblock from
	re-allocating the memory for other needs.

?

> > +	/* Mark unaccepted memory bitmap reserved */
> > +	if (boot_params.unaccepted_memory) {
> > +		unsigned long size;
> > +
> > +		/* One bit per 2MB */
> > +		size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
> > +				    PMD_SIZE * BITS_PER_BYTE);
> > +		memblock_reserve(boot_params.unaccepted_memory, size);
> > +	}
> 
> Is it OK that the size of the bitmap is inferred from
> e820__end_of_ram_pfn()?  Is this OK in the presence of mem= and other things
> that muck with the e820?

Good question. I think we are fine. If kernel is not able to allocate
memory from a part of physical address space we don't need the bitmap for
it either.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHv2 6/7] x86/mm: Provide helpers for unaccepted memory
  2022-01-11 20:01   ` Dave Hansen
@ 2022-01-12 19:43     ` Kirill A. Shutemov
  0 siblings, 0 replies; 25+ messages in thread
From: Kirill A. Shutemov @ 2022-01-12 19:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel

On Tue, Jan 11, 2022 at 12:01:56PM -0800, Dave Hansen wrote:
> On 1/11/22 03:33, Kirill A. Shutemov wrote:
> > Core-mm requires few helpers to support unaccepted memory:
> > 
> >   - accept_memory() checks the range of addresses against the bitmap and
> >     accept memory if needed;
> > 
> >   - maybe_set_page_offline() checks the bitmap and marks a page with
> >     PageOffline() if memory acceptance required on the first
> >     allocation of the page.
> > 
> >   - accept_and_clear_page_offline() accepts memory for the page and clears
> >     PageOffline().
> > 
> ...
> > +void accept_memory(phys_addr_t start, phys_addr_t end)
> > +{
> > +	unsigned long flags;
> > +	if (!boot_params.unaccepted_memory)
> > +		return;
> > +
> > +	spin_lock_irqsave(&unaccepted_memory_lock, flags);
> > +	__accept_memory(start, end);
> > +	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
> > +}
> 
> Not a big deal, but please cc me on all the patches in the series.  This is
> called from the core mm patches which I wasn't cc'd on.
> 
> This also isn't obvious, but this introduces a new, global lock into the
> fast path of the page allocator and holds it for extended periods of time.
> It won't be taken any more once all memory is accepted, but you can sure bet
> that it will be noticeable until that happens.
> 
> *PLEASE* document this.  It needs changelog and probably code comments.

Okay, will do.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHv2 5/7] x86/mm: Reserve unaccepted memory bitmap
  2022-01-12 19:43     ` Kirill A. Shutemov
@ 2022-01-12 19:53       ` Dave Hansen
  2022-01-15 18:46         ` Mike Rapoport
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Hansen @ 2022-01-12 19:53 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel

On 1/12/22 11:43 AM, Kirill A. Shutemov wrote:
> On Tue, Jan 11, 2022 at 11:10:40AM -0800, Dave Hansen wrote:
>> On 1/11/22 03:33, Kirill A. Shutemov wrote:
>>> Unaccepted memory bitmap is allocated during decompression stage and
>>> handed over to main kernel image via boot_params. The bitmap is used to
>>> track if memory has been accepted.
>>>
>>> Reserve unaccepted memory bitmap has to prevent reallocating memory for
>>> other means.
>>
>> I'm having a hard time parsing that changelog, especially the second
>> paragraph.  Could you give it another shot?
> 
> What about this:
> 
> 	Unaccepted memory bitmap is allocated during decompression stage and
> 	handed over to main kernel image via boot_params.
> 
> 	Kernel tracks what memory has been accepted in the bitmap.
> 
> 	Reserve memory where the bitmap is placed to prevent memblock from
> 	re-allocating the memory for other needs.
> 
> ?

Ahh, I get what you're trying to say now.  But, it still really lacks a
coherent problem statement.  How about this?

	== Problem ==

	A given page of memory can only be accepted once.  The kernel
	has a need to accept memory both in the early decompression
	stage and during normal runtime.

	== Solution ==

	Use a bitmap to communicate the acceptance state of each page
	between the decompression stage and normal runtime.  This
	eliminates the possibility of attempting to double-accept a
	page.

	== Details ==

	Allocate the bitmap during decompression stage and hand it over
	to the main kernel image via boot_params.

	In the runtime kernel, reserve the bitmap's memory to ensure
	nothing overwrites it.

>>> +	/* Mark unaccepted memory bitmap reserved */
>>> +	if (boot_params.unaccepted_memory) {
>>> +		unsigned long size;
>>> +
>>> +		/* One bit per 2MB */
>>> +		size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
>>> +				    PMD_SIZE * BITS_PER_BYTE);
>>> +		memblock_reserve(boot_params.unaccepted_memory, size);
>>> +	}
>>
>> Is it OK that the size of the bitmap is inferred from
>> e820__end_of_ram_pfn()?  Is this OK in the presence of mem= and other things
>> that muck with the e820?
> 
> Good question. I think we are fine. If kernel is not able to allocate
> memory from a part of physical address space we don't need the bitmap for
> it either.

That's a good point.  If the e820 range does a one-way shrink it's
probably fine.  The only problem would be if the bitmap had space for
for stuff past e820__end_of_ram_pfn() *and* it later needed to be accepted.

Would it be worth recording the size of the reservation and then
double-checking against it in the bitmap operations?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHv2 1/7] mm: Add support for unaccepted memory
  2022-01-12 18:40       ` Dave Hansen
@ 2022-01-13  7:42         ` Mike Rapoport
  0 siblings, 0 replies; 25+ messages in thread
From: Mike Rapoport @ 2022-01-13  7:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Wed, Jan 12, 2022 at 10:40:53AM -0800, Dave Hansen wrote:
> On 1/12/22 10:30, Kirill A. Shutemov wrote:
> > On Tue, Jan 11, 2022 at 11:46:37AM -0800, Dave Hansen wrote:
> > > > diff --git a/mm/memblock.c b/mm/memblock.c
> > > > index 1018e50566f3..6dfa594192de 100644
> > > > --- a/mm/memblock.c
> > > > +++ b/mm/memblock.c
> > > > @@ -1400,6 +1400,7 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
> > > >    		 */
> > > >    		kmemleak_alloc_phys(found, size, 0, 0);
> > > > +	accept_memory(found, found + size);
> > > >    	return found;
> > > >    }
> > > 
> > > This could use a comment.
> > 
> > How about this:
> > 
> > 	/*
> > 	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> > 	 * requiring memory to be accepted before it can be used by the
> > 	 * guest.
> > 	 *
> > 	 * Accept the memory of the allocated buffer.
> > 	 */
> 
> I think a one-liner that might cue the reader to go look at accept_memory()
> itself would be fine.  Maybe:
> 
> 	/* Make the memblock usable when running in picky VM guests: */

I'd s/memblock/found range/ or something like that, memblock is too vague
IMO
 
> That implies that the memory isn't usable without doing this and also points
> out that it's related to running in a guest.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHv2 1/7] mm: Add support for unaccepted memory
  2022-01-12 19:15       ` Kirill A. Shutemov
@ 2022-01-14 13:22         ` David Hildenbrand
  0 siblings, 0 replies; 25+ messages in thread
From: David Hildenbrand @ 2022-01-14 13:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On 12.01.22 20:15, Kirill A. Shutemov wrote:
> On Wed, Jan 12, 2022 at 12:31:10PM +0100, David Hildenbrand wrote:
>>
>>>
>>> Looking at stuff like this, I can't help but think that a:
>>>
>>> 	#define PageOffline PageUnaccepted
>>>
>>> and some other renaming would be a fine idea.  I get that the Offline 
>>> bit can be reused, but I'm not sure that the "Offline" *naming* should 
>>> be reused.  What you're doing here is logically distinct from existing 
>>> offlining.
>>
>> Yes, or using a new pagetype bit to make the distinction clearer.
>> Especially the function names like maybe_set_page_offline() et. Al are
>> confusing IMHO. They are all about accepting unaccepted memory ... and
>> should express that.
> 
> "Unaccepted" is UEFI treminology and I'm not sure we want to expose
> core-mm to it. Power/S390/ARM may have a different name for the same
> concept. Offline/online is neutral terminology, familiar to MM developers.

Personally, I'd much rather prefer clear UEFI terminology for now than
making the code more confusing to get. We can always generalize later
iff there are similar needs by other archs (and if they are able to come
up witha  better name). But maybe we can find a different name immediately.

The issue with online vs. offline I have is that we already have enough
confusion:

offline page: memory section is offline. These pages are not managed by
the buddy. The memmap is stale unless we're dealing with special
ZONE_DEVICE memory.

logically offline pages: memory section is online and pages are
PageOffline(). These pages were removed from the buddy e.g., to free
them up in the hypervisor.

soft offline pages:  memory section is online and pages are
PageHWPoison(). These pages are removed from the buddy such that we
cannot allocate them to not trigger MCEs.


offline pages are exposed to the buddy by onlining them
(generic_online_page()), which is init+freeing. PageOffline() and
PageHWPoison() are onlined by removing the flag and freeing them to the
buddy.


Your case is different such that the pages are managed by the buddy and
they don't really have online/offline semantics compared to what we
already have. All the buddy has to do is prepare them for initial use.


I'm fine with reusing PageOffline(), but for the purpose of reading the
code, I think we really want some different terminology in page_alloc.c

So using any such terminology would make it clearer to me:
* PageBuddyUnprepared()
* PageBuddyUninitialized()
* PageBuddyUnprocessed()
* PageBuddyUnready()


> 
> What if I change accept->online in function names and document the meaning
> properly?
> 
>> I assume PageOffline() will be set only on the first sub-page of a
>> high-order PageBuddy() page, correct?
>>
>> Then we'll have to monitor all PageOffline() users such that they can
>> actually deal with PageBuddy() pages spanning *multiple* base pages for
>> a PageBuddy() page. For now it's clear that if a page is PageOffline(),
>> it cannot be PageBuddy() and cannot span more than one base page.
> 
>> E.g., fs/proc/kcore.c:read_kcore() assumes that PageOffline() is set on
>> individual base pages.
> 
> Right, pages that offline from hotplug POV are never on page allocator's
> free lists, so it cannot ever step on them.
> 


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHv2 5/7] x86/mm: Reserve unaccepted memory bitmap
  2022-01-12 19:53       ` Dave Hansen
@ 2022-01-15 18:46         ` Mike Rapoport
  0 siblings, 0 replies; 25+ messages in thread
From: Mike Rapoport @ 2022-01-15 18:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Wed, Jan 12, 2022 at 11:53:42AM -0800, Dave Hansen wrote:
> On 1/12/22 11:43 AM, Kirill A. Shutemov wrote:
> > On Tue, Jan 11, 2022 at 11:10:40AM -0800, Dave Hansen wrote:
> >> On 1/11/22 03:33, Kirill A. Shutemov wrote:
> >>
> >>> +	/* Mark unaccepted memory bitmap reserved */
> >>> +	if (boot_params.unaccepted_memory) {
> >>> +		unsigned long size;
> >>> +
> >>> +		/* One bit per 2MB */
> >>> +		size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
> >>> +				    PMD_SIZE * BITS_PER_BYTE);
> >>> +		memblock_reserve(boot_params.unaccepted_memory, size);
> >>> +	}
> >>
> >> Is it OK that the size of the bitmap is inferred from
> >> e820__end_of_ram_pfn()?  Is this OK in the presence of mem= and other things
> >> that muck with the e820?
> > 
> > Good question. I think we are fine. If kernel is not able to allocate
> > memory from a part of physical address space we don't need the bitmap for
> > it either.
> 
> That's a good point.  If the e820 range does a one-way shrink it's
> probably fine.  The only problem would be if the bitmap had space for
> for stuff past e820__end_of_ram_pfn() *and* it later needed to be accepted.

It's unlikely, but e820 can grow because of EFI and because of memmap=.
To be completely on the safe side, the unaccepted bitmap should be reserved
after parse_early_param() and efi_memblock_x86_reserve_range().

Since we anyway do not have memblock allocations before
e820__memblock_setup(), the simplest thing would be to put the reservation
first thing in e820__memblock_setup().

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHv2 0/7] Implement support for unaccepted memory
  2022-01-11 11:33 [PATCHv2 0/7] Implement support for unaccepted memory Kirill A. Shutemov
                   ` (6 preceding siblings ...)
  2022-01-11 11:33 ` [PATCHv2 7/7] x86/tdx: Unaccepted memory support Kirill A. Shutemov
@ 2022-01-18 21:05 ` Brijesh Singh
  7 siblings, 0 replies; 25+ messages in thread
From: Brijesh Singh @ 2022-01-18 21:05 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: brijesh.singh, Andi Kleen, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Varad Gautam,
	Dario Faggioli, x86, linux-mm, linux-coco, linux-efi,
	linux-kernel

Hi Kirill,

...

> 
> The approach lowers boot time substantially. Boot to shell is ~2.5x
> faster for 4G TDX VM and ~4x faster for 64G.
> 
> Patches 1-6/7 are generic and don't have any dependencies on TDX. They
> should serve AMD SEV needs as well. TDX-specific code isolated in the
> last patch. This patch requires the core TDX patchset which is currently
> under review.
> 

I can confirm that this series works for the SEV-SNP guest. I was able 
to hook the SEV-SNP page validation vmgexit (similar to the TDX patch#7) 
and have verified that the guest kernel successfully accepted all the 
memory regions marked unaccepted by the EFI boot loader.
Not a big deal, but can I ask you to include me in Cc on the future 
series; I should be able to do more testing on SNP hardware and provide 
my Test-by tag.

~ Brijesh

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2022-01-18 21:05 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-11 11:33 [PATCHv2 0/7] Implement support for unaccepted memory Kirill A. Shutemov
2022-01-11 11:33 ` [PATCHv2 1/7] mm: Add " Kirill A. Shutemov
2022-01-11 19:46   ` Dave Hansen
2022-01-12 11:31     ` David Hildenbrand
2022-01-12 19:15       ` Kirill A. Shutemov
2022-01-14 13:22         ` David Hildenbrand
2022-01-12 18:30     ` Kirill A. Shutemov
2022-01-12 18:40       ` Dave Hansen
2022-01-13  7:42         ` Mike Rapoport
2022-01-11 11:33 ` [PATCHv2 2/7] efi/x86: Get full memory map in allocate_e820() Kirill A. Shutemov
2022-01-11 11:33 ` [PATCHv2 3/7] efi/x86: Implement support for unaccepted memory Kirill A. Shutemov
2022-01-11 17:17   ` Dave Hansen
2022-01-12 19:29     ` Kirill A. Shutemov
2022-01-12 19:35       ` Dave Hansen
2022-01-11 11:33 ` [PATCHv2 4/7] x86/boot/compressed: Handle " Kirill A. Shutemov
2022-01-11 11:33 ` [PATCHv2 5/7] x86/mm: Reserve unaccepted memory bitmap Kirill A. Shutemov
2022-01-11 19:10   ` Dave Hansen
2022-01-12 19:43     ` Kirill A. Shutemov
2022-01-12 19:53       ` Dave Hansen
2022-01-15 18:46         ` Mike Rapoport
2022-01-11 11:33 ` [PATCHv2 6/7] x86/mm: Provide helpers for unaccepted memory Kirill A. Shutemov
2022-01-11 20:01   ` Dave Hansen
2022-01-12 19:43     ` Kirill A. Shutemov
2022-01-11 11:33 ` [PATCHv2 7/7] x86/tdx: Unaccepted memory support Kirill A. Shutemov
2022-01-18 21:05 ` [PATCHv2 0/7] Implement support for unaccepted memory Brijesh Singh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.