linux-coco.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCHv8 00/14] mm, x86/cc: Implement support for unaccepted memory
@ 2022-12-07  1:49 Kirill A. Shutemov
  2022-12-07  1:49 ` [PATCHv8 01/14] x86/boot: Centralize __pa()/__va() definitions Kirill A. Shutemov
                   ` (13 more replies)
  0 siblings, 14 replies; 26+ messages in thread
From: Kirill A. Shutemov @ 2022-12-07  1:49 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Mel Gorman, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, aarcange, peterx, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov

UEFI Specification version 2.9 introduces the concept of memory
acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, requiring memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific for the Virtual
Machine platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptance until memory is needed. It lowers boot time and reduces
memory overhead.

The kernel needs to know what memory has been accepted. Firmware
communicates this information via memory map: a new memory type --
EFI_UNACCEPTED_MEMORY -- indicates such memory.

Range-based tracking works fine for firmware, but it gets bulky for
the kernel: e820 has to be modified on every page acceptance. It leads
to table fragmentation, but there's a limited number of entries in the
e820 table

Another option is to mark such memory as usable in e820 and track if the
range has been accepted in a bitmap. One bit in the bitmap represents
2MiB in the address space: one 4k page is enough to track 64GiB or
physical address space.

In the worst-case scenario -- a huge hole in the middle of the
address space -- It needs 256MiB to handle 4PiB of the address
space.

Any unaccepted memory that is not aligned to 2M gets accepted upfront.

The approach lowers boot time substantially. Boot to shell is ~2.5x
faster for 4G TDX VM and ~4x faster for 64G.

TDX-specific code isolated from the core of unaccepted memory support. It
supposed to help to plug-in different implementation of unaccepted memory
such as SEV-SNP.

The tree can be found here:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git unaccepted-memory

v8:
 - Rewrite core-mm support for unaccepted memory (patch 02/14);
 - s/UnacceptedPages/Unaccepted/ in meminfo;
 - Drop arch/x86/boot/compressed/compiler.h;
 - Fix build errors;
 - Adjust commit messages and comments;
 - Reviewed-bys from Dave and Borislav;
 - Rebased to tip/master.
v7:
 - Rework meminfo counter to use PageUnaccepted() and move to generic code;
 - Fix range_contains_unaccepted_memory() on machines without unaccepted memory;
 - Add Reviewed-by from David;
v6:
 - Fix load_unaligned_zeropad() on machine with unaccepted memory;
 - Clear PageUnaccepted() on merged pages, leaving it only on head;
 - Clarify error handling in allocate_e820();
 - Fix build with CONFIG_UNACCEPTED_MEMORY=y, but without TDX;
 - Disable kexec at boottime instead of build conflict;
 - Rebased to tip/master;
 - Spelling fixes;
 - Add Reviewed-by from Mike and David;
v5:
 - Updates comments and commit messages;
   + Explain options for unaccepted memory handling;
 - Expose amount of unaccepted memory in /proc/meminfo
 - Adjust check in page_expected_state();
 - Fix error code handling in allocate_e820();
 - Centralize __pa()/__va() definitions in the boot stub;
 - Avoid includes from the main kernel in the boot stub;
 - Use an existing hole in boot_param for unaccepted_memory, instead of adding
   to the end of the structure;
 - Extract allocate_unaccepted_memory() form allocate_e820();
 - Complain if there's unaccepted memory, but kernel does not support it;
 - Fix vmstat counter;
 - Split up few preparatory patches;
 - Random readability adjustments;
v4:
 - PageBuddyUnaccepted() -> PageUnaccepted;
 - Use separate page_type, not shared with offline;
 - Rework interface between core-mm and arch code;
 - Adjust commit messages;
 - Ack from Mike;
Kirill A. Shutemov (14):
  x86/boot: Centralize __pa()/__va() definitions
  mm: Add support for unaccepted memory
  mm: Report unaccepted memory in meminfo
  efi/x86: Get full memory map in allocate_e820()
  x86/boot: Add infrastructure required for unaccepted memory support
  efi/x86: Implement support for unaccepted memory
  x86/boot/compressed: Handle unaccepted memory
  x86/mm: Reserve unaccepted memory bitmap
  x86/mm: Provide helpers for unaccepted memory
  x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  x86: Disable kexec if system has unaccepted memory
  x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in
    boot stub
  x86/tdx: Refactor try_accept_one()
  x86/tdx: Add unaccepted memory support

 Documentation/x86/zero-page.rst          |   1 +
 arch/x86/Kconfig                         |   2 +
 arch/x86/boot/bitops.h                   |  40 ++++++++
 arch/x86/boot/compressed/Makefile        |   3 +-
 arch/x86/boot/compressed/align.h         |  14 +++
 arch/x86/boot/compressed/bitmap.c        |  43 ++++++++
 arch/x86/boot/compressed/bitmap.h        |  49 +++++++++
 arch/x86/boot/compressed/bits.h          |  36 +++++++
 arch/x86/boot/compressed/efi.h           |   1 +
 arch/x86/boot/compressed/error.c         |  19 ++++
 arch/x86/boot/compressed/error.h         |   1 +
 arch/x86/boot/compressed/find.c          |  54 ++++++++++
 arch/x86/boot/compressed/find.h          |  79 +++++++++++++++
 arch/x86/boot/compressed/ident_map_64.c  |   8 --
 arch/x86/boot/compressed/kaslr.c         |  35 ++++---
 arch/x86/boot/compressed/math.h          |  37 +++++++
 arch/x86/boot/compressed/mem.c           | 122 ++++++++++++++++++++++
 arch/x86/boot/compressed/minmax.h        |  61 +++++++++++
 arch/x86/boot/compressed/misc.c          |   6 ++
 arch/x86/boot/compressed/misc.h          |  15 +++
 arch/x86/boot/compressed/pgtable_types.h |  25 +++++
 arch/x86/boot/compressed/sev.c           |   2 -
 arch/x86/boot/compressed/tdx-shared.c    |   2 +
 arch/x86/boot/compressed/tdx.c           |  39 +++++++
 arch/x86/coco/tdx/Makefile               |   2 +-
 arch/x86/coco/tdx/tdx-shared.c           |  95 +++++++++++++++++
 arch/x86/coco/tdx/tdx.c                  | 113 +--------------------
 arch/x86/include/asm/kexec.h             |   5 +
 arch/x86/include/asm/page.h              |   3 +
 arch/x86/include/asm/shared/tdx.h        |  48 +++++++++
 arch/x86/include/asm/tdx.h               |  21 +---
 arch/x86/include/asm/unaccepted_memory.h |  16 +++
 arch/x86/include/uapi/asm/bootparam.h    |   2 +-
 arch/x86/kernel/e820.c                   |  17 ++++
 arch/x86/mm/Makefile                     |   2 +
 arch/x86/mm/unaccepted_memory.c          | 123 +++++++++++++++++++++++
 drivers/base/node.c                      |   7 ++
 drivers/firmware/efi/Kconfig             |  14 +++
 drivers/firmware/efi/efi.c               |   1 +
 drivers/firmware/efi/libstub/x86-stub.c  | 101 ++++++++++++++++---
 fs/proc/meminfo.c                        |   5 +
 include/linux/efi.h                      |   3 +-
 include/linux/kexec.h                    |   7 ++
 include/linux/mmzone.h                   |   8 ++
 include/linux/page-flags.h               |  24 +++++
 kernel/kexec.c                           |   4 +
 kernel/kexec_file.c                      |   4 +
 mm/internal.h                            |  12 +++
 mm/memblock.c                            |   9 ++
 mm/page_alloc.c                          | 121 ++++++++++++++++++++++
 mm/vmstat.c                              |   1 +
 51 files changed, 1291 insertions(+), 171 deletions(-)
 create mode 100644 arch/x86/boot/compressed/align.h
 create mode 100644 arch/x86/boot/compressed/bitmap.c
 create mode 100644 arch/x86/boot/compressed/bitmap.h
 create mode 100644 arch/x86/boot/compressed/bits.h
 create mode 100644 arch/x86/boot/compressed/find.c
 create mode 100644 arch/x86/boot/compressed/find.h
 create mode 100644 arch/x86/boot/compressed/math.h
 create mode 100644 arch/x86/boot/compressed/mem.c
 create mode 100644 arch/x86/boot/compressed/minmax.h
 create mode 100644 arch/x86/boot/compressed/pgtable_types.h
 create mode 100644 arch/x86/boot/compressed/tdx-shared.c
 create mode 100644 arch/x86/coco/tdx/tdx-shared.c
 create mode 100644 arch/x86/include/asm/unaccepted_memory.h
 create mode 100644 arch/x86/mm/unaccepted_memory.c

-- 
2.38.0


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCHv8 01/14] x86/boot: Centralize __pa()/__va() definitions
  2022-12-07  1:49 [PATCHv8 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
@ 2022-12-07  1:49 ` Kirill A. Shutemov
  2022-12-07  1:49 ` [PATCHv8 02/14] mm: Add support for unaccepted memory Kirill A. Shutemov
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Kirill A. Shutemov @ 2022-12-07  1:49 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Mel Gorman, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, aarcange, peterx, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov, Mike Rapoport, Dave Hansen

Replace multiple __pa()/__va() definitions with a single one in misc.h.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 arch/x86/boot/compressed/ident_map_64.c | 8 --------
 arch/x86/boot/compressed/misc.h         | 9 +++++++++
 arch/x86/boot/compressed/sev.c          | 2 --
 3 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index d4a314cc50d6..56550f289118 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -8,14 +8,6 @@
  * Copyright (C)      2016  Kees Cook
  */
 
-/*
- * Since we're dealing with identity mappings, physical and virtual
- * addresses are the same, so override these defines which are ultimately
- * used by the headers in misc.h.
- */
-#define __pa(x)  ((unsigned long)(x))
-#define __va(x)  ((void *)((unsigned long)(x)))
-
 /* No PAGE_TABLE_ISOLATION support needed either: */
 #undef CONFIG_PAGE_TABLE_ISOLATION
 
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 62208ec04ca4..d37d612a6390 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -19,6 +19,15 @@
 /* cpu_feature_enabled() cannot be used this early */
 #define USE_EARLY_PGTABLE_L5
 
+/*
+ * Boot stub deals with identity mappings, physical and virtual addresses are
+ * the same, so override these defines.
+ *
+ * <asm/page.h> will not define them if they are already defined.
+ */
+#define __pa(x)  ((unsigned long)(x))
+#define __va(x)  ((void *)((unsigned long)(x)))
+
 #include <linux/linkage.h>
 #include <linux/screen_info.h>
 #include <linux/elf.h>
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index c93930d5ccbd..e4d3a2da2eb9 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -104,9 +104,7 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
 }
 
 #undef __init
-#undef __pa
 #define __init
-#define __pa(x)	((unsigned long)(x))
 
 #define __BOOT_COMPRESSED
 
-- 
2.38.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCHv8 02/14] mm: Add support for unaccepted memory
  2022-12-07  1:49 [PATCHv8 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
  2022-12-07  1:49 ` [PATCHv8 01/14] x86/boot: Centralize __pa()/__va() definitions Kirill A. Shutemov
@ 2022-12-07  1:49 ` Kirill A. Shutemov
  2022-12-09 18:10   ` Vlastimil Babka
                     ` (2 more replies)
  2022-12-07  1:49 ` [PATCHv8 03/14] mm: Report unaccepted memory in meminfo Kirill A. Shutemov
                   ` (11 subsequent siblings)
  13 siblings, 3 replies; 26+ messages in thread
From: Kirill A. Shutemov @ 2022-12-07  1:49 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Mel Gorman, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, aarcange, peterx, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov, Mike Rapoport

UEFI Specification version 2.9 introduces the concept of memory
acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, require memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific to the Virtual Machine
platform.

There are several ways kernel can deal with unaccepted memory:

 1. Accept all the memory during the boot. It is easy to implement and
    it doesn't have runtime cost once the system is booted. The downside
    is very long boot time.

    Accept can be parallelized to multiple CPUs to keep it manageable
    (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
    memory bandwidth and does not scale beyond the point.

 2. Accept a block of memory on the first use. It requires more
    infrastructure and changes in page allocator to make it work, but
    it provides good boot time.

    On-demand memory accept means latency spikes every time kernel steps
    onto a new memory block. The spikes will go away once workload data
    set size gets stabilized or all memory gets accepted.

 3. Accept all memory in background. Introduce a thread (or multiple)
    that gets memory accepted proactively. It will minimize time the
    system experience latency spikes on memory allocation while keeping
    low boot time.

    This approach cannot function on its own. It is an extension of #2:
    background memory acceptance requires functional scheduler, but the
    page allocator may need to tap into unaccepted memory before that.

    The downside of the approach is that these threads also steal CPU
    cycles and memory bandwidth from the user's workload and may hurt
    user experience.

Implement #2 for now. It is a reasonable default. Some workloads may
want to use #1 or #3 and they can be implemented later based on user's
demands.

Support of unaccepted memory requires a few changes in core-mm code:

  - memblock has to accept memory on allocation;

  - page allocator has to accept memory on the first allocation of the
    page;

Memblock change is trivial.

The page allocator is modified to accept pages. New memory gets accepted
before putting pages on free lists. It is done lazily: only accept new
pages when we run out of already accepted memory.

Architecture has to provide two helpers if it wants to support
unaccepted memory:

 - accept_memory() makes a range of physical addresses accepted.

 - range_contains_unaccepted_memory() checks anything within the range
   of physical addresses requires acceptance.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
---
 include/linux/mmzone.h     |   5 ++
 include/linux/page-flags.h |  24 ++++++++
 mm/internal.h              |  12 ++++
 mm/memblock.c              |   9 +++
 mm/page_alloc.c            | 119 +++++++++++++++++++++++++++++++++++++
 5 files changed, 169 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5f74891556f3..da335381e63f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -822,6 +822,11 @@ struct zone {
 	/* free areas of different sizes */
 	struct free_area	free_area[MAX_ORDER];
 
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	/* pages to be accepted */
+	struct list_head	unaccepted_pages;
+#endif
+
 	/* zone flags, see below */
 	unsigned long		flags;
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 0b0ae5084e60..ce953be8fe10 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -941,6 +941,7 @@ static inline bool is_page_hwpoison(struct page *page)
 #define PG_offline	0x00000100
 #define PG_table	0x00000200
 #define PG_guard	0x00000400
+#define PG_unaccepted	0x00000800
 
 #define PageType(page, flag)						\
 	((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
@@ -966,6 +967,18 @@ static __always_inline void __ClearPage##uname(struct page *page)	\
 	page->page_type |= PG_##lname;					\
 }
 
+#define PAGE_TYPE_OPS_FALSE(uname)					\
+static __always_inline int Page##uname(struct page *page)		\
+{									\
+	return false;							\
+}									\
+static __always_inline void __SetPage##uname(struct page *page)		\
+{									\
+}									\
+static __always_inline void __ClearPage##uname(struct page *page)	\
+{									\
+}
+
 /*
  * PageBuddy() indicates that the page is free and in the buddy system
  * (see mm/page_alloc.c).
@@ -996,6 +1009,17 @@ PAGE_TYPE_OPS(Buddy, buddy)
  */
 PAGE_TYPE_OPS(Offline, offline)
 
+/*
+ * PageUnaccepted() indicates that the page has to be "accepted" before it can
+ * be read or written. The page allocator must call accept_page() before
+ * touching the page or returning it to the caller.
+ */
+#ifdef CONFIG_UNACCEPTED_MEMORY
+PAGE_TYPE_OPS(Unaccepted, unaccepted)
+#else
+PAGE_TYPE_OPS_FALSE(Unaccepted)
+#endif
+
 extern void page_offline_freeze(void);
 extern void page_offline_thaw(void);
 extern void page_offline_begin(void);
diff --git a/mm/internal.h b/mm/internal.h
index 6b7ef495b56d..8ef4f88608ad 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -856,4 +856,16 @@ static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma)
 	return !(vma->vm_flags & VM_SOFTDIRTY);
 }
 
+#ifndef CONFIG_UNACCEPTED_MEMORY
+static inline bool range_contains_unaccepted_memory(phys_addr_t start,
+						    phys_addr_t end)
+{
+	return false;
+}
+
+static inline void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+}
+#endif
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memblock.c b/mm/memblock.c
index 511d4783dcf1..3bc404a5352a 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1423,6 +1423,15 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
 		 */
 		kmemleak_alloc_phys(found, size, 0);
 
+	/*
+	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
+	 * require memory to be accepted before it can be used by the
+	 * guest.
+	 *
+	 * Accept the memory of the allocated buffer.
+	 */
+	accept_memory(found, found + size);
+
 	return found;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e60657875d3..6d597e833a73 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -450,6 +450,11 @@ EXPORT_SYMBOL(nr_online_nodes);
 
 int page_group_by_mobility_disabled __read_mostly;
 
+#ifdef CONFIG_UNACCEPTED_MEMORY
+/* Counts number of zones with unaccepted pages. */
+static DEFINE_STATIC_KEY_FALSE(unaccepted_pages);
+#endif
+
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 /*
  * During boot we initialize deferred pages on-demand, as needed, but once
@@ -1043,12 +1048,15 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
 {
 	struct free_area *area = &zone->free_area[order];
 
+	VM_BUG_ON_PAGE(PageUnevictable(page), page);
 	list_move_tail(&page->buddy_list, &area->free_list[migratetype]);
 }
 
 static inline void del_page_from_free_list(struct page *page, struct zone *zone,
 					   unsigned int order)
 {
+	VM_BUG_ON_PAGE(PageUnevictable(page), page);
+
 	/* clear reported state and update reported page count */
 	if (page_reported(page))
 		__ClearPageReported(page);
@@ -1728,6 +1736,97 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 	__count_vm_events(PGFREE, 1 << order);
 }
 
+static bool page_contains_unaccepted(struct page *page, unsigned int order)
+{
+	phys_addr_t start = page_to_phys(page);
+	phys_addr_t end = start + (PAGE_SIZE << order);
+
+	return range_contains_unaccepted_memory(start, end);
+}
+
+static void accept_page(struct page *page, unsigned int order)
+{
+	phys_addr_t start = page_to_phys(page);
+
+	accept_memory(start, start + (PAGE_SIZE << order));
+}
+
+#ifdef CONFIG_UNACCEPTED_MEMORY
+
+static bool try_to_accept_memory(struct zone *zone)
+{
+	unsigned long flags, order;
+	struct page *page;
+	bool last = false;
+	int migratetype;
+
+	if (!static_branch_unlikely(&unaccepted_pages))
+		return false;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	page = list_first_entry_or_null(&zone->unaccepted_pages,
+					struct page, lru);
+	if (!page) {
+		spin_unlock_irqrestore(&zone->lock, flags);
+		return false;
+	}
+
+	list_del(&page->lru);
+	last = list_empty(&zone->unaccepted_pages);
+
+	order = page->private;
+	VM_BUG_ON(order > MAX_ORDER || order < pageblock_order);
+
+	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));
+	__mod_zone_freepage_state(zone, -1 << order, migratetype);
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	if (last)
+		static_branch_dec(&unaccepted_pages);
+
+	accept_page(page, order);
+	__ClearPageUnaccepted(page);
+	__free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON);
+
+	return true;
+}
+
+static void __free_unaccepted(struct page *page, unsigned int order)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long flags;
+	int migratetype;
+	bool first = false;
+
+	VM_BUG_ON(order > MAX_ORDER || order < pageblock_order);
+	__SetPageUnaccepted(page);
+	page->private = order;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	first = list_empty(&zone->unaccepted_pages);
+	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));
+	list_add_tail(&page->lru, &zone->unaccepted_pages);
+	__mod_zone_freepage_state(zone, 1 << order, migratetype);
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	if (first)
+		static_branch_inc(&unaccepted_pages);
+}
+
+#else
+
+static bool try_to_accept_memory(struct zone *zone)
+{
+	return false;
+}
+
+static void __free_unaccepted(struct page *page, unsigned int order)
+{
+	BUILD_BUG();
+}
+
+#endif /* CONFIG_UNACCEPTED_MEMORY */
+
 void __free_pages_core(struct page *page, unsigned int order)
 {
 	unsigned int nr_pages = 1 << order;
@@ -1750,6 +1849,13 @@ void __free_pages_core(struct page *page, unsigned int order)
 
 	atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
 
+	if (page_contains_unaccepted(page, order)) {
+		if (order >= pageblock_order)
+			return __free_unaccepted(page, order);
+		else
+			accept_page(page, order);
+	}
+
 	/*
 	 * Bypass PCP and place fresh pages right to the tail, primarily
 	 * relevant for memory onlining.
@@ -1910,6 +2016,9 @@ static void __init deferred_free_range(unsigned long pfn,
 		return;
 	}
 
+	/* Accept chunks smaller than page-block upfront */
+	accept_memory(PFN_PHYS(pfn), PFN_PHYS(pfn + nr_pages));
+
 	for (i = 0; i < nr_pages; i++, page++, pfn++) {
 		if (pageblock_aligned(pfn))
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
@@ -4247,6 +4356,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 				       gfp_mask)) {
 			int ret;
 
+			if (try_to_accept_memory(zone))
+				goto try_this_zone;
+
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 			/*
 			 * Watermark failed for this zone, but see if we can
@@ -4299,6 +4411,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 
 			return page;
 		} else {
+			if (try_to_accept_memory(zone))
+				goto try_this_zone;
+
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 			/* Try again if zone has deferred pages */
 			if (static_branch_unlikely(&deferred_pages)) {
@@ -6935,6 +7050,10 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
 		zone->free_area[order].nr_free = 0;
 	}
+
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	INIT_LIST_HEAD(&zone->unaccepted_pages);
+#endif
 }
 
 /*
-- 
2.38.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCHv8 03/14] mm: Report unaccepted memory in meminfo
  2022-12-07  1:49 [PATCHv8 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
  2022-12-07  1:49 ` [PATCHv8 01/14] x86/boot: Centralize __pa()/__va() definitions Kirill A. Shutemov
  2022-12-07  1:49 ` [PATCHv8 02/14] mm: Add support for unaccepted memory Kirill A. Shutemov
@ 2022-12-07  1:49 ` Kirill A. Shutemov
  2022-12-07  1:49 ` [PATCHv8 04/14] efi/x86: Get full memory map in allocate_e820() Kirill A. Shutemov
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Kirill A. Shutemov @ 2022-12-07  1:49 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Mel Gorman, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, aarcange, peterx, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov

Track amount of unaccepted memory and report it in /proc/meminfo and in
node meminfo.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 drivers/base/node.c    | 7 +++++++
 fs/proc/meminfo.c      | 5 +++++
 include/linux/mmzone.h | 3 +++
 mm/page_alloc.c        | 2 ++
 mm/vmstat.c            | 1 +
 5 files changed, 18 insertions(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index faf3597a96da..ca6f0590be21 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -448,6 +448,9 @@ static ssize_t node_read_meminfo(struct device *dev,
 			     "Node %d ShmemPmdMapped: %8lu kB\n"
 			     "Node %d FileHugePages: %8lu kB\n"
 			     "Node %d FilePmdMapped: %8lu kB\n"
+#endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+			     "Node %d Unaccepted:     %8lu kB\n"
 #endif
 			     ,
 			     nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
@@ -477,6 +480,10 @@ static ssize_t node_read_meminfo(struct device *dev,
 			     nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)),
 			     nid, K(node_page_state(pgdat, NR_FILE_THPS)),
 			     nid, K(node_page_state(pgdat, NR_FILE_PMDMAPPED))
+#endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+			     ,
+			     nid, K(node_page_state(pgdat, NR_UNACCEPTED))
 #endif
 			    );
 	len += hugetlb_report_node_meminfo(buf, len, nid);
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 440960110a42..789b77c7b6df 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -155,6 +155,11 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		    global_zone_page_state(NR_FREE_CMA_PAGES));
 #endif
 
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	show_val_kb(m, "Unaccepted:     ",
+		    global_node_page_state(NR_UNACCEPTED));
+#endif
+
 	hugetlb_report_meminfo(m);
 
 	arch_report_meminfo(m);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index da335381e63f..9c762e8175fc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -198,6 +198,9 @@ enum node_stat_item {
 	NR_FOLL_PIN_ACQUIRED,	/* via: pin_user_page(), gup flag: FOLL_PIN */
 	NR_FOLL_PIN_RELEASED,	/* pages returned via unpin_user_page() */
 	NR_KERNEL_STACK_KB,	/* measured in KiB */
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	NR_UNACCEPTED,
+#endif
 #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
 	NR_KERNEL_SCS_KB,	/* measured in KiB */
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d597e833a73..e80e8d398863 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1779,6 +1779,7 @@ static bool try_to_accept_memory(struct zone *zone)
 
 	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));
 	__mod_zone_freepage_state(zone, -1 << order, migratetype);
+	__mod_node_page_state(page_pgdat(page), NR_UNACCEPTED, -1 << order);
 	spin_unlock_irqrestore(&zone->lock, flags);
 
 	if (last)
@@ -1807,6 +1808,7 @@ static void __free_unaccepted(struct page *page, unsigned int order)
 	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));
 	list_add_tail(&page->lru, &zone->unaccepted_pages);
 	__mod_zone_freepage_state(zone, 1 << order, migratetype);
+	__mod_node_page_state(page_pgdat(page), NR_UNACCEPTED, 1 << order);
 	spin_unlock_irqrestore(&zone->lock, flags);
 
 	if (first)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b2371d745e00..fb15213be374 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1230,6 +1230,7 @@ const char * const vmstat_text[] = {
 	"nr_foll_pin_acquired",
 	"nr_foll_pin_released",
 	"nr_kernel_stack",
+	"nr_unaccepted",
 #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
 	"nr_shadow_call_stack",
 #endif
-- 
2.38.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCHv8 04/14] efi/x86: Get full memory map in allocate_e820()
  2022-12-07  1:49 [PATCHv8 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2022-12-07  1:49 ` [PATCHv8 03/14] mm: Report unaccepted memory in meminfo Kirill A. Shutemov
@ 2022-12-07  1:49 ` Kirill A. Shutemov
  2022-12-07  1:49 ` [PATCHv8 05/14] x86/boot: Add infrastructure required for unaccepted memory support Kirill A. Shutemov
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Kirill A. Shutemov @ 2022-12-07  1:49 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Mel Gorman, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, aarcange, peterx, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov, Borislav Petkov

Currently allocate_e820() is only interested in the size of map and size
of memory descriptor to determine how many e820 entries the kernel
needs.

UEFI Specification version 2.9 introduces a new memory type --
unaccepted memory. To track unaccepted memory kernel needs to allocate
a bitmap. The size of the bitmap is dependent on the maximum physical
address present in the system. A full memory map is required to find
the maximum address.

Modify allocate_e820() to get a full memory map.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
---
 drivers/firmware/efi/libstub/x86-stub.c | 26 +++++++++++--------------
 1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index a0bfd31358ba..fff81843169c 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -681,28 +681,24 @@ static efi_status_t allocate_e820(struct boot_params *params,
 				  struct setup_data **e820ext,
 				  u32 *e820ext_size)
 {
-	unsigned long map_size, desc_size, map_key;
+	struct efi_boot_memmap *map;
 	efi_status_t status;
-	__u32 nr_desc, desc_version;
+	__u32 nr_desc;
 
-	/* Only need the size of the mem map and size of each mem descriptor */
-	map_size = 0;
-	status = efi_bs_call(get_memory_map, &map_size, NULL, &map_key,
-			     &desc_size, &desc_version);
-	if (status != EFI_BUFFER_TOO_SMALL)
-		return (status != EFI_SUCCESS) ? status : EFI_UNSUPPORTED;
-
-	nr_desc = map_size / desc_size + EFI_MMAP_NR_SLACK_SLOTS;
+	status = efi_get_memory_map(&map, false);
+	if (status != EFI_SUCCESS)
+		return status;
 
-	if (nr_desc > ARRAY_SIZE(params->e820_table)) {
-		u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table);
+	nr_desc = map->map_size / map->desc_size;
+	if (nr_desc > ARRAY_SIZE(params->e820_table) - EFI_MMAP_NR_SLACK_SLOTS) {
+		u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table) +
+			EFI_MMAP_NR_SLACK_SLOTS;
 
 		status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
-		if (status != EFI_SUCCESS)
-			return status;
 	}
 
-	return EFI_SUCCESS;
+	efi_bs_call(free_pool, map);
+	return status;
 }
 
 struct exit_boot_struct {
-- 
2.38.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCHv8 05/14] x86/boot: Add infrastructure required for unaccepted memory support
  2022-12-07  1:49 [PATCHv8 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2022-12-07  1:49 ` [PATCHv8 04/14] efi/x86: Get full memory map in allocate_e820() Kirill A. Shutemov
@ 2022-12-07  1:49 ` Kirill A. Shutemov
  2023-01-03 13:52   ` Borislav Petkov
  2022-12-07  1:49 ` [PATCHv8 06/14] efi/x86: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 26+ messages in thread
From: Kirill A. Shutemov @ 2022-12-07  1:49 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Mel Gorman, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, aarcange, peterx, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov

Pull functionality from the main kernel headers and lib/ that is
required for unaccepted memory support.

This is preparatory patch. The users for the functionality will come in
following patches.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/bitops.h                   | 40 ++++++++++++
 arch/x86/boot/compressed/align.h         | 14 +++++
 arch/x86/boot/compressed/bitmap.c        | 43 +++++++++++++
 arch/x86/boot/compressed/bitmap.h        | 49 +++++++++++++++
 arch/x86/boot/compressed/bits.h          | 36 +++++++++++
 arch/x86/boot/compressed/find.c          | 54 ++++++++++++++++
 arch/x86/boot/compressed/find.h          | 79 ++++++++++++++++++++++++
 arch/x86/boot/compressed/math.h          | 37 +++++++++++
 arch/x86/boot/compressed/minmax.h        | 61 ++++++++++++++++++
 arch/x86/boot/compressed/pgtable_types.h | 25 ++++++++
 10 files changed, 438 insertions(+)
 create mode 100644 arch/x86/boot/compressed/align.h
 create mode 100644 arch/x86/boot/compressed/bitmap.c
 create mode 100644 arch/x86/boot/compressed/bitmap.h
 create mode 100644 arch/x86/boot/compressed/bits.h
 create mode 100644 arch/x86/boot/compressed/find.c
 create mode 100644 arch/x86/boot/compressed/find.h
 create mode 100644 arch/x86/boot/compressed/math.h
 create mode 100644 arch/x86/boot/compressed/minmax.h
 create mode 100644 arch/x86/boot/compressed/pgtable_types.h

diff --git a/arch/x86/boot/bitops.h b/arch/x86/boot/bitops.h
index 8518ae214c9b..38badf028543 100644
--- a/arch/x86/boot/bitops.h
+++ b/arch/x86/boot/bitops.h
@@ -41,4 +41,44 @@ static inline void set_bit(int nr, void *addr)
 	asm("btsl %1,%0" : "+m" (*(u32 *)addr) : "Ir" (nr));
 }
 
+static __always_inline void __set_bit(long nr, volatile unsigned long *addr)
+{
+	asm volatile(__ASM_SIZE(bts) " %1,%0" : : "m" (*(volatile long *) addr),
+		     "Ir" (nr) : "memory");
+}
+
+static __always_inline void __clear_bit(long nr, volatile unsigned long *addr)
+{
+	asm volatile(__ASM_SIZE(btr) " %1,%0" : : "m" (*(volatile long *) addr),
+		     "Ir" (nr) : "memory");
+}
+
+/**
+ * __ffs - find first set bit in word
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+static __always_inline unsigned long __ffs(unsigned long word)
+{
+	asm("rep; bsf %1,%0"
+		: "=r" (word)
+		: "rm" (word));
+	return word;
+}
+
+/**
+ * ffz - find first zero bit in word
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+static __always_inline unsigned long ffz(unsigned long word)
+{
+	asm("rep; bsf %1,%0"
+		: "=r" (word)
+		: "r" (~word));
+	return word;
+}
+
 #endif /* BOOT_BITOPS_H */
diff --git a/arch/x86/boot/compressed/align.h b/arch/x86/boot/compressed/align.h
new file mode 100644
index 000000000000..7ccabbc5d1b8
--- /dev/null
+++ b/arch/x86/boot/compressed/align.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_ALIGN_H
+#define BOOT_ALIGN_H
+#define _LINUX_ALIGN_H /* Inhibit inclusion of <linux/align.h> */
+
+/* @a is a power of 2 value */
+#define ALIGN(x, a)		__ALIGN_KERNEL((x), (a))
+#define ALIGN_DOWN(x, a)	__ALIGN_KERNEL((x) - ((a) - 1), (a))
+#define __ALIGN_MASK(x, mask)	__ALIGN_KERNEL_MASK((x), (mask))
+#define PTR_ALIGN(p, a)		((typeof(p))ALIGN((unsigned long)(p), (a)))
+#define PTR_ALIGN_DOWN(p, a)	((typeof(p))ALIGN_DOWN((unsigned long)(p), (a)))
+#define IS_ALIGNED(x, a)		(((x) & ((typeof(x))(a) - 1)) == 0)
+
+#endif
diff --git a/arch/x86/boot/compressed/bitmap.c b/arch/x86/boot/compressed/bitmap.c
new file mode 100644
index 000000000000..789ecadeb521
--- /dev/null
+++ b/arch/x86/boot/compressed/bitmap.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "bitmap.h"
+
+void __bitmap_set(unsigned long *map, unsigned int start, int len)
+{
+	unsigned long *p = map + BIT_WORD(start);
+	const unsigned int size = start + len;
+	int bits_to_set = BITS_PER_LONG - (start % BITS_PER_LONG);
+	unsigned long mask_to_set = BITMAP_FIRST_WORD_MASK(start);
+
+	while (len - bits_to_set >= 0) {
+		*p |= mask_to_set;
+		len -= bits_to_set;
+		bits_to_set = BITS_PER_LONG;
+		mask_to_set = ~0UL;
+		p++;
+	}
+	if (len) {
+		mask_to_set &= BITMAP_LAST_WORD_MASK(size);
+		*p |= mask_to_set;
+	}
+}
+
+void __bitmap_clear(unsigned long *map, unsigned int start, int len)
+{
+	unsigned long *p = map + BIT_WORD(start);
+	const unsigned int size = start + len;
+	int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+	unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+
+	while (len - bits_to_clear >= 0) {
+		*p &= ~mask_to_clear;
+		len -= bits_to_clear;
+		bits_to_clear = BITS_PER_LONG;
+		mask_to_clear = ~0UL;
+		p++;
+	}
+	if (len) {
+		mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+		*p &= ~mask_to_clear;
+	}
+}
diff --git a/arch/x86/boot/compressed/bitmap.h b/arch/x86/boot/compressed/bitmap.h
new file mode 100644
index 000000000000..35357f5feda2
--- /dev/null
+++ b/arch/x86/boot/compressed/bitmap.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_BITMAP_H
+#define BOOT_BITMAP_H
+#define __LINUX_BITMAP_H /* Inhibit inclusion of <linux/bitmap.h> */
+
+#include "../bitops.h"
+#include "../string.h"
+#include "align.h"
+
+#define BITMAP_MEM_ALIGNMENT 8
+#define BITMAP_MEM_MASK (BITMAP_MEM_ALIGNMENT - 1)
+
+#define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 1)))
+#define BITMAP_LAST_WORD_MASK(nbits) (~0UL >> (-(nbits) & (BITS_PER_LONG - 1)))
+
+#define BIT_WORD(nr)		((nr) / BITS_PER_LONG)
+
+void __bitmap_set(unsigned long *map, unsigned int start, int len);
+void __bitmap_clear(unsigned long *map, unsigned int start, int len);
+
+static __always_inline void bitmap_set(unsigned long *map, unsigned int start,
+		unsigned int nbits)
+{
+	if (__builtin_constant_p(nbits) && nbits == 1)
+		__set_bit(start, map);
+	else if (__builtin_constant_p(start & BITMAP_MEM_MASK) &&
+		 IS_ALIGNED(start, BITMAP_MEM_ALIGNMENT) &&
+		 __builtin_constant_p(nbits & BITMAP_MEM_MASK) &&
+		 IS_ALIGNED(nbits, BITMAP_MEM_ALIGNMENT))
+		memset((char *)map + start / 8, 0xff, nbits / 8);
+	else
+		__bitmap_set(map, start, nbits);
+}
+
+static __always_inline void bitmap_clear(unsigned long *map, unsigned int start,
+		unsigned int nbits)
+{
+	if (__builtin_constant_p(nbits) && nbits == 1)
+		__clear_bit(start, map);
+	else if (__builtin_constant_p(start & BITMAP_MEM_MASK) &&
+		 IS_ALIGNED(start, BITMAP_MEM_ALIGNMENT) &&
+		 __builtin_constant_p(nbits & BITMAP_MEM_MASK) &&
+		 IS_ALIGNED(nbits, BITMAP_MEM_ALIGNMENT))
+		memset((char *)map + start / 8, 0, nbits / 8);
+	else
+		__bitmap_clear(map, start, nbits);
+}
+
+#endif
diff --git a/arch/x86/boot/compressed/bits.h b/arch/x86/boot/compressed/bits.h
new file mode 100644
index 000000000000..b0ffa007ee19
--- /dev/null
+++ b/arch/x86/boot/compressed/bits.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_BITS_H
+#define BOOT_BITS_H
+#define __LINUX_BITS_H /* Inhibit inclusion of <linux/bits.h> */
+
+#ifdef __ASSEMBLY__
+#define _AC(X,Y)	X
+#define _AT(T,X)	X
+#else
+#define __AC(X,Y)	(X##Y)
+#define _AC(X,Y)	__AC(X,Y)
+#define _AT(T,X)	((T)(X))
+#endif
+
+#define _UL(x)		(_AC(x, UL))
+#define _ULL(x)		(_AC(x, ULL))
+#define UL(x)		(_UL(x))
+#define ULL(x)		(_ULL(x))
+
+#define BIT(nr)			(UL(1) << (nr))
+#define BIT_ULL(nr)		(ULL(1) << (nr))
+#define BIT_MASK(nr)		(UL(1) << ((nr) % BITS_PER_LONG))
+#define BIT_WORD(nr)		((nr) / BITS_PER_LONG)
+#define BIT_ULL_MASK(nr)	(ULL(1) << ((nr) % BITS_PER_LONG_LONG))
+#define BIT_ULL_WORD(nr)	((nr) / BITS_PER_LONG_LONG)
+#define BITS_PER_BYTE		8
+
+#define GENMASK(h, l) \
+	(((~UL(0)) - (UL(1) << (l)) + 1) & \
+	 (~UL(0) >> (BITS_PER_LONG - 1 - (h))))
+
+#define GENMASK_ULL(h, l) \
+	(((~ULL(0)) - (ULL(1) << (l)) + 1) & \
+	 (~ULL(0) >> (BITS_PER_LONG_LONG - 1 - (h))))
+
+#endif
diff --git a/arch/x86/boot/compressed/find.c b/arch/x86/boot/compressed/find.c
new file mode 100644
index 000000000000..b97a9e7c8085
--- /dev/null
+++ b/arch/x86/boot/compressed/find.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include "bitmap.h"
+#include "find.h"
+#include "math.h"
+#include "minmax.h"
+
+static __always_inline unsigned long swab(const unsigned long y)
+{
+#if __BITS_PER_LONG == 64
+	return __builtin_bswap32(y);
+#else /* __BITS_PER_LONG == 32 */
+	return __builtin_bswap64(y);
+#endif
+}
+
+unsigned long _find_next_bit(const unsigned long *addr1,
+		const unsigned long *addr2, unsigned long nbits,
+		unsigned long start, unsigned long invert, unsigned long le)
+{
+	unsigned long tmp, mask;
+
+	if (start >= nbits)
+		return nbits;
+
+	tmp = addr1[start / BITS_PER_LONG];
+	if (addr2)
+		tmp &= addr2[start / BITS_PER_LONG];
+	tmp ^= invert;
+
+	/* Handle 1st word. */
+	mask = BITMAP_FIRST_WORD_MASK(start);
+	if (le)
+		mask = swab(mask);
+
+	tmp &= mask;
+
+	start = round_down(start, BITS_PER_LONG);
+
+	while (!tmp) {
+		start += BITS_PER_LONG;
+		if (start >= nbits)
+			return nbits;
+
+		tmp = addr1[start / BITS_PER_LONG];
+		if (addr2)
+			tmp &= addr2[start / BITS_PER_LONG];
+		tmp ^= invert;
+	}
+
+	if (le)
+		tmp = swab(tmp);
+
+	return min(start + __ffs(tmp), nbits);
+}
diff --git a/arch/x86/boot/compressed/find.h b/arch/x86/boot/compressed/find.h
new file mode 100644
index 000000000000..903574b9d57a
--- /dev/null
+++ b/arch/x86/boot/compressed/find.h
@@ -0,0 +1,79 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_FIND_H
+#define BOOT_FIND_H
+#define __LINUX_FIND_H /* Inhibit inclusion of <linux/find.h> */
+
+#include "../bitops.h"
+#include "align.h"
+#include "bits.h"
+
+unsigned long _find_next_bit(const unsigned long *addr1,
+		const unsigned long *addr2, unsigned long nbits,
+		unsigned long start, unsigned long invert, unsigned long le);
+
+/**
+ * find_next_bit - find the next set bit in a memory region
+ * @addr: The address to base the search on
+ * @offset: The bitnumber to start searching at
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number for the next set bit
+ * If no bits are set, returns @size.
+ */
+static inline
+unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
+			    unsigned long offset)
+{
+	if (small_const_nbits(size)) {
+		unsigned long val;
+
+		if (offset >= size)
+			return size;
+
+		val = *addr & GENMASK(size - 1, offset);
+		return val ? __ffs(val) : size;
+	}
+
+	return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
+}
+
+/**
+ * find_next_zero_bit - find the next cleared bit in a memory region
+ * @addr: The address to base the search on
+ * @offset: The bitnumber to start searching at
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number of the next zero bit
+ * If no bits are zero, returns @size.
+ */
+static inline
+unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
+				 unsigned long offset)
+{
+	if (small_const_nbits(size)) {
+		unsigned long val;
+
+		if (offset >= size)
+			return size;
+
+		val = *addr | ~GENMASK(size - 1, offset);
+		return val == ~0UL ? size : ffz(val);
+	}
+
+	return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
+}
+
+/**
+ * for_each_set_bitrange_from - iterate over all set bit ranges [b; e)
+ * @b: bit offset of start of current bitrange (first set bit); must be initialized
+ * @e: bit offset of end of current bitrange (first unset bit)
+ * @addr: bitmap address to base the search on
+ * @size: bitmap size in number of bits
+ */
+#define for_each_set_bitrange_from(b, e, addr, size)		\
+	for ((b) = find_next_bit((addr), (size), (b)),		\
+	     (e) = find_next_zero_bit((addr), (size), (b) + 1);	\
+	     (b) < (size);					\
+	     (b) = find_next_bit((addr), (size), (e) + 1),	\
+	     (e) = find_next_zero_bit((addr), (size), (b) + 1))
+#endif
diff --git a/arch/x86/boot/compressed/math.h b/arch/x86/boot/compressed/math.h
new file mode 100644
index 000000000000..f7eede84bbc2
--- /dev/null
+++ b/arch/x86/boot/compressed/math.h
@@ -0,0 +1,37 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_MATH_H
+#define BOOT_MATH_H
+#define __LINUX_MATH_H /* Inhibit inclusion of <linux/math.h> */
+
+/*
+ *
+ * This looks more complex than it should be. But we need to
+ * get the type for the ~ right in round_down (it needs to be
+ * as wide as the result!), and we want to evaluate the macro
+ * arguments just once each.
+ */
+#define __round_mask(x, y) ((__typeof__(x))((y)-1))
+
+/**
+ * round_up - round up to next specified power of 2
+ * @x: the value to round
+ * @y: multiple to round up to (must be a power of 2)
+ *
+ * Rounds @x up to next multiple of @y (which must be a power of 2).
+ * To perform arbitrary rounding up, use roundup() below.
+ */
+#define round_up(x, y) ((((x)-1) | __round_mask(x, y))+1)
+
+/**
+ * round_down - round down to next specified power of 2
+ * @x: the value to round
+ * @y: multiple to round down to (must be a power of 2)
+ *
+ * Rounds @x down to next multiple of @y (which must be a power of 2).
+ * To perform arbitrary rounding down, use rounddown() below.
+ */
+#define round_down(x, y) ((x) & ~__round_mask(x, y))
+
+#define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
+
+#endif
diff --git a/arch/x86/boot/compressed/minmax.h b/arch/x86/boot/compressed/minmax.h
new file mode 100644
index 000000000000..4efd05673260
--- /dev/null
+++ b/arch/x86/boot/compressed/minmax.h
@@ -0,0 +1,61 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_MINMAX_H
+#define BOOT_MINMAX_H
+#define __LINUX_MINMAX_H /* Inhibit inclusion of <linux/minmax.h> */
+
+/*
+ * This returns a constant expression while determining if an argument is
+ * a constant expression, most importantly without evaluating the argument.
+ * Glory to Martin Uecker <Martin.Uecker@med.uni-goettingen.de>
+ */
+#define __is_constexpr(x) \
+	(sizeof(int) == sizeof(*(8 ? ((void *)((long)(x) * 0l)) : (int *)8)))
+
+/*
+ * min()/max()/clamp() macros must accomplish three things:
+ *
+ * - avoid multiple evaluations of the arguments (so side-effects like
+ *   "x++" happen only once) when non-constant.
+ * - perform strict type-checking (to generate warnings instead of
+ *   nasty runtime surprises). See the "unnecessary" pointer comparison
+ *   in __typecheck().
+ * - retain result as a constant expressions when called with only
+ *   constant expressions (to avoid tripping VLA warnings in stack
+ *   allocation usage).
+ */
+#define __typecheck(x, y) \
+	(!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
+
+#define __no_side_effects(x, y) \
+		(__is_constexpr(x) && __is_constexpr(y))
+
+#define __safe_cmp(x, y) \
+		(__typecheck(x, y) && __no_side_effects(x, y))
+
+#define __cmp(x, y, op)	((x) op (y) ? (x) : (y))
+
+#define __cmp_once(x, y, unique_x, unique_y, op) ({	\
+		typeof(x) unique_x = (x);		\
+		typeof(y) unique_y = (y);		\
+		__cmp(unique_x, unique_y, op); })
+
+#define __careful_cmp(x, y, op) \
+	__builtin_choose_expr(__safe_cmp(x, y), \
+		__cmp(x, y, op), \
+		__cmp_once(x, y, __UNIQUE_ID(__x), __UNIQUE_ID(__y), op))
+
+/**
+ * min - return minimum of two values of the same or compatible types
+ * @x: first value
+ * @y: second value
+ */
+#define min(x, y)	__careful_cmp(x, y, <)
+
+/**
+ * max - return maximum of two values of the same or compatible types
+ * @x: first value
+ * @y: second value
+ */
+#define max(x, y)	__careful_cmp(x, y, >)
+
+#endif
diff --git a/arch/x86/boot/compressed/pgtable_types.h b/arch/x86/boot/compressed/pgtable_types.h
new file mode 100644
index 000000000000..8f1d87a69efc
--- /dev/null
+++ b/arch/x86/boot/compressed/pgtable_types.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef BOOT_COMPRESSED_PGTABLE_TYPES_H
+#define BOOT_COMPRESSED_PGTABLE_TYPES_H
+#define _ASM_X86_PGTABLE_DEFS_H /* Inhibit inclusion of <asm/pgtable_types.h> */
+
+#define PAGE_SHIFT	12
+
+#ifdef CONFIG_X86_64
+#define PTE_SHIFT	9
+#elif defined CONFIG_X86_PAE
+#define PTE_SHIFT	9
+#else /* 2-level */
+#define PTE_SHIFT	10
+#endif
+
+enum pg_level {
+	PG_LEVEL_NONE,
+	PG_LEVEL_4K,
+	PG_LEVEL_2M,
+	PG_LEVEL_1G,
+	PG_LEVEL_512G,
+	PG_LEVEL_NUM
+};
+
+#endif
-- 
2.38.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCHv8 06/14] efi/x86: Implement support for unaccepted memory
  2022-12-07  1:49 [PATCHv8 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2022-12-07  1:49 ` [PATCHv8 05/14] x86/boot: Add infrastructure required for unaccepted memory support Kirill A. Shutemov
@ 2022-12-07  1:49 ` Kirill A. Shutemov
  2023-01-03 14:20   ` Borislav Petkov
  2022-12-07  1:49 ` [PATCHv8 07/14] x86/boot/compressed: Handle " Kirill A. Shutemov
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 26+ messages in thread
From: Kirill A. Shutemov @ 2022-12-07  1:49 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Mel Gorman, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, aarcange, peterx, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov

UEFI Specification version 2.9 introduces the concept of memory
acceptance: Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, requiring memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific for the Virtual
Machine platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptance until memory is needed. It lowers boot time and reduces
memory overhead.

The kernel needs to know what memory has been accepted. Firmware
communicates this information via memory map: a new memory type --
EFI_UNACCEPTED_MEMORY -- indicates such memory.

Range-based tracking works fine for firmware, but it gets bulky for
the kernel: e820 has to be modified on every page acceptance. It leads
to table fragmentation, but there's a limited number of entries in the
e820 table

Another option is to mark such memory as usable in e820 and track if the
range has been accepted in a bitmap. One bit in the bitmap represents
2MiB in the address space: one 4k page is enough to track 64GiB or
physical address space.

In the worst-case scenario -- a huge hole in the middle of the
address space -- It needs 256MiB to handle 4PiB of the address
space.

Any unaccepted memory that is not aligned to 2M gets accepted upfront.

The bitmap is allocated and constructed in the EFI stub and passed down
to the kernel via boot_params. allocate_e820() allocates the bitmap if
unaccepted memory is present, according to the maximum address in the
memory map.

The same boot_params.unaccepted_memory can be used to pass the bitmap
between two kernels on kexec, but the use-case is not yet implemented.

The implementation requires some basic helpers in boot stub. They
provided by linux/ includes in the main kernel image, but is not present
in boot stub. Create copy of required functionality in the boot stub.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/x86/zero-page.rst          |  1 +
 arch/x86/boot/compressed/Makefile        |  1 +
 arch/x86/boot/compressed/mem.c           | 73 ++++++++++++++++++++++++
 arch/x86/include/asm/unaccepted_memory.h | 10 ++++
 arch/x86/include/uapi/asm/bootparam.h    |  2 +-
 drivers/firmware/efi/Kconfig             | 14 +++++
 drivers/firmware/efi/efi.c               |  1 +
 drivers/firmware/efi/libstub/x86-stub.c  | 68 ++++++++++++++++++++++
 include/linux/efi.h                      |  3 +-
 9 files changed, 171 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/boot/compressed/mem.c
 create mode 100644 arch/x86/include/asm/unaccepted_memory.h

diff --git a/Documentation/x86/zero-page.rst b/Documentation/x86/zero-page.rst
index 45aa9cceb4f1..f21905e61ade 100644
--- a/Documentation/x86/zero-page.rst
+++ b/Documentation/x86/zero-page.rst
@@ -20,6 +20,7 @@ Offset/Size	Proto	Name			Meaning
 060/010		ALL	ist_info		Intel SpeedStep (IST) BIOS support information
 						(struct ist_info)
 070/008		ALL	acpi_rsdp_addr		Physical address of ACPI RSDP table
+078/008		ALL	unaccepted_memory	Bitmap of unaccepted memory (1bit == 2M)
 080/010		ALL	hd0_info		hd0 disk parameter, OBSOLETE!!
 090/010		ALL	hd1_info		hd1 disk parameter, OBSOLETE!!
 0A0/010		ALL	sys_desc_table		System description table (struct sys_desc_table),
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 3dc5db651dd0..0ae221540dee 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -107,6 +107,7 @@ endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
 vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
+vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/bitmap.o $(obj)/mem.o
 
 vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
new file mode 100644
index 000000000000..a848119e4455
--- /dev/null
+++ b/arch/x86/boot/compressed/mem.c
@@ -0,0 +1,73 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "../cpuflags.h"
+#include "bitmap.h"
+#include "error.h"
+#include "math.h"
+
+#define PMD_SHIFT	21
+#define PMD_SIZE	(_AC(1, UL) << PMD_SHIFT)
+#define PMD_MASK	(~(PMD_SIZE - 1))
+
+static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	/* Platform-specific memory-acceptance call goes here */
+	error("Cannot accept memory");
+}
+
+/*
+ * The accepted memory bitmap only works at PMD_SIZE granularity.  This
+ * function takes unaligned start/end addresses and either:
+ *  1. Accepts the memory immediately and in its entirety
+ *  2. Accepts unaligned parts, and marks *some* aligned part unaccepted
+ *
+ * The function will never reach the bitmap_set() with zero bits to set.
+ */
+void process_unaccepted_memory(struct boot_params *params, u64 start, u64 end)
+{
+	/*
+	 * Ensure that at least one bit will be set in the bitmap by
+	 * immediately accepting all regions under 2*PMD_SIZE.  This is
+	 * imprecise and may immediately accept some areas that could
+	 * have been represented in the bitmap.  But, results in simpler
+	 * code below
+	 *
+	 * Consider case like this:
+	 *
+	 * | 4k | 2044k |    2048k   |
+	 * ^ 0x0        ^ 2MB        ^ 4MB
+	 *
+	 * Only the first 4k has been accepted. The 0MB->2MB region can not be
+	 * represented in the bitmap. The 2MB->4MB region can be represented in
+	 * the bitmap. But, the 0MB->4MB region is <2*PMD_SIZE and will be
+	 * immediately accepted in its entirety.
+	 */
+	if (end - start < 2 * PMD_SIZE) {
+		__accept_memory(start, end);
+		return;
+	}
+
+	/*
+	 * No matter how the start and end are aligned, at least one unaccepted
+	 * PMD_SIZE area will remain to be marked in the bitmap.
+	 */
+
+	/* Immediately accept a <PMD_SIZE piece at the start: */
+	if (start & ~PMD_MASK) {
+		__accept_memory(start, round_up(start, PMD_SIZE));
+		start = round_up(start, PMD_SIZE);
+	}
+
+	/* Immediately accept a <PMD_SIZE piece at the end: */
+	if (end & ~PMD_MASK) {
+		__accept_memory(round_down(end, PMD_SIZE), end);
+		end = round_down(end, PMD_SIZE);
+	}
+
+	/*
+	 * 'start' and 'end' are now both PMD-aligned.
+	 * Record the range as being unaccepted:
+	 */
+	bitmap_set((unsigned long *)params->unaccepted_memory,
+		   start / PMD_SIZE, (end - start) / PMD_SIZE);
+}
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
new file mode 100644
index 000000000000..df0736d32858
--- /dev/null
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2020 Intel Corporation */
+#ifndef _ASM_X86_UNACCEPTED_MEMORY_H
+#define _ASM_X86_UNACCEPTED_MEMORY_H
+
+struct boot_params;
+
+void process_unaccepted_memory(struct boot_params *params, u64 start, u64 num);
+
+#endif
diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h
index 01d19fc22346..630a54046af0 100644
--- a/arch/x86/include/uapi/asm/bootparam.h
+++ b/arch/x86/include/uapi/asm/bootparam.h
@@ -189,7 +189,7 @@ struct boot_params {
 	__u64  tboot_addr;				/* 0x058 */
 	struct ist_info ist_info;			/* 0x060 */
 	__u64 acpi_rsdp_addr;				/* 0x070 */
-	__u8  _pad3[8];					/* 0x078 */
+	__u64 unaccepted_memory;			/* 0x078 */
 	__u8  hd0_info[16];	/* obsolete! */		/* 0x080 */
 	__u8  hd1_info[16];	/* obsolete! */		/* 0x090 */
 	struct sys_desc_table sys_desc_table; /* obsolete! */	/* 0x0a0 */
diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
index 6787ed8dfacf..8aa8adf0bcb5 100644
--- a/drivers/firmware/efi/Kconfig
+++ b/drivers/firmware/efi/Kconfig
@@ -314,6 +314,20 @@ config EFI_COCO_SECRET
 	  virt/coco/efi_secret module to access the secrets, which in turn
 	  allows userspace programs to access the injected secrets.
 
+config UNACCEPTED_MEMORY
+	bool
+	depends on EFI_STUB
+	help
+	   Some Virtual Machine platforms, such as Intel TDX, require
+	   some memory to be "accepted" by the guest before it can be used.
+	   This mechanism helps prevent malicious hosts from making changes
+	   to guest memory.
+
+	   UEFI specification v2.9 introduced EFI_UNACCEPTED_MEMORY memory type.
+
+	   This option adds support for unaccepted memory and makes such memory
+	   usable by the kernel.
+
 config EFI_EMBEDDED_FIRMWARE
 	bool
 	select CRYPTO_LIB_SHA256
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index a46df5d1d094..f525144e22e4 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -777,6 +777,7 @@ static __initdata char memory_type_name[][13] = {
 	"MMIO Port",
 	"PAL Code",
 	"Persistent",
+	"Unaccepted",
 };
 
 char * __init efi_md_typeattr_format(char *buf, size_t size,
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index fff81843169c..27b9eed5883b 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -15,6 +15,7 @@
 #include <asm/setup.h>
 #include <asm/desc.h>
 #include <asm/boot.h>
+#include <asm/unaccepted_memory.h>
 
 #include "efistub.h"
 
@@ -613,6 +614,16 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
 			e820_type = E820_TYPE_PMEM;
 			break;
 
+		case EFI_UNACCEPTED_MEMORY:
+			if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY)) {
+				efi_warn_once(
+"The system has unaccepted memory,  but kernel does not support it\nConsider enabling CONFIG_UNACCEPTED_MEMORY\n");
+				continue;
+			}
+			e820_type = E820_TYPE_RAM;
+			process_unaccepted_memory(params, d->phys_addr,
+						  d->phys_addr + PAGE_SIZE * d->num_pages);
+			break;
 		default:
 			continue;
 		}
@@ -677,6 +688,60 @@ static efi_status_t alloc_e820ext(u32 nr_desc, struct setup_data **e820ext,
 	return status;
 }
 
+static efi_status_t allocate_unaccepted_bitmap(struct boot_params *params,
+					       __u32 nr_desc,
+					       struct efi_boot_memmap *map)
+{
+	unsigned long *mem = NULL;
+	u64 size, max_addr = 0;
+	efi_status_t status;
+	bool found = false;
+	int i;
+
+	/* Check if there's any unaccepted memory and find the max address */
+	for (i = 0; i < nr_desc; i++) {
+		efi_memory_desc_t *d;
+		unsigned long m = (unsigned long)map->map;
+
+		d = efi_early_memdesc_ptr(m, map->desc_size, i);
+		if (d->type == EFI_UNACCEPTED_MEMORY)
+			found = true;
+		if (d->phys_addr + d->num_pages * PAGE_SIZE > max_addr)
+			max_addr = d->phys_addr + d->num_pages * PAGE_SIZE;
+	}
+
+	if (!found) {
+		params->unaccepted_memory = 0;
+		return EFI_SUCCESS;
+	}
+
+	/*
+	 * If unaccepted memory is present, allocate a bitmap to track what
+	 * memory has to be accepted before access.
+	 *
+	 * One bit in the bitmap represents 2MiB in the address space:
+	 * A 4k bitmap can track 64GiB of physical address space.
+	 *
+	 * In the worst case scenario -- a huge hole in the middle of the
+	 * address space -- It needs 256MiB to handle 4PiB of the address
+	 * space.
+	 *
+	 * TODO: handle situation if params->unaccepted_memory is already set.
+	 * It's required to deal with kexec.
+	 *
+	 * The bitmap will be populated in setup_e820() according to the memory
+	 * map after efi_exit_boot_services().
+	 */
+	size = DIV_ROUND_UP(max_addr, PMD_SIZE * BITS_PER_BYTE);
+	status = efi_allocate_pages(size, (unsigned long *)&mem, ULONG_MAX);
+	if (status == EFI_SUCCESS) {
+		memset(mem, 0, size);
+		params->unaccepted_memory = (unsigned long)mem;
+	}
+
+	return status;
+}
+
 static efi_status_t allocate_e820(struct boot_params *params,
 				  struct setup_data **e820ext,
 				  u32 *e820ext_size)
@@ -697,6 +762,9 @@ static efi_status_t allocate_e820(struct boot_params *params,
 		status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
 	}
 
+	if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) && status == EFI_SUCCESS)
+		status = allocate_unaccepted_bitmap(params, nr_desc, map);
+
 	efi_bs_call(free_pool, map);
 	return status;
 }
diff --git a/include/linux/efi.h b/include/linux/efi.h
index 7603fc58c47c..cfdcc165071e 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -108,7 +108,8 @@ typedef	struct {
 #define EFI_MEMORY_MAPPED_IO_PORT_SPACE	12
 #define EFI_PAL_CODE			13
 #define EFI_PERSISTENT_MEMORY		14
-#define EFI_MAX_MEMORY_TYPE		15
+#define EFI_UNACCEPTED_MEMORY		15
+#define EFI_MAX_MEMORY_TYPE		16
 
 /* Attribute values: */
 #define EFI_MEMORY_UC		((u64)0x0000000000000001ULL)	/* uncached */
-- 
2.38.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCHv8 07/14] x86/boot/compressed: Handle unaccepted memory
  2022-12-07  1:49 [PATCHv8 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  2022-12-07  1:49 ` [PATCHv8 06/14] efi/x86: Implement support for unaccepted memory Kirill A. Shutemov
@ 2022-12-07  1:49 ` Kirill A. Shutemov
  2022-12-07  1:49 ` [PATCHv8 08/14] x86/mm: Reserve unaccepted memory bitmap Kirill A. Shutemov
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Kirill A. Shutemov @ 2022-12-07  1:49 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Mel Gorman, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, aarcange, peterx, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov

The firmware will pre-accept the memory used to run the stub. But, the
stub is responsible for accepting the memory into which it decompresses
the main kernel. Accept memory just before decompression starts.

The stub is also responsible for choosing a physical address in which to
place the decompressed kernel image. The KASLR mechanism will randomize
this physical address. Since the unaccepted memory region is relatively
small, KASLR would be quite ineffective if it only used the pre-accepted
area (EFI_CONVENTIONAL_MEMORY). Ensure that KASLR randomizes among the
entire physical address space by also including EFI_UNACCEPTED_MEMORY.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/Makefile        |  2 +-
 arch/x86/boot/compressed/efi.h           |  1 +
 arch/x86/boot/compressed/kaslr.c         | 35 ++++++++++++++++--------
 arch/x86/boot/compressed/mem.c           | 18 ++++++++++++
 arch/x86/boot/compressed/misc.c          |  6 ++++
 arch/x86/boot/compressed/misc.h          |  6 ++++
 arch/x86/include/asm/unaccepted_memory.h |  2 ++
 7 files changed, 57 insertions(+), 13 deletions(-)

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 0ae221540dee..a9937b50c37f 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -107,7 +107,7 @@ endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
 vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
-vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/bitmap.o $(obj)/mem.o
+vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/bitmap.o $(obj)/find.o $(obj)/mem.o
 
 vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
diff --git a/arch/x86/boot/compressed/efi.h b/arch/x86/boot/compressed/efi.h
index 7db2f41b54cd..cf475243b6d5 100644
--- a/arch/x86/boot/compressed/efi.h
+++ b/arch/x86/boot/compressed/efi.h
@@ -32,6 +32,7 @@ typedef	struct {
 } efi_table_hdr_t;
 
 #define EFI_CONVENTIONAL_MEMORY		 7
+#define EFI_UNACCEPTED_MEMORY		15
 
 #define EFI_MEMORY_MORE_RELIABLE \
 				((u64)0x0000000000010000ULL)	/* higher reliability */
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 454757fbdfe5..749f0fe7e446 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -672,6 +672,28 @@ static bool process_mem_region(struct mem_vector *region,
 }
 
 #ifdef CONFIG_EFI
+
+/*
+ * Only EFI_CONVENTIONAL_MEMORY and EFI_UNACCEPTED_MEMORY (if supported) are
+ * guaranteed to be free.
+ *
+ * It is more conservative in picking free memory than the EFI spec allows:
+ *
+ * According to the spec, EFI_BOOT_SERVICES_{CODE|DATA} are also free memory
+ * and thus available to place the kernel image into, but in practice there's
+ * firmware where using that memory leads to crashes.
+ */
+static inline bool memory_type_is_free(efi_memory_desc_t *md)
+{
+	if (md->type == EFI_CONVENTIONAL_MEMORY)
+		return true;
+
+	if (md->type == EFI_UNACCEPTED_MEMORY)
+		return IS_ENABLED(CONFIG_UNACCEPTED_MEMORY);
+
+	return false;
+}
+
 /*
  * Returns true if we processed the EFI memmap, which we prefer over the E820
  * table if it is available.
@@ -716,18 +738,7 @@ process_efi_entries(unsigned long minimum, unsigned long image_size)
 	for (i = 0; i < nr_desc; i++) {
 		md = efi_early_memdesc_ptr(pmap, e->efi_memdesc_size, i);
 
-		/*
-		 * Here we are more conservative in picking free memory than
-		 * the EFI spec allows:
-		 *
-		 * According to the spec, EFI_BOOT_SERVICES_{CODE|DATA} are also
-		 * free memory and thus available to place the kernel image into,
-		 * but in practice there's firmware where using that memory leads
-		 * to crashes.
-		 *
-		 * Only EFI_CONVENTIONAL_MEMORY is guaranteed to be free.
-		 */
-		if (md->type != EFI_CONVENTIONAL_MEMORY)
+		if (!memory_type_is_free(md))
 			continue;
 
 		if (efi_soft_reserve_enabled() &&
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index a848119e4455..626e4f10ba2c 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -3,12 +3,15 @@
 #include "../cpuflags.h"
 #include "bitmap.h"
 #include "error.h"
+#include "find.h"
 #include "math.h"
 
 #define PMD_SHIFT	21
 #define PMD_SIZE	(_AC(1, UL) << PMD_SHIFT)
 #define PMD_MASK	(~(PMD_SIZE - 1))
 
+extern struct boot_params *boot_params;
+
 static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
 {
 	/* Platform-specific memory-acceptance call goes here */
@@ -71,3 +74,18 @@ void process_unaccepted_memory(struct boot_params *params, u64 start, u64 end)
 	bitmap_set((unsigned long *)params->unaccepted_memory,
 		   start / PMD_SIZE, (end - start) / PMD_SIZE);
 }
+
+void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	unsigned long range_start, range_end;
+	unsigned long *bitmap, bitmap_size;
+
+	bitmap = (unsigned long *)boot_params->unaccepted_memory;
+	range_start = start / PMD_SIZE;
+	bitmap_size = DIV_ROUND_UP(end, PMD_SIZE);
+
+	for_each_set_bitrange_from(range_start, range_end, bitmap, bitmap_size) {
+		__accept_memory(range_start * PMD_SIZE, range_end * PMD_SIZE);
+		bitmap_clear(bitmap, range_start, range_end - range_start);
+	}
+}
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index cf690d8712f4..c41a87b0ec06 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -454,6 +454,12 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
 #endif
 
 	debug_putstr("\nDecompressing Linux... ");
+
+	if (boot_params->unaccepted_memory) {
+		debug_putstr("Accepting memory... ");
+		accept_memory(__pa(output), __pa(output) + needed_size);
+	}
+
 	__decompress(input_data, input_len, NULL, NULL, output, output_len,
 			NULL, error);
 	parse_elf(output);
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index d37d612a6390..44337a1ac23c 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -245,4 +245,10 @@ static inline unsigned long efi_find_vendor_table(struct boot_params *bp,
 }
 #endif /* CONFIG_EFI */
 
+#ifdef CONFIG_UNACCEPTED_MEMORY
+void accept_memory(phys_addr_t start, phys_addr_t end);
+#else
+static inline void accept_memory(phys_addr_t start, phys_addr_t end) {}
+#endif
+
 #endif /* BOOT_COMPRESSED_MISC_H */
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
index df0736d32858..41fbfc798100 100644
--- a/arch/x86/include/asm/unaccepted_memory.h
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -7,4 +7,6 @@ struct boot_params;
 
 void process_unaccepted_memory(struct boot_params *params, u64 start, u64 num);
 
+void accept_memory(phys_addr_t start, phys_addr_t end);
+
 #endif
-- 
2.38.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCHv8 08/14] x86/mm: Reserve unaccepted memory bitmap
  2022-12-07  1:49 [PATCHv8 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (6 preceding siblings ...)
  2022-12-07  1:49 ` [PATCHv8 07/14] x86/boot/compressed: Handle " Kirill A. Shutemov
@ 2022-12-07  1:49 ` Kirill A. Shutemov
  2022-12-07  1:49 ` [PATCHv8 09/14] x86/mm: Provide helpers for unaccepted memory Kirill A. Shutemov
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Kirill A. Shutemov @ 2022-12-07  1:49 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Mel Gorman, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, aarcange, peterx, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov, Mike Rapoport

A given page of memory can only be accepted once. The kernel has to
accept memory both in the early decompression stage and during normal
runtime.

A bitmap is used to communicate the acceptance state of each page
between the decompression stage and normal runtime.

boot_params is used to communicate location of the bitmap throughout
the boot. The bitmap is allocated and initially populated in EFI stub.
Decompression stage accepts pages required for kernel/initrd and marks
these pages accordingly in the bitmap. The main kernel picks up the
bitmap from the same boot_params and uses it to determine what has to
be accepted on allocation.

In the runtime kernel, reserve the bitmap's memory to ensure nothing
overwrites it.

The size of bitmap is determined with e820__end_of_ram_pfn() which
relies on setup_e820() marking unaccepted memory as E820_TYPE_RAM.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
---
 arch/x86/kernel/e820.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 9dac24680ff8..62068956bb76 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1316,6 +1316,23 @@ void __init e820__memblock_setup(void)
 	int i;
 	u64 end;
 
+	/*
+	 * Mark unaccepted memory bitmap reserved.
+	 *
+	 * This kind of reservation usually done from early_reserve_memory(),
+	 * but early_reserve_memory() called before e820__memory_setup(), so
+	 * e820_table is not finalized and e820__end_of_ram_pfn() cannot be
+	 * used to get correct RAM size.
+	 */
+	if (boot_params.unaccepted_memory) {
+		unsigned long size;
+
+		/* One bit per 2MB */
+		size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
+				    PMD_SIZE * BITS_PER_BYTE);
+		memblock_reserve(boot_params.unaccepted_memory, size);
+	}
+
 	/*
 	 * The bootstrap memblock region count maximum is 128 entries
 	 * (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
-- 
2.38.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCHv8 09/14] x86/mm: Provide helpers for unaccepted memory
  2022-12-07  1:49 [PATCHv8 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (7 preceding siblings ...)
  2022-12-07  1:49 ` [PATCHv8 08/14] x86/mm: Reserve unaccepted memory bitmap Kirill A. Shutemov
@ 2022-12-07  1:49 ` Kirill A. Shutemov
  2022-12-07  1:49 ` [PATCHv8 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into " Kirill A. Shutemov
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Kirill A. Shutemov @ 2022-12-07  1:49 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Mel Gorman, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, aarcange, peterx, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov

Core-mm requires few helpers to support unaccepted memory:

 - accept_memory() checks the range of addresses against the bitmap and
   accept memory if needed.

 - range_contains_unaccepted_memory() checks if anything within the
   range requires acceptance.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/page.h              |  3 ++
 arch/x86/include/asm/unaccepted_memory.h |  4 ++
 arch/x86/mm/Makefile                     |  2 +
 arch/x86/mm/unaccepted_memory.c          | 61 ++++++++++++++++++++++++
 4 files changed, 70 insertions(+)
 create mode 100644 arch/x86/mm/unaccepted_memory.c

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 9cc82f305f4b..df4ec3a988dc 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -19,6 +19,9 @@
 struct page;
 
 #include <linux/range.h>
+
+#include <asm/unaccepted_memory.h>
+
 extern struct range pfn_mapped[];
 extern int nr_pfn_mapped;
 
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
index 41fbfc798100..89fc91c61560 100644
--- a/arch/x86/include/asm/unaccepted_memory.h
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -7,6 +7,10 @@ struct boot_params;
 
 void process_unaccepted_memory(struct boot_params *params, u64 start, u64 num);
 
+#ifdef CONFIG_UNACCEPTED_MEMORY
+
 void accept_memory(phys_addr_t start, phys_addr_t end);
+bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end);
 
 #endif
+#endif
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index c80febc44cd2..b0ef1755e5c8 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -67,3 +67,5 @@ obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_amd.o
 
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
+
+obj-$(CONFIG_UNACCEPTED_MEMORY)	+= unaccepted_memory.o
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
new file mode 100644
index 000000000000..1df918b21469
--- /dev/null
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -0,0 +1,61 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/memblock.h>
+#include <linux/mm.h>
+#include <linux/pfn.h>
+#include <linux/spinlock.h>
+
+#include <asm/io.h>
+#include <asm/setup.h>
+#include <asm/unaccepted_memory.h>
+
+/* Protects unaccepted memory bitmap */
+static DEFINE_SPINLOCK(unaccepted_memory_lock);
+
+void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	unsigned long range_start, range_end;
+	unsigned long *bitmap;
+	unsigned long flags;
+
+	if (!boot_params.unaccepted_memory)
+		return;
+
+	bitmap = __va(boot_params.unaccepted_memory);
+	range_start = start / PMD_SIZE;
+
+	spin_lock_irqsave(&unaccepted_memory_lock, flags);
+	for_each_set_bitrange_from(range_start, range_end, bitmap,
+				   DIV_ROUND_UP(end, PMD_SIZE)) {
+		unsigned long len = range_end - range_start;
+
+		/* Platform-specific memory-acceptance call goes here */
+		panic("Cannot accept memory: unknown platform\n");
+		bitmap_clear(bitmap, range_start, len);
+	}
+	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+}
+
+bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
+{
+	unsigned long *bitmap;
+	unsigned long flags;
+	bool ret = false;
+
+	if (!boot_params.unaccepted_memory)
+		return 0;
+
+	bitmap = __va(boot_params.unaccepted_memory);
+
+	spin_lock_irqsave(&unaccepted_memory_lock, flags);
+	while (start < end) {
+		if (test_bit(start / PMD_SIZE, bitmap)) {
+			ret = true;
+			break;
+		}
+
+		start += PMD_SIZE;
+	}
+	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+
+	return ret;
+}
-- 
2.38.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCHv8 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2022-12-07  1:49 [PATCHv8 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (8 preceding siblings ...)
  2022-12-07  1:49 ` [PATCHv8 09/14] x86/mm: Provide helpers for unaccepted memory Kirill A. Shutemov
@ 2022-12-07  1:49 ` Kirill A. Shutemov
  2022-12-07  1:49 ` [PATCHv8 11/14] x86: Disable kexec if system has " Kirill A. Shutemov
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Kirill A. Shutemov @ 2022-12-07  1:49 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Mel Gorman, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, aarcange, peterx, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov, Dave Hansen

load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
The unwanted loads are typically harmless. But, they might be made to
totally unrelated or even unmapped memory. load_unaligned_zeropad()
relies on exception fixup (#PF, #GP and now #VE) to recover from these
unwanted loads.

But, this approach does not work for unaccepted memory. For TDX, a load
from unaccepted memory will not lead to a recoverable exception within
the guest. The guest will exit to the VMM where the only recourse is to
terminate the guest.

There are three parts to fix this issue and comprehensively avoid access
to unaccepted memory. Together these ensure that an extra "guard" page
is accepted in addition to the memory that needs to be used.

1. Implicitly extend the range_contains_unaccepted_memory(start, end)
   checks up to end+2M if 'end' is aligned on a 2M boundary. It may
   require checking 2M chunk beyond end of RAM. The bitmap allocation is
   modified to accommodate this.
2. Implicitly extend accept_memory(start, end) to end+2M if 'end' is
   aligned on a 2M boundary.
3. Set PageUnaccepted() on both memory that itself needs to be accepted
   *and* memory where the next page needs to be accepted. Essentially,
   make PageUnaccepted(page) a marker for whether work needs to be done
   to make 'page' usable. That work might include accepting pages in
   addition to 'page' itself.

Side note: This leads to something strange. Pages which were accepted
	   at boot, marked by the firmware as accepted and will never
	   _need_ to be accepted might have PageUnaccepted() set on
	   them. PageUnaccepted(page) is a cue to ensure that the next
	   page is accepted before 'page' can be used.

This is an actual, real-world problem which was discovered during TDX
testing.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 arch/x86/mm/unaccepted_memory.c         | 39 +++++++++++++++++++++++++
 drivers/firmware/efi/libstub/x86-stub.c |  7 +++++
 2 files changed, 46 insertions(+)

diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index 1df918b21469..a0a58486eb74 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -23,6 +23,38 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
 	bitmap = __va(boot_params.unaccepted_memory);
 	range_start = start / PMD_SIZE;
 
+	/*
+	 * load_unaligned_zeropad() can lead to unwanted loads across page
+	 * boundaries. The unwanted loads are typically harmless. But, they
+	 * might be made to totally unrelated or even unmapped memory.
+	 * load_unaligned_zeropad() relies on exception fixup (#PF, #GP and now
+	 * #VE) to recover from these unwanted loads.
+	 *
+	 * But, this approach does not work for unaccepted memory. For TDX, a
+	 * load from unaccepted memory will not lead to a recoverable exception
+	 * within the guest. The guest will exit to the VMM where the only
+	 * recourse is to terminate the guest.
+	 *
+	 * There are three parts to fix this issue and comprehensively avoid
+	 * access to unaccepted memory. Together these ensure that an extra
+	 * "guard" page is accepted in addition to the memory that needs to be
+	 * used:
+	 *
+	 * 1. Implicitly extend the range_contains_unaccepted_memory(start, end)
+	 *    checks up to end+2M if 'end' is aligned on a 2M boundary.
+	 *
+	 * 2. Implicitly extend accept_memory(start, end) to end+2M if 'end' is
+	 *    aligned on a 2M boundary. (immediately following this comment)
+	 *
+	 * 3. Set PageUnaccepted() on both memory that itself needs to be
+	 *    accepted *and* memory where the next page needs to be accepted.
+	 *    Essentially, make PageUnaccepted(page) a marker for whether work
+	 *    needs to be done to make 'page' usable. That work might include
+	 *    accepting pages in addition to 'page' itself.
+	 */
+	if (!(end % PMD_SIZE))
+		end += PMD_SIZE;
+
 	spin_lock_irqsave(&unaccepted_memory_lock, flags);
 	for_each_set_bitrange_from(range_start, range_end, bitmap,
 				   DIV_ROUND_UP(end, PMD_SIZE)) {
@@ -46,6 +78,13 @@ bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
 
 	bitmap = __va(boot_params.unaccepted_memory);
 
+	/*
+	 * Also consider the unaccepted state of the *next* page. See fix #1 in
+	 * the comment on load_unaligned_zeropad() in accept_memory().
+	 */
+	if (!(end % PMD_SIZE))
+		end += PMD_SIZE;
+
 	spin_lock_irqsave(&unaccepted_memory_lock, flags);
 	while (start < end) {
 		if (test_bit(start / PMD_SIZE, bitmap)) {
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index 27b9eed5883b..f375ab784c78 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -715,6 +715,13 @@ static efi_status_t allocate_unaccepted_bitmap(struct boot_params *params,
 		return EFI_SUCCESS;
 	}
 
+	/*
+	 * range_contains_unaccepted_memory() may need to check one 2M chunk
+	 * beyond the end of RAM to deal with load_unaligned_zeropad(). Make
+	 * sure that the bitmap is large enough handle it.
+	 */
+	max_addr += PMD_SIZE;
+
 	/*
 	 * If unaccepted memory is present, allocate a bitmap to track what
 	 * memory has to be accepted before access.
-- 
2.38.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCHv8 11/14] x86: Disable kexec if system has unaccepted memory
  2022-12-07  1:49 [PATCHv8 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (9 preceding siblings ...)
  2022-12-07  1:49 ` [PATCHv8 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into " Kirill A. Shutemov
@ 2022-12-07  1:49 ` Kirill A. Shutemov
  2022-12-07  1:49 ` [PATCHv8 12/14] x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub Kirill A. Shutemov
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Kirill A. Shutemov @ 2022-12-07  1:49 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Mel Gorman, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, aarcange, peterx, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov

On kexec, the target kernel has to know what memory has been accepted.
Information in EFI map is out of date and cannot be used.

boot_params.unaccepted_memory can be used to pass the bitmap between two
kernels on kexec, but the use-case is not yet implemented.

Disable kexec on machines with unaccepted memory for now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/kexec.h    |  5 +++++
 arch/x86/mm/unaccepted_memory.c | 16 ++++++++++++++++
 include/linux/kexec.h           |  7 +++++++
 kernel/kexec.c                  |  4 ++++
 kernel/kexec_file.c             |  4 ++++
 5 files changed, 36 insertions(+)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index a3760ca796aa..87abab578154 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -189,6 +189,11 @@ extern void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages);
 void arch_kexec_protect_crashkres(void);
 #define arch_kexec_protect_crashkres arch_kexec_protect_crashkres
 
+#ifdef CONFIG_UNACCEPTED_MEMORY
+int arch_kexec_load(void);
+#define arch_kexec_load arch_kexec_load
+#endif
+
 void arch_kexec_unprotect_crashkres(void);
 #define arch_kexec_unprotect_crashkres arch_kexec_unprotect_crashkres
 
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index a0a58486eb74..1745e6a65024 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -1,4 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0-only
+#include <linux/kexec.h>
 #include <linux/memblock.h>
 #include <linux/mm.h>
 #include <linux/pfn.h>
@@ -98,3 +99,18 @@ bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
 
 	return ret;
 }
+
+#ifdef CONFIG_KEXEC_CORE
+int arch_kexec_load(void)
+{
+	if (!boot_params.unaccepted_memory)
+		return 0;
+
+	/*
+	 * TODO: Information on memory acceptance status has to be communicated
+	 * between kernel.
+	 */
+	pr_warn_once("Disable kexec: not yet supported on systems with unaccepted memory\n");
+	return -EOPNOTSUPP;
+}
+#endif
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 41a686996aaa..6b75051d5271 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -444,6 +444,13 @@ static inline void arch_kexec_protect_crashkres(void) { }
 static inline void arch_kexec_unprotect_crashkres(void) { }
 #endif
 
+#ifndef arch_kexec_load
+static inline int arch_kexec_load(void)
+{
+	return 0;
+}
+#endif
+
 #ifndef page_to_boot_pfn
 static inline unsigned long page_to_boot_pfn(struct page *page)
 {
diff --git a/kernel/kexec.c b/kernel/kexec.c
index cb8e6e6f983c..65dff44b487f 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -192,6 +192,10 @@ static inline int kexec_load_check(unsigned long nr_segments,
 {
 	int result;
 
+	result = arch_kexec_load();
+	if (result)
+		return result;
+
 	/* We only trust the superuser with rebooting the system. */
 	if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
 		return -EPERM;
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 45637511e0de..8f1454c3776a 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -329,6 +329,10 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
 	int ret = 0, i;
 	struct kimage **dest_image, *image;
 
+	ret = arch_kexec_load();
+	if (ret)
+		return ret;
+
 	/* We only trust the superuser with rebooting the system. */
 	if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
 		return -EPERM;
-- 
2.38.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCHv8 12/14] x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub
  2022-12-07  1:49 [PATCHv8 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (10 preceding siblings ...)
  2022-12-07  1:49 ` [PATCHv8 11/14] x86: Disable kexec if system has " Kirill A. Shutemov
@ 2022-12-07  1:49 ` Kirill A. Shutemov
  2022-12-07  1:49 ` [PATCHv8 13/14] x86/tdx: Refactor try_accept_one() Kirill A. Shutemov
  2022-12-07  1:49 ` [PATCHv8 14/14] x86/tdx: Add unaccepted memory support Kirill A. Shutemov
  13 siblings, 0 replies; 26+ messages in thread
From: Kirill A. Shutemov @ 2022-12-07  1:49 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Mel Gorman, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, aarcange, peterx, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov, Dave Hansen

Memory acceptance requires a hypercall and one or multiple module calls.

Make helpers for the calls available in boot stub. It has to accept
memory where kernel image and initrd are placed.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 arch/x86/coco/tdx/tdx.c           | 27 ------------------
 arch/x86/include/asm/shared/tdx.h | 46 +++++++++++++++++++++++++++++++
 arch/x86/include/asm/tdx.h        | 19 -------------
 3 files changed, 46 insertions(+), 46 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index cfd4c95b9f04..12c14affa5f2 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -14,15 +14,6 @@
 #include <asm/insn-eval.h>
 #include <asm/pgtable.h>
 
-/* TDX module Call Leaf IDs */
-#define TDX_GET_INFO			1
-#define TDX_GET_VEINFO			3
-#define TDX_GET_REPORT			4
-#define TDX_ACCEPT_PAGE			6
-
-/* TDX hypercall Leaf IDs */
-#define TDVMCALL_MAP_GPA		0x10001
-
 /* MMIO direction */
 #define EPT_READ	0
 #define EPT_WRITE	1
@@ -45,24 +36,6 @@
 
 #define TDREPORT_SUBTYPE_0	0
 
-/*
- * Wrapper for standard use of __tdx_hypercall with no output aside from
- * return code.
- */
-static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
-{
-	struct tdx_hypercall_args args = {
-		.r10 = TDX_HYPERCALL_STANDARD,
-		.r11 = fn,
-		.r12 = r12,
-		.r13 = r13,
-		.r14 = r14,
-		.r15 = r15,
-	};
-
-	return __tdx_hypercall(&args, 0);
-}
-
 /* Called from __tdx_hypercall() for unrecoverable failure */
 void __tdx_hypercall_failed(void)
 {
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index e53f26228fbb..c5f12b90ef70 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -13,6 +13,15 @@
 #define TDX_CPUID_LEAF_ID	0x21
 #define TDX_IDENT		"IntelTDX    "
 
+/* TDX module Call Leaf IDs */
+#define TDX_GET_INFO			1
+#define TDX_GET_VEINFO			3
+#define TDX_GET_REPORT			5
+#define TDX_ACCEPT_PAGE			6
+
+/* TDX hypercall Leaf IDs */
+#define TDVMCALL_MAP_GPA		0x10001
+
 #ifndef __ASSEMBLY__
 
 /*
@@ -33,8 +42,45 @@ struct tdx_hypercall_args {
 /* Used to request services from the VMM */
 u64 __tdx_hypercall(struct tdx_hypercall_args *args, unsigned long flags);
 
+/*
+ * Wrapper for standard use of __tdx_hypercall with no output aside from
+ * return code.
+ */
+static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+	struct tdx_hypercall_args args = {
+		.r10 = TDX_HYPERCALL_STANDARD,
+		.r11 = fn,
+		.r12 = r12,
+		.r13 = r13,
+		.r14 = r14,
+		.r15 = r15,
+	};
+
+	return __tdx_hypercall(&args, 0);
+}
+
+
 /* Called from __tdx_hypercall() for unrecoverable failure */
 void __tdx_hypercall_failed(void);
 
+/*
+ * Used in __tdx_module_call() to gather the output registers' values of the
+ * TDCALL instruction when requesting services from the TDX module. This is a
+ * software only structure and not part of the TDX module/VMM ABI
+ */
+struct tdx_module_output {
+	u64 rcx;
+	u64 rdx;
+	u64 r8;
+	u64 r9;
+	u64 r10;
+	u64 r11;
+};
+
+/* Used to communicate with the TDX module */
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+		      struct tdx_module_output *out);
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_SHARED_TDX_H */
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 28d889c9aa16..234197ec17e4 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -20,21 +20,6 @@
 
 #ifndef __ASSEMBLY__
 
-/*
- * Used to gather the output registers values of the TDCALL and SEAMCALL
- * instructions when requesting services from the TDX module.
- *
- * This is a software only structure and not part of the TDX module/VMM ABI.
- */
-struct tdx_module_output {
-	u64 rcx;
-	u64 rdx;
-	u64 r8;
-	u64 r9;
-	u64 r10;
-	u64 r11;
-};
-
 /*
  * Used by the #VE exception handler to gather the #VE exception
  * info from the TDX module. This is a software only structure
@@ -55,10 +40,6 @@ struct ve_info {
 
 void __init tdx_early_init(void);
 
-/* Used to communicate with the TDX module */
-u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
-		      struct tdx_module_output *out);
-
 void tdx_get_ve_info(struct ve_info *ve);
 
 bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
-- 
2.38.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCHv8 13/14] x86/tdx: Refactor try_accept_one()
  2022-12-07  1:49 [PATCHv8 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (11 preceding siblings ...)
  2022-12-07  1:49 ` [PATCHv8 12/14] x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub Kirill A. Shutemov
@ 2022-12-07  1:49 ` Kirill A. Shutemov
  2022-12-07  1:49 ` [PATCHv8 14/14] x86/tdx: Add unaccepted memory support Kirill A. Shutemov
  13 siblings, 0 replies; 26+ messages in thread
From: Kirill A. Shutemov @ 2022-12-07  1:49 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Mel Gorman, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, aarcange, peterx, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov, Dave Hansen

Rework try_accept_one() to return accepted size instead of modifying
'start' inside the helper. It makes 'start' in-only argument and
streamlines code on the caller side.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 arch/x86/coco/tdx/tdx.c | 38 +++++++++++++++++++-------------------
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 12c14affa5f2..cf6d9a0968d8 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -674,18 +674,18 @@ static bool tdx_cache_flush_required(void)
 	return true;
 }
 
-static bool try_accept_one(phys_addr_t *start, unsigned long len,
-			  enum pg_level pg_level)
+static unsigned long try_accept_one(phys_addr_t start, unsigned long len,
+				    enum pg_level pg_level)
 {
 	unsigned long accept_size = page_level_size(pg_level);
 	u64 tdcall_rcx;
 	u8 page_size;
 
-	if (!IS_ALIGNED(*start, accept_size))
-		return false;
+	if (!IS_ALIGNED(start, accept_size))
+		return 0;
 
 	if (len < accept_size)
-		return false;
+		return 0;
 
 	/*
 	 * Pass the page physical address to the TDX module to accept the
@@ -704,15 +704,14 @@ static bool try_accept_one(phys_addr_t *start, unsigned long len,
 		page_size = 2;
 		break;
 	default:
-		return false;
+		return 0;
 	}
 
-	tdcall_rcx = *start | page_size;
+	tdcall_rcx = start | page_size;
 	if (__tdx_module_call(TDX_ACCEPT_PAGE, tdcall_rcx, 0, 0, 0, NULL))
-		return false;
+		return 0;
 
-	*start += accept_size;
-	return true;
+	return accept_size;
 }
 
 /*
@@ -749,21 +748,22 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
 	 */
 	while (start < end) {
 		unsigned long len = end - start;
+		unsigned long accept_size;
 
 		/*
 		 * Try larger accepts first. It gives chance to VMM to keep
-		 * 1G/2M SEPT entries where possible and speeds up process by
-		 * cutting number of hypercalls (if successful).
+		 * 1G/2M Secure EPT entries where possible and speeds up
+		 * process by cutting number of hypercalls (if successful).
 		 */
 
-		if (try_accept_one(&start, len, PG_LEVEL_1G))
-			continue;
-
-		if (try_accept_one(&start, len, PG_LEVEL_2M))
-			continue;
-
-		if (!try_accept_one(&start, len, PG_LEVEL_4K))
+		accept_size = try_accept_one(start, len, PG_LEVEL_1G);
+		if (!accept_size)
+			accept_size = try_accept_one(start, len, PG_LEVEL_2M);
+		if (!accept_size)
+			accept_size = try_accept_one(start, len, PG_LEVEL_4K);
+		if (!accept_size)
 			return false;
+		start += accept_size;
 	}
 
 	return true;
-- 
2.38.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCHv8 14/14] x86/tdx: Add unaccepted memory support
  2022-12-07  1:49 [PATCHv8 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (12 preceding siblings ...)
  2022-12-07  1:49 ` [PATCHv8 13/14] x86/tdx: Refactor try_accept_one() Kirill A. Shutemov
@ 2022-12-07  1:49 ` Kirill A. Shutemov
  13 siblings, 0 replies; 26+ messages in thread
From: Kirill A. Shutemov @ 2022-12-07  1:49 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Dave Hansen,
	Mike Rapoport, David Hildenbrand, Mel Gorman, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, aarcange, peterx, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel,
	Kirill A. Shutemov

Hookup TDX-specific code to accept memory.

Accepting the memory is the same process as converting memory from
shared to private: kernel notifies VMM with MAP_GPA hypercall and then
accept pages with ACCEPT_PAGE module call.

The implementation in core kernel uses tdx_enc_status_changed(). It
already used for converting memory to shared and back for I/O
transactions.

Boot stub provides own implementation of tdx_accept_memory(). It is
similar in structure to tdx_enc_status_changed(), but only cares about
converting memory to private.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                      |  2 +
 arch/x86/boot/compressed/Makefile     |  2 +-
 arch/x86/boot/compressed/error.c      | 19 ++++++
 arch/x86/boot/compressed/error.h      |  1 +
 arch/x86/boot/compressed/mem.c        | 33 +++++++++-
 arch/x86/boot/compressed/tdx-shared.c |  2 +
 arch/x86/boot/compressed/tdx.c        | 39 +++++++++++
 arch/x86/coco/tdx/Makefile            |  2 +-
 arch/x86/coco/tdx/tdx-shared.c        | 95 +++++++++++++++++++++++++++
 arch/x86/coco/tdx/tdx.c               | 86 +-----------------------
 arch/x86/include/asm/shared/tdx.h     |  2 +
 arch/x86/include/asm/tdx.h            |  2 +
 arch/x86/mm/unaccepted_memory.c       |  9 ++-
 13 files changed, 206 insertions(+), 88 deletions(-)
 create mode 100644 arch/x86/boot/compressed/tdx-shared.c
 create mode 100644 arch/x86/coco/tdx/tdx-shared.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 947937d327c6..bc031fd1d95e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -888,6 +888,8 @@ config INTEL_TDX_GUEST
 	select ARCH_HAS_CC_PLATFORM
 	select X86_MEM_ENCRYPT
 	select X86_MCE
+	select UNACCEPTED_MEMORY
+	select EFI_STUB
 	help
 	  Support running as a guest under Intel TDX.  Without this support,
 	  the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index a9937b50c37f..38c42c2b4349 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -106,7 +106,7 @@ ifdef CONFIG_X86_64
 endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
-vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdx-shared.o $(obj)/tdcall.o
 vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/bitmap.o $(obj)/find.o $(obj)/mem.o
 
 vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
diff --git a/arch/x86/boot/compressed/error.c b/arch/x86/boot/compressed/error.c
index c881878e56d3..5313c5cb2b80 100644
--- a/arch/x86/boot/compressed/error.c
+++ b/arch/x86/boot/compressed/error.c
@@ -22,3 +22,22 @@ void error(char *m)
 	while (1)
 		asm("hlt");
 }
+
+/* EFI libstub  provides vsnprintf() */
+#ifdef CONFIG_EFI_STUB
+void panic(const char *fmt, ...)
+{
+	static char buf[1024];
+	va_list args;
+	int len;
+
+	va_start(args, fmt);
+	len = vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	if (len && buf[len - 1] == '\n')
+		buf[len - 1] = '\0';
+
+	error(buf);
+}
+#endif
diff --git a/arch/x86/boot/compressed/error.h b/arch/x86/boot/compressed/error.h
index 1de5821184f1..86fe33b93715 100644
--- a/arch/x86/boot/compressed/error.h
+++ b/arch/x86/boot/compressed/error.h
@@ -6,5 +6,6 @@
 
 void warn(char *m);
 void error(char *m) __noreturn;
+void panic(const char *fmt, ...) __noreturn __cold;
 
 #endif /* BOOT_COMPRESSED_ERROR_H */
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index 626e4f10ba2c..23a84c46aa9b 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -5,6 +5,8 @@
 #include "error.h"
 #include "find.h"
 #include "math.h"
+#include "tdx.h"
+#include <asm/shared/tdx.h>
 
 #define PMD_SHIFT	21
 #define PMD_SIZE	(_AC(1, UL) << PMD_SHIFT)
@@ -12,10 +14,39 @@
 
 extern struct boot_params *boot_params;
 
+/*
+ * accept_memory() and process_unaccepted_memory() called from EFI stub which
+ * runs before decompresser and its early_tdx_detect().
+ *
+ * Enumerate TDX directly from the early users.
+ */
+static bool early_is_tdx_guest(void)
+{
+	static bool once;
+	static bool is_tdx;
+
+	if (!IS_ENABLED(CONFIG_INTEL_TDX_GUEST))
+		return false;
+
+	if (!once) {
+		u32 eax, sig[3];
+
+		cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax,
+			    &sig[0], &sig[2],  &sig[1]);
+		is_tdx = !memcmp(TDX_IDENT, sig, sizeof(sig));
+		once = true;
+	}
+
+	return is_tdx;
+}
+
 static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
 {
 	/* Platform-specific memory-acceptance call goes here */
-	error("Cannot accept memory");
+	if (early_is_tdx_guest())
+		tdx_accept_memory(start, end);
+	else
+		error("Cannot accept memory: unknown platform\n");
 }
 
 /*
diff --git a/arch/x86/boot/compressed/tdx-shared.c b/arch/x86/boot/compressed/tdx-shared.c
new file mode 100644
index 000000000000..5ac43762fe13
--- /dev/null
+++ b/arch/x86/boot/compressed/tdx-shared.c
@@ -0,0 +1,2 @@
+#include "error.h"
+#include "../../coco/tdx/tdx-shared.c"
diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
index 918a7606f53c..de1d4a87418d 100644
--- a/arch/x86/boot/compressed/tdx.c
+++ b/arch/x86/boot/compressed/tdx.c
@@ -3,12 +3,17 @@
 #include "../cpuflags.h"
 #include "../string.h"
 #include "../io.h"
+#include "align.h"
 #include "error.h"
+#include "pgtable_types.h"
 
 #include <vdso/limits.h>
 #include <uapi/asm/vmx.h>
 
 #include <asm/shared/tdx.h>
+#include <asm/page_types.h>
+
+static u64 cc_mask;
 
 /* Called from __tdx_hypercall() for unrecoverable failure */
 void __tdx_hypercall_failed(void)
@@ -16,6 +21,38 @@ void __tdx_hypercall_failed(void)
 	error("TDVMCALL failed. TDX module bug?");
 }
 
+static u64 get_cc_mask(void)
+{
+	struct tdx_module_output out;
+	unsigned int gpa_width;
+
+	/*
+	 * TDINFO TDX module call is used to get the TD execution environment
+	 * information like GPA width, number of available vcpus, debug mode
+	 * information, etc. More details about the ABI can be found in TDX
+	 * Guest-Host-Communication Interface (GHCI), section 2.4.2 TDCALL
+	 * [TDG.VP.INFO].
+	 *
+	 * The GPA width that comes out of this call is critical. TDX guests
+	 * can not meaningfully run without it.
+	 */
+	if (__tdx_module_call(TDX_GET_INFO, 0, 0, 0, 0, &out))
+		error("TDCALL GET_INFO failed (Buggy TDX module!)\n");
+
+	gpa_width = out.rcx & GENMASK(5, 0);
+
+	/*
+	 * The highest bit of a guest physical address is the "sharing" bit.
+	 * Set it for shared pages and clear it for private pages.
+	 */
+	return BIT_ULL(gpa_width - 1);
+}
+
+u64 cc_mkdec(u64 val)
+{
+	return val & ~cc_mask;
+}
+
 static inline unsigned int tdx_io_in(int size, u16 port)
 {
 	struct tdx_hypercall_args args = {
@@ -70,6 +107,8 @@ void early_tdx_detect(void)
 	if (memcmp(TDX_IDENT, sig, sizeof(sig)))
 		return;
 
+	cc_mask = get_cc_mask();
+
 	/* Use hypercalls instead of I/O instructions */
 	pio_ops.f_inb  = tdx_inb;
 	pio_ops.f_outb = tdx_outb;
diff --git a/arch/x86/coco/tdx/Makefile b/arch/x86/coco/tdx/Makefile
index 46c55998557d..2c7dcbf1458b 100644
--- a/arch/x86/coco/tdx/Makefile
+++ b/arch/x86/coco/tdx/Makefile
@@ -1,3 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0
 
-obj-y += tdx.o tdcall.o
+obj-y += tdx.o tdx-shared.o tdcall.o
diff --git a/arch/x86/coco/tdx/tdx-shared.c b/arch/x86/coco/tdx/tdx-shared.c
new file mode 100644
index 000000000000..ee74f7bbe806
--- /dev/null
+++ b/arch/x86/coco/tdx/tdx-shared.c
@@ -0,0 +1,95 @@
+#include <asm/tdx.h>
+#include <asm/pgtable.h>
+
+static unsigned long try_accept_one(phys_addr_t start, unsigned long len,
+				    enum pg_level pg_level)
+{
+	unsigned long accept_size = page_level_size(pg_level);
+	u64 tdcall_rcx;
+	u8 page_size;
+
+	if (!IS_ALIGNED(start, accept_size))
+		return 0;
+
+	if (len < accept_size)
+		return 0;
+
+	/*
+	 * Pass the page physical address to the TDX module to accept the
+	 * pending, private page.
+	 *
+	 * Bits 2:0 of RCX encode page size: 0 - 4K, 1 - 2M, 2 - 1G.
+	 */
+	switch (pg_level) {
+	case PG_LEVEL_4K:
+		page_size = 0;
+		break;
+	case PG_LEVEL_2M:
+		page_size = 1;
+		break;
+	case PG_LEVEL_1G:
+		page_size = 2;
+		break;
+	default:
+		return 0;
+	}
+
+	tdcall_rcx = start | page_size;
+	if (__tdx_module_call(TDX_ACCEPT_PAGE, tdcall_rcx, 0, 0, 0, NULL))
+		return 0;
+
+	return accept_size;
+}
+
+bool tdx_enc_status_changed_phys(phys_addr_t start, phys_addr_t end, bool enc)
+{
+	if (!enc) {
+		/* Set the shared (decrypted) bits: */
+		start |= cc_mkdec(0);
+		end   |= cc_mkdec(0);
+	}
+
+	/*
+	 * Notify the VMM about page mapping conversion. More info about ABI
+	 * can be found in TDX Guest-Host-Communication Interface (GHCI),
+	 * section "TDG.VP.VMCALL<MapGPA>"
+	 */
+	if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
+		return false;
+
+	/* private->shared conversion  requires only MapGPA call */
+	if (!enc)
+		return true;
+
+	/*
+	 * For shared->private conversion, accept the page using
+	 * TDX_ACCEPT_PAGE TDX module call.
+	 */
+	while (start < end) {
+		unsigned long len = end - start;
+		unsigned long accept_size;
+
+		/*
+		 * Try larger accepts first. It gives chance to VMM to keep
+		 * 1G/2M Secure EPT entries where possible and speeds up
+		 * process by cutting number of hypercalls (if successful).
+		 */
+
+		accept_size = try_accept_one(start, len, PG_LEVEL_1G);
+		if (!accept_size)
+			accept_size = try_accept_one(start, len, PG_LEVEL_2M);
+		if (!accept_size)
+			accept_size = try_accept_one(start, len, PG_LEVEL_4K);
+		if (!accept_size)
+			return false;
+		start += accept_size;
+	}
+
+	return true;
+}
+
+void tdx_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	if (!tdx_enc_status_changed_phys(start, end, true))
+		panic("Accepting memory failed: %#llx-%#llx\n",  start, end);
+}
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index cf6d9a0968d8..0aa95fe29045 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -674,46 +674,6 @@ static bool tdx_cache_flush_required(void)
 	return true;
 }
 
-static unsigned long try_accept_one(phys_addr_t start, unsigned long len,
-				    enum pg_level pg_level)
-{
-	unsigned long accept_size = page_level_size(pg_level);
-	u64 tdcall_rcx;
-	u8 page_size;
-
-	if (!IS_ALIGNED(start, accept_size))
-		return 0;
-
-	if (len < accept_size)
-		return 0;
-
-	/*
-	 * Pass the page physical address to the TDX module to accept the
-	 * pending, private page.
-	 *
-	 * Bits 2:0 of RCX encode page size: 0 - 4K, 1 - 2M, 2 - 1G.
-	 */
-	switch (pg_level) {
-	case PG_LEVEL_4K:
-		page_size = 0;
-		break;
-	case PG_LEVEL_2M:
-		page_size = 1;
-		break;
-	case PG_LEVEL_1G:
-		page_size = 2;
-		break;
-	default:
-		return 0;
-	}
-
-	tdcall_rcx = start | page_size;
-	if (__tdx_module_call(TDX_ACCEPT_PAGE, tdcall_rcx, 0, 0, 0, NULL))
-		return 0;
-
-	return accept_size;
-}
-
 /*
  * Inform the VMM of the guest's intent for this physical page: shared with
  * the VMM or private to the guest.  The VMM is expected to change its mapping
@@ -722,51 +682,9 @@ static unsigned long try_accept_one(phys_addr_t start, unsigned long len,
 static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
 {
 	phys_addr_t start = __pa(vaddr);
-	phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
-
-	if (!enc) {
-		/* Set the shared (decrypted) bits: */
-		start |= cc_mkdec(0);
-		end   |= cc_mkdec(0);
-	}
-
-	/*
-	 * Notify the VMM about page mapping conversion. More info about ABI
-	 * can be found in TDX Guest-Host-Communication Interface (GHCI),
-	 * section "TDG.VP.VMCALL<MapGPA>"
-	 */
-	if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
-		return false;
-
-	/* private->shared conversion  requires only MapGPA call */
-	if (!enc)
-		return true;
+	phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
 
-	/*
-	 * For shared->private conversion, accept the page using
-	 * TDX_ACCEPT_PAGE TDX module call.
-	 */
-	while (start < end) {
-		unsigned long len = end - start;
-		unsigned long accept_size;
-
-		/*
-		 * Try larger accepts first. It gives chance to VMM to keep
-		 * 1G/2M Secure EPT entries where possible and speeds up
-		 * process by cutting number of hypercalls (if successful).
-		 */
-
-		accept_size = try_accept_one(start, len, PG_LEVEL_1G);
-		if (!accept_size)
-			accept_size = try_accept_one(start, len, PG_LEVEL_2M);
-		if (!accept_size)
-			accept_size = try_accept_one(start, len, PG_LEVEL_4K);
-		if (!accept_size)
-			return false;
-		start += accept_size;
-	}
-
-	return true;
+	return tdx_enc_status_changed_phys(start, end, enc);
 }
 
 void __init tdx_early_init(void)
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index c5f12b90ef70..a21685fd7c4f 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -82,5 +82,7 @@ struct tdx_module_output {
 u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 		      struct tdx_module_output *out);
 
+void tdx_accept_memory(phys_addr_t start, phys_addr_t end);
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_SHARED_TDX_H */
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 234197ec17e4..3a7340ad9a4b 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -50,6 +50,8 @@ bool tdx_early_handle_ve(struct pt_regs *regs);
 
 int tdx_mcall_get_report0(u8 *reportdata, u8 *tdreport);
 
+bool tdx_enc_status_changed_phys(phys_addr_t start, phys_addr_t end, bool enc);
+
 #else
 
 static inline void tdx_early_init(void) { };
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index 1745e6a65024..45706c684ca5 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -7,6 +7,7 @@
 
 #include <asm/io.h>
 #include <asm/setup.h>
+#include <asm/shared/tdx.h>
 #include <asm/unaccepted_memory.h>
 
 /* Protects unaccepted memory bitmap */
@@ -62,7 +63,13 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
 		unsigned long len = range_end - range_start;
 
 		/* Platform-specific memory-acceptance call goes here */
-		panic("Cannot accept memory: unknown platform\n");
+		if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
+			tdx_accept_memory(range_start * PMD_SIZE,
+					  range_end * PMD_SIZE);
+		} else {
+			panic("Cannot accept memory: unknown platform\n");
+		}
+
 		bitmap_clear(bitmap, range_start, len);
 	}
 	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
-- 
2.38.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCHv8 02/14] mm: Add support for unaccepted memory
  2022-12-07  1:49 ` [PATCHv8 02/14] mm: Add support for unaccepted memory Kirill A. Shutemov
@ 2022-12-09 18:10   ` Vlastimil Babka
  2022-12-09 19:26     ` Kirill A. Shutemov
  2022-12-26 12:23   ` Borislav Petkov
  2023-01-16 13:04   ` Mel Gorman
  2 siblings, 1 reply; 26+ messages in thread
From: Vlastimil Babka @ 2022-12-09 18:10 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Dave Hansen, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Mike Rapoport

On 12/7/22 02:49, Kirill A. Shutemov wrote:
> UEFI Specification version 2.9 introduces the concept of memory
> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
> SEV-SNP, require memory to be accepted before it can be used by the
> guest. Accepting happens via a protocol specific to the Virtual Machine
> platform.
> 
> There are several ways kernel can deal with unaccepted memory:
> 
>  1. Accept all the memory during the boot. It is easy to implement and
>     it doesn't have runtime cost once the system is booted. The downside
>     is very long boot time.
> 
>     Accept can be parallelized to multiple CPUs to keep it manageable
>     (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
>     memory bandwidth and does not scale beyond the point.
> 
>  2. Accept a block of memory on the first use. It requires more
>     infrastructure and changes in page allocator to make it work, but
>     it provides good boot time.
> 
>     On-demand memory accept means latency spikes every time kernel steps
>     onto a new memory block. The spikes will go away once workload data
>     set size gets stabilized or all memory gets accepted.
> 
>  3. Accept all memory in background. Introduce a thread (or multiple)
>     that gets memory accepted proactively. It will minimize time the
>     system experience latency spikes on memory allocation while keeping
>     low boot time.
> 
>     This approach cannot function on its own. It is an extension of #2:
>     background memory acceptance requires functional scheduler, but the
>     page allocator may need to tap into unaccepted memory before that.
> 
>     The downside of the approach is that these threads also steal CPU
>     cycles and memory bandwidth from the user's workload and may hurt
>     user experience.
> 
> Implement #2 for now. It is a reasonable default. Some workloads may
> want to use #1 or #3 and they can be implemented later based on user's
> demands.
> 
> Support of unaccepted memory requires a few changes in core-mm code:
> 
>   - memblock has to accept memory on allocation;
> 
>   - page allocator has to accept memory on the first allocation of the
>     page;
> 
> Memblock change is trivial.
> 
> The page allocator is modified to accept pages. New memory gets accepted
> before putting pages on free lists. It is done lazily: only accept new
> pages when we run out of already accepted memory.
> 
> Architecture has to provide two helpers if it wants to support
> unaccepted memory:
> 
>  - accept_memory() makes a range of physical addresses accepted.
> 
>  - range_contains_unaccepted_memory() checks anything within the range
>    of physical addresses requires acceptance.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock

Hi,

the intrusiveness to page allocator seems to be much better in this version
than in the older ones, thanks!

> ---
>  include/linux/mmzone.h     |   5 ++
>  include/linux/page-flags.h |  24 ++++++++
>  mm/internal.h              |  12 ++++
>  mm/memblock.c              |   9 +++
>  mm/page_alloc.c            | 119 +++++++++++++++++++++++++++++++++++++
>  5 files changed, 169 insertions(+)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 5f74891556f3..da335381e63f 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -822,6 +822,11 @@ struct zone {
>  	/* free areas of different sizes */
>  	struct free_area	free_area[MAX_ORDER];
>  
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +	/* pages to be accepted */
> +	struct list_head	unaccepted_pages;
> +#endif
> +
>  	/* zone flags, see below */
>  	unsigned long		flags;
>  
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 0b0ae5084e60..ce953be8fe10 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -941,6 +941,7 @@ static inline bool is_page_hwpoison(struct page *page)
>  #define PG_offline	0x00000100
>  #define PG_table	0x00000200
>  #define PG_guard	0x00000400
> +#define PG_unaccepted	0x00000800
>  
>  #define PageType(page, flag)						\
>  	((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
> @@ -966,6 +967,18 @@ static __always_inline void __ClearPage##uname(struct page *page)	\
>  	page->page_type |= PG_##lname;					\
>  }
>  
> +#define PAGE_TYPE_OPS_FALSE(uname)					\
> +static __always_inline int Page##uname(struct page *page)		\
> +{									\
> +	return false;							\
> +}									\
> +static __always_inline void __SetPage##uname(struct page *page)		\
> +{									\
> +}									\
> +static __always_inline void __ClearPage##uname(struct page *page)	\
> +{									\
> +}

Wonder if we need this. The page type is defined regardless of
CONFIG_UNACCEPTED_MEMORY and with CONFIG_UNACCEPTED_MEMORY=n nobody will be
actually be setting or checking the type, so we can simply just keep the
"real" implementations?

> +
>  /*
>   * PageBuddy() indicates that the page is free and in the buddy system
>   * (see mm/page_alloc.c).
> @@ -996,6 +1009,17 @@ PAGE_TYPE_OPS(Buddy, buddy)
>   */
>  PAGE_TYPE_OPS(Offline, offline)
>  
> +/*
> + * PageUnaccepted() indicates that the page has to be "accepted" before it can
> + * be read or written. The page allocator must call accept_page() before
> + * touching the page or returning it to the caller.
> + */
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +PAGE_TYPE_OPS(Unaccepted, unaccepted)
> +#else
> +PAGE_TYPE_OPS_FALSE(Unaccepted)
> +#endif
> +
>  extern void page_offline_freeze(void);
>  extern void page_offline_thaw(void);
>  extern void page_offline_begin(void);
> diff --git a/mm/internal.h b/mm/internal.h
> index 6b7ef495b56d..8ef4f88608ad 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -856,4 +856,16 @@ static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma)
>  	return !(vma->vm_flags & VM_SOFTDIRTY);
>  }
>  
> +#ifndef CONFIG_UNACCEPTED_MEMORY
> +static inline bool range_contains_unaccepted_memory(phys_addr_t start,
> +						    phys_addr_t end)
> +{
> +	return false;
> +}
> +
> +static inline void accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +}
> +#endif
> +
>  #endif	/* __MM_INTERNAL_H */
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 511d4783dcf1..3bc404a5352a 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1423,6 +1423,15 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
>  		 */
>  		kmemleak_alloc_phys(found, size, 0);
>  
> +	/*
> +	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> +	 * require memory to be accepted before it can be used by the
> +	 * guest.
> +	 *
> +	 * Accept the memory of the allocated buffer.
> +	 */
> +	accept_memory(found, found + size);
> +
>  	return found;
>  }
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6e60657875d3..6d597e833a73 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -450,6 +450,11 @@ EXPORT_SYMBOL(nr_online_nodes);
>  
>  int page_group_by_mobility_disabled __read_mostly;
>  
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +/* Counts number of zones with unaccepted pages. */
> +static DEFINE_STATIC_KEY_FALSE(unaccepted_pages);
> +#endif
> +
>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>  /*
>   * During boot we initialize deferred pages on-demand, as needed, but once
> @@ -1043,12 +1048,15 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
>  {
>  	struct free_area *area = &zone->free_area[order];
>  
> +	VM_BUG_ON_PAGE(PageUnevictable(page), page);

Hm I think we're not allowed to do VM_BUG_ON anymore per some semi-recent
thread with Linus, as new BUG_ONs are forbidden in general, and Fedora
builds with DEBUG_VM. So probably use VM_WARN_ON everywhere and bail out
where feasible (i.e. probably not here).

>  	list_move_tail(&page->buddy_list, &area->free_list[migratetype]);
>  }
>  
>  static inline void del_page_from_free_list(struct page *page, struct zone *zone,
>  					   unsigned int order)
>  {
> +	VM_BUG_ON_PAGE(PageUnevictable(page), page);
> +
>  	/* clear reported state and update reported page count */
>  	if (page_reported(page))
>  		__ClearPageReported(page);
> @@ -1728,6 +1736,97 @@ static void __free_pages_ok(struct page *page, unsigned int order,
>  	__count_vm_events(PGFREE, 1 << order);
>  }
>  
> +static bool page_contains_unaccepted(struct page *page, unsigned int order)
> +{
> +	phys_addr_t start = page_to_phys(page);
> +	phys_addr_t end = start + (PAGE_SIZE << order);
> +
> +	return range_contains_unaccepted_memory(start, end);
> +}
> +
> +static void accept_page(struct page *page, unsigned int order)
> +{
> +	phys_addr_t start = page_to_phys(page);
> +
> +	accept_memory(start, start + (PAGE_SIZE << order));
> +}
> +
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +
> +static bool try_to_accept_memory(struct zone *zone)
> +{
> +	unsigned long flags, order;
> +	struct page *page;
> +	bool last = false;
> +	int migratetype;
> +
> +	if (!static_branch_unlikely(&unaccepted_pages))
> +		return false;
> +
> +	spin_lock_irqsave(&zone->lock, flags);
> +	page = list_first_entry_or_null(&zone->unaccepted_pages,
> +					struct page, lru);
> +	if (!page) {
> +		spin_unlock_irqrestore(&zone->lock, flags);
> +		return false;
> +	}
> +
> +	list_del(&page->lru);
> +	last = list_empty(&zone->unaccepted_pages);
> +
> +	order = page->private;
> +	VM_BUG_ON(order > MAX_ORDER || order < pageblock_order);
> +
> +	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));
> +	__mod_zone_freepage_state(zone, -1 << order, migratetype);
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +
> +	if (last)
> +		static_branch_dec(&unaccepted_pages);
> +
> +	accept_page(page, order);
> +	__ClearPageUnaccepted(page);
> +	__free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON);
> +
> +	return true;
> +}
> +
> +static void __free_unaccepted(struct page *page, unsigned int order)
> +{
> +	struct zone *zone = page_zone(page);
> +	unsigned long flags;
> +	int migratetype;
> +	bool first = false;
> +
> +	VM_BUG_ON(order > MAX_ORDER || order < pageblock_order);
> +	__SetPageUnaccepted(page);
> +	page->private = order;
> +
> +	spin_lock_irqsave(&zone->lock, flags);
> +	first = list_empty(&zone->unaccepted_pages);
> +	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));
> +	list_add_tail(&page->lru, &zone->unaccepted_pages);
> +	__mod_zone_freepage_state(zone, 1 << order, migratetype);
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +
> +	if (first)
> +		static_branch_inc(&unaccepted_pages);
> +}
> +
> +#else
> +
> +static bool try_to_accept_memory(struct zone *zone)
> +{
> +	return false;
> +}
> +
> +static void __free_unaccepted(struct page *page, unsigned int order)
> +{
> +	BUILD_BUG();
> +}
> +
> +#endif /* CONFIG_UNACCEPTED_MEMORY */
> +
>  void __free_pages_core(struct page *page, unsigned int order)
>  {
>  	unsigned int nr_pages = 1 << order;
> @@ -1750,6 +1849,13 @@ void __free_pages_core(struct page *page, unsigned int order)
>  
>  	atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
>  
> +	if (page_contains_unaccepted(page, order)) {
> +		if (order >= pageblock_order)
> +			return __free_unaccepted(page, order);

Would it be better to only do this when all pages are unaccepted, and accept
all of them as long as at least one in there is already accepted? Or not
worth the trouble of making page_contains_unaccepted() actually count all of
them, which would be necessary.

> +		else
> +			accept_page(page, order);
> +	}
> +
>  	/*
>  	 * Bypass PCP and place fresh pages right to the tail, primarily
>  	 * relevant for memory onlining.
> @@ -1910,6 +2016,9 @@ static void __init deferred_free_range(unsigned long pfn,
>  		return;
>  	}
>  
> +	/* Accept chunks smaller than page-block upfront */
> +	accept_memory(PFN_PHYS(pfn), PFN_PHYS(pfn + nr_pages));
> +
>  	for (i = 0; i < nr_pages; i++, page++, pfn++) {
>  		if (pageblock_aligned(pfn))
>  			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> @@ -4247,6 +4356,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  				       gfp_mask)) {
>  			int ret;
>  
> +			if (try_to_accept_memory(zone))
> +				goto try_this_zone;


This is under !zone_watermark_fast(), but as unaccepted pages are counted as
free pages, then I think this is unlikely to trigger in practice AFAIU as we
should therefore be passing the watermarks even if pages are unaccepted.

That's (intentionally IIRC) different from deferred init which doesn't count
the pages as free until they are initialized.

>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>  			/*
>  			 * Watermark failed for this zone, but see if we can
> @@ -4299,6 +4411,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  
>  			return page;
>  		} else {
> +			if (try_to_accept_memory(zone))
> +				goto try_this_zone;

On the other hand, here we failed the full rmqueue(), including the
potentially fragmenting fallbacks, so I'm worried that before we finally
fail all of that and resort to accepting more memory, we already fragmented
the already accepted memory, more than necessary.

So one way to prevent would be to move the acceptance into rmqueue() to
happen before __rmqueue_fallback(), which I originally had in mind and maybe
suggested that previously.

But maybe less intrusive and more robust way would be to track how much
memory is unaccepted, and actually decrement that amount  from free memory
in zone_watermark_fast() in order to force earlier failure of that check and
thus to accept more memory and give us a buffer of truly accepted and
available memory up to high watermark, which should hopefully prevent most
of the fallbacks. Then the code I flagged above as currently unecessary
would make perfect sense.

And maybe Mel will have some ideas as well.

> +
>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>  			/* Try again if zone has deferred pages */
>  			if (static_branch_unlikely(&deferred_pages)) {
> @@ -6935,6 +7050,10 @@ static void __meminit zone_init_free_lists(struct zone *zone)
>  		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
>  		zone->free_area[order].nr_free = 0;
>  	}
> +
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +	INIT_LIST_HEAD(&zone->unaccepted_pages);
> +#endif
>  }
>  
>  /*


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHv8 02/14] mm: Add support for unaccepted memory
  2022-12-09 18:10   ` Vlastimil Babka
@ 2022-12-09 19:26     ` Kirill A. Shutemov
  2022-12-09 22:23       ` Vlastimil Babka
  0 siblings, 1 reply; 26+ messages in thread
From: Kirill A. Shutemov @ 2022-12-09 19:26 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Dave Hansen, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Mike Rapoport

On Fri, Dec 09, 2022 at 07:10:25PM +0100, Vlastimil Babka wrote:
> On 12/7/22 02:49, Kirill A. Shutemov wrote:
> > UEFI Specification version 2.9 introduces the concept of memory
> > acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
> > SEV-SNP, require memory to be accepted before it can be used by the
> > guest. Accepting happens via a protocol specific to the Virtual Machine
> > platform.
> > 
> > There are several ways kernel can deal with unaccepted memory:
> > 
> >  1. Accept all the memory during the boot. It is easy to implement and
> >     it doesn't have runtime cost once the system is booted. The downside
> >     is very long boot time.
> > 
> >     Accept can be parallelized to multiple CPUs to keep it manageable
> >     (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
> >     memory bandwidth and does not scale beyond the point.
> > 
> >  2. Accept a block of memory on the first use. It requires more
> >     infrastructure and changes in page allocator to make it work, but
> >     it provides good boot time.
> > 
> >     On-demand memory accept means latency spikes every time kernel steps
> >     onto a new memory block. The spikes will go away once workload data
> >     set size gets stabilized or all memory gets accepted.
> > 
> >  3. Accept all memory in background. Introduce a thread (or multiple)
> >     that gets memory accepted proactively. It will minimize time the
> >     system experience latency spikes on memory allocation while keeping
> >     low boot time.
> > 
> >     This approach cannot function on its own. It is an extension of #2:
> >     background memory acceptance requires functional scheduler, but the
> >     page allocator may need to tap into unaccepted memory before that.
> > 
> >     The downside of the approach is that these threads also steal CPU
> >     cycles and memory bandwidth from the user's workload and may hurt
> >     user experience.
> > 
> > Implement #2 for now. It is a reasonable default. Some workloads may
> > want to use #1 or #3 and they can be implemented later based on user's
> > demands.
> > 
> > Support of unaccepted memory requires a few changes in core-mm code:
> > 
> >   - memblock has to accept memory on allocation;
> > 
> >   - page allocator has to accept memory on the first allocation of the
> >     page;
> > 
> > Memblock change is trivial.
> > 
> > The page allocator is modified to accept pages. New memory gets accepted
> > before putting pages on free lists. It is done lazily: only accept new
> > pages when we run out of already accepted memory.
> > 
> > Architecture has to provide two helpers if it wants to support
> > unaccepted memory:
> > 
> >  - accept_memory() makes a range of physical addresses accepted.
> > 
> >  - range_contains_unaccepted_memory() checks anything within the range
> >    of physical addresses requires acceptance.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
> 
> Hi,
> 
> the intrusiveness to page allocator seems to be much better in this version
> than in the older ones, thanks!
> 
> > ---
> >  include/linux/mmzone.h     |   5 ++
> >  include/linux/page-flags.h |  24 ++++++++
> >  mm/internal.h              |  12 ++++
> >  mm/memblock.c              |   9 +++
> >  mm/page_alloc.c            | 119 +++++++++++++++++++++++++++++++++++++
> >  5 files changed, 169 insertions(+)
> > 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 5f74891556f3..da335381e63f 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -822,6 +822,11 @@ struct zone {
> >  	/* free areas of different sizes */
> >  	struct free_area	free_area[MAX_ORDER];
> >  
> > +#ifdef CONFIG_UNACCEPTED_MEMORY
> > +	/* pages to be accepted */
> > +	struct list_head	unaccepted_pages;
> > +#endif
> > +
> >  	/* zone flags, see below */
> >  	unsigned long		flags;
> >  
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 0b0ae5084e60..ce953be8fe10 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -941,6 +941,7 @@ static inline bool is_page_hwpoison(struct page *page)
> >  #define PG_offline	0x00000100
> >  #define PG_table	0x00000200
> >  #define PG_guard	0x00000400
> > +#define PG_unaccepted	0x00000800
> >  
> >  #define PageType(page, flag)						\
> >  	((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
> > @@ -966,6 +967,18 @@ static __always_inline void __ClearPage##uname(struct page *page)	\
> >  	page->page_type |= PG_##lname;					\
> >  }
> >  
> > +#define PAGE_TYPE_OPS_FALSE(uname)					\
> > +static __always_inline int Page##uname(struct page *page)		\
> > +{									\
> > +	return false;							\
> > +}									\
> > +static __always_inline void __SetPage##uname(struct page *page)		\
> > +{									\
> > +}									\
> > +static __always_inline void __ClearPage##uname(struct page *page)	\
> > +{									\
> > +}
> 
> Wonder if we need this. The page type is defined regardless of
> CONFIG_UNACCEPTED_MEMORY and with CONFIG_UNACCEPTED_MEMORY=n nobody will be
> actually be setting or checking the type, so we can simply just keep the
> "real" implementations?

Okay. Makes sense.

> > +
> >  /*
> >   * PageBuddy() indicates that the page is free and in the buddy system
> >   * (see mm/page_alloc.c).
> > @@ -996,6 +1009,17 @@ PAGE_TYPE_OPS(Buddy, buddy)
> >   */
> >  PAGE_TYPE_OPS(Offline, offline)
> >  
> > +/*
> > + * PageUnaccepted() indicates that the page has to be "accepted" before it can
> > + * be read or written. The page allocator must call accept_page() before
> > + * touching the page or returning it to the caller.
> > + */
> > +#ifdef CONFIG_UNACCEPTED_MEMORY
> > +PAGE_TYPE_OPS(Unaccepted, unaccepted)
> > +#else
> > +PAGE_TYPE_OPS_FALSE(Unaccepted)
> > +#endif
> > +
> >  extern void page_offline_freeze(void);
> >  extern void page_offline_thaw(void);
> >  extern void page_offline_begin(void);
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 6b7ef495b56d..8ef4f88608ad 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -856,4 +856,16 @@ static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma)
> >  	return !(vma->vm_flags & VM_SOFTDIRTY);
> >  }
> >  
> > +#ifndef CONFIG_UNACCEPTED_MEMORY
> > +static inline bool range_contains_unaccepted_memory(phys_addr_t start,
> > +						    phys_addr_t end)
> > +{
> > +	return false;
> > +}
> > +
> > +static inline void accept_memory(phys_addr_t start, phys_addr_t end)
> > +{
> > +}
> > +#endif
> > +
> >  #endif	/* __MM_INTERNAL_H */
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 511d4783dcf1..3bc404a5352a 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -1423,6 +1423,15 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
> >  		 */
> >  		kmemleak_alloc_phys(found, size, 0);
> >  
> > +	/*
> > +	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> > +	 * require memory to be accepted before it can be used by the
> > +	 * guest.
> > +	 *
> > +	 * Accept the memory of the allocated buffer.
> > +	 */
> > +	accept_memory(found, found + size);
> > +
> >  	return found;
> >  }
> >  
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 6e60657875d3..6d597e833a73 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -450,6 +450,11 @@ EXPORT_SYMBOL(nr_online_nodes);
> >  
> >  int page_group_by_mobility_disabled __read_mostly;
> >  
> > +#ifdef CONFIG_UNACCEPTED_MEMORY
> > +/* Counts number of zones with unaccepted pages. */
> > +static DEFINE_STATIC_KEY_FALSE(unaccepted_pages);
> > +#endif
> > +
> >  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> >  /*
> >   * During boot we initialize deferred pages on-demand, as needed, but once
> > @@ -1043,12 +1048,15 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
> >  {
> >  	struct free_area *area = &zone->free_area[order];
> >  
> > +	VM_BUG_ON_PAGE(PageUnevictable(page), page);
> 
> Hm I think we're not allowed to do VM_BUG_ON anymore per some semi-recent
> thread with Linus, as new BUG_ONs are forbidden in general, and Fedora
> builds with DEBUG_VM. So probably use VM_WARN_ON everywhere and bail out
> where feasible (i.e. probably not here).

Okay, I can downgrade it to VM_WARN_ON_PAGE(), but it will crash soon
after anyway. On the first access to the supposedly free page that
unaccepted.

> >  	list_move_tail(&page->buddy_list, &area->free_list[migratetype]);
> >  }
> >  
> >  static inline void del_page_from_free_list(struct page *page, struct zone *zone,
> >  					   unsigned int order)
> >  {
> > +	VM_BUG_ON_PAGE(PageUnevictable(page), page);
> > +
> >  	/* clear reported state and update reported page count */
> >  	if (page_reported(page))
> >  		__ClearPageReported(page);
> > @@ -1728,6 +1736,97 @@ static void __free_pages_ok(struct page *page, unsigned int order,
> >  	__count_vm_events(PGFREE, 1 << order);
> >  }
> >  
> > +static bool page_contains_unaccepted(struct page *page, unsigned int order)
> > +{
> > +	phys_addr_t start = page_to_phys(page);
> > +	phys_addr_t end = start + (PAGE_SIZE << order);
> > +
> > +	return range_contains_unaccepted_memory(start, end);
> > +}
> > +
> > +static void accept_page(struct page *page, unsigned int order)
> > +{
> > +	phys_addr_t start = page_to_phys(page);
> > +
> > +	accept_memory(start, start + (PAGE_SIZE << order));
> > +}
> > +
> > +#ifdef CONFIG_UNACCEPTED_MEMORY
> > +
> > +static bool try_to_accept_memory(struct zone *zone)
> > +{
> > +	unsigned long flags, order;
> > +	struct page *page;
> > +	bool last = false;
> > +	int migratetype;
> > +
> > +	if (!static_branch_unlikely(&unaccepted_pages))
> > +		return false;
> > +
> > +	spin_lock_irqsave(&zone->lock, flags);
> > +	page = list_first_entry_or_null(&zone->unaccepted_pages,
> > +					struct page, lru);
> > +	if (!page) {
> > +		spin_unlock_irqrestore(&zone->lock, flags);
> > +		return false;
> > +	}
> > +
> > +	list_del(&page->lru);
> > +	last = list_empty(&zone->unaccepted_pages);
> > +
> > +	order = page->private;
> > +	VM_BUG_ON(order > MAX_ORDER || order < pageblock_order);
> > +
> > +	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));
> > +	__mod_zone_freepage_state(zone, -1 << order, migratetype);
> > +	spin_unlock_irqrestore(&zone->lock, flags);
> > +
> > +	if (last)
> > +		static_branch_dec(&unaccepted_pages);
> > +
> > +	accept_page(page, order);
> > +	__ClearPageUnaccepted(page);
> > +	__free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON);
> > +
> > +	return true;
> > +}
> > +
> > +static void __free_unaccepted(struct page *page, unsigned int order)
> > +{
> > +	struct zone *zone = page_zone(page);
> > +	unsigned long flags;
> > +	int migratetype;
> > +	bool first = false;
> > +
> > +	VM_BUG_ON(order > MAX_ORDER || order < pageblock_order);
> > +	__SetPageUnaccepted(page);
> > +	page->private = order;
> > +
> > +	spin_lock_irqsave(&zone->lock, flags);
> > +	first = list_empty(&zone->unaccepted_pages);
> > +	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));
> > +	list_add_tail(&page->lru, &zone->unaccepted_pages);
> > +	__mod_zone_freepage_state(zone, 1 << order, migratetype);
> > +	spin_unlock_irqrestore(&zone->lock, flags);
> > +
> > +	if (first)
> > +		static_branch_inc(&unaccepted_pages);
> > +}
> > +
> > +#else
> > +
> > +static bool try_to_accept_memory(struct zone *zone)
> > +{
> > +	return false;
> > +}
> > +
> > +static void __free_unaccepted(struct page *page, unsigned int order)
> > +{
> > +	BUILD_BUG();
> > +}
> > +
> > +#endif /* CONFIG_UNACCEPTED_MEMORY */
> > +
> >  void __free_pages_core(struct page *page, unsigned int order)
> >  {
> >  	unsigned int nr_pages = 1 << order;
> > @@ -1750,6 +1849,13 @@ void __free_pages_core(struct page *page, unsigned int order)
> >  
> >  	atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
> >  
> > +	if (page_contains_unaccepted(page, order)) {
> > +		if (order >= pageblock_order)
> > +			return __free_unaccepted(page, order);
> 
> Would it be better to only do this when all pages are unaccepted, and accept
> all of them as long as at least one in there is already accepted? Or not
> worth the trouble of making page_contains_unaccepted() actually count all of
> them, which would be necessary.

I don't think it worth the change.

This is going to be corner case that affect few pages that happened to be
on the border between unaccepted and accepted memory. It makes no
difference in practice.

> > +		else
> > +			accept_page(page, order);
> > +	}
> > +
> >  	/*
> >  	 * Bypass PCP and place fresh pages right to the tail, primarily
> >  	 * relevant for memory onlining.
> > @@ -1910,6 +2016,9 @@ static void __init deferred_free_range(unsigned long pfn,
> >  		return;
> >  	}
> >  
> > +	/* Accept chunks smaller than page-block upfront */
> > +	accept_memory(PFN_PHYS(pfn), PFN_PHYS(pfn + nr_pages));
> > +
> >  	for (i = 0; i < nr_pages; i++, page++, pfn++) {
> >  		if (pageblock_aligned(pfn))
> >  			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> > @@ -4247,6 +4356,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> >  				       gfp_mask)) {
> >  			int ret;
> >  
> > +			if (try_to_accept_memory(zone))
> > +				goto try_this_zone;
> 
> 
> This is under !zone_watermark_fast(), but as unaccepted pages are counted as
> free pages, then I think this is unlikely to trigger in practice AFAIU as we
> should therefore be passing the watermarks even if pages are unaccepted.
> 
> That's (intentionally IIRC) different from deferred init which doesn't count
> the pages as free until they are initialized.

It is intentional in sense that we don't want to accept pages proactively.
Deferred init need to be finished before userspace starts.

> >  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> >  			/*
> >  			 * Watermark failed for this zone, but see if we can
> > @@ -4299,6 +4411,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> >  
> >  			return page;
> >  		} else {
> > +			if (try_to_accept_memory(zone))
> > +				goto try_this_zone;
> 
> On the other hand, here we failed the full rmqueue(), including the
> potentially fragmenting fallbacks, so I'm worried that before we finally
> fail all of that and resort to accepting more memory, we already fragmented
> the already accepted memory, more than necessary.

I'm not sure I follow. We accept memory in pageblock chunks. Do we want to
allocate from a free pageblock if we have other memory to tap from? It
doesn't make sense to me.

> So one way to prevent would be to move the acceptance into rmqueue() to
> happen before __rmqueue_fallback(), which I originally had in mind and maybe
> suggested that previously.

I guess it should be pretty straight forward to fail __rmqueue_fallback()
if there's non-empty unaccepted_pages list and steer to
try_to_accept_memory() this way.

But I still don't understand why.

> But maybe less intrusive and more robust way would be to track how much
> memory is unaccepted, and actually decrement that amount  from free memory
> in zone_watermark_fast() in order to force earlier failure of that check and
> thus to accept more memory and give us a buffer of truly accepted and
> available memory up to high watermark, which should hopefully prevent most
> of the fallbacks. Then the code I flagged above as currently unecessary
> would make perfect sense.

The next patch adds per-node unaccepted memory accounting. We can move it
per-zone if it would help.

> And maybe Mel will have some ideas as well.

I don't have much expertise in page allocator. Any input is valuable.

> > +
> >  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> >  			/* Try again if zone has deferred pages */
> >  			if (static_branch_unlikely(&deferred_pages)) {
> > @@ -6935,6 +7050,10 @@ static void __meminit zone_init_free_lists(struct zone *zone)
> >  		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
> >  		zone->free_area[order].nr_free = 0;
> >  	}
> > +
> > +#ifdef CONFIG_UNACCEPTED_MEMORY
> > +	INIT_LIST_HEAD(&zone->unaccepted_pages);
> > +#endif
> >  }
> >  
> >  /*
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHv8 02/14] mm: Add support for unaccepted memory
  2022-12-09 19:26     ` Kirill A. Shutemov
@ 2022-12-09 22:23       ` Vlastimil Babka
  2022-12-24 16:46         ` Kirill A. Shutemov
  0 siblings, 1 reply; 26+ messages in thread
From: Vlastimil Babka @ 2022-12-09 22:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Dave Hansen, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Mike Rapoport

On 12/9/22 20:26, Kirill A. Shutemov wrote:
>> >  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>> >  			/*
>> >  			 * Watermark failed for this zone, but see if we can
>> > @@ -4299,6 +4411,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>> >  
>> >  			return page;
>> >  		} else {
>> > +			if (try_to_accept_memory(zone))
>> > +				goto try_this_zone;
>> 
>> On the other hand, here we failed the full rmqueue(), including the
>> potentially fragmenting fallbacks, so I'm worried that before we finally
>> fail all of that and resort to accepting more memory, we already fragmented
>> the already accepted memory, more than necessary.
> 
> I'm not sure I follow. We accept memory in pageblock chunks. Do we want to
> allocate from a free pageblock if we have other memory to tap from? It
> doesn't make sense to me.

The fragmentation avoidance based on migratetype does work with pageblock
granularity, so yeah, if you accept a single pageblock worth of memory and
then (through __rmqueue_fallback()) end up serving both movable and
unmovable allocations from it, the whole fragmentation avoidance mechanism
is defeated and you end up with unmovable allocations (e.g. page tables)
scattered over many pageblocks and inability to allocate any huge pages.

>> So one way to prevent would be to move the acceptance into rmqueue() to
>> happen before __rmqueue_fallback(), which I originally had in mind and maybe
>> suggested that previously.
> 
> I guess it should be pretty straight forward to fail __rmqueue_fallback()
> if there's non-empty unaccepted_pages list and steer to
> try_to_accept_memory() this way.

That could be a way indeed. We do have ALLOC_NOFRAGMENT which could be
possible to employ here.
But maybe the zone_watermark_fast() modification would be simpler yet
sufficient. It makes sense to me that we'd try to keep a high watermark
worth of pre-accepted memory. zone_watermark_fast() would fail at low
watermark, so we could try accepting (high-low) at a time instead of single
pageblock.

> But I still don't understand why.

To avoid what I described above.

>> But maybe less intrusive and more robust way would be to track how much
>> memory is unaccepted, and actually decrement that amount  from free memory
>> in zone_watermark_fast() in order to force earlier failure of that check and
>> thus to accept more memory and give us a buffer of truly accepted and
>> available memory up to high watermark, which should hopefully prevent most
>> of the fallbacks. Then the code I flagged above as currently unecessary
>> would make perfect sense.
> 
> The next patch adds per-node unaccepted memory accounting. We can move it
> per-zone if it would help.

Right.

>> And maybe Mel will have some ideas as well.
> 
> I don't have much expertise in page allocator. Any input is valuable.
> 
>> > +
>> >  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>> >  			/* Try again if zone has deferred pages */
>> >  			if (static_branch_unlikely(&deferred_pages)) {
>> > @@ -6935,6 +7050,10 @@ static void __meminit zone_init_free_lists(struct zone *zone)
>> >  		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
>> >  		zone->free_area[order].nr_free = 0;
>> >  	}
>> > +
>> > +#ifdef CONFIG_UNACCEPTED_MEMORY
>> > +	INIT_LIST_HEAD(&zone->unaccepted_pages);
>> > +#endif
>> >  }
>> >  
>> >  /*
>> 
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHv8 02/14] mm: Add support for unaccepted memory
  2022-12-09 22:23       ` Vlastimil Babka
@ 2022-12-24 16:46         ` Kirill A. Shutemov
  2023-01-12 11:59           ` Vlastimil Babka
  0 siblings, 1 reply; 26+ messages in thread
From: Kirill A. Shutemov @ 2022-12-24 16:46 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Dave Hansen, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Mike Rapoport

On Fri, Dec 09, 2022 at 11:23:50PM +0100, Vlastimil Babka wrote:
> On 12/9/22 20:26, Kirill A. Shutemov wrote:
> >> >  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> >> >  			/*
> >> >  			 * Watermark failed for this zone, but see if we can
> >> > @@ -4299,6 +4411,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> >> >  
> >> >  			return page;
> >> >  		} else {
> >> > +			if (try_to_accept_memory(zone))
> >> > +				goto try_this_zone;
> >> 
> >> On the other hand, here we failed the full rmqueue(), including the
> >> potentially fragmenting fallbacks, so I'm worried that before we finally
> >> fail all of that and resort to accepting more memory, we already fragmented
> >> the already accepted memory, more than necessary.
> > 
> > I'm not sure I follow. We accept memory in pageblock chunks. Do we want to
> > allocate from a free pageblock if we have other memory to tap from? It
> > doesn't make sense to me.
> 
> The fragmentation avoidance based on migratetype does work with pageblock
> granularity, so yeah, if you accept a single pageblock worth of memory and
> then (through __rmqueue_fallback()) end up serving both movable and
> unmovable allocations from it, the whole fragmentation avoidance mechanism
> is defeated and you end up with unmovable allocations (e.g. page tables)
> scattered over many pageblocks and inability to allocate any huge pages.
> 
> >> So one way to prevent would be to move the acceptance into rmqueue() to
> >> happen before __rmqueue_fallback(), which I originally had in mind and maybe
> >> suggested that previously.
> > 
> > I guess it should be pretty straight forward to fail __rmqueue_fallback()
> > if there's non-empty unaccepted_pages list and steer to
> > try_to_accept_memory() this way.
> 
> That could be a way indeed. We do have ALLOC_NOFRAGMENT which could be
> possible to employ here.
> But maybe the zone_watermark_fast() modification would be simpler yet
> sufficient. It makes sense to me that we'd try to keep a high watermark
> worth of pre-accepted memory. zone_watermark_fast() would fail at low
> watermark, so we could try accepting (high-low) at a time instead of single
> pageblock.

Looks like we already have __zone_watermark_unusable_free() that seems
match use-case rather closely. We only need switch unaccepted memory to
per-zone accounting.

The fixup below suppose to do the trick, but I'm not sure how to test
fragmentation avoidance properly.

Any suggestions?

diff --git a/drivers/base/node.c b/drivers/base/node.c
index ca6f0590be21..1bd2d245edee 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -483,7 +483,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 #endif
 #ifdef CONFIG_UNACCEPTED_MEMORY
 			     ,
-			     nid, K(node_page_state(pgdat, NR_UNACCEPTED))
+			     nid, K(sum_zone_node_page_state(nid, NR_UNACCEPTED))
 #endif
 			    );
 	len += hugetlb_report_node_meminfo(buf, len, nid);
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 789b77c7b6df..e9c05b4c457c 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -157,7 +157,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 
 #ifdef CONFIG_UNACCEPTED_MEMORY
 	show_val_kb(m, "Unaccepted:     ",
-		    global_node_page_state(NR_UNACCEPTED));
+		    global_zone_page_state(NR_UNACCEPTED));
 #endif
 
 	hugetlb_report_meminfo(m);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9c762e8175fc..8b5800cd4424 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -152,6 +152,9 @@ enum zone_stat_item {
 	NR_ZSPAGES,		/* allocated in zsmalloc */
 #endif
 	NR_FREE_CMA_PAGES,
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	NR_UNACCEPTED,
+#endif
 	NR_VM_ZONE_STAT_ITEMS };
 
 enum node_stat_item {
@@ -198,9 +201,6 @@ enum node_stat_item {
 	NR_FOLL_PIN_ACQUIRED,	/* via: pin_user_page(), gup flag: FOLL_PIN */
 	NR_FOLL_PIN_RELEASED,	/* pages returned via unpin_user_page() */
 	NR_KERNEL_STACK_KB,	/* measured in KiB */
-#ifdef CONFIG_UNACCEPTED_MEMORY
-	NR_UNACCEPTED,
-#endif
 #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
 	NR_KERNEL_SCS_KB,	/* measured in KiB */
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e80e8d398863..404b267332a9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1779,7 +1779,7 @@ static bool try_to_accept_memory(struct zone *zone)
 
 	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));
 	__mod_zone_freepage_state(zone, -1 << order, migratetype);
-	__mod_node_page_state(page_pgdat(page), NR_UNACCEPTED, -1 << order);
+	__mod_zone_page_state(zone, NR_UNACCEPTED, -1 << order);
 	spin_unlock_irqrestore(&zone->lock, flags);
 
 	if (last)
@@ -1808,7 +1808,7 @@ static void __free_unaccepted(struct page *page, unsigned int order)
 	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));
 	list_add_tail(&page->lru, &zone->unaccepted_pages);
 	__mod_zone_freepage_state(zone, 1 << order, migratetype);
-	__mod_node_page_state(page_pgdat(page), NR_UNACCEPTED, 1 << order);
+	__mod_zone_page_state(zone, NR_UNACCEPTED, 1 << order);
 	spin_unlock_irqrestore(&zone->lock, flags);
 
 	if (first)
@@ -4074,6 +4074,9 @@ static inline long __zone_watermark_unusable_free(struct zone *z,
 	if (!(alloc_flags & ALLOC_CMA))
 		unusable_free += zone_page_state(z, NR_FREE_CMA_PAGES);
 #endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	unusable_free += zone_page_state(z, NR_UNACCEPTED);
+#endif
 
 	return unusable_free;
 }
-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCHv8 02/14] mm: Add support for unaccepted memory
  2022-12-07  1:49 ` [PATCHv8 02/14] mm: Add support for unaccepted memory Kirill A. Shutemov
  2022-12-09 18:10   ` Vlastimil Babka
@ 2022-12-26 12:23   ` Borislav Petkov
  2022-12-27  3:18     ` Kirill A. Shutemov
  2023-01-16 13:04   ` Mel Gorman
  2 siblings, 1 reply; 26+ messages in thread
From: Borislav Petkov @ 2022-12-26 12:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Dave Hansen, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Mike Rapoport

On Wed, Dec 07, 2022 at 04:49:21AM +0300, Kirill A. Shutemov wrote:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6e60657875d3..6d597e833a73 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -450,6 +450,11 @@ EXPORT_SYMBOL(nr_online_nodes);
>  
>  int page_group_by_mobility_disabled __read_mostly;
>  
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +/* Counts number of zones with unaccepted pages. */
> +static DEFINE_STATIC_KEY_FALSE(unaccepted_pages);

Then call it what it is:

static DEFINE_STATIC_KEY_FALSE(zones_with_unaccepted_pages);

so that when reading the code it is self-describing:

        if (!static_branch_unlikely(&zones_with_unaccepted_pages))
                return false;

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHv8 02/14] mm: Add support for unaccepted memory
  2022-12-26 12:23   ` Borislav Petkov
@ 2022-12-27  3:18     ` Kirill A. Shutemov
  0 siblings, 0 replies; 26+ messages in thread
From: Kirill A. Shutemov @ 2022-12-27  3:18 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Dave Hansen, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Mike Rapoport

On Mon, Dec 26, 2022 at 01:23:18PM +0100, Borislav Petkov wrote:
> On Wed, Dec 07, 2022 at 04:49:21AM +0300, Kirill A. Shutemov wrote:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 6e60657875d3..6d597e833a73 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -450,6 +450,11 @@ EXPORT_SYMBOL(nr_online_nodes);
> >  
> >  int page_group_by_mobility_disabled __read_mostly;
> >  
> > +#ifdef CONFIG_UNACCEPTED_MEMORY
> > +/* Counts number of zones with unaccepted pages. */
> > +static DEFINE_STATIC_KEY_FALSE(unaccepted_pages);
> 
> Then call it what it is:
> 
> static DEFINE_STATIC_KEY_FALSE(zones_with_unaccepted_pages);
> 
> so that when reading the code it is self-describing:
> 
>         if (!static_branch_unlikely(&zones_with_unaccepted_pages))
>                 return false;

Looks good. Will fix for the next version.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHv8 05/14] x86/boot: Add infrastructure required for unaccepted memory support
  2022-12-07  1:49 ` [PATCHv8 05/14] x86/boot: Add infrastructure required for unaccepted memory support Kirill A. Shutemov
@ 2023-01-03 13:52   ` Borislav Petkov
  0 siblings, 0 replies; 26+ messages in thread
From: Borislav Petkov @ 2023-01-03 13:52 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Dave Hansen, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Wed, Dec 07, 2022 at 04:49:24AM +0300, Kirill A. Shutemov wrote:
> Pull functionality from the main kernel headers and lib/ that is
> required for unaccepted memory support.
> 
> This is preparatory patch. The users for the functionality will come in
> following patches.

No need for that sentence.

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/boot/bitops.h                   | 40 ++++++++++++
>  arch/x86/boot/compressed/align.h         | 14 +++++
>  arch/x86/boot/compressed/bitmap.c        | 43 +++++++++++++
>  arch/x86/boot/compressed/bitmap.h        | 49 +++++++++++++++
>  arch/x86/boot/compressed/bits.h          | 36 +++++++++++
>  arch/x86/boot/compressed/find.c          | 54 ++++++++++++++++
>  arch/x86/boot/compressed/find.h          | 79 ++++++++++++++++++++++++
>  arch/x86/boot/compressed/math.h          | 37 +++++++++++
>  arch/x86/boot/compressed/minmax.h        | 61 ++++++++++++++++++
>  arch/x86/boot/compressed/pgtable_types.h | 25 ++++++++
>  10 files changed, 438 insertions(+)
>  create mode 100644 arch/x86/boot/compressed/align.h
>  create mode 100644 arch/x86/boot/compressed/bitmap.c
>  create mode 100644 arch/x86/boot/compressed/bitmap.h
>  create mode 100644 arch/x86/boot/compressed/bits.h
>  create mode 100644 arch/x86/boot/compressed/find.c
>  create mode 100644 arch/x86/boot/compressed/find.h
>  create mode 100644 arch/x86/boot/compressed/math.h
>  create mode 100644 arch/x86/boot/compressed/minmax.h
>  create mode 100644 arch/x86/boot/compressed/pgtable_types.h

Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHv8 06/14] efi/x86: Implement support for unaccepted memory
  2022-12-07  1:49 ` [PATCHv8 06/14] efi/x86: Implement support for unaccepted memory Kirill A. Shutemov
@ 2023-01-03 14:20   ` Borislav Petkov
  2023-03-25  0:51     ` Kirill A. Shutemov
  0 siblings, 1 reply; 26+ messages in thread
From: Borislav Petkov @ 2023-01-03 14:20 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Dave Hansen, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Wed, Dec 07, 2022 at 04:49:25AM +0300, Kirill A. Shutemov wrote:
> The implementation requires some basic helpers in boot stub. They
> provided by linux/ includes in the main kernel image, but is not present
> in boot stub. Create copy of required functionality in the boot stub.

Leftover paragraph from a previous version. Can be removed.

...

> +/*
> + * The accepted memory bitmap only works at PMD_SIZE granularity.  This
> + * function takes unaligned start/end addresses and either:

s/This function takes/Take/

> + *  1. Accepts the memory immediately and in its entirety
> + *  2. Accepts unaligned parts, and marks *some* aligned part unaccepted
> + *
> + * The function will never reach the bitmap_set() with zero bits to set.
> + */
> +void process_unaccepted_memory(struct boot_params *params, u64 start, u64 end)
> +{
> +	/*
> +	 * Ensure that at least one bit will be set in the bitmap by
> +	 * immediately accepting all regions under 2*PMD_SIZE.  This is
> +	 * imprecise and may immediately accept some areas that could
> +	 * have been represented in the bitmap.  But, results in simpler
> +	 * code below
> +	 *
> +	 * Consider case like this:
> +	 *
> +	 * | 4k | 2044k |    2048k   |
> +	 * ^ 0x0        ^ 2MB        ^ 4MB
> +	 *
> +	 * Only the first 4k has been accepted. The 0MB->2MB region can not be
> +	 * represented in the bitmap. The 2MB->4MB region can be represented in
> +	 * the bitmap. But, the 0MB->4MB region is <2*PMD_SIZE and will be
> +	 * immediately accepted in its entirety.
> +	 */
> +	if (end - start < 2 * PMD_SIZE) {
> +		__accept_memory(start, end);
> +		return;
> +	}
> +
> +	/*
> +	 * No matter how the start and end are aligned, at least one unaccepted
> +	 * PMD_SIZE area will remain to be marked in the bitmap.
> +	 */
> +
> +	/* Immediately accept a <PMD_SIZE piece at the start: */
> +	if (start & ~PMD_MASK) {
> +		__accept_memory(start, round_up(start, PMD_SIZE));
> +		start = round_up(start, PMD_SIZE);
> +	}
> +
> +	/* Immediately accept a <PMD_SIZE piece at the end: */
> +	if (end & ~PMD_MASK) {
> +		__accept_memory(round_down(end, PMD_SIZE), end);
> +		end = round_down(end, PMD_SIZE);
> +	}
> +
> +	/*
> +	 * 'start' and 'end' are now both PMD-aligned.
> +	 * Record the range as being unaccepted:
> +	 */
> +	bitmap_set((unsigned long *)params->unaccepted_memory,
> +		   start / PMD_SIZE, (end - start) / PMD_SIZE);
> +}

...

> diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
> index 6787ed8dfacf..8aa8adf0bcb5 100644
> --- a/drivers/firmware/efi/Kconfig
> +++ b/drivers/firmware/efi/Kconfig
> @@ -314,6 +314,20 @@ config EFI_COCO_SECRET
>  	  virt/coco/efi_secret module to access the secrets, which in turn
>  	  allows userspace programs to access the injected secrets.
>  
> +config UNACCEPTED_MEMORY
> +	bool
> +	depends on EFI_STUB

This still doesn't make a whole lotta sense. If I do "make menuconfig" I don't
see the help text because that bool doesn't have a string prompt. So who is that
help text for?

Then, in the last patch you have

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -888,6 +888,8 @@ config INTEL_TDX_GUEST
        select ARCH_HAS_CC_PLATFORM
        select X86_MEM_ENCRYPT
        select X86_MCE
+       select UNACCEPTED_MEMORY
+       select EFI_STUB

I guess you want to select UNACCEPTED_MEMORY only.

And I've already mentioned this whole mess:

https://lore.kernel.org/r/Yt%2BnOeLMqRxjObbx@zn.tnic

Please incorporate all review comments before sending a new version of
your patch.

Ignoring review feedback is a very unfriendly thing to do:

- if you agree with the feedback, you work it in in the next revision

- if you don't agree, you *say* *why* you don't

> +	help
> +	   Some Virtual Machine platforms, such as Intel TDX, require
> +	   some memory to be "accepted" by the guest before it can be used.
> +	   This mechanism helps prevent malicious hosts from making changes
> +	   to guest memory.
> +
> +	   UEFI specification v2.9 introduced EFI_UNACCEPTED_MEMORY memory type.
> +
> +	   This option adds support for unaccepted memory and makes such memory
> +	   usable by the kernel.

...

> +static efi_status_t allocate_unaccepted_bitmap(struct boot_params *params,
> +					       __u32 nr_desc,
> +					       struct efi_boot_memmap *map)
> +{
> +	unsigned long *mem = NULL;
> +	u64 size, max_addr = 0;
> +	efi_status_t status;
> +	bool found = false;
> +	int i;
> +
> +	/* Check if there's any unaccepted memory and find the max address */
> +	for (i = 0; i < nr_desc; i++) {
> +		efi_memory_desc_t *d;
> +		unsigned long m = (unsigned long)map->map;
> +
> +		d = efi_early_memdesc_ptr(m, map->desc_size, i);
> +		if (d->type == EFI_UNACCEPTED_MEMORY)
> +			found = true;
> +		if (d->phys_addr + d->num_pages * PAGE_SIZE > max_addr)
> +			max_addr = d->phys_addr + d->num_pages * PAGE_SIZE;
> +	}
> +
> +	if (!found) {
> +		params->unaccepted_memory = 0;
> +		return EFI_SUCCESS;
> +	}
> +
> +	/*
> +	 * If unaccepted memory is present, allocate a bitmap to track what
> +	 * memory has to be accepted before access.
> +	 *
> +	 * One bit in the bitmap represents 2MiB in the address space:
> +	 * A 4k bitmap can track 64GiB of physical address space.
> +	 *
> +	 * In the worst case scenario -- a huge hole in the middle of the
> +	 * address space -- It needs 256MiB to handle 4PiB of the address
> +	 * space.
> +	 *
> +	 * TODO: handle situation if params->unaccepted_memory is already set.
> +	 * It's required to deal with kexec.

A TODO in a patch basically says this patch is not ready to go anywhere.

IOW, you need to handle that kexec case here gracefully. Even if you refuse to
boot a kexec-ed kernel because it cannot support handing in the bitmap from the
first kernel, yadda yadda...

> +	 *
> +	 * The bitmap will be populated in setup_e820() according to the memory
> +	 * map after efi_exit_boot_services().
> +	 */
> +	size = DIV_ROUND_UP(max_addr, PMD_SIZE * BITS_PER_BYTE);
> +	status = efi_allocate_pages(size, (unsigned long *)&mem, ULONG_MAX);
> +	if (status == EFI_SUCCESS) {
> +		memset(mem, 0, size);
> +		params->unaccepted_memory = (unsigned long)mem;
> +	}
> +
> +	return status;
> +}

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHv8 02/14] mm: Add support for unaccepted memory
  2022-12-24 16:46         ` Kirill A. Shutemov
@ 2023-01-12 11:59           ` Vlastimil Babka
  0 siblings, 0 replies; 26+ messages in thread
From: Vlastimil Babka @ 2023-01-12 11:59 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Dave Hansen, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Mike Rapoport

On 12/24/22 17:46, Kirill A. Shutemov wrote:
> On Fri, Dec 09, 2022 at 11:23:50PM +0100, Vlastimil Babka wrote:
>> On 12/9/22 20:26, Kirill A. Shutemov wrote:
>> >> >  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>> >> >  			/*
>> >> >  			 * Watermark failed for this zone, but see if we can
>> >> > @@ -4299,6 +4411,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>> >> >  
>> >> >  			return page;
>> >> >  		} else {
>> >> > +			if (try_to_accept_memory(zone))
>> >> > +				goto try_this_zone;
>> >> 
>> >> On the other hand, here we failed the full rmqueue(), including the
>> >> potentially fragmenting fallbacks, so I'm worried that before we finally
>> >> fail all of that and resort to accepting more memory, we already fragmented
>> >> the already accepted memory, more than necessary.
>> > 
>> > I'm not sure I follow. We accept memory in pageblock chunks. Do we want to
>> > allocate from a free pageblock if we have other memory to tap from? It
>> > doesn't make sense to me.
>> 
>> The fragmentation avoidance based on migratetype does work with pageblock
>> granularity, so yeah, if you accept a single pageblock worth of memory and
>> then (through __rmqueue_fallback()) end up serving both movable and
>> unmovable allocations from it, the whole fragmentation avoidance mechanism
>> is defeated and you end up with unmovable allocations (e.g. page tables)
>> scattered over many pageblocks and inability to allocate any huge pages.
>> 
>> >> So one way to prevent would be to move the acceptance into rmqueue() to
>> >> happen before __rmqueue_fallback(), which I originally had in mind and maybe
>> >> suggested that previously.
>> > 
>> > I guess it should be pretty straight forward to fail __rmqueue_fallback()
>> > if there's non-empty unaccepted_pages list and steer to
>> > try_to_accept_memory() this way.
>> 
>> That could be a way indeed. We do have ALLOC_NOFRAGMENT which could be
>> possible to employ here.
>> But maybe the zone_watermark_fast() modification would be simpler yet
>> sufficient. It makes sense to me that we'd try to keep a high watermark
>> worth of pre-accepted memory. zone_watermark_fast() would fail at low
>> watermark, so we could try accepting (high-low) at a time instead of single
>> pageblock.
> 
> Looks like we already have __zone_watermark_unusable_free() that seems
> match use-case rather closely. We only need switch unaccepted memory to
> per-zone accounting.

Could work. I'd still suggest also making try_to_accept_memory() to accept
up to high watermark, not a single pageblock.
> The fixup below suppose to do the trick, but I'm not sure how to test
> fragmentation avoidance properly.
> 
> Any suggestions?

Haven't done that for years, maybe Mel knows better. But from what I
remember, I'd compare /proc/pagetypeinfo with and without memory accepting,
and collect the mm_page_alloc_extfrag tracepoint. If there are more of these
events happening, it's bad. Ideally with a workload that stresses both
userspace (movable) allocations and kernel allocations. Again, Mel might
have suggestions for a mmtest?

> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index ca6f0590be21..1bd2d245edee 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -483,7 +483,7 @@ static ssize_t node_read_meminfo(struct device *dev,
>  #endif
>  #ifdef CONFIG_UNACCEPTED_MEMORY
>  			     ,
> -			     nid, K(node_page_state(pgdat, NR_UNACCEPTED))
> +			     nid, K(sum_zone_node_page_state(nid, NR_UNACCEPTED))
>  #endif
>  			    );
>  	len += hugetlb_report_node_meminfo(buf, len, nid);
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 789b77c7b6df..e9c05b4c457c 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -157,7 +157,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>  
>  #ifdef CONFIG_UNACCEPTED_MEMORY
>  	show_val_kb(m, "Unaccepted:     ",
> -		    global_node_page_state(NR_UNACCEPTED));
> +		    global_zone_page_state(NR_UNACCEPTED));
>  #endif
>  
>  	hugetlb_report_meminfo(m);
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 9c762e8175fc..8b5800cd4424 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -152,6 +152,9 @@ enum zone_stat_item {
>  	NR_ZSPAGES,		/* allocated in zsmalloc */
>  #endif
>  	NR_FREE_CMA_PAGES,
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +	NR_UNACCEPTED,
> +#endif
>  	NR_VM_ZONE_STAT_ITEMS };
>  
>  enum node_stat_item {
> @@ -198,9 +201,6 @@ enum node_stat_item {
>  	NR_FOLL_PIN_ACQUIRED,	/* via: pin_user_page(), gup flag: FOLL_PIN */
>  	NR_FOLL_PIN_RELEASED,	/* pages returned via unpin_user_page() */
>  	NR_KERNEL_STACK_KB,	/* measured in KiB */
> -#ifdef CONFIG_UNACCEPTED_MEMORY
> -	NR_UNACCEPTED,
> -#endif
>  #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
>  	NR_KERNEL_SCS_KB,	/* measured in KiB */
>  #endif
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e80e8d398863..404b267332a9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1779,7 +1779,7 @@ static bool try_to_accept_memory(struct zone *zone)
>  
>  	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));
>  	__mod_zone_freepage_state(zone, -1 << order, migratetype);
> -	__mod_node_page_state(page_pgdat(page), NR_UNACCEPTED, -1 << order);
> +	__mod_zone_page_state(zone, NR_UNACCEPTED, -1 << order);
>  	spin_unlock_irqrestore(&zone->lock, flags);
>  
>  	if (last)
> @@ -1808,7 +1808,7 @@ static void __free_unaccepted(struct page *page, unsigned int order)
>  	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));
>  	list_add_tail(&page->lru, &zone->unaccepted_pages);
>  	__mod_zone_freepage_state(zone, 1 << order, migratetype);
> -	__mod_node_page_state(page_pgdat(page), NR_UNACCEPTED, 1 << order);
> +	__mod_zone_page_state(zone, NR_UNACCEPTED, 1 << order);
>  	spin_unlock_irqrestore(&zone->lock, flags);
>  
>  	if (first)
> @@ -4074,6 +4074,9 @@ static inline long __zone_watermark_unusable_free(struct zone *z,
>  	if (!(alloc_flags & ALLOC_CMA))
>  		unusable_free += zone_page_state(z, NR_FREE_CMA_PAGES);
>  #endif
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +	unusable_free += zone_page_state(z, NR_UNACCEPTED);
> +#endif
>  
>  	return unusable_free;
>  }


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHv8 02/14] mm: Add support for unaccepted memory
  2022-12-07  1:49 ` [PATCHv8 02/14] mm: Add support for unaccepted memory Kirill A. Shutemov
  2022-12-09 18:10   ` Vlastimil Babka
  2022-12-26 12:23   ` Borislav Petkov
@ 2023-01-16 13:04   ` Mel Gorman
  2 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2023-01-16 13:04 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Dave Hansen, Mike Rapoport,
	David Hildenbrand, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, aarcange, peterx, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Mike Rapoport

On Wed, Dec 07, 2022 at 04:49:21AM +0300, Kirill A. Shutemov wrote:
> UEFI Specification version 2.9 introduces the concept of memory
> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
> SEV-SNP, require memory to be accepted before it can be used by the
> guest. Accepting happens via a protocol specific to the Virtual Machine
> platform.
> 
> There are several ways kernel can deal with unaccepted memory:
> 
>  1. Accept all the memory during the boot. It is easy to implement and
>     it doesn't have runtime cost once the system is booted. The downside
>     is very long boot time.
> 
>     Accept can be parallelized to multiple CPUs to keep it manageable
>     (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
>     memory bandwidth and does not scale beyond the point.
> 

This should be an option via kernel command line if at all possible as soon
as possible because if issues like fragmentation show up as highlighted by
Vlastimil, then there will be need to do a comparison.  Furthermore, while
boot time is the primary focus of this series, it ignores ramp-up time of
an application that ultimately uses a high percentage of memory. A latency
sensitive application that stalls while accepting memory during ramp-up
may be more undesirable than a long boot time because it's unpredictable.
While it's not exactly the same, sometimes similar happens for ext4 for
lazy inode init after filesystem creation and this feature can be disabled
because it screws with benchmarks based on fresh filesystems. That problem
only happens on every filesystem init though, not every boot.

>  2. Accept a block of memory on the first use. It requires more
>     infrastructure and changes in page allocator to make it work, but
>     it provides good boot time.
> 
>     On-demand memory accept means latency spikes every time kernel steps
>     onto a new memory block. The spikes will go away once workload data
>     set size gets stabilized or all memory gets accepted.
> 

For very large machines (and > 1TB virtual machines do exist), this may
take a long time.

>  3. Accept all memory in background. Introduce a thread (or multiple)
>     that gets memory accepted proactively. It will minimize time the
>     system experience latency spikes on memory allocation while keeping
>     low boot time.
> 
>     This approach cannot function on its own. It is an extension of #2:
>     background memory acceptance requires functional scheduler, but the
>     page allocator may need to tap into unaccepted memory before that.
> 
>     The downside of the approach is that these threads also steal CPU
>     cycles and memory bandwidth from the user's workload and may hurt
>     user experience.
> 

This is not necessarily true, while the background thread (or threads,
one per zone using one local CPU) are running an allocation request
could block on a waitqueue until the background thread makes progress.
That means the application would only stall if the background thread
is too slow or there are too many allocation requests. With that setup,
it would be possible that an application never blocks on accepting memory.


> Implement #2 for now. It is a reasonable default. Some workloads may
> want to use #1 or #3 and they can be implemented later based on user's
> demands.
> 
> Support of unaccepted memory requires a few changes in core-mm code:
> 
>   - memblock has to accept memory on allocation;
> 
>   - page allocator has to accept memory on the first allocation of the
>     page;
> 
> Memblock change is trivial.
> 
> The page allocator is modified to accept pages. New memory gets accepted
> before putting pages on free lists. It is done lazily: only accept new
> pages when we run out of already accepted memory.
> 

And this is the primary disadvantage between 2 and 3, stalls are guaranteed
for some allocating tasks until all memory is accepted.

> Architecture has to provide two helpers if it wants to support
> unaccepted memory:
> 
>  - accept_memory() makes a range of physical addresses accepted.
> 

accept_memory is introduced later in the series and it's an arch-specfic
decision on what range to accept. pageblock order would be the miminum
to help fragmentation avoidance but MAX_ORDER-1 would make more sense.
MAX_ORDER-1 would amortise the cost of acceptance, reduce potential issues
with fragmentation and reduce the total number of stalls required to
accept memory. The minimum range to accept should be generically enforced
in accept_memory that calls arch_accept_memory with the range. accept_page
-> accept_memory is fine but the core->arch_specific divide should be clear.

>  - range_contains_unaccepted_memory() checks anything within the range
>    of physical addresses requires acceptance.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
> ---
>  include/linux/mmzone.h     |   5 ++
>  include/linux/page-flags.h |  24 ++++++++
>  mm/internal.h              |  12 ++++
>  mm/memblock.c              |   9 +++
>  mm/page_alloc.c            | 119 +++++++++++++++++++++++++++++++++++++
>  5 files changed, 169 insertions(+)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 5f74891556f3..da335381e63f 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -822,6 +822,11 @@ struct zone {
>  	/* free areas of different sizes */
>  	struct free_area	free_area[MAX_ORDER];
>  
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +	/* pages to be accepted */
> +	struct list_head	unaccepted_pages;
> +#endif
> +

If MAX_ORDER-1 ranges were accepted at a time, it could be documented that the
list_head only contains orders of pages of a fixed size.

>  	/* zone flags, see below */
>  	unsigned long		flags;
>  
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 0b0ae5084e60..ce953be8fe10 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -941,6 +941,7 @@ static inline bool is_page_hwpoison(struct page *page)
>  #define PG_offline	0x00000100
>  #define PG_table	0x00000200
>  #define PG_guard	0x00000400
> +#define PG_unaccepted	0x00000800
>  
>  #define PageType(page, flag)						\
>  	((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
> @@ -966,6 +967,18 @@ static __always_inline void __ClearPage##uname(struct page *page)	\
>  	page->page_type |= PG_##lname;					\
>  }
>  
> +#define PAGE_TYPE_OPS_FALSE(uname)					\
> +static __always_inline int Page##uname(struct page *page)		\
> +{									\
> +	return false;							\
> +}									\
> +static __always_inline void __SetPage##uname(struct page *page)		\
> +{									\
> +}									\
> +static __always_inline void __ClearPage##uname(struct page *page)	\
> +{									\
> +}
> +
>  /*
>   * PageBuddy() indicates that the page is free and in the buddy system
>   * (see mm/page_alloc.c).
> @@ -996,6 +1009,17 @@ PAGE_TYPE_OPS(Buddy, buddy)
>   */
>  PAGE_TYPE_OPS(Offline, offline)
>  
> +/*
> + * PageUnaccepted() indicates that the page has to be "accepted" before it can
> + * be read or written. The page allocator must call accept_page() before
> + * touching the page or returning it to the caller.
> + */
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +PAGE_TYPE_OPS(Unaccepted, unaccepted)
> +#else
> +PAGE_TYPE_OPS_FALSE(Unaccepted)
> +#endif
> +
>  extern void page_offline_freeze(void);
>  extern void page_offline_thaw(void);
>  extern void page_offline_begin(void);
> diff --git a/mm/internal.h b/mm/internal.h
> index 6b7ef495b56d..8ef4f88608ad 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -856,4 +856,16 @@ static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma)
>  	return !(vma->vm_flags & VM_SOFTDIRTY);
>  }
>  
> +#ifndef CONFIG_UNACCEPTED_MEMORY
> +static inline bool range_contains_unaccepted_memory(phys_addr_t start,
> +						    phys_addr_t end)
> +{
> +	return false;
> +}
> +
> +static inline void accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +}
> +#endif
> +
>  #endif	/* __MM_INTERNAL_H */
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 511d4783dcf1..3bc404a5352a 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1423,6 +1423,15 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
>  		 */
>  		kmemleak_alloc_phys(found, size, 0);
>  
> +	/*
> +	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> +	 * require memory to be accepted before it can be used by the
> +	 * guest.
> +	 *
> +	 * Accept the memory of the allocated buffer.
> +	 */
> +	accept_memory(found, found + size);
> +
>  	return found;
>  }
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6e60657875d3..6d597e833a73 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -450,6 +450,11 @@ EXPORT_SYMBOL(nr_online_nodes);
>  
>  int page_group_by_mobility_disabled __read_mostly;
>  
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +/* Counts number of zones with unaccepted pages. */
> +static DEFINE_STATIC_KEY_FALSE(unaccepted_pages);
> +#endif
> +
>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>  /*
>   * During boot we initialize deferred pages on-demand, as needed, but once
> @@ -1043,12 +1048,15 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
>  {
>  	struct free_area *area = &zone->free_area[order];
>  
> +	VM_BUG_ON_PAGE(PageUnevictable(page), page);
>  	list_move_tail(&page->buddy_list, &area->free_list[migratetype]);
>  }
>  
>  static inline void del_page_from_free_list(struct page *page, struct zone *zone,
>  					   unsigned int order)
>  {
> +	VM_BUG_ON_PAGE(PageUnevictable(page), page);
> +
>  	/* clear reported state and update reported page count */
>  	if (page_reported(page))
>  		__ClearPageReported(page);

Aside from what Vlastimil said, it's not commented *why* these VM_BUG_ON_PAGE
(or WARN) could trigger and why Unevitable specifically. It could be a long
time after these patches are merged before they are used in production so if
it does happen, it'd be nice to have a comment explaining why it could occur.

> @@ -1728,6 +1736,97 @@ static void __free_pages_ok(struct page *page, unsigned int order,
>  	__count_vm_events(PGFREE, 1 << order);
>  }
>  
> +static bool page_contains_unaccepted(struct page *page, unsigned int order)
> +{
> +	phys_addr_t start = page_to_phys(page);
> +	phys_addr_t end = start + (PAGE_SIZE << order);
> +
> +	return range_contains_unaccepted_memory(start, end);
> +}
> +

If pages are accepted in MAX_ORDER-1 ranges enforced by the core, it'll
not be necessary to check ranges as 1 page will be enough to know the
whole block needs to be accepted.

> +static void accept_page(struct page *page, unsigned int order)
> +{
> +	phys_addr_t start = page_to_phys(page);
> +
> +	accept_memory(start, start + (PAGE_SIZE << order));
> +}
> +

Generically enforce a minimum range -- MAX_ORDER-1 recommended.

> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +
> +static bool try_to_accept_memory(struct zone *zone)
> +{
> +	unsigned long flags, order;
> +	struct page *page;
> +	bool last = false;
> +	int migratetype;
> +
> +	if (!static_branch_unlikely(&unaccepted_pages))
> +		return false;
> +
> +	spin_lock_irqsave(&zone->lock, flags);
> +	page = list_first_entry_or_null(&zone->unaccepted_pages,
> +					struct page, lru);
> +	if (!page) {
> +		spin_unlock_irqrestore(&zone->lock, flags);
> +		return false;
> +	}
> +

Probably could check this first locklessly as unaccepted pages only go down,
never up unless memory hotplug is a factor?

> +	list_del(&page->lru);
> +	last = list_empty(&zone->unaccepted_pages);
> +
> +	order = page->private;

If blocks are released in fixed sizes or MAX_ORDER-1 then order is
irrelevant.

> +	VM_BUG_ON(order > MAX_ORDER || order < pageblock_order);
> +
> +	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));

As is the migration type because the migration type will become the type
that splits at least a pageblock for the first time.

> +	__mod_zone_freepage_state(zone, -1 << order, migratetype);
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +
> +	if (last)
> +		static_branch_dec(&unaccepted_pages);
> +
> +	accept_page(page, order);
> +	__ClearPageUnaccepted(page);
> +	__free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON);
> +

If accept_page returns the first struct page of the block freed then this
can be simplified. The corner case for zone boundaries or holes within a zone
that are not MAX_ORDER-1 should be accepted unconditionally in early boot.

> +	return true;
> +}
> +
> +static void __free_unaccepted(struct page *page, unsigned int order)
> +{
> +	struct zone *zone = page_zone(page);
> +	unsigned long flags;
> +	int migratetype;
> +	bool first = false;
> +
> +	VM_BUG_ON(order > MAX_ORDER || order < pageblock_order);

Regardless of forced alignment, this should be a WARN_ON and bail. It's most
likely early in boot and if it affects 1 machine, it'll happen every time.

> +	__SetPageUnaccepted(page);
> +	page->private = order;
> +
> +	spin_lock_irqsave(&zone->lock, flags);
> +	first = list_empty(&zone->unaccepted_pages);
> +	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));
> +	list_add_tail(&page->lru, &zone->unaccepted_pages);
> +	__mod_zone_freepage_state(zone, 1 << order, migratetype);
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +

Enforcing alignment would also simplify fragmentation handling here.

> +	if (first)
> +		static_branch_inc(&unaccepted_pages);
> +}
> +
> +#else
> +
> +static bool try_to_accept_memory(struct zone *zone)
> +{
> +	return false;
> +}
> +
> +static void __free_unaccepted(struct page *page, unsigned int order)
> +{
> +	BUILD_BUG();
> +}
> +
> +#endif /* CONFIG_UNACCEPTED_MEMORY */
> +
>  void __free_pages_core(struct page *page, unsigned int order)
>  {
>  	unsigned int nr_pages = 1 << order;
> @@ -1750,6 +1849,13 @@ void __free_pages_core(struct page *page, unsigned int order)
>  
>  	atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
>  
> +	if (page_contains_unaccepted(page, order)) {
> +		if (order >= pageblock_order)
> +			return __free_unaccepted(page, order);
> +		else
> +			accept_page(page, order);
> +	}
> +

If MAX_ORDER-1 is the minimum for acceptance then this changes. It would
have to be checked if this is a zone boundary and if so, accept the
pages immediately with a comment, otherwise add a MAX_ORDER-1 page to the
unaccepted list.

>  	/*
>  	 * Bypass PCP and place fresh pages right to the tail, primarily
>  	 * relevant for memory onlining.
> @@ -1910,6 +2016,9 @@ static void __init deferred_free_range(unsigned long pfn,
>  		return;
>  	}
>  
> +	/* Accept chunks smaller than page-block upfront */
> +	accept_memory(PFN_PHYS(pfn), PFN_PHYS(pfn + nr_pages));
> +
>  	for (i = 0; i < nr_pages; i++, page++, pfn++) {
>  		if (pageblock_aligned(pfn))
>  			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> @@ -4247,6 +4356,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  				       gfp_mask)) {
>  			int ret;
>  
> +			if (try_to_accept_memory(zone))
> +				goto try_this_zone;
> +

This is a potential fragmentation hazard because by the time this
is reached, the watermarks are low. By then, pageblocks potentially
became mixed due to __rmqueue_fallback and the mixing of blocks may be
persistent. __rmqueue_fallback should fail if there is still unaccepted
memory.

That said, what you have here is fine, it's just not enough to avoid
the hazard on its own.

>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>  			/*
>  			 * Watermark failed for this zone, but see if we can
> @@ -4299,6 +4411,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  
>  			return page;
>  		} else {
> +			if (try_to_accept_memory(zone))
> +				goto try_this_zone;
> +

This has a similar problem, fallbacks could have already occurred. By
forbidding fallbacks until all memory within a zone is accepted, this
should be less of an issue because NULL will be returned, more memory is
accepted and it retries. It'll be a little race prone as another request
could allocate the newly accepted memory before the retry but it would
eventually be resolved.

Vlastimil was correct in terms of how to gauge how serious fragmentation
issues are early-on. Tracking /proc/pagetypeinfo helps but is hard to
use because it has no information on mixed pageblocks. It's better to
track mm_page_alloc_extfrag and at the very least, count the number of
times this happens when all memory is accepted up-front versus deferred
acceptance. Any major difference in counts is potentially problematic.

At the very least, a kernel build with all CPUs early in boot would be a
start. Slightly better would be to run the number of a number of separate
kernel builds that uses a high percentage of memory. As the allocations
are a good mix, but short-lived, it would also help to run fio in the
background creating many 4K files to overally generate a mix of page
tables allocations, slab and movable allocs, some of which are short-lived
from kernbench and others that are more persisent due to fio. I don't
have something suitable automated in mmtests at the moment as it's been
a while since I investigated fragmentation specifically and the old tests
took too long to complete on larger systems.

>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>  			/* Try again if zone has deferred pages */
>  			if (static_branch_unlikely(&deferred_pages)) {
> @@ -6935,6 +7050,10 @@ static void __meminit zone_init_free_lists(struct zone *zone)
>  		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
>  		zone->free_area[order].nr_free = 0;
>  	}
> +
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +	INIT_LIST_HEAD(&zone->unaccepted_pages);
> +#endif
>  }
>  
>  /*
> -- 
> 2.38.0
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHv8 06/14] efi/x86: Implement support for unaccepted memory
  2023-01-03 14:20   ` Borislav Petkov
@ 2023-03-25  0:51     ` Kirill A. Shutemov
  0 siblings, 0 replies; 26+ messages in thread
From: Kirill A. Shutemov @ 2023-03-25  0:51 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kirill A. Shutemov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Ard Biesheuvel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Dave Hansen, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Tue, Jan 03, 2023 at 03:20:55PM +0100, Borislav Petkov wrote:
> > diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
> > index 6787ed8dfacf..8aa8adf0bcb5 100644
> > --- a/drivers/firmware/efi/Kconfig
> > +++ b/drivers/firmware/efi/Kconfig
> > @@ -314,6 +314,20 @@ config EFI_COCO_SECRET
> >  	  virt/coco/efi_secret module to access the secrets, which in turn
> >  	  allows userspace programs to access the injected secrets.
> >  
> > +config UNACCEPTED_MEMORY
> > +	bool
> > +	depends on EFI_STUB
> 
> This still doesn't make a whole lotta sense. If I do "make menuconfig" I don't
> see the help text because that bool doesn't have a string prompt. So who is that
> help text for?

It is a form of documentation for a developer. The same happens for other
options. For instance, BOOT_VESA_SUPPORT or ARCH_HAS_CURRENT_STACK_POINTER.

Yes, it is not visible user, but I still think it is helpful for a
developer to understand what the option does.

> Then, in the last patch you have
> 
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -888,6 +888,8 @@ config INTEL_TDX_GUEST
>         select ARCH_HAS_CC_PLATFORM
>         select X86_MEM_ENCRYPT
>         select X86_MCE
> +       select UNACCEPTED_MEMORY
> +       select EFI_STUB
> 
> I guess you want to select UNACCEPTED_MEMORY only.

I had to rework it as

config INTEL_TDX_GUEST
	...
	depends on EFI_STUB
	select UNACCEPTED_MEMORY

Naked select UNACCEPTED_MEMORY doesn't work if EFI and EFI_STUB is
disabled:

WARNING: unmet direct dependencies detected for UNACCEPTED_MEMORY
  Depends on [n]: EFI [=n] && EFI_STUB [=n]
  Selected by [y]:
  - INTEL_TDX_GUEST [=y] && HYPERVISOR_GUEST [=y] && X86_64 [=y] && CPU_SUP_INTEL [=y] && X86_X2APIC [=y]

IIUC, the alternative is to have selects all the way down the option tree.

> 
> And I've already mentioned this whole mess:
> 
> https://lore.kernel.org/r/Yt%2BnOeLMqRxjObbx@zn.tnic
> 
> Please incorporate all review comments before sending a new version of
> your patch.
> 
> Ignoring review feedback is a very unfriendly thing to do:
> 
> - if you agree with the feedback, you work it in in the next revision
> 
> - if you don't agree, you *say* *why* you don't

Sorry, it was not my intention. I misread your comment and focused on
build issues around the option.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2023-03-25  0:52 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-07  1:49 [PATCHv8 00/14] mm, x86/cc: Implement support for unaccepted memory Kirill A. Shutemov
2022-12-07  1:49 ` [PATCHv8 01/14] x86/boot: Centralize __pa()/__va() definitions Kirill A. Shutemov
2022-12-07  1:49 ` [PATCHv8 02/14] mm: Add support for unaccepted memory Kirill A. Shutemov
2022-12-09 18:10   ` Vlastimil Babka
2022-12-09 19:26     ` Kirill A. Shutemov
2022-12-09 22:23       ` Vlastimil Babka
2022-12-24 16:46         ` Kirill A. Shutemov
2023-01-12 11:59           ` Vlastimil Babka
2022-12-26 12:23   ` Borislav Petkov
2022-12-27  3:18     ` Kirill A. Shutemov
2023-01-16 13:04   ` Mel Gorman
2022-12-07  1:49 ` [PATCHv8 03/14] mm: Report unaccepted memory in meminfo Kirill A. Shutemov
2022-12-07  1:49 ` [PATCHv8 04/14] efi/x86: Get full memory map in allocate_e820() Kirill A. Shutemov
2022-12-07  1:49 ` [PATCHv8 05/14] x86/boot: Add infrastructure required for unaccepted memory support Kirill A. Shutemov
2023-01-03 13:52   ` Borislav Petkov
2022-12-07  1:49 ` [PATCHv8 06/14] efi/x86: Implement support for unaccepted memory Kirill A. Shutemov
2023-01-03 14:20   ` Borislav Petkov
2023-03-25  0:51     ` Kirill A. Shutemov
2022-12-07  1:49 ` [PATCHv8 07/14] x86/boot/compressed: Handle " Kirill A. Shutemov
2022-12-07  1:49 ` [PATCHv8 08/14] x86/mm: Reserve unaccepted memory bitmap Kirill A. Shutemov
2022-12-07  1:49 ` [PATCHv8 09/14] x86/mm: Provide helpers for unaccepted memory Kirill A. Shutemov
2022-12-07  1:49 ` [PATCHv8 10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into " Kirill A. Shutemov
2022-12-07  1:49 ` [PATCHv8 11/14] x86: Disable kexec if system has " Kirill A. Shutemov
2022-12-07  1:49 ` [PATCHv8 12/14] x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub Kirill A. Shutemov
2022-12-07  1:49 ` [PATCHv8 13/14] x86/tdx: Refactor try_accept_one() Kirill A. Shutemov
2022-12-07  1:49 ` [PATCHv8 14/14] x86/tdx: Add unaccepted memory support Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).