All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv11 0/9] mm, x86/cc, efi: Implement support for unaccepted memory
@ 2023-05-13 22:04 Kirill A. Shutemov
  2023-05-13 22:04 ` [PATCHv11 1/9] mm: Add " Kirill A. Shutemov
                   ` (9 more replies)
  0 siblings, 10 replies; 38+ messages in thread
From: Kirill A. Shutemov @ 2023-05-13 22:04 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Dave Hansen,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov

UEFI Specification version 2.9 introduces the concept of memory
acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, requiring memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific for the Virtual
Machine platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptance until memory is needed. It lowers boot time and reduces
memory overhead.

The kernel needs to know what memory has been accepted. Firmware
communicates this information via memory map: a new memory type --
EFI_UNACCEPTED_MEMORY -- indicates such memory.

Range-based tracking works fine for firmware, but it gets bulky for
the kernel: e820 has to be modified on every page acceptance. It leads
to table fragmentation, but there's a limited number of entries in the
e820 table

Another option is to mark such memory as usable in e820 and track if the
range has been accepted in a bitmap. One bit in the bitmap represents
2MiB in the address space: one 4k page is enough to track 64GiB or
physical address space.

In the worst-case scenario -- a huge hole in the middle of the
address space -- It needs 256MiB to handle 4PiB of the address
space.

Any unaccepted memory that is not aligned to 2M gets accepted upfront.

The approach lowers boot time substantially. Boot to shell is ~2.5x
faster for 4G TDX VM and ~4x faster for 64G.

TDX-specific code isolated from the core of unaccepted memory support. It
supposed to help to plug-in different implementation of unaccepted memory
such as SEV-SNP.

-- Fragmentation study --

Vlastimil and Mel were concern about effect of unaccepted memory on
fragmentation prevention measures in page allocator. I tried to evaluate
it, but it is tricky. As suggested I tried to run multiple parallel kernel
builds and follow how often kmem:mm_page_alloc_extfrag gets hit.

See results in the v9 of the patchset[1][2]

[1] https://lore.kernel.org/all/20230330114956.20342-1-kirill.shutemov@linux.intel.com
[2] https://lore.kernel.org/all/20230416191940.ex7ao43pmrjhru2p@box.shutemov.name

--

The tree can be found here:

https://github.com/intel/tdx.git guest-unaccepted-memory

The patchset depends on MAX_ORDER changes in MM tree.

v11:
 - Restructure the code to make it less x86-specific (suggested by Ard):
   + use EFI configuration table instead of zero-page to pass down bitmap;
   + do not imply 1bit == 2M in bitmap;
   + move bulk of the code under driver/firmware/efi;
 - The bitmap only covers unaccpeted memory now. All memory that is not covered
   by the bitmap assumed accepted;
 - Reviewed-by from Ard;
v10:
 - Restructure code around zones_with_unaccepted_pages static brach to avoid
   unnecessary function calls (Suggested by Vlastimil);
 - Drop mentions of PageUnaccepted();
 - Drop patches that add fake unaccepted memory support and sysfs handle to
   accept memory manually;
 - Add Reviewed-by from Vlastimil;
v9:
 - Accept memory up to high watermark when kernel runs out of free memory;
 - Treat unaccepted memory as unusable in __zone_watermark_unusable_free();
 - Per-zone unaccepted memory accounting;
 - All pages on unaccepted list are MAX_ORDER now;
 - accept_memory=eager in cmdline to pre-accept memory during the boot;
 - Implement fake unaccepted memory;
 - Sysfs handle to accept memory manually;
 - Drop PageUnaccepted();
 - Rename unaccepted_pages static key to zones_with_unaccepted_pages;
v8:
 - Rewrite core-mm support for unaccepted memory (patch 02/14);
 - s/UnacceptedPages/Unaccepted/ in meminfo;
 - Drop arch/x86/boot/compressed/compiler.h;
 - Fix build errors;
 - Adjust commit messages and comments;
 - Reviewed-bys from Dave and Borislav;
 - Rebased to tip/master.
v7:
 - Rework meminfo counter to use PageUnaccepted() and move to generic code;
 - Fix range_contains_unaccepted_memory() on machines without unaccepted memory;
 - Add Reviewed-by from David;
v6:
 - Fix load_unaligned_zeropad() on machine with unaccepted memory;
 - Clear PageUnaccepted() on merged pages, leaving it only on head;
 - Clarify error handling in allocate_e820();
 - Fix build with CONFIG_UNACCEPTED_MEMORY=y, but without TDX;
 - Disable kexec at boottime instead of build conflict;
 - Rebased to tip/master;
 - Spelling fixes;
 - Add Reviewed-by from Mike and David;
v5:
 - Updates comments and commit messages;
   + Explain options for unaccepted memory handling;
 - Expose amount of unaccepted memory in /proc/meminfo
 - Adjust check in page_expected_state();
 - Fix error code handling in allocate_e820();
 - Centralize __pa()/__va() definitions in the boot stub;
 - Avoid includes from the main kernel in the boot stub;
 - Use an existing hole in boot_param for unaccepted_memory, instead of adding
   to the end of the structure;
 - Extract allocate_unaccepted_memory() form allocate_e820();
 - Complain if there's unaccepted memory, but kernel does not support it;
 - Fix vmstat counter;
 - Split up few preparatory patches;
 - Random readability adjustments;
v4:
 - PageBuddyUnaccepted() -> PageUnaccepted;
 - Use separate page_type, not shared with offline;
 - Rework interface between core-mm and arch code;
 - Adjust commit messages;
 - Ack from Mike;

Kirill A. Shutemov (9):
  mm: Add support for unaccepted memory
  efi/x86: Get full memory map in allocate_e820()
  efi/libstub: Implement support for unaccepted memory
  x86/boot/compressed: Handle unaccepted memory
  efi: Provide helpers for unaccepted memory
  efi/unaccepted: Avoid load_unaligned_zeropad() stepping into
    unaccepted memory
  x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in
    boot stub
  x86/tdx: Refactor try_accept_one()
  x86/tdx: Add unaccepted memory support

 arch/x86/Kconfig                              |   2 +
 arch/x86/boot/compressed/Makefile             |   1 +
 arch/x86/boot/compressed/efi.h                |   1 +
 arch/x86/boot/compressed/error.c              |  19 ++
 arch/x86/boot/compressed/error.h              |   1 +
 arch/x86/boot/compressed/kaslr.c              |  35 ++-
 arch/x86/boot/compressed/mem.c                |  42 ++++
 arch/x86/boot/compressed/misc.c               |   6 +
 arch/x86/boot/compressed/misc.h               |   6 +
 arch/x86/boot/compressed/tdx-shared.c         |   2 +
 arch/x86/boot/compressed/tdx.c                |  37 +++
 arch/x86/coco/tdx/Makefile                    |   2 +-
 arch/x86/coco/tdx/tdx-shared.c                |  95 ++++++++
 arch/x86/coco/tdx/tdx.c                       | 118 +---------
 arch/x86/include/asm/efi.h                    |   2 +
 arch/x86/include/asm/shared/tdx.h             |  53 +++++
 arch/x86/include/asm/tdx.h                    |  21 +-
 arch/x86/include/asm/unaccepted_memory.h      |  23 ++
 drivers/base/node.c                           |   7 +
 drivers/firmware/efi/Kconfig                  |  14 ++
 drivers/firmware/efi/Makefile                 |   1 +
 drivers/firmware/efi/efi.c                    |   7 +
 drivers/firmware/efi/libstub/Makefile         |   2 +
 drivers/firmware/efi/libstub/bitmap.c         |  41 ++++
 drivers/firmware/efi/libstub/efistub.h        |   6 +
 drivers/firmware/efi/libstub/find.c           |  43 ++++
 .../firmware/efi/libstub/unaccepted_memory.c  | 222 ++++++++++++++++++
 drivers/firmware/efi/libstub/x86-stub.c       |  39 +--
 drivers/firmware/efi/unaccepted_memory.c      | 138 +++++++++++
 fs/proc/meminfo.c                             |   5 +
 include/linux/efi.h                           |  13 +-
 include/linux/mm.h                            |  19 ++
 include/linux/mmzone.h                        |   8 +
 mm/internal.h                                 |   1 +
 mm/memblock.c                                 |   9 +
 mm/mm_init.c                                  |   7 +
 mm/page_alloc.c                               | 173 ++++++++++++++
 mm/vmstat.c                                   |   3 +
 38 files changed, 1060 insertions(+), 164 deletions(-)
 create mode 100644 arch/x86/boot/compressed/mem.c
 create mode 100644 arch/x86/boot/compressed/tdx-shared.c
 create mode 100644 arch/x86/coco/tdx/tdx-shared.c
 create mode 100644 arch/x86/include/asm/unaccepted_memory.h
 create mode 100644 drivers/firmware/efi/libstub/bitmap.c
 create mode 100644 drivers/firmware/efi/libstub/find.c
 create mode 100644 drivers/firmware/efi/libstub/unaccepted_memory.c
 create mode 100644 drivers/firmware/efi/unaccepted_memory.c

-- 
2.39.3


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCHv11 1/9] mm: Add support for unaccepted memory
  2023-05-13 22:04 [PATCHv11 0/9] mm, x86/cc, efi: Implement support for unaccepted memory Kirill A. Shutemov
@ 2023-05-13 22:04 ` Kirill A. Shutemov
  2023-05-16 19:44   ` Tom Lendacky
  2023-05-13 22:04 ` [PATCHv11 2/9] efi/x86: Get full memory map in allocate_e820() Kirill A. Shutemov
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 38+ messages in thread
From: Kirill A. Shutemov @ 2023-05-13 22:04 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Dave Hansen,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov,
	Mike Rapoport

UEFI Specification version 2.9 introduces the concept of memory
acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, require memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific to the Virtual Machine
platform.

There are several ways kernel can deal with unaccepted memory:

 1. Accept all the memory during the boot. It is easy to implement and
    it doesn't have runtime cost once the system is booted. The downside
    is very long boot time.

    Accept can be parallelized to multiple CPUs to keep it manageable
    (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
    memory bandwidth and does not scale beyond the point.

 2. Accept a block of memory on the first use. It requires more
    infrastructure and changes in page allocator to make it work, but
    it provides good boot time.

    On-demand memory accept means latency spikes every time kernel steps
    onto a new memory block. The spikes will go away once workload data
    set size gets stabilized or all memory gets accepted.

 3. Accept all memory in background. Introduce a thread (or multiple)
    that gets memory accepted proactively. It will minimize time the
    system experience latency spikes on memory allocation while keeping
    low boot time.

    This approach cannot function on its own. It is an extension of #2:
    background memory acceptance requires functional scheduler, but the
    page allocator may need to tap into unaccepted memory before that.

    The downside of the approach is that these threads also steal CPU
    cycles and memory bandwidth from the user's workload and may hurt
    user experience.

The patch implements #1 and #2 for now. #2 is the default. Some
workloads may want to use #1 with accept_memory=eager in kernel
command line. #3 can be implemented later based on user's demands.

Support of unaccepted memory requires a few changes in core-mm code:

  - memblock has to accept memory on allocation;

  - page allocator has to accept memory on the first allocation of the
    page;

Memblock change is trivial.

The page allocator is modified to accept pages. New memory gets accepted
before putting pages on free lists. It is done lazily: only accept new
pages when we run out of already accepted memory. The memory gets
accepted until the high watermark is reached.

EFI code will provide two helpers if the platform supports unaccepted
memory:

 - accept_memory() makes a range of physical addresses accepted.

 - range_contains_unaccepted_memory() checks anything within the range
   of physical addresses requires acceptance.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 drivers/base/node.c    |   7 ++
 fs/proc/meminfo.c      |   5 ++
 include/linux/mm.h     |  19 +++++
 include/linux/mmzone.h |   8 ++
 mm/internal.h          |   1 +
 mm/memblock.c          |   9 +++
 mm/mm_init.c           |   7 ++
 mm/page_alloc.c        | 173 +++++++++++++++++++++++++++++++++++++++++
 mm/vmstat.c            |   3 +
 9 files changed, 232 insertions(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index b46db17124f3..655975946ef6 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -448,6 +448,9 @@ static ssize_t node_read_meminfo(struct device *dev,
 			     "Node %d ShmemPmdMapped: %8lu kB\n"
 			     "Node %d FileHugePages: %8lu kB\n"
 			     "Node %d FilePmdMapped: %8lu kB\n"
+#endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+			     "Node %d Unaccepted:     %8lu kB\n"
 #endif
 			     ,
 			     nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
@@ -477,6 +480,10 @@ static ssize_t node_read_meminfo(struct device *dev,
 			     nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)),
 			     nid, K(node_page_state(pgdat, NR_FILE_THPS)),
 			     nid, K(node_page_state(pgdat, NR_FILE_PMDMAPPED))
+#endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+			     ,
+			     nid, K(sum_zone_node_page_state(nid, NR_UNACCEPTED))
 #endif
 			    );
 	len += hugetlb_report_node_meminfo(buf, len, nid);
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index b43d0bd42762..8dca4d6d96c7 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -168,6 +168,11 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		    global_zone_page_state(NR_FREE_CMA_PAGES));
 #endif
 
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	show_val_kb(m, "Unaccepted:     ",
+		    global_zone_page_state(NR_UNACCEPTED));
+#endif
+
 	hugetlb_report_meminfo(m);
 
 	arch_report_meminfo(m);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..d9174d464348 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3816,4 +3816,23 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
 }
 #endif
 
+#ifdef CONFIG_UNACCEPTED_MEMORY
+
+bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end);
+void accept_memory(phys_addr_t start, phys_addr_t end);
+
+#else
+
+static inline bool range_contains_unaccepted_memory(phys_addr_t start,
+						    phys_addr_t end)
+{
+	return false;
+}
+
+static inline void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+}
+
+#endif
+
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a4889c9d4055..6c1c2fc13017 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -143,6 +143,9 @@ enum zone_stat_item {
 	NR_ZSPAGES,		/* allocated in zsmalloc */
 #endif
 	NR_FREE_CMA_PAGES,
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	NR_UNACCEPTED,
+#endif
 	NR_VM_ZONE_STAT_ITEMS };
 
 enum node_stat_item {
@@ -910,6 +913,11 @@ struct zone {
 	/* free areas of different sizes */
 	struct free_area	free_area[MAX_ORDER + 1];
 
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	/* Pages to be accepted. All pages on the list are MAX_ORDER */
+	struct list_head	unaccepted_pages;
+#endif
+
 	/* zone flags, see below */
 	unsigned long		flags;
 
diff --git a/mm/internal.h b/mm/internal.h
index 68410c6d97ac..b1db7ba5f57d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1099,4 +1099,5 @@ struct vma_prepare {
 	struct vm_area_struct *remove;
 	struct vm_area_struct *remove2;
 };
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memblock.c b/mm/memblock.c
index 3feafea06ab2..50b921119600 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1436,6 +1436,15 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
 		 */
 		kmemleak_alloc_phys(found, size, 0);
 
+	/*
+	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
+	 * require memory to be accepted before it can be used by the
+	 * guest.
+	 *
+	 * Accept the memory of the allocated buffer.
+	 */
+	accept_memory(found, found + size);
+
 	return found;
 }
 
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 7f7f9c677854..1cfc08e25f93 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1375,6 +1375,10 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
 		zone->free_area[order].nr_free = 0;
 	}
+
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	INIT_LIST_HEAD(&zone->unaccepted_pages);
+#endif
 }
 
 void __meminit init_currently_empty_zone(struct zone *zone,
@@ -1960,6 +1964,9 @@ static void __init deferred_free_range(unsigned long pfn,
 		return;
 	}
 
+	/* Accept chunks smaller than MAX_ORDER upfront */
+	accept_memory(PFN_PHYS(pfn), PFN_PHYS(pfn + nr_pages));
+
 	for (i = 0; i < nr_pages; i++, page++, pfn++) {
 		if (pageblock_aligned(pfn))
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 47421bedc12b..d239fba3f31c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -387,6 +387,12 @@ EXPORT_SYMBOL(nr_node_ids);
 EXPORT_SYMBOL(nr_online_nodes);
 #endif
 
+static bool page_contains_unaccepted(struct page *page, unsigned int order);
+static void accept_page(struct page *page, unsigned int order);
+static bool try_to_accept_memory(struct zone *zone, unsigned int order);
+static inline bool has_unaccepted_memory(void);
+static bool __free_unaccepted(struct page *page);
+
 int page_group_by_mobility_disabled __read_mostly;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
@@ -1481,6 +1487,13 @@ void __free_pages_core(struct page *page, unsigned int order)
 
 	atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
 
+	if (page_contains_unaccepted(page, order)) {
+		if (order == MAX_ORDER && __free_unaccepted(page))
+			return;
+
+		accept_page(page, order);
+	}
+
 	/*
 	 * Bypass PCP and place fresh pages right to the tail, primarily
 	 * relevant for memory onlining.
@@ -3159,6 +3172,9 @@ static inline long __zone_watermark_unusable_free(struct zone *z,
 	if (!(alloc_flags & ALLOC_CMA))
 		unusable_free += zone_page_state(z, NR_FREE_CMA_PAGES);
 #endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	unusable_free += zone_page_state(z, NR_UNACCEPTED);
+#endif
 
 	return unusable_free;
 }
@@ -3458,6 +3474,11 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 				       gfp_mask)) {
 			int ret;
 
+			if (has_unaccepted_memory()) {
+				if (try_to_accept_memory(zone, order))
+					goto try_this_zone;
+			}
+
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 			/*
 			 * Watermark failed for this zone, but see if we can
@@ -3510,6 +3531,11 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 
 			return page;
 		} else {
+			if (has_unaccepted_memory()) {
+				if (try_to_accept_memory(zone, order))
+					goto try_this_zone;
+			}
+
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 			/* Try again if zone has deferred pages */
 			if (deferred_pages_enabled()) {
@@ -7215,3 +7241,150 @@ bool has_managed_dma(void)
 	return false;
 }
 #endif /* CONFIG_ZONE_DMA */
+
+#ifdef CONFIG_UNACCEPTED_MEMORY
+
+/* Counts number of zones with unaccepted pages. */
+static DEFINE_STATIC_KEY_FALSE(zones_with_unaccepted_pages);
+
+static bool lazy_accept = true;
+
+static int __init accept_memory_parse(char *p)
+{
+	if (!strcmp(p, "lazy")) {
+		lazy_accept = true;
+		return 0;
+	} else if (!strcmp(p, "eager")) {
+		lazy_accept = false;
+		return 0;
+	} else {
+		return -EINVAL;
+	}
+}
+early_param("accept_memory", accept_memory_parse);
+
+static bool page_contains_unaccepted(struct page *page, unsigned int order)
+{
+	phys_addr_t start = page_to_phys(page);
+	phys_addr_t end = start + (PAGE_SIZE << order);
+
+	return range_contains_unaccepted_memory(start, end);
+}
+
+static void accept_page(struct page *page, unsigned int order)
+{
+	phys_addr_t start = page_to_phys(page);
+
+	accept_memory(start, start + (PAGE_SIZE << order));
+}
+
+static bool try_to_accept_memory_one(struct zone *zone)
+{
+	unsigned long flags;
+	struct page *page;
+	bool last;
+
+	if (list_empty(&zone->unaccepted_pages))
+		return false;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	page = list_first_entry_or_null(&zone->unaccepted_pages,
+					struct page, lru);
+	if (!page) {
+		spin_unlock_irqrestore(&zone->lock, flags);
+		return false;
+	}
+
+	list_del(&page->lru);
+	last = list_empty(&zone->unaccepted_pages);
+
+	__mod_zone_freepage_state(zone, -MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE);
+	__mod_zone_page_state(zone, NR_UNACCEPTED, -MAX_ORDER_NR_PAGES);
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	accept_page(page, MAX_ORDER);
+
+	__free_pages_ok(page, MAX_ORDER, FPI_TO_TAIL);
+
+	if (last)
+		static_branch_dec(&zones_with_unaccepted_pages);
+
+	return true;
+}
+
+static bool try_to_accept_memory(struct zone *zone, unsigned int order)
+{
+	long to_accept;
+	int ret = false;
+
+	/* How much to accept to get to high watermark? */
+	to_accept = high_wmark_pages(zone) -
+		    (zone_page_state(zone, NR_FREE_PAGES) -
+		    __zone_watermark_unusable_free(zone, order, 0));
+
+	/* Accept at least one page */
+	do {
+		if (!try_to_accept_memory_one(zone))
+			break;
+		ret = true;
+		to_accept -= MAX_ORDER_NR_PAGES;
+	} while (to_accept > 0);
+
+	return ret;
+}
+
+static inline bool has_unaccepted_memory(void)
+{
+	return static_branch_unlikely(&zones_with_unaccepted_pages);
+}
+
+static bool __free_unaccepted(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long flags;
+	bool first = false;
+
+	if (!lazy_accept)
+		return false;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	first = list_empty(&zone->unaccepted_pages);
+	list_add_tail(&page->lru, &zone->unaccepted_pages);
+	__mod_zone_freepage_state(zone, MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE);
+	__mod_zone_page_state(zone, NR_UNACCEPTED, MAX_ORDER_NR_PAGES);
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	if (first)
+		static_branch_inc(&zones_with_unaccepted_pages);
+
+	return true;
+}
+
+#else
+
+static bool page_contains_unaccepted(struct page *page, unsigned int order)
+{
+	return false;
+}
+
+static void accept_page(struct page *page, unsigned int order)
+{
+}
+
+static bool try_to_accept_memory(struct zone *zone, unsigned int order)
+{
+	return false;
+}
+
+static inline bool has_unaccepted_memory(void)
+{
+	return false;
+}
+
+static bool __free_unaccepted(struct page *page)
+{
+	BUILD_BUG();
+	return false;
+}
+
+#endif /* CONFIG_UNACCEPTED_MEMORY */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c28046371b45..282349cabf01 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1180,6 +1180,9 @@ const char * const vmstat_text[] = {
 	"nr_zspages",
 #endif
 	"nr_free_cma",
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	"nr_unaccepted",
+#endif
 
 	/* enum numa_stat_item counters */
 #ifdef CONFIG_NUMA
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCHv11 2/9] efi/x86: Get full memory map in allocate_e820()
  2023-05-13 22:04 [PATCHv11 0/9] mm, x86/cc, efi: Implement support for unaccepted memory Kirill A. Shutemov
  2023-05-13 22:04 ` [PATCHv11 1/9] mm: Add " Kirill A. Shutemov
@ 2023-05-13 22:04 ` Kirill A. Shutemov
  2023-05-16 19:52   ` Tom Lendacky
  2023-05-13 22:04 ` [PATCHv11 3/9] efi/libstub: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 38+ messages in thread
From: Kirill A. Shutemov @ 2023-05-13 22:04 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Dave Hansen,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov,
	Borislav Petkov

Currently allocate_e820() is only interested in the size of map and size
of memory descriptor to determine how many e820 entries the kernel
needs.

UEFI Specification version 2.9 introduces a new memory type --
unaccepted memory. To track unaccepted memory kernel needs to allocate
a bitmap. The size of the bitmap is dependent on the maximum physical
address present in the system. A full memory map is required to find
the maximum address.

Modify allocate_e820() to get a full memory map.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
---
 drivers/firmware/efi/libstub/x86-stub.c | 26 +++++++++++--------------
 1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index a0bfd31358ba..fff81843169c 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -681,28 +681,24 @@ static efi_status_t allocate_e820(struct boot_params *params,
 				  struct setup_data **e820ext,
 				  u32 *e820ext_size)
 {
-	unsigned long map_size, desc_size, map_key;
+	struct efi_boot_memmap *map;
 	efi_status_t status;
-	__u32 nr_desc, desc_version;
+	__u32 nr_desc;
 
-	/* Only need the size of the mem map and size of each mem descriptor */
-	map_size = 0;
-	status = efi_bs_call(get_memory_map, &map_size, NULL, &map_key,
-			     &desc_size, &desc_version);
-	if (status != EFI_BUFFER_TOO_SMALL)
-		return (status != EFI_SUCCESS) ? status : EFI_UNSUPPORTED;
+	status = efi_get_memory_map(&map, false);
+	if (status != EFI_SUCCESS)
+		return status;
 
-	nr_desc = map_size / desc_size + EFI_MMAP_NR_SLACK_SLOTS;
-
-	if (nr_desc > ARRAY_SIZE(params->e820_table)) {
-		u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table);
+	nr_desc = map->map_size / map->desc_size;
+	if (nr_desc > ARRAY_SIZE(params->e820_table) - EFI_MMAP_NR_SLACK_SLOTS) {
+		u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table) +
+			EFI_MMAP_NR_SLACK_SLOTS;
 
 		status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
-		if (status != EFI_SUCCESS)
-			return status;
 	}
 
-	return EFI_SUCCESS;
+	efi_bs_call(free_pool, map);
+	return status;
 }
 
 struct exit_boot_struct {
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCHv11 3/9] efi/libstub: Implement support for unaccepted memory
  2023-05-13 22:04 [PATCHv11 0/9] mm, x86/cc, efi: Implement support for unaccepted memory Kirill A. Shutemov
  2023-05-13 22:04 ` [PATCHv11 1/9] mm: Add " Kirill A. Shutemov
  2023-05-13 22:04 ` [PATCHv11 2/9] efi/x86: Get full memory map in allocate_e820() Kirill A. Shutemov
@ 2023-05-13 22:04 ` Kirill A. Shutemov
  2023-05-14  5:08   ` Mika Penttilä
  2023-05-16 18:06   ` Ard Biesheuvel
  2023-05-13 22:04 ` [PATCHv11 4/9] x86/boot/compressed: Handle " Kirill A. Shutemov
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 38+ messages in thread
From: Kirill A. Shutemov @ 2023-05-13 22:04 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Dave Hansen,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov

UEFI Specification version 2.9 introduces the concept of memory
acceptance: Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, requiring memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific for the Virtual
Machine platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptance until memory is needed. It lowers boot time and reduces
memory overhead.

The kernel needs to know what memory has been accepted. Firmware
communicates this information via memory map: a new memory type --
EFI_UNACCEPTED_MEMORY -- indicates such memory.

Range-based tracking works fine for firmware, but it gets bulky for
the kernel: e820 (or whatever the arch uses) has to be modified on every
page acceptance. It leads to table fragmentation and there's a limited
number of entries in the e820 table.

Another option is to mark such memory as usable in e820 and track if the
range has been accepted in a bitmap. One bit in the bitmap represents a
naturally aligned power-2-sized region of address space -- unit.

For x86, unit size is 2MiB: 4k of the bitmap is enough to track 64GiB or
physical address space.

In the worst-case scenario -- a huge hole in the middle of the
address space -- It needs 256MiB to handle 4PiB of the address
space.

Any unaccepted memory that is not aligned to unit_size gets accepted
upfront.

The bitmap is allocated and constructed in the EFI stub and passed down
to the kernel via EFI configuration table. allocate_e820() allocates the
bitmap if unaccepted memory is present, according to the size of
unaccepted region.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/Makefile             |   1 +
 arch/x86/boot/compressed/mem.c                |   9 +
 arch/x86/include/asm/efi.h                    |   2 +
 drivers/firmware/efi/Kconfig                  |  14 ++
 drivers/firmware/efi/efi.c                    |   1 +
 drivers/firmware/efi/libstub/Makefile         |   2 +
 drivers/firmware/efi/libstub/bitmap.c         |  41 ++++
 drivers/firmware/efi/libstub/efistub.h        |   6 +
 drivers/firmware/efi/libstub/find.c           |  43 ++++
 .../firmware/efi/libstub/unaccepted_memory.c  | 222 ++++++++++++++++++
 drivers/firmware/efi/libstub/x86-stub.c       |  13 +
 include/linux/efi.h                           |  12 +-
 12 files changed, 365 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/boot/compressed/mem.c
 create mode 100644 drivers/firmware/efi/libstub/bitmap.c
 create mode 100644 drivers/firmware/efi/libstub/find.c
 create mode 100644 drivers/firmware/efi/libstub/unaccepted_memory.c

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 6b6cfe607bdb..cc4978123c30 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -107,6 +107,7 @@ endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
 vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
+vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/mem.o
 
 vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
new file mode 100644
index 000000000000..67594fcb11d9
--- /dev/null
+++ b/arch/x86/boot/compressed/mem.c
@@ -0,0 +1,9 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "error.h"
+
+void arch_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	/* Platform-specific memory-acceptance call goes here */
+	error("Cannot accept memory");
+}
diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index 419280d263d2..8b4be7cecdb8 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -31,6 +31,8 @@ extern unsigned long efi_mixed_mode_stack_pa;
 
 #define ARCH_EFI_IRQ_FLAGS_MASK	X86_EFLAGS_IF
 
+#define EFI_UNACCEPTED_UNIT_SIZE PMD_SIZE
+
 /*
  * The EFI services are called through variadic functions in many cases. These
  * functions are implemented in assembler and support only a fixed number of
diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
index 043ca31c114e..231f1c70d1db 100644
--- a/drivers/firmware/efi/Kconfig
+++ b/drivers/firmware/efi/Kconfig
@@ -269,6 +269,20 @@ config EFI_COCO_SECRET
 	  virt/coco/efi_secret module to access the secrets, which in turn
 	  allows userspace programs to access the injected secrets.
 
+config UNACCEPTED_MEMORY
+	bool
+	depends on EFI_STUB
+	help
+	   Some Virtual Machine platforms, such as Intel TDX, require
+	   some memory to be "accepted" by the guest before it can be used.
+	   This mechanism helps prevent malicious hosts from making changes
+	   to guest memory.
+
+	   UEFI specification v2.9 introduced EFI_UNACCEPTED_MEMORY memory type.
+
+	   This option adds support for unaccepted memory and makes such memory
+	   usable by the kernel.
+
 config EFI_EMBEDDED_FIRMWARE
 	bool
 	select CRYPTO_LIB_SHA256
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index abeff7dc0b58..7dce06e419c5 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -843,6 +843,7 @@ static __initdata char memory_type_name[][13] = {
 	"MMIO Port",
 	"PAL Code",
 	"Persistent",
+	"Unaccepted",
 };
 
 char * __init efi_md_typeattr_format(char *buf, size_t size,
diff --git a/drivers/firmware/efi/libstub/Makefile b/drivers/firmware/efi/libstub/Makefile
index 3abb2b357482..16d64a34d1e1 100644
--- a/drivers/firmware/efi/libstub/Makefile
+++ b/drivers/firmware/efi/libstub/Makefile
@@ -96,6 +96,8 @@ CFLAGS_arm32-stub.o		:= -DTEXT_OFFSET=$(TEXT_OFFSET)
 zboot-obj-$(CONFIG_RISCV)	:= lib-clz_ctz.o lib-ashldi3.o
 lib-$(CONFIG_EFI_ZBOOT)		+= zboot.o $(zboot-obj-y)
 
+lib-$(CONFIG_UNACCEPTED_MEMORY) += unaccepted_memory.o bitmap.o find.o
+
 extra-y				:= $(lib-y)
 lib-y				:= $(patsubst %.o,%.stub.o,$(lib-y))
 
diff --git a/drivers/firmware/efi/libstub/bitmap.c b/drivers/firmware/efi/libstub/bitmap.c
new file mode 100644
index 000000000000..5c9bba0d549b
--- /dev/null
+++ b/drivers/firmware/efi/libstub/bitmap.c
@@ -0,0 +1,41 @@
+#include <linux/bitmap.h>
+
+void __bitmap_set(unsigned long *map, unsigned int start, int len)
+{
+	unsigned long *p = map + BIT_WORD(start);
+	const unsigned int size = start + len;
+	int bits_to_set = BITS_PER_LONG - (start % BITS_PER_LONG);
+	unsigned long mask_to_set = BITMAP_FIRST_WORD_MASK(start);
+
+	while (len - bits_to_set >= 0) {
+		*p |= mask_to_set;
+		len -= bits_to_set;
+		bits_to_set = BITS_PER_LONG;
+		mask_to_set = ~0UL;
+		p++;
+	}
+	if (len) {
+		mask_to_set &= BITMAP_LAST_WORD_MASK(size);
+		*p |= mask_to_set;
+	}
+}
+
+void __bitmap_clear(unsigned long *map, unsigned int start, int len)
+{
+	unsigned long *p = map + BIT_WORD(start);
+	const unsigned int size = start + len;
+	int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+	unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+
+	while (len - bits_to_clear >= 0) {
+		*p &= ~mask_to_clear;
+		len -= bits_to_clear;
+		bits_to_clear = BITS_PER_LONG;
+		mask_to_clear = ~0UL;
+		p++;
+	}
+	if (len) {
+		mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+		*p &= ~mask_to_clear;
+	}
+}
diff --git a/drivers/firmware/efi/libstub/efistub.h b/drivers/firmware/efi/libstub/efistub.h
index 67d5a20802e0..8659a01664b8 100644
--- a/drivers/firmware/efi/libstub/efistub.h
+++ b/drivers/firmware/efi/libstub/efistub.h
@@ -1133,4 +1133,10 @@ const u8 *__efi_get_smbios_string(const struct efi_smbios_record *record,
 void efi_remap_image(unsigned long image_base, unsigned alloc_size,
 		     unsigned long code_size);
 
+efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
+					struct efi_boot_memmap *map);
+void process_unaccepted_memory(u64 start, u64 end);
+void accept_memory(phys_addr_t start, phys_addr_t end);
+void arch_accept_memory(phys_addr_t start, phys_addr_t end);
+
 #endif
diff --git a/drivers/firmware/efi/libstub/find.c b/drivers/firmware/efi/libstub/find.c
new file mode 100644
index 000000000000..4e7740d28987
--- /dev/null
+++ b/drivers/firmware/efi/libstub/find.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/bitmap.h>
+#include <linux/math.h>
+#include <linux/minmax.h>
+
+/*
+ * Common helper for find_next_bit() function family
+ * @FETCH: The expression that fetches and pre-processes each word of bitmap(s)
+ * @MUNGE: The expression that post-processes a word containing found bit (may be empty)
+ * @size: The bitmap size in bits
+ * @start: The bitnumber to start searching at
+ */
+#define FIND_NEXT_BIT(FETCH, MUNGE, size, start)				\
+({										\
+	unsigned long mask, idx, tmp, sz = (size), __start = (start);		\
+										\
+	if (unlikely(__start >= sz))						\
+		goto out;							\
+										\
+	mask = MUNGE(BITMAP_FIRST_WORD_MASK(__start));				\
+	idx = __start / BITS_PER_LONG;						\
+										\
+	for (tmp = (FETCH) & mask; !tmp; tmp = (FETCH)) {			\
+		if ((idx + 1) * BITS_PER_LONG >= sz)				\
+			goto out;						\
+		idx++;								\
+	}									\
+										\
+	sz = min(idx * BITS_PER_LONG + __ffs(MUNGE(tmp)), sz);			\
+out:										\
+	sz;									\
+})
+
+unsigned long _find_next_bit(const unsigned long *addr, unsigned long nbits, unsigned long start)
+{
+	return FIND_NEXT_BIT(addr[idx], /* nop */, nbits, start);
+}
+
+unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
+					 unsigned long start)
+{
+	return FIND_NEXT_BIT(~addr[idx], /* nop */, nbits, start);
+}
diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c b/drivers/firmware/efi/libstub/unaccepted_memory.c
new file mode 100644
index 000000000000..f4642c4f25dd
--- /dev/null
+++ b/drivers/firmware/efi/libstub/unaccepted_memory.c
@@ -0,0 +1,222 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/efi.h>
+#include <asm/efi.h>
+#include "efistub.h"
+
+static struct efi_unaccepted_memory *unaccepted_table;
+
+efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
+					struct efi_boot_memmap *map)
+{
+	efi_guid_t unaccepted_table_guid = LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID;
+	u64 unaccepted_start = ULLONG_MAX, unaccepted_end = 0, bitmap_size;
+	efi_status_t status;
+	int i;
+
+	/* Check if the table is already installed */
+	unaccepted_table = get_efi_config_table(unaccepted_table_guid);
+	if (unaccepted_table) {
+		if (unaccepted_table->version != 1) {
+			efi_err("Unknown version of unaccepted memory table\n");
+			return EFI_UNSUPPORTED;
+		}
+		return EFI_SUCCESS;
+	}
+
+	/* Check if there's any unaccepted memory and find the max address */
+	for (i = 0; i < nr_desc; i++) {
+		efi_memory_desc_t *d;
+		unsigned long m = (unsigned long)map->map;
+
+		d = efi_early_memdesc_ptr(m, map->desc_size, i);
+		if (d->type != EFI_UNACCEPTED_MEMORY)
+			continue;
+
+		unaccepted_start = min(unaccepted_start, d->phys_addr);
+		unaccepted_end = max(unaccepted_end,
+				     d->phys_addr + d->num_pages * PAGE_SIZE);
+	}
+
+	if (unaccepted_start == ULLONG_MAX)
+		return EFI_SUCCESS;
+
+	unaccepted_start = round_down(unaccepted_start,
+				      EFI_UNACCEPTED_UNIT_SIZE);
+	unaccepted_end = round_up(unaccepted_end, EFI_UNACCEPTED_UNIT_SIZE);
+
+	/*
+	 * If unaccepted memory is present, allocate a bitmap to track what
+	 * memory has to be accepted before access.
+	 *
+	 * One bit in the bitmap represents 2MiB in the address space:
+	 * A 4k bitmap can track 64GiB of physical address space.
+	 *
+	 * In the worst case scenario -- a huge hole in the middle of the
+	 * address space -- It needs 256MiB to handle 4PiB of the address
+	 * space.
+	 *
+	 * The bitmap will be populated in setup_e820() according to the memory
+	 * map after efi_exit_boot_services().
+	 */
+	bitmap_size = DIV_ROUND_UP(unaccepted_end - unaccepted_start,
+				   EFI_UNACCEPTED_UNIT_SIZE * BITS_PER_BYTE);
+
+	status = efi_bs_call(allocate_pool, EFI_LOADER_DATA,
+			     sizeof(*unaccepted_table) + bitmap_size,
+			     (void **)&unaccepted_table);
+	if (status != EFI_SUCCESS) {
+		efi_err("Failed to allocate unaccepted memory config table\n");
+		return status;
+	}
+
+	unaccepted_table->version = 1;
+	unaccepted_table->unit_size = EFI_UNACCEPTED_UNIT_SIZE;
+	unaccepted_table->phys_base = unaccepted_start;
+	unaccepted_table->size = bitmap_size;
+	memset(unaccepted_table->bitmap, 0, bitmap_size);
+
+	status = efi_bs_call(install_configuration_table,
+			     &unaccepted_table_guid, unaccepted_table);
+	if (status != EFI_SUCCESS) {
+		efi_bs_call(free_pool, unaccepted_table);
+		efi_err("Failed to install unaccepted memory config table!\n");
+	}
+
+	return status;
+}
+
+/*
+ * The accepted memory bitmap only works at unit_size granularity.  Take
+ * unaligned start/end addresses and either:
+ *  1. Accepts the memory immediately and in its entirety
+ *  2. Accepts unaligned parts, and marks *some* aligned part unaccepted
+ *
+ * The function will never reach the bitmap_set() with zero bits to set.
+ */
+void process_unaccepted_memory(u64 start, u64 end)
+{
+	u64 unit_size = unaccepted_table->unit_size;
+	u64 unit_mask = unaccepted_table->unit_size - 1;
+	u64 bitmap_size = unaccepted_table->size;
+
+	/*
+	 * Ensure that at least one bit will be set in the bitmap by
+	 * immediately accepting all regions under 2*unit_size.  This is
+	 * imprecise and may immediately accept some areas that could
+	 * have been represented in the bitmap.  But, results in simpler
+	 * code below
+	 *
+	 * Consider case like this (assuming unit_size == 2MB):
+	 *
+	 * | 4k | 2044k |    2048k   |
+	 * ^ 0x0        ^ 2MB        ^ 4MB
+	 *
+	 * Only the first 4k has been accepted. The 0MB->2MB region can not be
+	 * represented in the bitmap. The 2MB->4MB region can be represented in
+	 * the bitmap. But, the 0MB->4MB region is <2*unit_size and will be
+	 * immediately accepted in its entirety.
+	 */
+	if (end - start < 2 * unit_size) {
+		arch_accept_memory(start, end);
+		return;
+	}
+
+	/*
+	 * No matter how the start and end are aligned, at least one unaccepted
+	 * unit_size area will remain to be marked in the bitmap.
+	 */
+
+	/* Immediately accept a <unit_size piece at the start: */
+	if (start & unit_mask) {
+		arch_accept_memory(start, round_up(start, unit_size));
+		start = round_up(start, unit_size);
+	}
+
+	/* Immediately accept a <unit_size piece at the end: */
+	if (end & unit_mask) {
+		arch_accept_memory(round_down(end, unit_size), end);
+		end = round_down(end, unit_size);
+	}
+
+	/*
+	 * Accept part of the range that before phys_base and cannot be recorded
+	 * into the bitmap.
+	 */
+	if (start < unaccepted_table->phys_base) {
+		arch_accept_memory(start,
+				   min(unaccepted_table->phys_base, end));
+		start = unaccepted_table->phys_base;
+	}
+
+	/* Nothing to record */
+	if (end < unaccepted_table->phys_base)
+		return;
+
+	/* Translate to offsets from the beginning of the bitmap */
+	start -= unaccepted_table->phys_base;
+	end -= unaccepted_table->phys_base;
+
+	/* Accept memory that doesn't fit into bitmap */
+	if (end > bitmap_size * unit_size * BITS_PER_BYTE) {
+		unsigned long phys_start, phys_end;
+
+		phys_start = bitmap_size * unit_size * BITS_PER_BYTE +
+			     unaccepted_table->phys_base;
+		phys_end = end + unaccepted_table->phys_base;
+
+		arch_accept_memory(phys_start, phys_end);
+		end = bitmap_size * unit_size * BITS_PER_BYTE;
+	}
+
+	/*
+	 * 'start' and 'end' are now both unit_size-aligned.
+	 * Record the range as being unaccepted:
+	 */
+	bitmap_set(unaccepted_table->bitmap,
+		   start / unit_size, (end - start) / unit_size);
+}
+
+void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	unsigned long range_start, range_end;
+	unsigned long bitmap_size;
+	u64 unit_size;
+
+	if (!unaccepted_table)
+		return;
+
+	unit_size = unaccepted_table->unit_size;
+
+	/*
+	 * Only care for the part of the range that is represented
+	 * in the bitmap.
+	 */
+	if (start < unaccepted_table->phys_base)
+		start = unaccepted_table->phys_base;
+	if (end < unaccepted_table->phys_base)
+		return;
+
+	/* Translate to offsets from the beginning of the bitmap */
+	start -= unaccepted_table->phys_base;
+	end -= unaccepted_table->phys_base;
+
+	/* Make sure not to overrun the bitmap */
+	if (end > unaccepted_table->size * unit_size * BITS_PER_BYTE)
+		end = unaccepted_table->size * unit_size * BITS_PER_BYTE;
+
+	range_start = start / unit_size;
+	bitmap_size = DIV_ROUND_UP(end, unit_size);
+
+	for_each_set_bitrange_from(range_start, range_end,
+				   unaccepted_table->bitmap, bitmap_size) {
+		unsigned long phys_start, phys_end;
+
+		phys_start = range_start * unit_size + unaccepted_table->phys_base;
+		phys_end = range_end * unit_size + unaccepted_table->phys_base;
+
+		arch_accept_memory(phys_start, phys_end);
+		bitmap_clear(unaccepted_table->bitmap,
+			     range_start, range_end - range_start);
+	}
+}
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index fff81843169c..8d17cee8b98e 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -613,6 +613,16 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
 			e820_type = E820_TYPE_PMEM;
 			break;
 
+		case EFI_UNACCEPTED_MEMORY:
+			if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY)) {
+				efi_warn_once(
+"The system has unaccepted memory,  but kernel does not support it\nConsider enabling CONFIG_UNACCEPTED_MEMORY\n");
+				continue;
+			}
+			e820_type = E820_TYPE_RAM;
+			process_unaccepted_memory(d->phys_addr,
+						  d->phys_addr + PAGE_SIZE * d->num_pages);
+			break;
 		default:
 			continue;
 		}
@@ -697,6 +707,9 @@ static efi_status_t allocate_e820(struct boot_params *params,
 		status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
 	}
 
+	if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) && status == EFI_SUCCESS)
+		status = allocate_unaccepted_bitmap(nr_desc, map);
+
 	efi_bs_call(free_pool, map);
 	return status;
 }
diff --git a/include/linux/efi.h b/include/linux/efi.h
index 7aa62c92185f..29cc622910da 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -108,7 +108,8 @@ typedef	struct {
 #define EFI_MEMORY_MAPPED_IO_PORT_SPACE	12
 #define EFI_PAL_CODE			13
 #define EFI_PERSISTENT_MEMORY		14
-#define EFI_MAX_MEMORY_TYPE		15
+#define EFI_UNACCEPTED_MEMORY		15
+#define EFI_MAX_MEMORY_TYPE		16
 
 /* Attribute values: */
 #define EFI_MEMORY_UC		((u64)0x0000000000000001ULL)	/* uncached */
@@ -417,6 +418,7 @@ void efi_native_runtime_setup(void);
 #define LINUX_EFI_MOK_VARIABLE_TABLE_GUID	EFI_GUID(0xc451ed2b, 0x9694, 0x45d3,  0xba, 0xba, 0xed, 0x9f, 0x89, 0x88, 0xa3, 0x89)
 #define LINUX_EFI_COCO_SECRET_AREA_GUID		EFI_GUID(0xadf956ad, 0xe98c, 0x484c,  0xae, 0x11, 0xb5, 0x1c, 0x7d, 0x33, 0x64, 0x47)
 #define LINUX_EFI_BOOT_MEMMAP_GUID		EFI_GUID(0x800f683f, 0xd08b, 0x423a,  0xa2, 0x93, 0x96, 0x5c, 0x3c, 0x6f, 0xe2, 0xb4)
+#define LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID	EFI_GUID(0xd5d1de3c, 0x105c, 0x44f9,  0x9e, 0xa9, 0xbc, 0xef, 0x98, 0x12, 0x00, 0x31)
 
 #define RISCV_EFI_BOOT_PROTOCOL_GUID		EFI_GUID(0xccd15fec, 0x6f73, 0x4eec,  0x83, 0x95, 0x3e, 0x69, 0xe4, 0xb9, 0x40, 0xbf)
 
@@ -534,6 +536,14 @@ struct efi_boot_memmap {
 	efi_memory_desc_t	map[];
 };
 
+struct efi_unaccepted_memory {
+	u32 version;
+	u32 unit_size;
+	u64 phys_base;
+	u64 size;
+	unsigned long bitmap[];
+};
+
 /*
  * Architecture independent structure for describing a memory map for the
  * benefit of efi_memmap_init_early(), and for passing context between
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCHv11 4/9] x86/boot/compressed: Handle unaccepted memory
  2023-05-13 22:04 [PATCHv11 0/9] mm, x86/cc, efi: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2023-05-13 22:04 ` [PATCHv11 3/9] efi/libstub: Implement support for unaccepted memory Kirill A. Shutemov
@ 2023-05-13 22:04 ` Kirill A. Shutemov
  2023-05-16 17:09   ` Liam Merwick
  2023-05-17 15:52   ` Tom Lendacky
  2023-05-13 22:04 ` [PATCHv11 5/9] efi: Provide helpers for " Kirill A. Shutemov
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 38+ messages in thread
From: Kirill A. Shutemov @ 2023-05-13 22:04 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Dave Hansen,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov

The firmware will pre-accept the memory used to run the stub. But, the
stub is responsible for accepting the memory into which it decompresses
the main kernel. Accept memory just before decompression starts.

The stub is also responsible for choosing a physical address in which to
place the decompressed kernel image. The KASLR mechanism will randomize
this physical address. Since the unaccepted memory region is relatively
small, KASLR would be quite ineffective if it only used the pre-accepted
area (EFI_CONVENTIONAL_MEMORY). Ensure that KASLR randomizes among the
entire physical address space by also including EFI_UNACCEPTED_MEMORY.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/efi.h   |  1 +
 arch/x86/boot/compressed/kaslr.c | 35 +++++++++++++++++++++-----------
 arch/x86/boot/compressed/misc.c  |  6 ++++++
 arch/x86/boot/compressed/misc.h  |  6 ++++++
 4 files changed, 36 insertions(+), 12 deletions(-)

diff --git a/arch/x86/boot/compressed/efi.h b/arch/x86/boot/compressed/efi.h
index 7db2f41b54cd..cf475243b6d5 100644
--- a/arch/x86/boot/compressed/efi.h
+++ b/arch/x86/boot/compressed/efi.h
@@ -32,6 +32,7 @@ typedef	struct {
 } efi_table_hdr_t;
 
 #define EFI_CONVENTIONAL_MEMORY		 7
+#define EFI_UNACCEPTED_MEMORY		15
 
 #define EFI_MEMORY_MORE_RELIABLE \
 				((u64)0x0000000000010000ULL)	/* higher reliability */
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 454757fbdfe5..749f0fe7e446 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -672,6 +672,28 @@ static bool process_mem_region(struct mem_vector *region,
 }
 
 #ifdef CONFIG_EFI
+
+/*
+ * Only EFI_CONVENTIONAL_MEMORY and EFI_UNACCEPTED_MEMORY (if supported) are
+ * guaranteed to be free.
+ *
+ * It is more conservative in picking free memory than the EFI spec allows:
+ *
+ * According to the spec, EFI_BOOT_SERVICES_{CODE|DATA} are also free memory
+ * and thus available to place the kernel image into, but in practice there's
+ * firmware where using that memory leads to crashes.
+ */
+static inline bool memory_type_is_free(efi_memory_desc_t *md)
+{
+	if (md->type == EFI_CONVENTIONAL_MEMORY)
+		return true;
+
+	if (md->type == EFI_UNACCEPTED_MEMORY)
+		return IS_ENABLED(CONFIG_UNACCEPTED_MEMORY);
+
+	return false;
+}
+
 /*
  * Returns true if we processed the EFI memmap, which we prefer over the E820
  * table if it is available.
@@ -716,18 +738,7 @@ process_efi_entries(unsigned long minimum, unsigned long image_size)
 	for (i = 0; i < nr_desc; i++) {
 		md = efi_early_memdesc_ptr(pmap, e->efi_memdesc_size, i);
 
-		/*
-		 * Here we are more conservative in picking free memory than
-		 * the EFI spec allows:
-		 *
-		 * According to the spec, EFI_BOOT_SERVICES_{CODE|DATA} are also
-		 * free memory and thus available to place the kernel image into,
-		 * but in practice there's firmware where using that memory leads
-		 * to crashes.
-		 *
-		 * Only EFI_CONVENTIONAL_MEMORY is guaranteed to be free.
-		 */
-		if (md->type != EFI_CONVENTIONAL_MEMORY)
+		if (!memory_type_is_free(md))
 			continue;
 
 		if (efi_soft_reserve_enabled() &&
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 014ff222bf4b..eb8df0d4ad51 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -455,6 +455,12 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
 #endif
 
 	debug_putstr("\nDecompressing Linux... ");
+
+	if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY)) {
+		debug_putstr("Accepting memory... ");
+		accept_memory(__pa(output), __pa(output) + needed_size);
+	}
+
 	__decompress(input_data, input_len, NULL, NULL, output, output_len,
 			NULL, error);
 	entry_offset = parse_elf(output);
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 2f155a0e3041..9663d1839f54 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -247,4 +247,10 @@ static inline unsigned long efi_find_vendor_table(struct boot_params *bp,
 }
 #endif /* CONFIG_EFI */
 
+#ifdef CONFIG_UNACCEPTED_MEMORY
+void accept_memory(phys_addr_t start, phys_addr_t end);
+#else
+static inline void accept_memory(phys_addr_t start, phys_addr_t end) {}
+#endif
+
 #endif /* BOOT_COMPRESSED_MISC_H */
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCHv11 5/9] efi: Provide helpers for unaccepted memory
  2023-05-13 22:04 [PATCHv11 0/9] mm, x86/cc, efi: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2023-05-13 22:04 ` [PATCHv11 4/9] x86/boot/compressed: Handle " Kirill A. Shutemov
@ 2023-05-13 22:04 ` Kirill A. Shutemov
  2023-05-16 12:06   ` [PATCHv11.1 5/9] efi: Add unaccepted memory support Kirill A. Shutemov
  2023-05-13 22:04 ` [PATCHv11 6/9] efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory Kirill A. Shutemov
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 38+ messages in thread
From: Kirill A. Shutemov @ 2023-05-13 22:04 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Dave Hansen,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov

Core-mm requires few helpers to support unaccepted memory:

 - accept_memory() checks the range of addresses against the bitmap and
   accept memory if needed.

 - range_contains_unaccepted_memory() checks if anything within the
   range requires acceptance.

Architectural code has to provide efi_get_unaccepted_table() that
returns pointer to the unaccepted memory configuration table.

arch_accept_memory() handles arch-specific part of memory acceptance.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 drivers/firmware/efi/Makefile            |   1 +
 drivers/firmware/efi/efi.c               |   6 ++
 drivers/firmware/efi/unaccepted_memory.c | 103 +++++++++++++++++++++++
 include/linux/efi.h                      |   1 +
 4 files changed, 111 insertions(+)
 create mode 100644 drivers/firmware/efi/unaccepted_memory.c

diff --git a/drivers/firmware/efi/Makefile b/drivers/firmware/efi/Makefile
index b51f2a4c821e..e489fefd23da 100644
--- a/drivers/firmware/efi/Makefile
+++ b/drivers/firmware/efi/Makefile
@@ -41,3 +41,4 @@ obj-$(CONFIG_EFI_CAPSULE_LOADER)	+= capsule-loader.o
 obj-$(CONFIG_EFI_EARLYCON)		+= earlycon.o
 obj-$(CONFIG_UEFI_CPER_ARM)		+= cper-arm.o
 obj-$(CONFIG_UEFI_CPER_X86)		+= cper-x86.o
+obj-$(CONFIG_UNACCEPTED_MEMORY)		+= unaccepted_memory.o
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 7dce06e419c5..e15a2005ed93 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -50,6 +50,9 @@ struct efi __read_mostly efi = {
 #ifdef CONFIG_EFI_COCO_SECRET
 	.coco_secret		= EFI_INVALID_TABLE_ADDR,
 #endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	.unaccepted		= EFI_INVALID_TABLE_ADDR,
+#endif
 };
 EXPORT_SYMBOL(efi);
 
@@ -605,6 +608,9 @@ static const efi_config_table_type_t common_tables[] __initconst = {
 #ifdef CONFIG_EFI_COCO_SECRET
 	{LINUX_EFI_COCO_SECRET_AREA_GUID,	&efi.coco_secret,	"CocoSecret"	},
 #endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	{LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID,	&efi.unaccepted,	"Unaccepted"	},
+#endif
 #ifdef CONFIG_EFI_GENERIC_STUB
 	{LINUX_EFI_SCREEN_INFO_TABLE_GUID,	&screen_info_table			},
 #endif
diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
new file mode 100644
index 000000000000..bb91c41f76fb
--- /dev/null
+++ b/drivers/firmware/efi/unaccepted_memory.c
@@ -0,0 +1,103 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/efi.h>
+#include <linux/memblock.h>
+#include <linux/spinlock.h>
+#include <asm/unaccepted_memory.h>
+
+/* Protects unaccepted memory bitmap */
+static DEFINE_SPINLOCK(unaccepted_memory_lock);
+
+void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	struct efi_unaccepted_memory *unaccepted;
+	unsigned long range_start, range_end;
+	unsigned long flags;
+	u64 unit_size;
+
+	if (efi.unaccepted == EFI_INVALID_TABLE_ADDR)
+		return;
+
+	unaccepted = efi_get_unaccepted_table();
+	if (!unaccepted)
+		return;
+
+	unit_size = unaccepted->unit_size;
+
+	/*
+	 * Only care for the part of the range that is represented
+	 * in the bitmap.
+	 */
+	if (start < unaccepted->phys_base)
+		start = unaccepted->phys_base;
+	if (end < unaccepted->phys_base)
+		return;
+
+	/* Translate to offsets from the beginning of the bitmap */
+	start -= unaccepted->phys_base;
+	end -= unaccepted->phys_base;
+
+	/* Make sure not to overrun the bitmap */
+	if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
+		end = unaccepted->size * unit_size * BITS_PER_BYTE;
+
+	range_start = start / unit_size;
+
+	spin_lock_irqsave(&unaccepted_memory_lock, flags);
+	for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap,
+				   DIV_ROUND_UP(end, unit_size)) {
+		unsigned long phys_start, phys_end;
+		unsigned long len = range_end - range_start;
+
+		phys_start = range_start * unit_size + unaccepted->phys_base;
+		phys_end = range_end * unit_size + unaccepted->phys_base;
+
+		arch_accept_memory(phys_start, phys_end);
+		bitmap_clear(unaccepted->bitmap, range_start, len);
+	}
+	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+}
+
+bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
+{
+	struct efi_unaccepted_memory *unaccepted;
+	unsigned long flags;
+	bool ret = false;
+	u64 unit_size;
+
+	unaccepted = efi_get_unaccepted_table();
+	if (!unaccepted)
+		return false;
+
+	unit_size = unaccepted->unit_size;
+
+	/*
+	 * Only care for the part of the range that is represented
+	 * in the bitmap.
+	 */
+	if (start < unaccepted->phys_base)
+		start = unaccepted->phys_base;
+	if (end < unaccepted->phys_base)
+		return false;
+
+	/* Translate to offsets from the beginning of the bitmap */
+	start -= unaccepted->phys_base;
+	end -= unaccepted->phys_base;
+
+	/* Make sure not to overrun the bitmap */
+	if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
+		end = unaccepted->size * unit_size * BITS_PER_BYTE;
+
+	spin_lock_irqsave(&unaccepted_memory_lock, flags);
+	while (start < end) {
+		if (test_bit(start / unit_size, unaccepted->bitmap)) {
+			ret = true;
+			break;
+		}
+
+		start += unit_size;
+	}
+	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+
+	return ret;
+}
diff --git a/include/linux/efi.h b/include/linux/efi.h
index 29cc622910da..9864f9c00da2 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -646,6 +646,7 @@ extern struct efi {
 	unsigned long			tpm_final_log;		/* TPM2 Final Events Log table */
 	unsigned long			mokvar_table;		/* MOK variable config table */
 	unsigned long			coco_secret;		/* Confidential computing secret table */
+	unsigned long			unaccepted;		/* Unaccepted memory table */
 
 	efi_get_time_t			*get_time;
 	efi_set_time_t			*set_time;
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCHv11 6/9] efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2023-05-13 22:04 [PATCHv11 0/9] mm, x86/cc, efi: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2023-05-13 22:04 ` [PATCHv11 5/9] efi: Provide helpers for " Kirill A. Shutemov
@ 2023-05-13 22:04 ` Kirill A. Shutemov
  2023-05-16 18:08   ` Ard Biesheuvel
  2023-05-17 16:07   ` Tom Lendacky
  2023-05-13 22:04 ` [PATCHv11 7/9] x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub Kirill A. Shutemov
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 38+ messages in thread
From: Kirill A. Shutemov @ 2023-05-13 22:04 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Dave Hansen,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov,
	Dave Hansen

load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
The unwanted loads are typically harmless. But, they might be made to
totally unrelated or even unmapped memory. load_unaligned_zeropad()
relies on exception fixup (#PF, #GP and now #VE) to recover from these
unwanted loads.

But, this approach does not work for unaccepted memory. For TDX, a load
from unaccepted memory will not lead to a recoverable exception within
the guest. The guest will exit to the VMM where the only recourse is to
terminate the guest.

There are two parts to fix this issue and comprehensively avoid access
to unaccepted memory. Together these ensure that an extra "guard" page
is accepted in addition to the memory that needs to be used.

1. Implicitly extend the range_contains_unaccepted_memory(start, end)
   checks up to end+unit_size if 'end' is aligned on a unit_size
   boundary.
2. Implicitly extend accept_memory(start, end) to end+unit_size if 'end'
   is aligned on a unit_size boundary.

Side note: This leads to something strange. Pages which were accepted
	   at boot, marked by the firmware as accepted and will never
	   _need_ to be accepted might be on unaccepted_pages list
	   This is a cue to ensure that the next page is accepted
	   before 'page' can be used.

This is an actual, real-world problem which was discovered during TDX
testing.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 drivers/firmware/efi/unaccepted_memory.c | 35 ++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
index bb91c41f76fb..3d1ca60916dd 100644
--- a/drivers/firmware/efi/unaccepted_memory.c
+++ b/drivers/firmware/efi/unaccepted_memory.c
@@ -37,6 +37,34 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
 	start -= unaccepted->phys_base;
 	end -= unaccepted->phys_base;
 
+	/*
+	 * load_unaligned_zeropad() can lead to unwanted loads across page
+	 * boundaries. The unwanted loads are typically harmless. But, they
+	 * might be made to totally unrelated or even unmapped memory.
+	 * load_unaligned_zeropad() relies on exception fixup (#PF, #GP and now
+	 * #VE) to recover from these unwanted loads.
+	 *
+	 * But, this approach does not work for unaccepted memory. For TDX, a
+	 * load from unaccepted memory will not lead to a recoverable exception
+	 * within the guest. The guest will exit to the VMM where the only
+	 * recourse is to terminate the guest.
+	 *
+	 * There are two parts to fix this issue and comprehensively avoid
+	 * access to unaccepted memory. Together these ensure that an extra
+	 * "guard" page is accepted in addition to the memory that needs to be
+	 * used:
+	 *
+	 * 1. Implicitly extend the range_contains_unaccepted_memory(start, end)
+	 *    checks up to end+unit_size if 'end' is aligned on a unit_size
+	 *    boundary.
+	 *
+	 * 2. Implicitly extend accept_memory(start, end) to end+unit_size if
+	 *    'end' is aligned on a unit_size boundary. (immediately following
+	 *    this comment)
+	 */
+	if (!(end % unit_size))
+		end += unit_size;
+
 	/* Make sure not to overrun the bitmap */
 	if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
 		end = unaccepted->size * unit_size * BITS_PER_BYTE;
@@ -84,6 +112,13 @@ bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
 	start -= unaccepted->phys_base;
 	end -= unaccepted->phys_base;
 
+	/*
+	 * Also consider the unaccepted state of the *next* page. See fix #1 in
+	 * the comment on load_unaligned_zeropad() in accept_memory().
+	 */
+	if (!(end % unit_size))
+		end += unit_size;
+
 	/* Make sure not to overrun the bitmap */
 	if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
 		end = unaccepted->size * unit_size * BITS_PER_BYTE;
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCHv11 7/9] x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub
  2023-05-13 22:04 [PATCHv11 0/9] mm, x86/cc, efi: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  2023-05-13 22:04 ` [PATCHv11 6/9] efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory Kirill A. Shutemov
@ 2023-05-13 22:04 ` Kirill A. Shutemov
  2023-05-13 22:04 ` [PATCHv11 8/9] x86/tdx: Refactor try_accept_one() Kirill A. Shutemov
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 38+ messages in thread
From: Kirill A. Shutemov @ 2023-05-13 22:04 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Dave Hansen,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov,
	Dave Hansen

Memory acceptance requires a hypercall and one or multiple module calls.

Make helpers for the calls available in boot stub. It has to accept
memory where kernel image and initrd are placed.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 arch/x86/coco/tdx/tdx.c           | 32 -------------------
 arch/x86/include/asm/shared/tdx.h | 51 +++++++++++++++++++++++++++++++
 arch/x86/include/asm/tdx.h        | 19 ------------
 3 files changed, 51 insertions(+), 51 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index e146b599260f..e6f4c2758a68 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -14,20 +14,6 @@
 #include <asm/insn-eval.h>
 #include <asm/pgtable.h>
 
-/* TDX module Call Leaf IDs */
-#define TDX_GET_INFO			1
-#define TDX_GET_VEINFO			3
-#define TDX_GET_REPORT			4
-#define TDX_ACCEPT_PAGE			6
-#define TDX_WR				8
-
-/* TDCS fields. To be used by TDG.VM.WR and TDG.VM.RD module calls */
-#define TDCS_NOTIFY_ENABLES		0x9100000000000010
-
-/* TDX hypercall Leaf IDs */
-#define TDVMCALL_MAP_GPA		0x10001
-#define TDVMCALL_REPORT_FATAL_ERROR	0x10003
-
 /* MMIO direction */
 #define EPT_READ	0
 #define EPT_WRITE	1
@@ -51,24 +37,6 @@
 
 #define TDREPORT_SUBTYPE_0	0
 
-/*
- * Wrapper for standard use of __tdx_hypercall with no output aside from
- * return code.
- */
-static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
-{
-	struct tdx_hypercall_args args = {
-		.r10 = TDX_HYPERCALL_STANDARD,
-		.r11 = fn,
-		.r12 = r12,
-		.r13 = r13,
-		.r14 = r14,
-		.r15 = r15,
-	};
-
-	return __tdx_hypercall(&args);
-}
-
 /* Called from __tdx_hypercall() for unrecoverable failure */
 noinstr void __tdx_hypercall_failed(void)
 {
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index 2631e01f6e0f..1ff0ee822961 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -10,6 +10,20 @@
 #define TDX_CPUID_LEAF_ID	0x21
 #define TDX_IDENT		"IntelTDX    "
 
+/* TDX module Call Leaf IDs */
+#define TDX_GET_INFO			1
+#define TDX_GET_VEINFO			3
+#define TDX_GET_REPORT			4
+#define TDX_ACCEPT_PAGE			6
+#define TDX_WR				8
+
+/* TDCS fields. To be used by TDG.VM.WR and TDG.VM.RD module calls */
+#define TDCS_NOTIFY_ENABLES		0x9100000000000010
+
+/* TDX hypercall Leaf IDs */
+#define TDVMCALL_MAP_GPA		0x10001
+#define TDVMCALL_REPORT_FATAL_ERROR	0x10003
+
 #ifndef __ASSEMBLY__
 
 /*
@@ -37,8 +51,45 @@ struct tdx_hypercall_args {
 u64 __tdx_hypercall(struct tdx_hypercall_args *args);
 u64 __tdx_hypercall_ret(struct tdx_hypercall_args *args);
 
+/*
+ * Wrapper for standard use of __tdx_hypercall with no output aside from
+ * return code.
+ */
+static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+	struct tdx_hypercall_args args = {
+		.r10 = TDX_HYPERCALL_STANDARD,
+		.r11 = fn,
+		.r12 = r12,
+		.r13 = r13,
+		.r14 = r14,
+		.r15 = r15,
+	};
+
+	return __tdx_hypercall(&args);
+}
+
+
 /* Called from __tdx_hypercall() for unrecoverable failure */
 void __tdx_hypercall_failed(void);
 
+/*
+ * Used in __tdx_module_call() to gather the output registers' values of the
+ * TDCALL instruction when requesting services from the TDX module. This is a
+ * software only structure and not part of the TDX module/VMM ABI
+ */
+struct tdx_module_output {
+	u64 rcx;
+	u64 rdx;
+	u64 r8;
+	u64 r9;
+	u64 r10;
+	u64 r11;
+};
+
+/* Used to communicate with the TDX module */
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+		      struct tdx_module_output *out);
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_SHARED_TDX_H */
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 28d889c9aa16..234197ec17e4 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -20,21 +20,6 @@
 
 #ifndef __ASSEMBLY__
 
-/*
- * Used to gather the output registers values of the TDCALL and SEAMCALL
- * instructions when requesting services from the TDX module.
- *
- * This is a software only structure and not part of the TDX module/VMM ABI.
- */
-struct tdx_module_output {
-	u64 rcx;
-	u64 rdx;
-	u64 r8;
-	u64 r9;
-	u64 r10;
-	u64 r11;
-};
-
 /*
  * Used by the #VE exception handler to gather the #VE exception
  * info from the TDX module. This is a software only structure
@@ -55,10 +40,6 @@ struct ve_info {
 
 void __init tdx_early_init(void);
 
-/* Used to communicate with the TDX module */
-u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
-		      struct tdx_module_output *out);
-
 void tdx_get_ve_info(struct ve_info *ve);
 
 bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCHv11 8/9] x86/tdx: Refactor try_accept_one()
  2023-05-13 22:04 [PATCHv11 0/9] mm, x86/cc, efi: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (6 preceding siblings ...)
  2023-05-13 22:04 ` [PATCHv11 7/9] x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub Kirill A. Shutemov
@ 2023-05-13 22:04 ` Kirill A. Shutemov
  2023-05-13 22:04 ` [PATCHv11 9/9] x86/tdx: Add unaccepted memory support Kirill A. Shutemov
  2023-05-16 22:41 ` [PATCHv11 0/9] mm, x86/cc, efi: Implement support for unaccepted memory Tom Lendacky
  9 siblings, 0 replies; 38+ messages in thread
From: Kirill A. Shutemov @ 2023-05-13 22:04 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Dave Hansen,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov,
	Dave Hansen

Rework try_accept_one() to return accepted size instead of modifying
'start' inside the helper. It makes 'start' in-only argument and
streamlines code on the caller side.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 arch/x86/coco/tdx/tdx.c | 38 +++++++++++++++++++-------------------
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index e6f4c2758a68..0d5fe6e24e45 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -713,18 +713,18 @@ static bool tdx_cache_flush_required(void)
 	return true;
 }
 
-static bool try_accept_one(phys_addr_t *start, unsigned long len,
-			  enum pg_level pg_level)
+static unsigned long try_accept_one(phys_addr_t start, unsigned long len,
+				    enum pg_level pg_level)
 {
 	unsigned long accept_size = page_level_size(pg_level);
 	u64 tdcall_rcx;
 	u8 page_size;
 
-	if (!IS_ALIGNED(*start, accept_size))
-		return false;
+	if (!IS_ALIGNED(start, accept_size))
+		return 0;
 
 	if (len < accept_size)
-		return false;
+		return 0;
 
 	/*
 	 * Pass the page physical address to the TDX module to accept the
@@ -743,15 +743,14 @@ static bool try_accept_one(phys_addr_t *start, unsigned long len,
 		page_size = 2;
 		break;
 	default:
-		return false;
+		return 0;
 	}
 
-	tdcall_rcx = *start | page_size;
+	tdcall_rcx = start | page_size;
 	if (__tdx_module_call(TDX_ACCEPT_PAGE, tdcall_rcx, 0, 0, 0, NULL))
-		return false;
+		return 0;
 
-	*start += accept_size;
-	return true;
+	return accept_size;
 }
 
 /*
@@ -788,21 +787,22 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
 	 */
 	while (start < end) {
 		unsigned long len = end - start;
+		unsigned long accept_size;
 
 		/*
 		 * Try larger accepts first. It gives chance to VMM to keep
-		 * 1G/2M SEPT entries where possible and speeds up process by
-		 * cutting number of hypercalls (if successful).
+		 * 1G/2M Secure EPT entries where possible and speeds up
+		 * process by cutting number of hypercalls (if successful).
 		 */
 
-		if (try_accept_one(&start, len, PG_LEVEL_1G))
-			continue;
-
-		if (try_accept_one(&start, len, PG_LEVEL_2M))
-			continue;
-
-		if (!try_accept_one(&start, len, PG_LEVEL_4K))
+		accept_size = try_accept_one(start, len, PG_LEVEL_1G);
+		if (!accept_size)
+			accept_size = try_accept_one(start, len, PG_LEVEL_2M);
+		if (!accept_size)
+			accept_size = try_accept_one(start, len, PG_LEVEL_4K);
+		if (!accept_size)
 			return false;
+		start += accept_size;
 	}
 
 	return true;
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCHv11 9/9] x86/tdx: Add unaccepted memory support
  2023-05-13 22:04 [PATCHv11 0/9] mm, x86/cc, efi: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (7 preceding siblings ...)
  2023-05-13 22:04 ` [PATCHv11 8/9] x86/tdx: Refactor try_accept_one() Kirill A. Shutemov
@ 2023-05-13 22:04 ` Kirill A. Shutemov
  2023-05-16 22:41 ` [PATCHv11 0/9] mm, x86/cc, efi: Implement support for unaccepted memory Tom Lendacky
  9 siblings, 0 replies; 38+ messages in thread
From: Kirill A. Shutemov @ 2023-05-13 22:04 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Dave Hansen,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Kirill A. Shutemov

Hookup TDX-specific code to accept memory.

Accepting the memory is the same process as converting memory from
shared to private: kernel notifies VMM with MAP_GPA hypercall and then
accept pages with ACCEPT_PAGE module call.

The implementation in core kernel uses tdx_enc_status_changed(). It
already used for converting memory to shared and back for I/O
transactions.

Boot stub provides own implementation of tdx_accept_memory(). It is
similar in structure to tdx_enc_status_changed(), but only cares about
converting memory to private.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                         |  2 +
 arch/x86/boot/compressed/Makefile        |  2 +-
 arch/x86/boot/compressed/error.c         | 19 +++++
 arch/x86/boot/compressed/error.h         |  1 +
 arch/x86/boot/compressed/mem.c           | 35 ++++++++-
 arch/x86/boot/compressed/tdx-shared.c    |  2 +
 arch/x86/boot/compressed/tdx.c           | 37 +++++++++
 arch/x86/coco/tdx/Makefile               |  2 +-
 arch/x86/coco/tdx/tdx-shared.c           | 95 ++++++++++++++++++++++++
 arch/x86/coco/tdx/tdx.c                  | 86 +--------------------
 arch/x86/include/asm/shared/tdx.h        |  2 +
 arch/x86/include/asm/tdx.h               |  2 +
 arch/x86/include/asm/unaccepted_memory.h | 23 ++++++
 13 files changed, 221 insertions(+), 87 deletions(-)
 create mode 100644 arch/x86/boot/compressed/tdx-shared.c
 create mode 100644 arch/x86/coco/tdx/tdx-shared.c
 create mode 100644 arch/x86/include/asm/unaccepted_memory.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 53bab123a8ee..5c72067c06d4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -884,9 +884,11 @@ config INTEL_TDX_GUEST
 	bool "Intel TDX (Trust Domain Extensions) - Guest Support"
 	depends on X86_64 && CPU_SUP_INTEL
 	depends on X86_X2APIC
+	depends on EFI_STUB
 	select ARCH_HAS_CC_PLATFORM
 	select X86_MEM_ENCRYPT
 	select X86_MCE
+	select UNACCEPTED_MEMORY
 	help
 	  Support running as a guest under Intel TDX.  Without this support,
 	  the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index cc4978123c30..78f67e0a2666 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -107,7 +107,7 @@ endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
 vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
-vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/mem.o
+vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/mem.o $(obj)/tdx-shared.o
 
 vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
diff --git a/arch/x86/boot/compressed/error.c b/arch/x86/boot/compressed/error.c
index c881878e56d3..5313c5cb2b80 100644
--- a/arch/x86/boot/compressed/error.c
+++ b/arch/x86/boot/compressed/error.c
@@ -22,3 +22,22 @@ void error(char *m)
 	while (1)
 		asm("hlt");
 }
+
+/* EFI libstub  provides vsnprintf() */
+#ifdef CONFIG_EFI_STUB
+void panic(const char *fmt, ...)
+{
+	static char buf[1024];
+	va_list args;
+	int len;
+
+	va_start(args, fmt);
+	len = vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	if (len && buf[len - 1] == '\n')
+		buf[len - 1] = '\0';
+
+	error(buf);
+}
+#endif
diff --git a/arch/x86/boot/compressed/error.h b/arch/x86/boot/compressed/error.h
index 1de5821184f1..86fe33b93715 100644
--- a/arch/x86/boot/compressed/error.h
+++ b/arch/x86/boot/compressed/error.h
@@ -6,5 +6,6 @@
 
 void warn(char *m);
 void error(char *m) __noreturn;
+void panic(const char *fmt, ...) __noreturn __cold;
 
 #endif /* BOOT_COMPRESSED_ERROR_H */
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index 67594fcb11d9..a4308d077885 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -1,9 +1,42 @@
 // SPDX-License-Identifier: GPL-2.0-only
 
+#include "../cpuflags.h"
+#include "../string.h"
 #include "error.h"
+#include "tdx.h"
+#include <asm/shared/tdx.h>
+
+/*
+ * accept_memory() and process_unaccepted_memory() called from EFI stub which
+ * runs before decompresser and its early_tdx_detect().
+ *
+ * Enumerate TDX directly from the early users.
+ */
+static bool early_is_tdx_guest(void)
+{
+	static bool once;
+	static bool is_tdx;
+
+	if (!IS_ENABLED(CONFIG_INTEL_TDX_GUEST))
+		return false;
+
+	if (!once) {
+		u32 eax, sig[3];
+
+		cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax,
+			    &sig[0], &sig[2],  &sig[1]);
+		is_tdx = !memcmp(TDX_IDENT, sig, sizeof(sig));
+		once = true;
+	}
+
+	return is_tdx;
+}
 
 void arch_accept_memory(phys_addr_t start, phys_addr_t end)
 {
 	/* Platform-specific memory-acceptance call goes here */
-	error("Cannot accept memory");
+	if (early_is_tdx_guest())
+		tdx_accept_memory(start, end);
+	else
+		error("Cannot accept memory: unknown platform\n");
 }
diff --git a/arch/x86/boot/compressed/tdx-shared.c b/arch/x86/boot/compressed/tdx-shared.c
new file mode 100644
index 000000000000..5ac43762fe13
--- /dev/null
+++ b/arch/x86/boot/compressed/tdx-shared.c
@@ -0,0 +1,2 @@
+#include "error.h"
+#include "../../coco/tdx/tdx-shared.c"
diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
index 2d81d3cc72a1..d073764eaa50 100644
--- a/arch/x86/boot/compressed/tdx.c
+++ b/arch/x86/boot/compressed/tdx.c
@@ -9,6 +9,9 @@
 #include <uapi/asm/vmx.h>
 
 #include <asm/shared/tdx.h>
+#include <asm/page_types.h>
+
+static u64 cc_mask;
 
 /* Called from __tdx_hypercall() for unrecoverable failure */
 void __tdx_hypercall_failed(void)
@@ -16,6 +19,38 @@ void __tdx_hypercall_failed(void)
 	error("TDVMCALL failed. TDX module bug?");
 }
 
+static u64 get_cc_mask(void)
+{
+	struct tdx_module_output out;
+	unsigned int gpa_width;
+
+	/*
+	 * TDINFO TDX module call is used to get the TD execution environment
+	 * information like GPA width, number of available vcpus, debug mode
+	 * information, etc. More details about the ABI can be found in TDX
+	 * Guest-Host-Communication Interface (GHCI), section 2.4.2 TDCALL
+	 * [TDG.VP.INFO].
+	 *
+	 * The GPA width that comes out of this call is critical. TDX guests
+	 * can not meaningfully run without it.
+	 */
+	if (__tdx_module_call(TDX_GET_INFO, 0, 0, 0, 0, &out))
+		error("TDCALL GET_INFO failed (Buggy TDX module!)\n");
+
+	gpa_width = out.rcx & GENMASK(5, 0);
+
+	/*
+	 * The highest bit of a guest physical address is the "sharing" bit.
+	 * Set it for shared pages and clear it for private pages.
+	 */
+	return BIT_ULL(gpa_width - 1);
+}
+
+u64 cc_mkdec(u64 val)
+{
+	return val & ~cc_mask;
+}
+
 static inline unsigned int tdx_io_in(int size, u16 port)
 {
 	struct tdx_hypercall_args args = {
@@ -70,6 +105,8 @@ void early_tdx_detect(void)
 	if (memcmp(TDX_IDENT, sig, sizeof(sig)))
 		return;
 
+	cc_mask = get_cc_mask();
+
 	/* Use hypercalls instead of I/O instructions */
 	pio_ops.f_inb  = tdx_inb;
 	pio_ops.f_outb = tdx_outb;
diff --git a/arch/x86/coco/tdx/Makefile b/arch/x86/coco/tdx/Makefile
index 46c55998557d..2c7dcbf1458b 100644
--- a/arch/x86/coco/tdx/Makefile
+++ b/arch/x86/coco/tdx/Makefile
@@ -1,3 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0
 
-obj-y += tdx.o tdcall.o
+obj-y += tdx.o tdx-shared.o tdcall.o
diff --git a/arch/x86/coco/tdx/tdx-shared.c b/arch/x86/coco/tdx/tdx-shared.c
new file mode 100644
index 000000000000..ee74f7bbe806
--- /dev/null
+++ b/arch/x86/coco/tdx/tdx-shared.c
@@ -0,0 +1,95 @@
+#include <asm/tdx.h>
+#include <asm/pgtable.h>
+
+static unsigned long try_accept_one(phys_addr_t start, unsigned long len,
+				    enum pg_level pg_level)
+{
+	unsigned long accept_size = page_level_size(pg_level);
+	u64 tdcall_rcx;
+	u8 page_size;
+
+	if (!IS_ALIGNED(start, accept_size))
+		return 0;
+
+	if (len < accept_size)
+		return 0;
+
+	/*
+	 * Pass the page physical address to the TDX module to accept the
+	 * pending, private page.
+	 *
+	 * Bits 2:0 of RCX encode page size: 0 - 4K, 1 - 2M, 2 - 1G.
+	 */
+	switch (pg_level) {
+	case PG_LEVEL_4K:
+		page_size = 0;
+		break;
+	case PG_LEVEL_2M:
+		page_size = 1;
+		break;
+	case PG_LEVEL_1G:
+		page_size = 2;
+		break;
+	default:
+		return 0;
+	}
+
+	tdcall_rcx = start | page_size;
+	if (__tdx_module_call(TDX_ACCEPT_PAGE, tdcall_rcx, 0, 0, 0, NULL))
+		return 0;
+
+	return accept_size;
+}
+
+bool tdx_enc_status_changed_phys(phys_addr_t start, phys_addr_t end, bool enc)
+{
+	if (!enc) {
+		/* Set the shared (decrypted) bits: */
+		start |= cc_mkdec(0);
+		end   |= cc_mkdec(0);
+	}
+
+	/*
+	 * Notify the VMM about page mapping conversion. More info about ABI
+	 * can be found in TDX Guest-Host-Communication Interface (GHCI),
+	 * section "TDG.VP.VMCALL<MapGPA>"
+	 */
+	if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
+		return false;
+
+	/* private->shared conversion  requires only MapGPA call */
+	if (!enc)
+		return true;
+
+	/*
+	 * For shared->private conversion, accept the page using
+	 * TDX_ACCEPT_PAGE TDX module call.
+	 */
+	while (start < end) {
+		unsigned long len = end - start;
+		unsigned long accept_size;
+
+		/*
+		 * Try larger accepts first. It gives chance to VMM to keep
+		 * 1G/2M Secure EPT entries where possible and speeds up
+		 * process by cutting number of hypercalls (if successful).
+		 */
+
+		accept_size = try_accept_one(start, len, PG_LEVEL_1G);
+		if (!accept_size)
+			accept_size = try_accept_one(start, len, PG_LEVEL_2M);
+		if (!accept_size)
+			accept_size = try_accept_one(start, len, PG_LEVEL_4K);
+		if (!accept_size)
+			return false;
+		start += accept_size;
+	}
+
+	return true;
+}
+
+void tdx_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	if (!tdx_enc_status_changed_phys(start, end, true))
+		panic("Accepting memory failed: %#llx-%#llx\n",  start, end);
+}
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 0d5fe6e24e45..32501277ef84 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -713,46 +713,6 @@ static bool tdx_cache_flush_required(void)
 	return true;
 }
 
-static unsigned long try_accept_one(phys_addr_t start, unsigned long len,
-				    enum pg_level pg_level)
-{
-	unsigned long accept_size = page_level_size(pg_level);
-	u64 tdcall_rcx;
-	u8 page_size;
-
-	if (!IS_ALIGNED(start, accept_size))
-		return 0;
-
-	if (len < accept_size)
-		return 0;
-
-	/*
-	 * Pass the page physical address to the TDX module to accept the
-	 * pending, private page.
-	 *
-	 * Bits 2:0 of RCX encode page size: 0 - 4K, 1 - 2M, 2 - 1G.
-	 */
-	switch (pg_level) {
-	case PG_LEVEL_4K:
-		page_size = 0;
-		break;
-	case PG_LEVEL_2M:
-		page_size = 1;
-		break;
-	case PG_LEVEL_1G:
-		page_size = 2;
-		break;
-	default:
-		return 0;
-	}
-
-	tdcall_rcx = start | page_size;
-	if (__tdx_module_call(TDX_ACCEPT_PAGE, tdcall_rcx, 0, 0, 0, NULL))
-		return 0;
-
-	return accept_size;
-}
-
 /*
  * Inform the VMM of the guest's intent for this physical page: shared with
  * the VMM or private to the guest.  The VMM is expected to change its mapping
@@ -761,51 +721,9 @@ static unsigned long try_accept_one(phys_addr_t start, unsigned long len,
 static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
 {
 	phys_addr_t start = __pa(vaddr);
-	phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
+	phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
 
-	if (!enc) {
-		/* Set the shared (decrypted) bits: */
-		start |= cc_mkdec(0);
-		end   |= cc_mkdec(0);
-	}
-
-	/*
-	 * Notify the VMM about page mapping conversion. More info about ABI
-	 * can be found in TDX Guest-Host-Communication Interface (GHCI),
-	 * section "TDG.VP.VMCALL<MapGPA>"
-	 */
-	if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
-		return false;
-
-	/* private->shared conversion  requires only MapGPA call */
-	if (!enc)
-		return true;
-
-	/*
-	 * For shared->private conversion, accept the page using
-	 * TDX_ACCEPT_PAGE TDX module call.
-	 */
-	while (start < end) {
-		unsigned long len = end - start;
-		unsigned long accept_size;
-
-		/*
-		 * Try larger accepts first. It gives chance to VMM to keep
-		 * 1G/2M Secure EPT entries where possible and speeds up
-		 * process by cutting number of hypercalls (if successful).
-		 */
-
-		accept_size = try_accept_one(start, len, PG_LEVEL_1G);
-		if (!accept_size)
-			accept_size = try_accept_one(start, len, PG_LEVEL_2M);
-		if (!accept_size)
-			accept_size = try_accept_one(start, len, PG_LEVEL_4K);
-		if (!accept_size)
-			return false;
-		start += accept_size;
-	}
-
-	return true;
+	return tdx_enc_status_changed_phys(start, end, enc);
 }
 
 void __init tdx_early_init(void)
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index 1ff0ee822961..95fbe7376694 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -91,5 +91,7 @@ struct tdx_module_output {
 u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 		      struct tdx_module_output *out);
 
+void tdx_accept_memory(phys_addr_t start, phys_addr_t end);
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_SHARED_TDX_H */
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 234197ec17e4..3a7340ad9a4b 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -50,6 +50,8 @@ bool tdx_early_handle_ve(struct pt_regs *regs);
 
 int tdx_mcall_get_report0(u8 *reportdata, u8 *tdreport);
 
+bool tdx_enc_status_changed_phys(phys_addr_t start, phys_addr_t end, bool enc);
+
 #else
 
 static inline void tdx_early_init(void) { };
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
new file mode 100644
index 000000000000..72b354f992bb
--- /dev/null
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -0,0 +1,23 @@
+#ifndef _ASM_X86_UNACCEPTED_MEMORY_H
+#define _ASM_X86_UNACCEPTED_MEMORY_H
+
+#include <linux/efi.h>
+#include <asm/tdx.h>
+
+static inline void arch_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	/* Platform-specific memory-acceptance call goes here */
+	if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
+		tdx_accept_memory(start, end);
+	} else {
+		panic("Cannot accept memory: unknown platform\n");
+	}
+}
+
+static inline struct efi_unaccepted_memory *efi_get_unaccepted_table(void)
+{
+	if (efi.unaccepted == EFI_INVALID_TABLE_ADDR)
+		return NULL;
+	return __va(efi.unaccepted);
+}
+#endif
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 3/9] efi/libstub: Implement support for unaccepted memory
  2023-05-13 22:04 ` [PATCHv11 3/9] efi/libstub: Implement support for unaccepted memory Kirill A. Shutemov
@ 2023-05-14  5:08   ` Mika Penttilä
  2023-05-14 21:13     ` Kirill A. Shutemov
  2023-05-16 18:06   ` Ard Biesheuvel
  1 sibling, 1 reply; 38+ messages in thread
From: Mika Penttilä @ 2023-05-14  5:08 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Sean Christopherson, Andrew Morton, Joerg Roedel,
	Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

Hi,


On 14.5.2023 1.04, Kirill A. Shutemov wrote:
> UEFI Specification version 2.9 introduces the concept of memory
> acceptance: Some Virtual Machine platforms, such as Intel TDX or AMD
> SEV-SNP, requiring memory to be accepted before it can be used by the
> guest. Accepting happens via a protocol specific for the Virtual
> Machine platform.
> 
> Accepting memory is costly and it makes VMM allocate memory for the
> accepted guest physical address range. It's better to postpone memory
> acceptance until memory is needed. It lowers boot time and reduces
> memory overhead.
> 
> The kernel needs to know what memory has been accepted. Firmware
> communicates this information via memory map: a new memory type --
> EFI_UNACCEPTED_MEMORY -- indicates such memory.
> 
> Range-based tracking works fine for firmware, but it gets bulky for
> the kernel: e820 (or whatever the arch uses) has to be modified on every
> page acceptance. It leads to table fragmentation and there's a limited
> number of entries in the e820 table.
> 
> Another option is to mark such memory as usable in e820 and track if the
> range has been accepted in a bitmap. One bit in the bitmap represents a
> naturally aligned power-2-sized region of address space -- unit.
> 
> For x86, unit size is 2MiB: 4k of the bitmap is enough to track 64GiB or
> physical address space.
> 
> In the worst-case scenario -- a huge hole in the middle of the
> address space -- It needs 256MiB to handle 4PiB of the address
> space.
> 
> Any unaccepted memory that is not aligned to unit_size gets accepted
> upfront.
> 
> The bitmap is allocated and constructed in the EFI stub and passed down
> to the kernel via EFI configuration table. allocate_e820() allocates the
> bitmap if unaccepted memory is present, according to the size of
> unaccepted region.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>   arch/x86/boot/compressed/Makefile             |   1 +
>   arch/x86/boot/compressed/mem.c                |   9 +
>   arch/x86/include/asm/efi.h                    |   2 +
>   drivers/firmware/efi/Kconfig                  |  14 ++
>   drivers/firmware/efi/efi.c                    |   1 +
>   drivers/firmware/efi/libstub/Makefile         |   2 +
>   drivers/firmware/efi/libstub/bitmap.c         |  41 ++++
>   drivers/firmware/efi/libstub/efistub.h        |   6 +
>   drivers/firmware/efi/libstub/find.c           |  43 ++++
>   .../firmware/efi/libstub/unaccepted_memory.c  | 222 ++++++++++++++++++
>   drivers/firmware/efi/libstub/x86-stub.c       |  13 +
>   include/linux/efi.h                           |  12 +-
>   12 files changed, 365 insertions(+), 1 deletion(-)
>   create mode 100644 arch/x86/boot/compressed/mem.c
>   create mode 100644 drivers/firmware/efi/libstub/bitmap.c
>   create mode 100644 drivers/firmware/efi/libstub/find.c
>   create mode 100644 drivers/firmware/efi/libstub/unaccepted_memory.c
> 
> diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
> index 6b6cfe607bdb..cc4978123c30 100644
> --- a/arch/x86/boot/compressed/Makefile
> +++ b/arch/x86/boot/compressed/Makefile
> @@ -107,6 +107,7 @@ endif
>   
>   vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
>   vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
> +vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/mem.o
>   
>   vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
>   vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
> diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
> new file mode 100644
> index 000000000000..67594fcb11d9
> --- /dev/null
> +++ b/arch/x86/boot/compressed/mem.c
> @@ -0,0 +1,9 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include "error.h"
> +
> +void arch_accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +	/* Platform-specific memory-acceptance call goes here */
> +	error("Cannot accept memory");
> +}
> diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
> index 419280d263d2..8b4be7cecdb8 100644
> --- a/arch/x86/include/asm/efi.h
> +++ b/arch/x86/include/asm/efi.h
> @@ -31,6 +31,8 @@ extern unsigned long efi_mixed_mode_stack_pa;
>   
>   #define ARCH_EFI_IRQ_FLAGS_MASK	X86_EFLAGS_IF
>   
> +#define EFI_UNACCEPTED_UNIT_SIZE PMD_SIZE
> +
>   /*
>    * The EFI services are called through variadic functions in many cases. These
>    * functions are implemented in assembler and support only a fixed number of
> diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
> index 043ca31c114e..231f1c70d1db 100644
> --- a/drivers/firmware/efi/Kconfig
> +++ b/drivers/firmware/efi/Kconfig
> @@ -269,6 +269,20 @@ config EFI_COCO_SECRET
>   	  virt/coco/efi_secret module to access the secrets, which in turn
>   	  allows userspace programs to access the injected secrets.
>   
> +config UNACCEPTED_MEMORY
> +	bool
> +	depends on EFI_STUB
> +	help
> +	   Some Virtual Machine platforms, such as Intel TDX, require
> +	   some memory to be "accepted" by the guest before it can be used.
> +	   This mechanism helps prevent malicious hosts from making changes
> +	   to guest memory.
> +
> +	   UEFI specification v2.9 introduced EFI_UNACCEPTED_MEMORY memory type.
> +
> +	   This option adds support for unaccepted memory and makes such memory
> +	   usable by the kernel.
> +
>   config EFI_EMBEDDED_FIRMWARE
>   	bool
>   	select CRYPTO_LIB_SHA256
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index abeff7dc0b58..7dce06e419c5 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -843,6 +843,7 @@ static __initdata char memory_type_name[][13] = {
>   	"MMIO Port",
>   	"PAL Code",
>   	"Persistent",
> +	"Unaccepted",
>   };
>   
>   char * __init efi_md_typeattr_format(char *buf, size_t size,
> diff --git a/drivers/firmware/efi/libstub/Makefile b/drivers/firmware/efi/libstub/Makefile
> index 3abb2b357482..16d64a34d1e1 100644
> --- a/drivers/firmware/efi/libstub/Makefile
> +++ b/drivers/firmware/efi/libstub/Makefile
> @@ -96,6 +96,8 @@ CFLAGS_arm32-stub.o		:= -DTEXT_OFFSET=$(TEXT_OFFSET)
>   zboot-obj-$(CONFIG_RISCV)	:= lib-clz_ctz.o lib-ashldi3.o
>   lib-$(CONFIG_EFI_ZBOOT)		+= zboot.o $(zboot-obj-y)
>   
> +lib-$(CONFIG_UNACCEPTED_MEMORY) += unaccepted_memory.o bitmap.o find.o
> +
>   extra-y				:= $(lib-y)
>   lib-y				:= $(patsubst %.o,%.stub.o,$(lib-y))
>   
> diff --git a/drivers/firmware/efi/libstub/bitmap.c b/drivers/firmware/efi/libstub/bitmap.c
> new file mode 100644
> index 000000000000..5c9bba0d549b
> --- /dev/null
> +++ b/drivers/firmware/efi/libstub/bitmap.c
> @@ -0,0 +1,41 @@
> +#include <linux/bitmap.h>
> +
> +void __bitmap_set(unsigned long *map, unsigned int start, int len)
> +{
> +	unsigned long *p = map + BIT_WORD(start);
> +	const unsigned int size = start + len;
> +	int bits_to_set = BITS_PER_LONG - (start % BITS_PER_LONG);
> +	unsigned long mask_to_set = BITMAP_FIRST_WORD_MASK(start);
> +
> +	while (len - bits_to_set >= 0) {
> +		*p |= mask_to_set;
> +		len -= bits_to_set;
> +		bits_to_set = BITS_PER_LONG;
> +		mask_to_set = ~0UL;
> +		p++;
> +	}
> +	if (len) {
> +		mask_to_set &= BITMAP_LAST_WORD_MASK(size);
> +		*p |= mask_to_set;
> +	}
> +}
> +
> +void __bitmap_clear(unsigned long *map, unsigned int start, int len)
> +{
> +	unsigned long *p = map + BIT_WORD(start);
> +	const unsigned int size = start + len;
> +	int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
> +	unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
> +
> +	while (len - bits_to_clear >= 0) {
> +		*p &= ~mask_to_clear;
> +		len -= bits_to_clear;
> +		bits_to_clear = BITS_PER_LONG;
> +		mask_to_clear = ~0UL;
> +		p++;
> +	}
> +	if (len) {
> +		mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
> +		*p &= ~mask_to_clear;
> +	}
> +}
> diff --git a/drivers/firmware/efi/libstub/efistub.h b/drivers/firmware/efi/libstub/efistub.h
> index 67d5a20802e0..8659a01664b8 100644
> --- a/drivers/firmware/efi/libstub/efistub.h
> +++ b/drivers/firmware/efi/libstub/efistub.h
> @@ -1133,4 +1133,10 @@ const u8 *__efi_get_smbios_string(const struct efi_smbios_record *record,
>   void efi_remap_image(unsigned long image_base, unsigned alloc_size,
>   		     unsigned long code_size);
>   
> +efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
> +					struct efi_boot_memmap *map);
> +void process_unaccepted_memory(u64 start, u64 end);
> +void accept_memory(phys_addr_t start, phys_addr_t end);
> +void arch_accept_memory(phys_addr_t start, phys_addr_t end);
> +
>   #endif
> diff --git a/drivers/firmware/efi/libstub/find.c b/drivers/firmware/efi/libstub/find.c
> new file mode 100644
> index 000000000000..4e7740d28987
> --- /dev/null
> +++ b/drivers/firmware/efi/libstub/find.c
> @@ -0,0 +1,43 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include <linux/bitmap.h>
> +#include <linux/math.h>
> +#include <linux/minmax.h>
> +
> +/*
> + * Common helper for find_next_bit() function family
> + * @FETCH: The expression that fetches and pre-processes each word of bitmap(s)
> + * @MUNGE: The expression that post-processes a word containing found bit (may be empty)
> + * @size: The bitmap size in bits
> + * @start: The bitnumber to start searching at
> + */
> +#define FIND_NEXT_BIT(FETCH, MUNGE, size, start)				\
> +({										\
> +	unsigned long mask, idx, tmp, sz = (size), __start = (start);		\
> +										\
> +	if (unlikely(__start >= sz))						\
> +		goto out;							\
> +										\
> +	mask = MUNGE(BITMAP_FIRST_WORD_MASK(__start));				\
> +	idx = __start / BITS_PER_LONG;						\
> +										\
> +	for (tmp = (FETCH) & mask; !tmp; tmp = (FETCH)) {			\
> +		if ((idx + 1) * BITS_PER_LONG >= sz)				\
> +			goto out;						\
> +		idx++;								\
> +	}									\
> +										\
> +	sz = min(idx * BITS_PER_LONG + __ffs(MUNGE(tmp)), sz);			\
> +out:										\
> +	sz;									\
> +})
> +
> +unsigned long _find_next_bit(const unsigned long *addr, unsigned long nbits, unsigned long start)
> +{
> +	return FIND_NEXT_BIT(addr[idx], /* nop */, nbits, start);
> +}
> +
> +unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
> +					 unsigned long start)
> +{
> +	return FIND_NEXT_BIT(~addr[idx], /* nop */, nbits, start);
> +}
> diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c b/drivers/firmware/efi/libstub/unaccepted_memory.c
> new file mode 100644
> index 000000000000..f4642c4f25dd
> --- /dev/null
> +++ b/drivers/firmware/efi/libstub/unaccepted_memory.c
> @@ -0,0 +1,222 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include <linux/efi.h>
> +#include <asm/efi.h>
> +#include "efistub.h"
> +
> +static struct efi_unaccepted_memory *unaccepted_table;
> +
> +efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
> +					struct efi_boot_memmap *map)
> +{
> +	efi_guid_t unaccepted_table_guid = LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID;
> +	u64 unaccepted_start = ULLONG_MAX, unaccepted_end = 0, bitmap_size;
> +	efi_status_t status;
> +	int i;
> +
> +	/* Check if the table is already installed */
> +	unaccepted_table = get_efi_config_table(unaccepted_table_guid);
> +	if (unaccepted_table) {
> +		if (unaccepted_table->version != 1) {
> +			efi_err("Unknown version of unaccepted memory table\n");
> +			return EFI_UNSUPPORTED;
> +		}
> +		return EFI_SUCCESS;
> +	}
> +
> +	/* Check if there's any unaccepted memory and find the max address */
> +	for (i = 0; i < nr_desc; i++) {
> +		efi_memory_desc_t *d;
> +		unsigned long m = (unsigned long)map->map;
> +
> +		d = efi_early_memdesc_ptr(m, map->desc_size, i);
> +		if (d->type != EFI_UNACCEPTED_MEMORY)
> +			continue;
> +
> +		unaccepted_start = min(unaccepted_start, d->phys_addr);
> +		unaccepted_end = max(unaccepted_end,
> +				     d->phys_addr + d->num_pages * PAGE_SIZE);
> +	}
> +
> +	if (unaccepted_start == ULLONG_MAX)
> +		return EFI_SUCCESS;
> +
> +	unaccepted_start = round_down(unaccepted_start,
> +				      EFI_UNACCEPTED_UNIT_SIZE);
> +	unaccepted_end = round_up(unaccepted_end, EFI_UNACCEPTED_UNIT_SIZE);
> +
> +	/*
> +	 * If unaccepted memory is present, allocate a bitmap to track what
> +	 * memory has to be accepted before access.
> +	 *
> +	 * One bit in the bitmap represents 2MiB in the address space:
> +	 * A 4k bitmap can track 64GiB of physical address space.
> +	 *
> +	 * In the worst case scenario -- a huge hole in the middle of the
> +	 * address space -- It needs 256MiB to handle 4PiB of the address
> +	 * space.
> +	 *
> +	 * The bitmap will be populated in setup_e820() according to the memory
> +	 * map after efi_exit_boot_services().
> +	 */
> +	bitmap_size = DIV_ROUND_UP(unaccepted_end - unaccepted_start,
> +				   EFI_UNACCEPTED_UNIT_SIZE * BITS_PER_BYTE);
> +
> +	status = efi_bs_call(allocate_pool, EFI_LOADER_DATA,
> +			     sizeof(*unaccepted_table) + bitmap_size,
> +			     (void **)&unaccepted_table);


Wonder if EFI_LOADER_DATA guarantees bitmap not to be freed, or should 
some more persistent type be used. If EFI_LOADER_DATA is enough, maybe a 
comment why it is safe could be added.


> +	if (status != EFI_SUCCESS) {
> +		efi_err("Failed to allocate unaccepted memory config table\n");
> +		return status;
> +	}
> +
> +	unaccepted_table->version = 1;
> +	unaccepted_table->unit_size = EFI_UNACCEPTED_UNIT_SIZE;
> +	unaccepted_table->phys_base = unaccepted_start;
> +	unaccepted_table->size = bitmap_size;
> +	memset(unaccepted_table->bitmap, 0, bitmap_size);
> +
> +	status = efi_bs_call(install_configuration_table,
> +			     &unaccepted_table_guid, unaccepted_table);
> +	if (status != EFI_SUCCESS) {
> +		efi_bs_call(free_pool, unaccepted_table);
> +		efi_err("Failed to install unaccepted memory config table!\n");
> +	}
> +
> +	return status;
> +}
> +
> +/*
> + * The accepted memory bitmap only works at unit_size granularity.  Take
> + * unaligned start/end addresses and either:
> + *  1. Accepts the memory immediately and in its entirety
> + *  2. Accepts unaligned parts, and marks *some* aligned part unaccepted
> + *
> + * The function will never reach the bitmap_set() with zero bits to set.
> + */
> +void process_unaccepted_memory(u64 start, u64 end)
> +{
> +	u64 unit_size = unaccepted_table->unit_size;
> +	u64 unit_mask = unaccepted_table->unit_size - 1;
> +	u64 bitmap_size = unaccepted_table->size;
> +
> +	/*
> +	 * Ensure that at least one bit will be set in the bitmap by
> +	 * immediately accepting all regions under 2*unit_size.  This is
> +	 * imprecise and may immediately accept some areas that could
> +	 * have been represented in the bitmap.  But, results in simpler
> +	 * code below
> +	 *
> +	 * Consider case like this (assuming unit_size == 2MB):
> +	 *
> +	 * | 4k | 2044k |    2048k   |
> +	 * ^ 0x0        ^ 2MB        ^ 4MB
> +	 *
> +	 * Only the first 4k has been accepted. The 0MB->2MB region can not be
> +	 * represented in the bitmap. The 2MB->4MB region can be represented in
> +	 * the bitmap. But, the 0MB->4MB region is <2*unit_size and will be
> +	 * immediately accepted in its entirety.
> +	 */
> +	if (end - start < 2 * unit_size) {
> +		arch_accept_memory(start, end);
> +		return;
> +	}
> +
> +	/*
> +	 * No matter how the start and end are aligned, at least one unaccepted
> +	 * unit_size area will remain to be marked in the bitmap.
> +	 */
> +
> +	/* Immediately accept a <unit_size piece at the start: */
> +	if (start & unit_mask) {
> +		arch_accept_memory(start, round_up(start, unit_size));
> +		start = round_up(start, unit_size);
> +	}
> +
> +	/* Immediately accept a <unit_size piece at the end: */
> +	if (end & unit_mask) {
> +		arch_accept_memory(round_down(end, unit_size), end);
> +		end = round_down(end, unit_size);
> +	}
> +
> +	/*
> +	 * Accept part of the range that before phys_base and cannot be recorded
> +	 * into the bitmap.
> +	 */
> +	if (start < unaccepted_table->phys_base) {
> +		arch_accept_memory(start,
> +				   min(unaccepted_table->phys_base, end));
> +		start = unaccepted_table->phys_base;
> +	}
> +
> +	/* Nothing to record */
> +	if (end < unaccepted_table->phys_base)
> +		return;
> +
> +	/* Translate to offsets from the beginning of the bitmap */
> +	start -= unaccepted_table->phys_base;
> +	end -= unaccepted_table->phys_base;
> +
> +	/* Accept memory that doesn't fit into bitmap */
> +	if (end > bitmap_size * unit_size * BITS_PER_BYTE) {
> +		unsigned long phys_start, phys_end;
> +
> +		phys_start = bitmap_size * unit_size * BITS_PER_BYTE +
> +			     unaccepted_table->phys_base;
> +		phys_end = end + unaccepted_table->phys_base;
> +
> +		arch_accept_memory(phys_start, phys_end);
> +		end = bitmap_size * unit_size * BITS_PER_BYTE;
> +	}
> +
> +	/*
> +	 * 'start' and 'end' are now both unit_size-aligned.
> +	 * Record the range as being unaccepted:
> +	 */
> +	bitmap_set(unaccepted_table->bitmap,
> +		   start / unit_size, (end - start) / unit_size);
> +}
> +
> +void accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +	unsigned long range_start, range_end;
> +	unsigned long bitmap_size;
> +	u64 unit_size;
> +
> +	if (!unaccepted_table)
> +		return;
> +
> +	unit_size = unaccepted_table->unit_size;
> +
> +	/*
> +	 * Only care for the part of the range that is represented
> +	 * in the bitmap.
> +	 */
> +	if (start < unaccepted_table->phys_base)
> +		start = unaccepted_table->phys_base;
> +	if (end < unaccepted_table->phys_base)
> +		return;
> +
> +	/* Translate to offsets from the beginning of the bitmap */
> +	start -= unaccepted_table->phys_base;
> +	end -= unaccepted_table->phys_base;
> +
> +	/* Make sure not to overrun the bitmap */
> +	if (end > unaccepted_table->size * unit_size * BITS_PER_BYTE)
> +		end = unaccepted_table->size * unit_size * BITS_PER_BYTE;
> +
> +	range_start = start / unit_size;
> +	bitmap_size = DIV_ROUND_UP(end, unit_size);
> +
> +	for_each_set_bitrange_from(range_start, range_end,
> +				   unaccepted_table->bitmap, bitmap_size) {
> +		unsigned long phys_start, phys_end;
> +
> +		phys_start = range_start * unit_size + unaccepted_table->phys_base;
> +		phys_end = range_end * unit_size + unaccepted_table->phys_base;
> +
> +		arch_accept_memory(phys_start, phys_end);
> +		bitmap_clear(unaccepted_table->bitmap,
> +			     range_start, range_end - range_start);
> +	}
> +}
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> index fff81843169c..8d17cee8b98e 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -613,6 +613,16 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
>   			e820_type = E820_TYPE_PMEM;
>   			break;
>   
> +		case EFI_UNACCEPTED_MEMORY:
> +			if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY)) {
> +				efi_warn_once(
> +"The system has unaccepted memory,  but kernel does not support it\nConsider enabling CONFIG_UNACCEPTED_MEMORY\n");
> +				continue;
> +			}
> +			e820_type = E820_TYPE_RAM;
> +			process_unaccepted_memory(d->phys_addr,
> +						  d->phys_addr + PAGE_SIZE * d->num_pages);
> +			break;
>   		default:
>   			continue;
>   		}
> @@ -697,6 +707,9 @@ static efi_status_t allocate_e820(struct boot_params *params,
>   		status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
>   	}
>   
> +	if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) && status == EFI_SUCCESS)
> +		status = allocate_unaccepted_bitmap(nr_desc, map);
> +
>   	efi_bs_call(free_pool, map);
>   	return status;
>   }
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index 7aa62c92185f..29cc622910da 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -108,7 +108,8 @@ typedef	struct {
>   #define EFI_MEMORY_MAPPED_IO_PORT_SPACE	12
>   #define EFI_PAL_CODE			13
>   #define EFI_PERSISTENT_MEMORY		14
> -#define EFI_MAX_MEMORY_TYPE		15
> +#define EFI_UNACCEPTED_MEMORY		15
> +#define EFI_MAX_MEMORY_TYPE		16
>   
>   /* Attribute values: */
>   #define EFI_MEMORY_UC		((u64)0x0000000000000001ULL)	/* uncached */
> @@ -417,6 +418,7 @@ void efi_native_runtime_setup(void);
>   #define LINUX_EFI_MOK_VARIABLE_TABLE_GUID	EFI_GUID(0xc451ed2b, 0x9694, 0x45d3,  0xba, 0xba, 0xed, 0x9f, 0x89, 0x88, 0xa3, 0x89)
>   #define LINUX_EFI_COCO_SECRET_AREA_GUID		EFI_GUID(0xadf956ad, 0xe98c, 0x484c,  0xae, 0x11, 0xb5, 0x1c, 0x7d, 0x33, 0x64, 0x47)
>   #define LINUX_EFI_BOOT_MEMMAP_GUID		EFI_GUID(0x800f683f, 0xd08b, 0x423a,  0xa2, 0x93, 0x96, 0x5c, 0x3c, 0x6f, 0xe2, 0xb4)
> +#define LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID	EFI_GUID(0xd5d1de3c, 0x105c, 0x44f9,  0x9e, 0xa9, 0xbc, 0xef, 0x98, 0x12, 0x00, 0x31)
>   
>   #define RISCV_EFI_BOOT_PROTOCOL_GUID		EFI_GUID(0xccd15fec, 0x6f73, 0x4eec,  0x83, 0x95, 0x3e, 0x69, 0xe4, 0xb9, 0x40, 0xbf)
>   
> @@ -534,6 +536,14 @@ struct efi_boot_memmap {
>   	efi_memory_desc_t	map[];
>   };
>   
> +struct efi_unaccepted_memory {
> +	u32 version;
> +	u32 unit_size;
> +	u64 phys_base;
> +	u64 size;
> +	unsigned long bitmap[];
> +};
> +
>   /*
>    * Architecture independent structure for describing a memory map for the
>    * benefit of efi_memmap_init_early(), and for passing context between


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 3/9] efi/libstub: Implement support for unaccepted memory
  2023-05-14  5:08   ` Mika Penttilä
@ 2023-05-14 21:13     ` Kirill A. Shutemov
  2023-05-16 18:01       ` Ard Biesheuvel
  0 siblings, 1 reply; 38+ messages in thread
From: Kirill A. Shutemov @ 2023-05-14 21:13 UTC (permalink / raw)
  To: Mika Penttilä, Ard Biesheuvel
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Sean Christopherson, Andrew Morton, Joerg Roedel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Sun, May 14, 2023 at 08:08:07AM +0300, Mika Penttilä wrote:
> > +	status = efi_bs_call(allocate_pool, EFI_LOADER_DATA,
> > +			     sizeof(*unaccepted_table) + bitmap_size,
> > +			     (void **)&unaccepted_table);
> 
> 
> Wonder if EFI_LOADER_DATA guarantees bitmap not to be freed, or should some
> more persistent type be used. If EFI_LOADER_DATA is enough, maybe a comment
> why it is safe could be added.

Ughh.. I've lost the hunk that reserves the memory explicitly while
folding in the patch we discussed with Ard. See below.

But the question is solid.

Ard, do we want to allocate the memory as EFI_RUNTIME_SERVICES_DATA (or
something else?) that got reserved automatically without additional steps?

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index e15a2005ed93..d817e7afd266 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -765,6 +765,25 @@ int __init efi_config_parse_tables(const efi_config_table_t *config_tables,
 		}
 	}
 
+	if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) &&
+	    efi.unaccepted != EFI_INVALID_TABLE_ADDR) {
+		struct efi_unaccepted_memory *unaccepted;
+
+		unaccepted = early_memremap(efi.unaccepted, sizeof(*unaccepted));
+		if (unaccepted) {
+			unsigned long size;
+
+			if (unaccepted->version == 1) {
+				size = sizeof(*unaccepted) + unaccepted->size;
+				memblock_reserve(efi.unaccepted, size);
+			} else {
+				efi.unaccepted = EFI_INVALID_TABLE_ADDR;
+			}
+
+			early_memunmap(unaccepted, sizeof(*unaccepted));
+		}
+	}
+
 	return 0;
 }
 
-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCHv11.1 5/9] efi: Add unaccepted memory support
  2023-05-13 22:04 ` [PATCHv11 5/9] efi: Provide helpers for " Kirill A. Shutemov
@ 2023-05-16 12:06   ` Kirill A. Shutemov
  2023-05-16 17:25     ` Ard Biesheuvel
  2023-05-17 15:58     ` Tom Lendacky
  0 siblings, 2 replies; 38+ messages in thread
From: Kirill A. Shutemov @ 2023-05-16 12:06 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: aarcange, ak, akpm, ardb, bp, dave.hansen, david, dfaggioli,
	jroedel, khalid.elmously, linux-coco, linux-efi, linux-kernel,
	linux-mm, luto, marcelo.cerri, mgorman, mingo, pbonzini, peterx,
	peterz, philip.cox, rientjes, rppt, sathyanarayanan.kuppuswamy,
	seanjc, tglx, thomas.lendacky, tim.gardner, vbabka, x86

efi_config_parse_tables() reserves memory that holds unaccepted memory
configuration table so it won't be reused by page allocator.

Core-mm requires few helpers to support unaccepted memory:

 - accept_memory() checks the range of addresses against the bitmap and
   accept memory if needed.

 - range_contains_unaccepted_memory() checks if anything within the
   range requires acceptance.

Architectural code has to provide efi_get_unaccepted_table() that
returns pointer to the unaccepted memory configuration table.

arch_accept_memory() handles arch-specific part of memory acceptance.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

v11.1:
 - Add missing memblock_reserve() for the unaccepted memory
   configuration table.

---
 drivers/firmware/efi/Makefile            |   1 +
 drivers/firmware/efi/efi.c               |  25 ++++++
 drivers/firmware/efi/unaccepted_memory.c | 103 +++++++++++++++++++++++
 include/linux/efi.h                      |   1 +
 4 files changed, 130 insertions(+)
 create mode 100644 drivers/firmware/efi/unaccepted_memory.c

diff --git a/drivers/firmware/efi/Makefile b/drivers/firmware/efi/Makefile
index b51f2a4c821e..e489fefd23da 100644
--- a/drivers/firmware/efi/Makefile
+++ b/drivers/firmware/efi/Makefile
@@ -41,3 +41,4 @@ obj-$(CONFIG_EFI_CAPSULE_LOADER)	+= capsule-loader.o
 obj-$(CONFIG_EFI_EARLYCON)		+= earlycon.o
 obj-$(CONFIG_UEFI_CPER_ARM)		+= cper-arm.o
 obj-$(CONFIG_UEFI_CPER_X86)		+= cper-x86.o
+obj-$(CONFIG_UNACCEPTED_MEMORY)		+= unaccepted_memory.o
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 7dce06e419c5..d817e7afd266 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -50,6 +50,9 @@ struct efi __read_mostly efi = {
 #ifdef CONFIG_EFI_COCO_SECRET
 	.coco_secret		= EFI_INVALID_TABLE_ADDR,
 #endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	.unaccepted		= EFI_INVALID_TABLE_ADDR,
+#endif
 };
 EXPORT_SYMBOL(efi);
 
@@ -605,6 +608,9 @@ static const efi_config_table_type_t common_tables[] __initconst = {
 #ifdef CONFIG_EFI_COCO_SECRET
 	{LINUX_EFI_COCO_SECRET_AREA_GUID,	&efi.coco_secret,	"CocoSecret"	},
 #endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	{LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID,	&efi.unaccepted,	"Unaccepted"	},
+#endif
 #ifdef CONFIG_EFI_GENERIC_STUB
 	{LINUX_EFI_SCREEN_INFO_TABLE_GUID,	&screen_info_table			},
 #endif
@@ -759,6 +765,25 @@ int __init efi_config_parse_tables(const efi_config_table_t *config_tables,
 		}
 	}
 
+	if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) &&
+	    efi.unaccepted != EFI_INVALID_TABLE_ADDR) {
+		struct efi_unaccepted_memory *unaccepted;
+
+		unaccepted = early_memremap(efi.unaccepted, sizeof(*unaccepted));
+		if (unaccepted) {
+			unsigned long size;
+
+			if (unaccepted->version == 1) {
+				size = sizeof(*unaccepted) + unaccepted->size;
+				memblock_reserve(efi.unaccepted, size);
+			} else {
+				efi.unaccepted = EFI_INVALID_TABLE_ADDR;
+			}
+
+			early_memunmap(unaccepted, sizeof(*unaccepted));
+		}
+	}
+
 	return 0;
 }
 
diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
new file mode 100644
index 000000000000..bb91c41f76fb
--- /dev/null
+++ b/drivers/firmware/efi/unaccepted_memory.c
@@ -0,0 +1,103 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/efi.h>
+#include <linux/memblock.h>
+#include <linux/spinlock.h>
+#include <asm/unaccepted_memory.h>
+
+/* Protects unaccepted memory bitmap */
+static DEFINE_SPINLOCK(unaccepted_memory_lock);
+
+void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	struct efi_unaccepted_memory *unaccepted;
+	unsigned long range_start, range_end;
+	unsigned long flags;
+	u64 unit_size;
+
+	if (efi.unaccepted == EFI_INVALID_TABLE_ADDR)
+		return;
+
+	unaccepted = efi_get_unaccepted_table();
+	if (!unaccepted)
+		return;
+
+	unit_size = unaccepted->unit_size;
+
+	/*
+	 * Only care for the part of the range that is represented
+	 * in the bitmap.
+	 */
+	if (start < unaccepted->phys_base)
+		start = unaccepted->phys_base;
+	if (end < unaccepted->phys_base)
+		return;
+
+	/* Translate to offsets from the beginning of the bitmap */
+	start -= unaccepted->phys_base;
+	end -= unaccepted->phys_base;
+
+	/* Make sure not to overrun the bitmap */
+	if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
+		end = unaccepted->size * unit_size * BITS_PER_BYTE;
+
+	range_start = start / unit_size;
+
+	spin_lock_irqsave(&unaccepted_memory_lock, flags);
+	for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap,
+				   DIV_ROUND_UP(end, unit_size)) {
+		unsigned long phys_start, phys_end;
+		unsigned long len = range_end - range_start;
+
+		phys_start = range_start * unit_size + unaccepted->phys_base;
+		phys_end = range_end * unit_size + unaccepted->phys_base;
+
+		arch_accept_memory(phys_start, phys_end);
+		bitmap_clear(unaccepted->bitmap, range_start, len);
+	}
+	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+}
+
+bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
+{
+	struct efi_unaccepted_memory *unaccepted;
+	unsigned long flags;
+	bool ret = false;
+	u64 unit_size;
+
+	unaccepted = efi_get_unaccepted_table();
+	if (!unaccepted)
+		return false;
+
+	unit_size = unaccepted->unit_size;
+
+	/*
+	 * Only care for the part of the range that is represented
+	 * in the bitmap.
+	 */
+	if (start < unaccepted->phys_base)
+		start = unaccepted->phys_base;
+	if (end < unaccepted->phys_base)
+		return false;
+
+	/* Translate to offsets from the beginning of the bitmap */
+	start -= unaccepted->phys_base;
+	end -= unaccepted->phys_base;
+
+	/* Make sure not to overrun the bitmap */
+	if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
+		end = unaccepted->size * unit_size * BITS_PER_BYTE;
+
+	spin_lock_irqsave(&unaccepted_memory_lock, flags);
+	while (start < end) {
+		if (test_bit(start / unit_size, unaccepted->bitmap)) {
+			ret = true;
+			break;
+		}
+
+		start += unit_size;
+	}
+	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+
+	return ret;
+}
diff --git a/include/linux/efi.h b/include/linux/efi.h
index 29cc622910da..9864f9c00da2 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -646,6 +646,7 @@ extern struct efi {
 	unsigned long			tpm_final_log;		/* TPM2 Final Events Log table */
 	unsigned long			mokvar_table;		/* MOK variable config table */
 	unsigned long			coco_secret;		/* Confidential computing secret table */
+	unsigned long			unaccepted;		/* Unaccepted memory table */
 
 	efi_get_time_t			*get_time;
 	efi_set_time_t			*set_time;
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 4/9] x86/boot/compressed: Handle unaccepted memory
  2023-05-13 22:04 ` [PATCHv11 4/9] x86/boot/compressed: Handle " Kirill A. Shutemov
@ 2023-05-16 17:09   ` Liam Merwick
  2023-05-17 15:52   ` Tom Lendacky
  1 sibling, 0 replies; 38+ messages in thread
From: Liam Merwick @ 2023-05-16 17:09 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Sean Christopherson, Andrew Morton, Joerg Roedel,
	Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On 13/05/2023 23:04, Kirill A. Shutemov wrote:
> The firmware will pre-accept the memory used to run the stub. But, the
> stub is responsible for accepting the memory into which it decompresses
> the main kernel. Accept memory just before decompression starts.
> 
> The stub is also responsible for choosing a physical address in which to
> place the decompressed kernel image. The KASLR mechanism will randomize
> this physical address. Since the unaccepted memory region is relatively


Reading this sentence, should "unaccepted" be "accepted" here?
(i.e. most memory at the start is unaccepted and the accepted region is 
the smaller one).

> small, KASLR would be quite ineffective if it only used the pre-accepted
> area (EFI_CONVENTIONAL_MEMORY). Ensure that KASLR randomizes among the
> entire physical address space by also including EFI_UNACCEPTED_MEMORY.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Otherwise

Reviewed-by: Liam Merwick <liam.merwick@oracle.com>


> ---
>   arch/x86/boot/compressed/efi.h   |  1 +
>   arch/x86/boot/compressed/kaslr.c | 35 +++++++++++++++++++++-----------
>   arch/x86/boot/compressed/misc.c  |  6 ++++++
>   arch/x86/boot/compressed/misc.h  |  6 ++++++
>   4 files changed, 36 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/boot/compressed/efi.h b/arch/x86/boot/compressed/efi.h
> index 7db2f41b54cd..cf475243b6d5 100644
> --- a/arch/x86/boot/compressed/efi.h
> +++ b/arch/x86/boot/compressed/efi.h
> @@ -32,6 +32,7 @@ typedef	struct {
>   } efi_table_hdr_t;
>   
>   #define EFI_CONVENTIONAL_MEMORY		 7
> +#define EFI_UNACCEPTED_MEMORY		15
>   
>   #define EFI_MEMORY_MORE_RELIABLE \
>   				((u64)0x0000000000010000ULL)	/* higher reliability */
> diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
> index 454757fbdfe5..749f0fe7e446 100644
> --- a/arch/x86/boot/compressed/kaslr.c
> +++ b/arch/x86/boot/compressed/kaslr.c
> @@ -672,6 +672,28 @@ static bool process_mem_region(struct mem_vector *region,
>   }
>   
>   #ifdef CONFIG_EFI
> +
> +/*
> + * Only EFI_CONVENTIONAL_MEMORY and EFI_UNACCEPTED_MEMORY (if supported) are
> + * guaranteed to be free.
> + *
> + * It is more conservative in picking free memory than the EFI spec allows:
> + *
> + * According to the spec, EFI_BOOT_SERVICES_{CODE|DATA} are also free memory
> + * and thus available to place the kernel image into, but in practice there's
> + * firmware where using that memory leads to crashes.
> + */
> +static inline bool memory_type_is_free(efi_memory_desc_t *md)
> +{
> +	if (md->type == EFI_CONVENTIONAL_MEMORY)
> +		return true;
> +
> +	if (md->type == EFI_UNACCEPTED_MEMORY)
> +		return IS_ENABLED(CONFIG_UNACCEPTED_MEMORY);
> +
> +	return false;
> +}
> +
>   /*
>    * Returns true if we processed the EFI memmap, which we prefer over the E820
>    * table if it is available.
> @@ -716,18 +738,7 @@ process_efi_entries(unsigned long minimum, unsigned long image_size)
>   	for (i = 0; i < nr_desc; i++) {
>   		md = efi_early_memdesc_ptr(pmap, e->efi_memdesc_size, i);
>   
> -		/*
> -		 * Here we are more conservative in picking free memory than
> -		 * the EFI spec allows:
> -		 *
> -		 * According to the spec, EFI_BOOT_SERVICES_{CODE|DATA} are also
> -		 * free memory and thus available to place the kernel image into,
> -		 * but in practice there's firmware where using that memory leads
> -		 * to crashes.
> -		 *
> -		 * Only EFI_CONVENTIONAL_MEMORY is guaranteed to be free.
> -		 */
> -		if (md->type != EFI_CONVENTIONAL_MEMORY)
> +		if (!memory_type_is_free(md))
>   			continue;
>   
>   		if (efi_soft_reserve_enabled() &&
> diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
> index 014ff222bf4b..eb8df0d4ad51 100644
> --- a/arch/x86/boot/compressed/misc.c
> +++ b/arch/x86/boot/compressed/misc.c
> @@ -455,6 +455,12 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
>   #endif
>   
>   	debug_putstr("\nDecompressing Linux... ");
> +
> +	if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY)) {
> +		debug_putstr("Accepting memory... ");
> +		accept_memory(__pa(output), __pa(output) + needed_size);
> +	}
> +
>   	__decompress(input_data, input_len, NULL, NULL, output, output_len,
>   			NULL, error);
>   	entry_offset = parse_elf(output);
> diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
> index 2f155a0e3041..9663d1839f54 100644
> --- a/arch/x86/boot/compressed/misc.h
> +++ b/arch/x86/boot/compressed/misc.h
> @@ -247,4 +247,10 @@ static inline unsigned long efi_find_vendor_table(struct boot_params *bp,
>   }
>   #endif /* CONFIG_EFI */
>   
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +void accept_memory(phys_addr_t start, phys_addr_t end);
> +#else
> +static inline void accept_memory(phys_addr_t start, phys_addr_t end) {}
> +#endif
> +
>   #endif /* BOOT_COMPRESSED_MISC_H */


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv11.1 5/9] efi: Add unaccepted memory support
  2023-05-16 12:06   ` [PATCHv11.1 5/9] efi: Add unaccepted memory support Kirill A. Shutemov
@ 2023-05-16 17:25     ` Ard Biesheuvel
  2023-05-17 15:58     ` Tom Lendacky
  1 sibling, 0 replies; 38+ messages in thread
From: Ard Biesheuvel @ 2023-05-16 17:25 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: aarcange, ak, akpm, bp, dave.hansen, david, dfaggioli, jroedel,
	khalid.elmously, linux-coco, linux-efi, linux-kernel, linux-mm,
	luto, marcelo.cerri, mgorman, mingo, pbonzini, peterx, peterz,
	philip.cox, rientjes, rppt, sathyanarayanan.kuppuswamy, seanjc,
	tglx, thomas.lendacky, tim.gardner, vbabka, x86

On Tue, 16 May 2023 at 14:07, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> efi_config_parse_tables() reserves memory that holds unaccepted memory
> configuration table so it won't be reused by page allocator.
>
> Core-mm requires few helpers to support unaccepted memory:
>
>  - accept_memory() checks the range of addresses against the bitmap and
>    accept memory if needed.
>
>  - range_contains_unaccepted_memory() checks if anything within the
>    range requires acceptance.
>
> Architectural code has to provide efi_get_unaccepted_table() that
> returns pointer to the unaccepted memory configuration table.
>
> arch_accept_memory() handles arch-specific part of memory acceptance.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>

Reviewed-by: Ard Biesheuvel <ardb@kernel.org>

> v11.1:
>  - Add missing memblock_reserve() for the unaccepted memory
>    configuration table.
>
> ---
>  drivers/firmware/efi/Makefile            |   1 +
>  drivers/firmware/efi/efi.c               |  25 ++++++
>  drivers/firmware/efi/unaccepted_memory.c | 103 +++++++++++++++++++++++
>  include/linux/efi.h                      |   1 +
>  4 files changed, 130 insertions(+)
>  create mode 100644 drivers/firmware/efi/unaccepted_memory.c
>
> diff --git a/drivers/firmware/efi/Makefile b/drivers/firmware/efi/Makefile
> index b51f2a4c821e..e489fefd23da 100644
> --- a/drivers/firmware/efi/Makefile
> +++ b/drivers/firmware/efi/Makefile
> @@ -41,3 +41,4 @@ obj-$(CONFIG_EFI_CAPSULE_LOADER)      += capsule-loader.o
>  obj-$(CONFIG_EFI_EARLYCON)             += earlycon.o
>  obj-$(CONFIG_UEFI_CPER_ARM)            += cper-arm.o
>  obj-$(CONFIG_UEFI_CPER_X86)            += cper-x86.o
> +obj-$(CONFIG_UNACCEPTED_MEMORY)                += unaccepted_memory.o
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index 7dce06e419c5..d817e7afd266 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -50,6 +50,9 @@ struct efi __read_mostly efi = {
>  #ifdef CONFIG_EFI_COCO_SECRET
>         .coco_secret            = EFI_INVALID_TABLE_ADDR,
>  #endif
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +       .unaccepted             = EFI_INVALID_TABLE_ADDR,
> +#endif
>  };
>  EXPORT_SYMBOL(efi);
>
> @@ -605,6 +608,9 @@ static const efi_config_table_type_t common_tables[] __initconst = {
>  #ifdef CONFIG_EFI_COCO_SECRET
>         {LINUX_EFI_COCO_SECRET_AREA_GUID,       &efi.coco_secret,       "CocoSecret"    },
>  #endif
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +       {LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID,   &efi.unaccepted,        "Unaccepted"    },
> +#endif
>  #ifdef CONFIG_EFI_GENERIC_STUB
>         {LINUX_EFI_SCREEN_INFO_TABLE_GUID,      &screen_info_table                      },
>  #endif
> @@ -759,6 +765,25 @@ int __init efi_config_parse_tables(const efi_config_table_t *config_tables,
>                 }
>         }
>
> +       if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) &&
> +           efi.unaccepted != EFI_INVALID_TABLE_ADDR) {
> +               struct efi_unaccepted_memory *unaccepted;
> +
> +               unaccepted = early_memremap(efi.unaccepted, sizeof(*unaccepted));
> +               if (unaccepted) {
> +                       unsigned long size;
> +
> +                       if (unaccepted->version == 1) {
> +                               size = sizeof(*unaccepted) + unaccepted->size;
> +                               memblock_reserve(efi.unaccepted, size);
> +                       } else {
> +                               efi.unaccepted = EFI_INVALID_TABLE_ADDR;
> +                       }
> +
> +                       early_memunmap(unaccepted, sizeof(*unaccepted));
> +               }
> +       }
> +
>         return 0;
>  }
>
> diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
> new file mode 100644
> index 000000000000..bb91c41f76fb
> --- /dev/null
> +++ b/drivers/firmware/efi/unaccepted_memory.c
> @@ -0,0 +1,103 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include <linux/efi.h>
> +#include <linux/memblock.h>
> +#include <linux/spinlock.h>
> +#include <asm/unaccepted_memory.h>
> +
> +/* Protects unaccepted memory bitmap */
> +static DEFINE_SPINLOCK(unaccepted_memory_lock);
> +
> +void accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +       struct efi_unaccepted_memory *unaccepted;
> +       unsigned long range_start, range_end;
> +       unsigned long flags;
> +       u64 unit_size;
> +
> +       if (efi.unaccepted == EFI_INVALID_TABLE_ADDR)
> +               return;
> +
> +       unaccepted = efi_get_unaccepted_table();
> +       if (!unaccepted)
> +               return;
> +
> +       unit_size = unaccepted->unit_size;
> +
> +       /*
> +        * Only care for the part of the range that is represented
> +        * in the bitmap.
> +        */
> +       if (start < unaccepted->phys_base)
> +               start = unaccepted->phys_base;
> +       if (end < unaccepted->phys_base)
> +               return;
> +
> +       /* Translate to offsets from the beginning of the bitmap */
> +       start -= unaccepted->phys_base;
> +       end -= unaccepted->phys_base;
> +
> +       /* Make sure not to overrun the bitmap */
> +       if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
> +               end = unaccepted->size * unit_size * BITS_PER_BYTE;
> +
> +       range_start = start / unit_size;
> +
> +       spin_lock_irqsave(&unaccepted_memory_lock, flags);
> +       for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap,
> +                                  DIV_ROUND_UP(end, unit_size)) {
> +               unsigned long phys_start, phys_end;
> +               unsigned long len = range_end - range_start;
> +
> +               phys_start = range_start * unit_size + unaccepted->phys_base;
> +               phys_end = range_end * unit_size + unaccepted->phys_base;
> +
> +               arch_accept_memory(phys_start, phys_end);
> +               bitmap_clear(unaccepted->bitmap, range_start, len);
> +       }
> +       spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
> +}
> +
> +bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
> +{
> +       struct efi_unaccepted_memory *unaccepted;
> +       unsigned long flags;
> +       bool ret = false;
> +       u64 unit_size;
> +
> +       unaccepted = efi_get_unaccepted_table();
> +       if (!unaccepted)
> +               return false;
> +
> +       unit_size = unaccepted->unit_size;
> +
> +       /*
> +        * Only care for the part of the range that is represented
> +        * in the bitmap.
> +        */
> +       if (start < unaccepted->phys_base)
> +               start = unaccepted->phys_base;
> +       if (end < unaccepted->phys_base)
> +               return false;
> +
> +       /* Translate to offsets from the beginning of the bitmap */
> +       start -= unaccepted->phys_base;
> +       end -= unaccepted->phys_base;
> +
> +       /* Make sure not to overrun the bitmap */
> +       if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
> +               end = unaccepted->size * unit_size * BITS_PER_BYTE;
> +
> +       spin_lock_irqsave(&unaccepted_memory_lock, flags);
> +       while (start < end) {
> +               if (test_bit(start / unit_size, unaccepted->bitmap)) {
> +                       ret = true;
> +                       break;
> +               }
> +
> +               start += unit_size;
> +       }
> +       spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
> +
> +       return ret;
> +}
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index 29cc622910da..9864f9c00da2 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -646,6 +646,7 @@ extern struct efi {
>         unsigned long                   tpm_final_log;          /* TPM2 Final Events Log table */
>         unsigned long                   mokvar_table;           /* MOK variable config table */
>         unsigned long                   coco_secret;            /* Confidential computing secret table */
> +       unsigned long                   unaccepted;             /* Unaccepted memory table */
>
>         efi_get_time_t                  *get_time;
>         efi_set_time_t                  *set_time;
> --
> 2.39.3
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 3/9] efi/libstub: Implement support for unaccepted memory
  2023-05-14 21:13     ` Kirill A. Shutemov
@ 2023-05-16 18:01       ` Ard Biesheuvel
  0 siblings, 0 replies; 38+ messages in thread
From: Ard Biesheuvel @ 2023-05-16 18:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Mika Penttilä,
	Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Sean Christopherson, Andrew Morton, Joerg Roedel,
	Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Sun, 14 May 2023 at 23:13, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Sun, May 14, 2023 at 08:08:07AM +0300, Mika Penttilä wrote:
> > > +   status = efi_bs_call(allocate_pool, EFI_LOADER_DATA,
> > > +                        sizeof(*unaccepted_table) + bitmap_size,
> > > +                        (void **)&unaccepted_table);
> >
> >
> > Wonder if EFI_LOADER_DATA guarantees bitmap not to be freed, or should some
> > more persistent type be used. If EFI_LOADER_DATA is enough, maybe a comment
> > why it is safe could be added.
>
> Ughh.. I've lost the hunk that reserves the memory explicitly while
> folding in the patch we discussed with Ard. See below.
>
> But the question is solid.
>
> Ard, do we want to allocate the memory as EFI_RUNTIME_SERVICES_DATA (or
> something else?) that got reserved automatically without additional steps?
>


EFI loader data should be fine here, as long as we reserve it.

EFI runtime services data is intended for allocations that have
significance to the firmware itself, so it gets mapped into the EFI
runtime page tables and on some architectures, it gets removed from
the direct map as well.

The unaccepted bitmap is only accessed by the OS itself, so runtime
services data is really not the right choice. We just have to ensure
the bitmap gets reserved in memblock sufficiently early.

> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index e15a2005ed93..d817e7afd266 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -765,6 +765,25 @@ int __init efi_config_parse_tables(const efi_config_table_t *config_tables,
>                 }
>         }
>
> +       if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) &&
> +           efi.unaccepted != EFI_INVALID_TABLE_ADDR) {
> +               struct efi_unaccepted_memory *unaccepted;
> +
> +               unaccepted = early_memremap(efi.unaccepted, sizeof(*unaccepted));
> +               if (unaccepted) {
> +                       unsigned long size;
> +
> +                       if (unaccepted->version == 1) {
> +                               size = sizeof(*unaccepted) + unaccepted->size;
> +                               memblock_reserve(efi.unaccepted, size);
> +                       } else {
> +                               efi.unaccepted = EFI_INVALID_TABLE_ADDR;
> +                       }
> +
> +                       early_memunmap(unaccepted, sizeof(*unaccepted));
> +               }
> +       }
> +
>         return 0;
>  }
>
> --
>   Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 3/9] efi/libstub: Implement support for unaccepted memory
  2023-05-13 22:04 ` [PATCHv11 3/9] efi/libstub: Implement support for unaccepted memory Kirill A. Shutemov
  2023-05-14  5:08   ` Mika Penttilä
@ 2023-05-16 18:06   ` Ard Biesheuvel
  1 sibling, 0 replies; 38+ messages in thread
From: Ard Biesheuvel @ 2023-05-16 18:06 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Borislav Petkov, Andy Lutomirski, Dave Hansen,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Mike Rapoport, David Hildenbrand,
	Mel Gorman, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, aarcange, peterx, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel

On Sun, 14 May 2023 at 00:04, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> UEFI Specification version 2.9 introduces the concept of memory
> acceptance: Some Virtual Machine platforms, such as Intel TDX or AMD
> SEV-SNP, requiring memory to be accepted before it can be used by the
> guest. Accepting happens via a protocol specific for the Virtual
> Machine platform.
>
> Accepting memory is costly and it makes VMM allocate memory for the
> accepted guest physical address range. It's better to postpone memory
> acceptance until memory is needed. It lowers boot time and reduces
> memory overhead.
>
> The kernel needs to know what memory has been accepted. Firmware
> communicates this information via memory map: a new memory type --
> EFI_UNACCEPTED_MEMORY -- indicates such memory.
>
> Range-based tracking works fine for firmware, but it gets bulky for
> the kernel: e820 (or whatever the arch uses) has to be modified on every
> page acceptance. It leads to table fragmentation and there's a limited
> number of entries in the e820 table.
>
> Another option is to mark such memory as usable in e820 and track if the
> range has been accepted in a bitmap. One bit in the bitmap represents a
> naturally aligned power-2-sized region of address space -- unit.
>
> For x86, unit size is 2MiB: 4k of the bitmap is enough to track 64GiB or
> physical address space.
>
> In the worst-case scenario -- a huge hole in the middle of the
> address space -- It needs 256MiB to handle 4PiB of the address
> space.
>
> Any unaccepted memory that is not aligned to unit_size gets accepted
> upfront.
>
> The bitmap is allocated and constructed in the EFI stub and passed down
> to the kernel via EFI configuration table. allocate_e820() allocates the
> bitmap if unaccepted memory is present, according to the size of
> unaccepted region.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Reviewed-by: Ard Biesheuvel <ardb@kernel.org>

> ---
>  arch/x86/boot/compressed/Makefile             |   1 +
>  arch/x86/boot/compressed/mem.c                |   9 +
>  arch/x86/include/asm/efi.h                    |   2 +
>  drivers/firmware/efi/Kconfig                  |  14 ++
>  drivers/firmware/efi/efi.c                    |   1 +
>  drivers/firmware/efi/libstub/Makefile         |   2 +
>  drivers/firmware/efi/libstub/bitmap.c         |  41 ++++
>  drivers/firmware/efi/libstub/efistub.h        |   6 +
>  drivers/firmware/efi/libstub/find.c           |  43 ++++
>  .../firmware/efi/libstub/unaccepted_memory.c  | 222 ++++++++++++++++++
>  drivers/firmware/efi/libstub/x86-stub.c       |  13 +
>  include/linux/efi.h                           |  12 +-
>  12 files changed, 365 insertions(+), 1 deletion(-)
>  create mode 100644 arch/x86/boot/compressed/mem.c
>  create mode 100644 drivers/firmware/efi/libstub/bitmap.c
>  create mode 100644 drivers/firmware/efi/libstub/find.c
>  create mode 100644 drivers/firmware/efi/libstub/unaccepted_memory.c
>
> diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
> index 6b6cfe607bdb..cc4978123c30 100644
> --- a/arch/x86/boot/compressed/Makefile
> +++ b/arch/x86/boot/compressed/Makefile
> @@ -107,6 +107,7 @@ endif
>
>  vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
>  vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
> +vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/mem.o
>
>  vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
>  vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
> diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
> new file mode 100644
> index 000000000000..67594fcb11d9
> --- /dev/null
> +++ b/arch/x86/boot/compressed/mem.c
> @@ -0,0 +1,9 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include "error.h"
> +
> +void arch_accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +       /* Platform-specific memory-acceptance call goes here */
> +       error("Cannot accept memory");
> +}
> diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
> index 419280d263d2..8b4be7cecdb8 100644
> --- a/arch/x86/include/asm/efi.h
> +++ b/arch/x86/include/asm/efi.h
> @@ -31,6 +31,8 @@ extern unsigned long efi_mixed_mode_stack_pa;
>
>  #define ARCH_EFI_IRQ_FLAGS_MASK        X86_EFLAGS_IF
>
> +#define EFI_UNACCEPTED_UNIT_SIZE PMD_SIZE
> +
>  /*
>   * The EFI services are called through variadic functions in many cases. These
>   * functions are implemented in assembler and support only a fixed number of
> diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
> index 043ca31c114e..231f1c70d1db 100644
> --- a/drivers/firmware/efi/Kconfig
> +++ b/drivers/firmware/efi/Kconfig
> @@ -269,6 +269,20 @@ config EFI_COCO_SECRET
>           virt/coco/efi_secret module to access the secrets, which in turn
>           allows userspace programs to access the injected secrets.
>
> +config UNACCEPTED_MEMORY
> +       bool
> +       depends on EFI_STUB
> +       help
> +          Some Virtual Machine platforms, such as Intel TDX, require
> +          some memory to be "accepted" by the guest before it can be used.
> +          This mechanism helps prevent malicious hosts from making changes
> +          to guest memory.
> +
> +          UEFI specification v2.9 introduced EFI_UNACCEPTED_MEMORY memory type.
> +
> +          This option adds support for unaccepted memory and makes such memory
> +          usable by the kernel.
> +
>  config EFI_EMBEDDED_FIRMWARE
>         bool
>         select CRYPTO_LIB_SHA256
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index abeff7dc0b58..7dce06e419c5 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -843,6 +843,7 @@ static __initdata char memory_type_name[][13] = {
>         "MMIO Port",
>         "PAL Code",
>         "Persistent",
> +       "Unaccepted",
>  };
>
>  char * __init efi_md_typeattr_format(char *buf, size_t size,
> diff --git a/drivers/firmware/efi/libstub/Makefile b/drivers/firmware/efi/libstub/Makefile
> index 3abb2b357482..16d64a34d1e1 100644
> --- a/drivers/firmware/efi/libstub/Makefile
> +++ b/drivers/firmware/efi/libstub/Makefile
> @@ -96,6 +96,8 @@ CFLAGS_arm32-stub.o           := -DTEXT_OFFSET=$(TEXT_OFFSET)
>  zboot-obj-$(CONFIG_RISCV)      := lib-clz_ctz.o lib-ashldi3.o
>  lib-$(CONFIG_EFI_ZBOOT)                += zboot.o $(zboot-obj-y)
>
> +lib-$(CONFIG_UNACCEPTED_MEMORY) += unaccepted_memory.o bitmap.o find.o
> +
>  extra-y                                := $(lib-y)
>  lib-y                          := $(patsubst %.o,%.stub.o,$(lib-y))
>
> diff --git a/drivers/firmware/efi/libstub/bitmap.c b/drivers/firmware/efi/libstub/bitmap.c
> new file mode 100644
> index 000000000000..5c9bba0d549b
> --- /dev/null
> +++ b/drivers/firmware/efi/libstub/bitmap.c
> @@ -0,0 +1,41 @@
> +#include <linux/bitmap.h>
> +
> +void __bitmap_set(unsigned long *map, unsigned int start, int len)
> +{
> +       unsigned long *p = map + BIT_WORD(start);
> +       const unsigned int size = start + len;
> +       int bits_to_set = BITS_PER_LONG - (start % BITS_PER_LONG);
> +       unsigned long mask_to_set = BITMAP_FIRST_WORD_MASK(start);
> +
> +       while (len - bits_to_set >= 0) {
> +               *p |= mask_to_set;
> +               len -= bits_to_set;
> +               bits_to_set = BITS_PER_LONG;
> +               mask_to_set = ~0UL;
> +               p++;
> +       }
> +       if (len) {
> +               mask_to_set &= BITMAP_LAST_WORD_MASK(size);
> +               *p |= mask_to_set;
> +       }
> +}
> +
> +void __bitmap_clear(unsigned long *map, unsigned int start, int len)
> +{
> +       unsigned long *p = map + BIT_WORD(start);
> +       const unsigned int size = start + len;
> +       int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
> +       unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
> +
> +       while (len - bits_to_clear >= 0) {
> +               *p &= ~mask_to_clear;
> +               len -= bits_to_clear;
> +               bits_to_clear = BITS_PER_LONG;
> +               mask_to_clear = ~0UL;
> +               p++;
> +       }
> +       if (len) {
> +               mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
> +               *p &= ~mask_to_clear;
> +       }
> +}
> diff --git a/drivers/firmware/efi/libstub/efistub.h b/drivers/firmware/efi/libstub/efistub.h
> index 67d5a20802e0..8659a01664b8 100644
> --- a/drivers/firmware/efi/libstub/efistub.h
> +++ b/drivers/firmware/efi/libstub/efistub.h
> @@ -1133,4 +1133,10 @@ const u8 *__efi_get_smbios_string(const struct efi_smbios_record *record,
>  void efi_remap_image(unsigned long image_base, unsigned alloc_size,
>                      unsigned long code_size);
>
> +efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
> +                                       struct efi_boot_memmap *map);
> +void process_unaccepted_memory(u64 start, u64 end);
> +void accept_memory(phys_addr_t start, phys_addr_t end);
> +void arch_accept_memory(phys_addr_t start, phys_addr_t end);
> +
>  #endif
> diff --git a/drivers/firmware/efi/libstub/find.c b/drivers/firmware/efi/libstub/find.c
> new file mode 100644
> index 000000000000..4e7740d28987
> --- /dev/null
> +++ b/drivers/firmware/efi/libstub/find.c
> @@ -0,0 +1,43 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include <linux/bitmap.h>
> +#include <linux/math.h>
> +#include <linux/minmax.h>
> +
> +/*
> + * Common helper for find_next_bit() function family
> + * @FETCH: The expression that fetches and pre-processes each word of bitmap(s)
> + * @MUNGE: The expression that post-processes a word containing found bit (may be empty)
> + * @size: The bitmap size in bits
> + * @start: The bitnumber to start searching at
> + */
> +#define FIND_NEXT_BIT(FETCH, MUNGE, size, start)                               \
> +({                                                                             \
> +       unsigned long mask, idx, tmp, sz = (size), __start = (start);           \
> +                                                                               \
> +       if (unlikely(__start >= sz))                                            \
> +               goto out;                                                       \
> +                                                                               \
> +       mask = MUNGE(BITMAP_FIRST_WORD_MASK(__start));                          \
> +       idx = __start / BITS_PER_LONG;                                          \
> +                                                                               \
> +       for (tmp = (FETCH) & mask; !tmp; tmp = (FETCH)) {                       \
> +               if ((idx + 1) * BITS_PER_LONG >= sz)                            \
> +                       goto out;                                               \
> +               idx++;                                                          \
> +       }                                                                       \
> +                                                                               \
> +       sz = min(idx * BITS_PER_LONG + __ffs(MUNGE(tmp)), sz);                  \
> +out:                                                                           \
> +       sz;                                                                     \
> +})
> +
> +unsigned long _find_next_bit(const unsigned long *addr, unsigned long nbits, unsigned long start)
> +{
> +       return FIND_NEXT_BIT(addr[idx], /* nop */, nbits, start);
> +}
> +
> +unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
> +                                        unsigned long start)
> +{
> +       return FIND_NEXT_BIT(~addr[idx], /* nop */, nbits, start);
> +}
> diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c b/drivers/firmware/efi/libstub/unaccepted_memory.c
> new file mode 100644
> index 000000000000..f4642c4f25dd
> --- /dev/null
> +++ b/drivers/firmware/efi/libstub/unaccepted_memory.c
> @@ -0,0 +1,222 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include <linux/efi.h>
> +#include <asm/efi.h>
> +#include "efistub.h"
> +
> +static struct efi_unaccepted_memory *unaccepted_table;
> +
> +efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
> +                                       struct efi_boot_memmap *map)
> +{
> +       efi_guid_t unaccepted_table_guid = LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID;
> +       u64 unaccepted_start = ULLONG_MAX, unaccepted_end = 0, bitmap_size;
> +       efi_status_t status;
> +       int i;
> +
> +       /* Check if the table is already installed */
> +       unaccepted_table = get_efi_config_table(unaccepted_table_guid);
> +       if (unaccepted_table) {
> +               if (unaccepted_table->version != 1) {
> +                       efi_err("Unknown version of unaccepted memory table\n");
> +                       return EFI_UNSUPPORTED;
> +               }
> +               return EFI_SUCCESS;
> +       }
> +
> +       /* Check if there's any unaccepted memory and find the max address */
> +       for (i = 0; i < nr_desc; i++) {
> +               efi_memory_desc_t *d;
> +               unsigned long m = (unsigned long)map->map;
> +
> +               d = efi_early_memdesc_ptr(m, map->desc_size, i);
> +               if (d->type != EFI_UNACCEPTED_MEMORY)
> +                       continue;
> +
> +               unaccepted_start = min(unaccepted_start, d->phys_addr);
> +               unaccepted_end = max(unaccepted_end,
> +                                    d->phys_addr + d->num_pages * PAGE_SIZE);
> +       }
> +
> +       if (unaccepted_start == ULLONG_MAX)
> +               return EFI_SUCCESS;
> +
> +       unaccepted_start = round_down(unaccepted_start,
> +                                     EFI_UNACCEPTED_UNIT_SIZE);
> +       unaccepted_end = round_up(unaccepted_end, EFI_UNACCEPTED_UNIT_SIZE);
> +
> +       /*
> +        * If unaccepted memory is present, allocate a bitmap to track what
> +        * memory has to be accepted before access.
> +        *
> +        * One bit in the bitmap represents 2MiB in the address space:
> +        * A 4k bitmap can track 64GiB of physical address space.
> +        *
> +        * In the worst case scenario -- a huge hole in the middle of the
> +        * address space -- It needs 256MiB to handle 4PiB of the address
> +        * space.
> +        *
> +        * The bitmap will be populated in setup_e820() according to the memory
> +        * map after efi_exit_boot_services().
> +        */
> +       bitmap_size = DIV_ROUND_UP(unaccepted_end - unaccepted_start,
> +                                  EFI_UNACCEPTED_UNIT_SIZE * BITS_PER_BYTE);
> +
> +       status = efi_bs_call(allocate_pool, EFI_LOADER_DATA,
> +                            sizeof(*unaccepted_table) + bitmap_size,
> +                            (void **)&unaccepted_table);
> +       if (status != EFI_SUCCESS) {
> +               efi_err("Failed to allocate unaccepted memory config table\n");
> +               return status;
> +       }
> +
> +       unaccepted_table->version = 1;
> +       unaccepted_table->unit_size = EFI_UNACCEPTED_UNIT_SIZE;
> +       unaccepted_table->phys_base = unaccepted_start;
> +       unaccepted_table->size = bitmap_size;
> +       memset(unaccepted_table->bitmap, 0, bitmap_size);
> +
> +       status = efi_bs_call(install_configuration_table,
> +                            &unaccepted_table_guid, unaccepted_table);
> +       if (status != EFI_SUCCESS) {
> +               efi_bs_call(free_pool, unaccepted_table);
> +               efi_err("Failed to install unaccepted memory config table!\n");
> +       }
> +
> +       return status;
> +}
> +
> +/*
> + * The accepted memory bitmap only works at unit_size granularity.  Take
> + * unaligned start/end addresses and either:
> + *  1. Accepts the memory immediately and in its entirety
> + *  2. Accepts unaligned parts, and marks *some* aligned part unaccepted
> + *
> + * The function will never reach the bitmap_set() with zero bits to set.
> + */
> +void process_unaccepted_memory(u64 start, u64 end)
> +{
> +       u64 unit_size = unaccepted_table->unit_size;
> +       u64 unit_mask = unaccepted_table->unit_size - 1;
> +       u64 bitmap_size = unaccepted_table->size;
> +
> +       /*
> +        * Ensure that at least one bit will be set in the bitmap by
> +        * immediately accepting all regions under 2*unit_size.  This is
> +        * imprecise and may immediately accept some areas that could
> +        * have been represented in the bitmap.  But, results in simpler
> +        * code below
> +        *
> +        * Consider case like this (assuming unit_size == 2MB):
> +        *
> +        * | 4k | 2044k |    2048k   |
> +        * ^ 0x0        ^ 2MB        ^ 4MB
> +        *
> +        * Only the first 4k has been accepted. The 0MB->2MB region can not be
> +        * represented in the bitmap. The 2MB->4MB region can be represented in
> +        * the bitmap. But, the 0MB->4MB region is <2*unit_size and will be
> +        * immediately accepted in its entirety.
> +        */
> +       if (end - start < 2 * unit_size) {
> +               arch_accept_memory(start, end);
> +               return;
> +       }
> +
> +       /*
> +        * No matter how the start and end are aligned, at least one unaccepted
> +        * unit_size area will remain to be marked in the bitmap.
> +        */
> +
> +       /* Immediately accept a <unit_size piece at the start: */
> +       if (start & unit_mask) {
> +               arch_accept_memory(start, round_up(start, unit_size));
> +               start = round_up(start, unit_size);
> +       }
> +
> +       /* Immediately accept a <unit_size piece at the end: */
> +       if (end & unit_mask) {
> +               arch_accept_memory(round_down(end, unit_size), end);
> +               end = round_down(end, unit_size);
> +       }
> +
> +       /*
> +        * Accept part of the range that before phys_base and cannot be recorded
> +        * into the bitmap.
> +        */
> +       if (start < unaccepted_table->phys_base) {
> +               arch_accept_memory(start,
> +                                  min(unaccepted_table->phys_base, end));
> +               start = unaccepted_table->phys_base;
> +       }
> +
> +       /* Nothing to record */
> +       if (end < unaccepted_table->phys_base)
> +               return;
> +
> +       /* Translate to offsets from the beginning of the bitmap */
> +       start -= unaccepted_table->phys_base;
> +       end -= unaccepted_table->phys_base;
> +
> +       /* Accept memory that doesn't fit into bitmap */
> +       if (end > bitmap_size * unit_size * BITS_PER_BYTE) {
> +               unsigned long phys_start, phys_end;
> +
> +               phys_start = bitmap_size * unit_size * BITS_PER_BYTE +
> +                            unaccepted_table->phys_base;
> +               phys_end = end + unaccepted_table->phys_base;
> +
> +               arch_accept_memory(phys_start, phys_end);
> +               end = bitmap_size * unit_size * BITS_PER_BYTE;
> +       }
> +
> +       /*
> +        * 'start' and 'end' are now both unit_size-aligned.
> +        * Record the range as being unaccepted:
> +        */
> +       bitmap_set(unaccepted_table->bitmap,
> +                  start / unit_size, (end - start) / unit_size);
> +}
> +
> +void accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +       unsigned long range_start, range_end;
> +       unsigned long bitmap_size;
> +       u64 unit_size;
> +
> +       if (!unaccepted_table)
> +               return;
> +
> +       unit_size = unaccepted_table->unit_size;
> +
> +       /*
> +        * Only care for the part of the range that is represented
> +        * in the bitmap.
> +        */
> +       if (start < unaccepted_table->phys_base)
> +               start = unaccepted_table->phys_base;
> +       if (end < unaccepted_table->phys_base)
> +               return;
> +
> +       /* Translate to offsets from the beginning of the bitmap */
> +       start -= unaccepted_table->phys_base;
> +       end -= unaccepted_table->phys_base;
> +
> +       /* Make sure not to overrun the bitmap */
> +       if (end > unaccepted_table->size * unit_size * BITS_PER_BYTE)
> +               end = unaccepted_table->size * unit_size * BITS_PER_BYTE;
> +
> +       range_start = start / unit_size;
> +       bitmap_size = DIV_ROUND_UP(end, unit_size);
> +
> +       for_each_set_bitrange_from(range_start, range_end,
> +                                  unaccepted_table->bitmap, bitmap_size) {
> +               unsigned long phys_start, phys_end;
> +
> +               phys_start = range_start * unit_size + unaccepted_table->phys_base;
> +               phys_end = range_end * unit_size + unaccepted_table->phys_base;
> +
> +               arch_accept_memory(phys_start, phys_end);
> +               bitmap_clear(unaccepted_table->bitmap,
> +                            range_start, range_end - range_start);
> +       }
> +}
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> index fff81843169c..8d17cee8b98e 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -613,6 +613,16 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
>                         e820_type = E820_TYPE_PMEM;
>                         break;
>
> +               case EFI_UNACCEPTED_MEMORY:
> +                       if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY)) {
> +                               efi_warn_once(
> +"The system has unaccepted memory,  but kernel does not support it\nConsider enabling CONFIG_UNACCEPTED_MEMORY\n");
> +                               continue;
> +                       }
> +                       e820_type = E820_TYPE_RAM;
> +                       process_unaccepted_memory(d->phys_addr,
> +                                                 d->phys_addr + PAGE_SIZE * d->num_pages);
> +                       break;
>                 default:
>                         continue;
>                 }
> @@ -697,6 +707,9 @@ static efi_status_t allocate_e820(struct boot_params *params,
>                 status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
>         }
>
> +       if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) && status == EFI_SUCCESS)
> +               status = allocate_unaccepted_bitmap(nr_desc, map);
> +
>         efi_bs_call(free_pool, map);
>         return status;
>  }
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index 7aa62c92185f..29cc622910da 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -108,7 +108,8 @@ typedef     struct {
>  #define EFI_MEMORY_MAPPED_IO_PORT_SPACE        12
>  #define EFI_PAL_CODE                   13
>  #define EFI_PERSISTENT_MEMORY          14
> -#define EFI_MAX_MEMORY_TYPE            15
> +#define EFI_UNACCEPTED_MEMORY          15
> +#define EFI_MAX_MEMORY_TYPE            16
>
>  /* Attribute values: */
>  #define EFI_MEMORY_UC          ((u64)0x0000000000000001ULL)    /* uncached */
> @@ -417,6 +418,7 @@ void efi_native_runtime_setup(void);
>  #define LINUX_EFI_MOK_VARIABLE_TABLE_GUID      EFI_GUID(0xc451ed2b, 0x9694, 0x45d3,  0xba, 0xba, 0xed, 0x9f, 0x89, 0x88, 0xa3, 0x89)
>  #define LINUX_EFI_COCO_SECRET_AREA_GUID                EFI_GUID(0xadf956ad, 0xe98c, 0x484c,  0xae, 0x11, 0xb5, 0x1c, 0x7d, 0x33, 0x64, 0x47)
>  #define LINUX_EFI_BOOT_MEMMAP_GUID             EFI_GUID(0x800f683f, 0xd08b, 0x423a,  0xa2, 0x93, 0x96, 0x5c, 0x3c, 0x6f, 0xe2, 0xb4)
> +#define LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID    EFI_GUID(0xd5d1de3c, 0x105c, 0x44f9,  0x9e, 0xa9, 0xbc, 0xef, 0x98, 0x12, 0x00, 0x31)
>
>  #define RISCV_EFI_BOOT_PROTOCOL_GUID           EFI_GUID(0xccd15fec, 0x6f73, 0x4eec,  0x83, 0x95, 0x3e, 0x69, 0xe4, 0xb9, 0x40, 0xbf)
>
> @@ -534,6 +536,14 @@ struct efi_boot_memmap {
>         efi_memory_desc_t       map[];
>  };
>
> +struct efi_unaccepted_memory {
> +       u32 version;
> +       u32 unit_size;
> +       u64 phys_base;
> +       u64 size;
> +       unsigned long bitmap[];
> +};
> +
>  /*
>   * Architecture independent structure for describing a memory map for the
>   * benefit of efi_memmap_init_early(), and for passing context between
> --
> 2.39.3
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 6/9] efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2023-05-13 22:04 ` [PATCHv11 6/9] efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory Kirill A. Shutemov
@ 2023-05-16 18:08   ` Ard Biesheuvel
  2023-05-16 18:27     ` Dave Hansen
  2023-05-16 18:33     ` Kirill A. Shutemov
  2023-05-17 16:07   ` Tom Lendacky
  1 sibling, 2 replies; 38+ messages in thread
From: Ard Biesheuvel @ 2023-05-16 18:08 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Borislav Petkov, Andy Lutomirski, Dave Hansen,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Mike Rapoport, David Hildenbrand,
	Mel Gorman, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, aarcange, peterx, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Dave Hansen

On Sun, 14 May 2023 at 00:04, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
> The unwanted loads are typically harmless. But, they might be made to
> totally unrelated or even unmapped memory. load_unaligned_zeropad()
> relies on exception fixup (#PF, #GP and now #VE) to recover from these
> unwanted loads.
>
> But, this approach does not work for unaccepted memory. For TDX, a load
> from unaccepted memory will not lead to a recoverable exception within
> the guest. The guest will exit to the VMM where the only recourse is to
> terminate the guest.
>

Does this mean that the kernel maps memory before accepting it? As
otherwise, I would assume that such an access would page fault inside
the guest before triggering an exception related to the unaccepted
state.

> There are two parts to fix this issue and comprehensively avoid access
> to unaccepted memory. Together these ensure that an extra "guard" page
> is accepted in addition to the memory that needs to be used.
>
> 1. Implicitly extend the range_contains_unaccepted_memory(start, end)
>    checks up to end+unit_size if 'end' is aligned on a unit_size
>    boundary.
> 2. Implicitly extend accept_memory(start, end) to end+unit_size if 'end'
>    is aligned on a unit_size boundary.
>
> Side note: This leads to something strange. Pages which were accepted
>            at boot, marked by the firmware as accepted and will never
>            _need_ to be accepted might be on unaccepted_pages list
>            This is a cue to ensure that the next page is accepted
>            before 'page' can be used.
>
> This is an actual, real-world problem which was discovered during TDX
> testing.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
>  drivers/firmware/efi/unaccepted_memory.c | 35 ++++++++++++++++++++++++
>  1 file changed, 35 insertions(+)
>
> diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
> index bb91c41f76fb..3d1ca60916dd 100644
> --- a/drivers/firmware/efi/unaccepted_memory.c
> +++ b/drivers/firmware/efi/unaccepted_memory.c
> @@ -37,6 +37,34 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
>         start -= unaccepted->phys_base;
>         end -= unaccepted->phys_base;
>
> +       /*
> +        * load_unaligned_zeropad() can lead to unwanted loads across page
> +        * boundaries. The unwanted loads are typically harmless. But, they
> +        * might be made to totally unrelated or even unmapped memory.
> +        * load_unaligned_zeropad() relies on exception fixup (#PF, #GP and now
> +        * #VE) to recover from these unwanted loads.
> +        *
> +        * But, this approach does not work for unaccepted memory. For TDX, a
> +        * load from unaccepted memory will not lead to a recoverable exception
> +        * within the guest. The guest will exit to the VMM where the only
> +        * recourse is to terminate the guest.
> +        *
> +        * There are two parts to fix this issue and comprehensively avoid
> +        * access to unaccepted memory. Together these ensure that an extra
> +        * "guard" page is accepted in addition to the memory that needs to be
> +        * used:
> +        *
> +        * 1. Implicitly extend the range_contains_unaccepted_memory(start, end)
> +        *    checks up to end+unit_size if 'end' is aligned on a unit_size
> +        *    boundary.
> +        *
> +        * 2. Implicitly extend accept_memory(start, end) to end+unit_size if
> +        *    'end' is aligned on a unit_size boundary. (immediately following
> +        *    this comment)
> +        */
> +       if (!(end % unit_size))
> +               end += unit_size;
> +
>         /* Make sure not to overrun the bitmap */
>         if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
>                 end = unaccepted->size * unit_size * BITS_PER_BYTE;
> @@ -84,6 +112,13 @@ bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
>         start -= unaccepted->phys_base;
>         end -= unaccepted->phys_base;
>
> +       /*
> +        * Also consider the unaccepted state of the *next* page. See fix #1 in
> +        * the comment on load_unaligned_zeropad() in accept_memory().
> +        */
> +       if (!(end % unit_size))
> +               end += unit_size;
> +
>         /* Make sure not to overrun the bitmap */
>         if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
>                 end = unaccepted->size * unit_size * BITS_PER_BYTE;
> --
> 2.39.3
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 6/9] efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2023-05-16 18:08   ` Ard Biesheuvel
@ 2023-05-16 18:27     ` Dave Hansen
  2023-05-16 18:35       ` Ard Biesheuvel
  2023-05-16 18:33     ` Kirill A. Shutemov
  1 sibling, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2023-05-16 18:27 UTC (permalink / raw)
  To: Ard Biesheuvel, Kirill A. Shutemov
  Cc: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Mike Rapoport, David Hildenbrand,
	Mel Gorman, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, aarcange, peterx, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Dave Hansen

On 5/16/23 11:08, Ard Biesheuvel wrote:
>> But, this approach does not work for unaccepted memory. For TDX, a load
>> from unaccepted memory will not lead to a recoverable exception within
>> the guest. The guest will exit to the VMM where the only recourse is to
>> terminate the guest.
>>
> Does this mean that the kernel maps memory before accepting it? As
> otherwise, I would assume that such an access would page fault inside
> the guest before triggering an exception related to the unaccepted
> state.

Yes, the kernel maps memory before accepting it (modulo things like
DEBUG_PAGEALLOC).


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 6/9] efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2023-05-16 18:08   ` Ard Biesheuvel
  2023-05-16 18:27     ` Dave Hansen
@ 2023-05-16 18:33     ` Kirill A. Shutemov
  2023-05-16 23:04       ` Dave Hansen
  1 sibling, 1 reply; 38+ messages in thread
From: Kirill A. Shutemov @ 2023-05-16 18:33 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Borislav Petkov, Andy Lutomirski, Dave Hansen,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Mike Rapoport, David Hildenbrand,
	Mel Gorman, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, aarcange, peterx, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Dave Hansen

On Tue, May 16, 2023 at 08:08:37PM +0200, Ard Biesheuvel wrote:
> On Sun, 14 May 2023 at 00:04, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
> > The unwanted loads are typically harmless. But, they might be made to
> > totally unrelated or even unmapped memory. load_unaligned_zeropad()
> > relies on exception fixup (#PF, #GP and now #VE) to recover from these
> > unwanted loads.
> >
> > But, this approach does not work for unaccepted memory. For TDX, a load
> > from unaccepted memory will not lead to a recoverable exception within
> > the guest. The guest will exit to the VMM where the only recourse is to
> > terminate the guest.
> >
> 
> Does this mean that the kernel maps memory before accepting it? As
> otherwise, I would assume that such an access would page fault inside
> the guest before triggering an exception related to the unaccepted
> state.

Yes, kernel maps all memory into direct mapping whether it is accepted or
not [yet].

The problem is that access of unaccepted memory is not page fault on TDX.
It causes unrecoverable exit to the host so it must not happen to
legitimate accesses, including load_unaligned_zeropad() overshoot.

For context: there's a way configure TDX environment to trigger #VE on
such accesses and it is default. But Linux requires such #VEs to be
disabled as it opens attack vector from the host to the guest: host can
pull any private page from under kernel at any point and trigger such #VE.
If it happens in just a right time in syscall gap or NMI entry code it can
be exploitable.

See also commits 9a22bf6debbf and 373e715e31bf.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 6/9] efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2023-05-16 18:27     ` Dave Hansen
@ 2023-05-16 18:35       ` Ard Biesheuvel
  2023-05-16 19:15         ` Kirill A. Shutemov
  2023-05-16 20:03         ` Dave Hansen
  0 siblings, 2 replies; 38+ messages in thread
From: Ard Biesheuvel @ 2023-05-16 18:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Mike Rapoport, David Hildenbrand,
	Mel Gorman, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, aarcange, peterx, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Dave Hansen

On Tue, 16 May 2023 at 20:27, Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/16/23 11:08, Ard Biesheuvel wrote:
> >> But, this approach does not work for unaccepted memory. For TDX, a load
> >> from unaccepted memory will not lead to a recoverable exception within
> >> the guest. The guest will exit to the VMM where the only recourse is to
> >> terminate the guest.
> >>
> > Does this mean that the kernel maps memory before accepting it? As
> > otherwise, I would assume that such an access would page fault inside
> > the guest before triggering an exception related to the unaccepted
> > state.
>
> Yes, the kernel maps memory before accepting it (modulo things like
> DEBUG_PAGEALLOC).
>

OK, and so the architecture stipulates that prefetching or other
speculative accesses must never deliver exceptions to the host
regarding such ranges?

If this all works as it should, then I'm ok with leaving this here,
but I imagine we may want to factor out some arch specific policy here
in the future, as I don't think this would work the same on ARM.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 6/9] efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2023-05-16 18:35       ` Ard Biesheuvel
@ 2023-05-16 19:15         ` Kirill A. Shutemov
  2023-05-16 20:03         ` Dave Hansen
  1 sibling, 0 replies; 38+ messages in thread
From: Kirill A. Shutemov @ 2023-05-16 19:15 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Dave Hansen, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Mike Rapoport, David Hildenbrand,
	Mel Gorman, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, aarcange, peterx, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Dave Hansen

On Tue, May 16, 2023 at 08:35:27PM +0200, Ard Biesheuvel wrote:
> On Tue, 16 May 2023 at 20:27, Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 5/16/23 11:08, Ard Biesheuvel wrote:
> > >> But, this approach does not work for unaccepted memory. For TDX, a load
> > >> from unaccepted memory will not lead to a recoverable exception within
> > >> the guest. The guest will exit to the VMM where the only recourse is to
> > >> terminate the guest.
> > >>
> > > Does this mean that the kernel maps memory before accepting it? As
> > > otherwise, I would assume that such an access would page fault inside
> > > the guest before triggering an exception related to the unaccepted
> > > state.
> >
> > Yes, the kernel maps memory before accepting it (modulo things like
> > DEBUG_PAGEALLOC).
> >
> 
> OK, and so the architecture stipulates that prefetching or other
> speculative accesses must never deliver exceptions to the host
> regarding such ranges?
> 
> If this all works as it should, then I'm ok with leaving this here,
> but I imagine we may want to factor out some arch specific policy here
> in the future, as I don't think this would work the same on ARM.

Even if other architectures don't need this, it is harmless: we just
accept one unit ahead of time.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 1/9] mm: Add support for unaccepted memory
  2023-05-13 22:04 ` [PATCHv11 1/9] mm: Add " Kirill A. Shutemov
@ 2023-05-16 19:44   ` Tom Lendacky
  2023-05-16 21:32     ` Kirill A. Shutemov
  0 siblings, 1 reply; 38+ messages in thread
From: Tom Lendacky @ 2023-05-16 19:44 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Sean Christopherson, Andrew Morton, Joerg Roedel,
	Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Mike Rapoport, David Hildenbrand,
	Mel Gorman, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, aarcange, peterx, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Mike Rapoport

On 5/13/23 17:04, Kirill A. Shutemov wrote:
> UEFI Specification version 2.9 introduces the concept of memory
> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
> SEV-SNP, require memory to be accepted before it can be used by the
> guest. Accepting happens via a protocol specific to the Virtual Machine
> platform.
> 
> There are several ways kernel can deal with unaccepted memory:
> 
>   1. Accept all the memory during the boot. It is easy to implement and
>      it doesn't have runtime cost once the system is booted. The downside
>      is very long boot time.
> 
>      Accept can be parallelized to multiple CPUs to keep it manageable
>      (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
>      memory bandwidth and does not scale beyond the point.
> 
>   2. Accept a block of memory on the first use. It requires more
>      infrastructure and changes in page allocator to make it work, but
>      it provides good boot time.
> 
>      On-demand memory accept means latency spikes every time kernel steps
>      onto a new memory block. The spikes will go away once workload data
>      set size gets stabilized or all memory gets accepted.
> 
>   3. Accept all memory in background. Introduce a thread (or multiple)
>      that gets memory accepted proactively. It will minimize time the
>      system experience latency spikes on memory allocation while keeping
>      low boot time.
> 
>      This approach cannot function on its own. It is an extension of #2:
>      background memory acceptance requires functional scheduler, but the
>      page allocator may need to tap into unaccepted memory before that.
> 
>      The downside of the approach is that these threads also steal CPU
>      cycles and memory bandwidth from the user's workload and may hurt
>      user experience.
> 
> The patch implements #1 and #2 for now. #2 is the default. Some
> workloads may want to use #1 with accept_memory=eager in kernel
> command line. #3 can be implemented later based on user's demands.
> 
> Support of unaccepted memory requires a few changes in core-mm code:
> 
>    - memblock has to accept memory on allocation;
> 
>    - page allocator has to accept memory on the first allocation of the
>      page;
> 
> Memblock change is trivial.
> 
> The page allocator is modified to accept pages. New memory gets accepted
> before putting pages on free lists. It is done lazily: only accept new
> pages when we run out of already accepted memory. The memory gets
> accepted until the high watermark is reached.
> 
> EFI code will provide two helpers if the platform supports unaccepted
> memory:
> 
>   - accept_memory() makes a range of physical addresses accepted.
> 
>   - range_contains_unaccepted_memory() checks anything within the range
>     of physical addresses requires acceptance.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>   drivers/base/node.c    |   7 ++
>   fs/proc/meminfo.c      |   5 ++
>   include/linux/mm.h     |  19 +++++
>   include/linux/mmzone.h |   8 ++
>   mm/internal.h          |   1 +
>   mm/memblock.c          |   9 +++
>   mm/mm_init.c           |   7 ++
>   mm/page_alloc.c        | 173 +++++++++++++++++++++++++++++++++++++++++
>   mm/vmstat.c            |   3 +
>   9 files changed, 232 insertions(+)
> 

> diff --git a/mm/internal.h b/mm/internal.h
> index 68410c6d97ac..b1db7ba5f57d 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1099,4 +1099,5 @@ struct vma_prepare {
>   	struct vm_area_struct *remove;
>   	struct vm_area_struct *remove2;
>   };
> +

Looks like an unintentional change.

>   #endif	/* __MM_INTERNAL_H */
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 3feafea06ab2..50b921119600 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1436,6 +1436,15 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
>   		 */
>   		kmemleak_alloc_phys(found, size, 0);
>   
> +	/*
> +	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> +	 * require memory to be accepted before it can be used by the
> +	 * guest.
> +	 *
> +	 * Accept the memory of the allocated buffer.
> +	 */
> +	accept_memory(found, found + size);

I'm not an mm or memblock expert, but do we need to worry about freed 
memory from memblock_phys_free() being possibly doubly accepted? A double 
acceptance will trigger a guest termination on SNP.

Thanks,
Tom

> +
>   	return found;
>   }
>   

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 2/9] efi/x86: Get full memory map in allocate_e820()
  2023-05-13 22:04 ` [PATCHv11 2/9] efi/x86: Get full memory map in allocate_e820() Kirill A. Shutemov
@ 2023-05-16 19:52   ` Tom Lendacky
  0 siblings, 0 replies; 38+ messages in thread
From: Tom Lendacky @ 2023-05-16 19:52 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Sean Christopherson, Andrew Morton, Joerg Roedel,
	Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Mike Rapoport, David Hildenbrand,
	Mel Gorman, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, aarcange, peterx, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Borislav Petkov

On 5/13/23 17:04, Kirill A. Shutemov wrote:
> Currently allocate_e820() is only interested in the size of map and size
> of memory descriptor to determine how many e820 entries the kernel
> needs.
> 
> UEFI Specification version 2.9 introduces a new memory type --
> unaccepted memory. To track unaccepted memory kernel needs to allocate
> a bitmap. The size of the bitmap is dependent on the maximum physical
> address present in the system. A full memory map is required to find
> the maximum address.
> 
> Modify allocate_e820() to get a full memory map.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Borislav Petkov <bp@suse.de>
> Acked-by: Ard Biesheuvel <ardb@kernel.org>

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>

> ---
>   drivers/firmware/efi/libstub/x86-stub.c | 26 +++++++++++--------------
>   1 file changed, 11 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> index a0bfd31358ba..fff81843169c 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -681,28 +681,24 @@ static efi_status_t allocate_e820(struct boot_params *params,
>   				  struct setup_data **e820ext,
>   				  u32 *e820ext_size)
>   {
> -	unsigned long map_size, desc_size, map_key;
> +	struct efi_boot_memmap *map;
>   	efi_status_t status;
> -	__u32 nr_desc, desc_version;
> +	__u32 nr_desc;
>   
> -	/* Only need the size of the mem map and size of each mem descriptor */
> -	map_size = 0;
> -	status = efi_bs_call(get_memory_map, &map_size, NULL, &map_key,
> -			     &desc_size, &desc_version);
> -	if (status != EFI_BUFFER_TOO_SMALL)
> -		return (status != EFI_SUCCESS) ? status : EFI_UNSUPPORTED;
> +	status = efi_get_memory_map(&map, false);
> +	if (status != EFI_SUCCESS)
> +		return status;
>   
> -	nr_desc = map_size / desc_size + EFI_MMAP_NR_SLACK_SLOTS;
> -
> -	if (nr_desc > ARRAY_SIZE(params->e820_table)) {
> -		u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table);
> +	nr_desc = map->map_size / map->desc_size;
> +	if (nr_desc > ARRAY_SIZE(params->e820_table) - EFI_MMAP_NR_SLACK_SLOTS) {
> +		u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table) +
> +			EFI_MMAP_NR_SLACK_SLOTS;
>   
>   		status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
> -		if (status != EFI_SUCCESS)
> -			return status;
>   	}
>   
> -	return EFI_SUCCESS;
> +	efi_bs_call(free_pool, map);
> +	return status;
>   }
>   
>   struct exit_boot_struct {

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 6/9] efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2023-05-16 18:35       ` Ard Biesheuvel
  2023-05-16 19:15         ` Kirill A. Shutemov
@ 2023-05-16 20:03         ` Dave Hansen
  2023-05-16 21:52           ` Kirill A. Shutemov
  1 sibling, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2023-05-16 20:03 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Mike Rapoport, David Hildenbrand,
	Mel Gorman, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, aarcange, peterx, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Dave Hansen

On 5/16/23 11:35, Ard Biesheuvel wrote:
>>> Does this mean that the kernel maps memory before accepting it? As
>>> otherwise, I would assume that such an access would page fault inside
>>> the guest before triggering an exception related to the unaccepted
>>> state.
>> Yes, the kernel maps memory before accepting it (modulo things like
>> DEBUG_PAGEALLOC).
>>
> OK, and so the architecture stipulates that prefetching or other
> speculative accesses must never deliver exceptions to the host
> regarding such ranges?

I don't know of anywhere that this is explicitly written.  It's probably
implicit _somewhere_ in the reams of VMX/TDX and base SDM docs, but heck
if I know where it is. :)

If this is something anyone wants to see added to the SEPT_VE_DISABLE
documentation, please speak up.  I don't think it would be hard to get
it added and provide an explicit guarantee.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 1/9] mm: Add support for unaccepted memory
  2023-05-16 19:44   ` Tom Lendacky
@ 2023-05-16 21:32     ` Kirill A. Shutemov
  0 siblings, 0 replies; 38+ messages in thread
From: Kirill A. Shutemov @ 2023-05-16 21:32 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Sean Christopherson, Andrew Morton, Joerg Roedel,
	Ard Biesheuvel, Andi Kleen, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel, Mike Rapoport

On Tue, May 16, 2023 at 02:44:00PM -0500, Tom Lendacky wrote:
> On 5/13/23 17:04, Kirill A. Shutemov wrote:
> > UEFI Specification version 2.9 introduces the concept of memory
> > acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
> > SEV-SNP, require memory to be accepted before it can be used by the
> > guest. Accepting happens via a protocol specific to the Virtual Machine
> > platform.
> > 
> > There are several ways kernel can deal with unaccepted memory:
> > 
> >   1. Accept all the memory during the boot. It is easy to implement and
> >      it doesn't have runtime cost once the system is booted. The downside
> >      is very long boot time.
> > 
> >      Accept can be parallelized to multiple CPUs to keep it manageable
> >      (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
> >      memory bandwidth and does not scale beyond the point.
> > 
> >   2. Accept a block of memory on the first use. It requires more
> >      infrastructure and changes in page allocator to make it work, but
> >      it provides good boot time.
> > 
> >      On-demand memory accept means latency spikes every time kernel steps
> >      onto a new memory block. The spikes will go away once workload data
> >      set size gets stabilized or all memory gets accepted.
> > 
> >   3. Accept all memory in background. Introduce a thread (or multiple)
> >      that gets memory accepted proactively. It will minimize time the
> >      system experience latency spikes on memory allocation while keeping
> >      low boot time.
> > 
> >      This approach cannot function on its own. It is an extension of #2:
> >      background memory acceptance requires functional scheduler, but the
> >      page allocator may need to tap into unaccepted memory before that.
> > 
> >      The downside of the approach is that these threads also steal CPU
> >      cycles and memory bandwidth from the user's workload and may hurt
> >      user experience.
> > 
> > The patch implements #1 and #2 for now. #2 is the default. Some
> > workloads may want to use #1 with accept_memory=eager in kernel
> > command line. #3 can be implemented later based on user's demands.
> > 
> > Support of unaccepted memory requires a few changes in core-mm code:
> > 
> >    - memblock has to accept memory on allocation;
> > 
> >    - page allocator has to accept memory on the first allocation of the
> >      page;
> > 
> > Memblock change is trivial.
> > 
> > The page allocator is modified to accept pages. New memory gets accepted
> > before putting pages on free lists. It is done lazily: only accept new
> > pages when we run out of already accepted memory. The memory gets
> > accepted until the high watermark is reached.
> > 
> > EFI code will provide two helpers if the platform supports unaccepted
> > memory:
> > 
> >   - accept_memory() makes a range of physical addresses accepted.
> > 
> >   - range_contains_unaccepted_memory() checks anything within the range
> >     of physical addresses requires acceptance.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
> > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > ---
> >   drivers/base/node.c    |   7 ++
> >   fs/proc/meminfo.c      |   5 ++
> >   include/linux/mm.h     |  19 +++++
> >   include/linux/mmzone.h |   8 ++
> >   mm/internal.h          |   1 +
> >   mm/memblock.c          |   9 +++
> >   mm/mm_init.c           |   7 ++
> >   mm/page_alloc.c        | 173 +++++++++++++++++++++++++++++++++++++++++
> >   mm/vmstat.c            |   3 +
> >   9 files changed, 232 insertions(+)
> > 
> 
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 68410c6d97ac..b1db7ba5f57d 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -1099,4 +1099,5 @@ struct vma_prepare {
> >   	struct vm_area_struct *remove;
> >   	struct vm_area_struct *remove2;
> >   };
> > +
> 
> Looks like an unintentional change.

Yep, will fix.

> >   #endif	/* __MM_INTERNAL_H */
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 3feafea06ab2..50b921119600 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -1436,6 +1436,15 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
> >   		 */
> >   		kmemleak_alloc_phys(found, size, 0);
> > +	/*
> > +	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> > +	 * require memory to be accepted before it can be used by the
> > +	 * guest.
> > +	 *
> > +	 * Accept the memory of the allocated buffer.
> > +	 */
> > +	accept_memory(found, found + size);
> 
> I'm not an mm or memblock expert, but do we need to worry about freed memory
> from memblock_phys_free() being possibly doubly accepted? A double
> acceptance will trigger a guest termination on SNP.

There will be no double acceptance. accept_memory() will consult the
bitmap before accepting any memory. For already accepted memory it is a
nop.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 6/9] efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2023-05-16 20:03         ` Dave Hansen
@ 2023-05-16 21:52           ` Kirill A. Shutemov
  2023-05-16 21:59             ` Dave Hansen
  0 siblings, 1 reply; 38+ messages in thread
From: Kirill A. Shutemov @ 2023-05-16 21:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ard Biesheuvel, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Andi Kleen, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, Mel Gorman, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, aarcange, peterx, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel, Dave Hansen

On Tue, May 16, 2023 at 01:03:32PM -0700, Dave Hansen wrote:
> On 5/16/23 11:35, Ard Biesheuvel wrote:
> >>> Does this mean that the kernel maps memory before accepting it? As
> >>> otherwise, I would assume that such an access would page fault inside
> >>> the guest before triggering an exception related to the unaccepted
> >>> state.
> >> Yes, the kernel maps memory before accepting it (modulo things like
> >> DEBUG_PAGEALLOC).
> >>
> > OK, and so the architecture stipulates that prefetching or other
> > speculative accesses must never deliver exceptions to the host
> > regarding such ranges?
> 
> I don't know of anywhere that this is explicitly written.  It's probably
> implicit _somewhere_ in the reams of VMX/TDX and base SDM docs, but heck
> if I know where it is. :)

It is not specific to TDX: on x86 (and all architectures with precise
exceptions) exception handling is delayed until instruction retirement and
will not happen if speculation turned out to be wrong. And prefetching
never generates exceptions.

But I failed to find right away in 5000+ pages of Intel Software
Developer’s Manual. :/

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 6/9] efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2023-05-16 21:52           ` Kirill A. Shutemov
@ 2023-05-16 21:59             ` Dave Hansen
  2023-05-16 22:15               ` Ard Biesheuvel
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2023-05-16 21:59 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Ard Biesheuvel, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Andi Kleen, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, Mel Gorman, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, aarcange, peterx, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel, Dave Hansen

On 5/16/23 14:52, Kirill A. Shutemov wrote:
> On Tue, May 16, 2023 at 01:03:32PM -0700, Dave Hansen wrote:
>> On 5/16/23 11:35, Ard Biesheuvel wrote:
>>>>> Does this mean that the kernel maps memory before accepting it? As
>>>>> otherwise, I would assume that such an access would page fault inside
>>>>> the guest before triggering an exception related to the unaccepted
>>>>> state.
>>>> Yes, the kernel maps memory before accepting it (modulo things like
>>>> DEBUG_PAGEALLOC).
>>>>
>>> OK, and so the architecture stipulates that prefetching or other
>>> speculative accesses must never deliver exceptions to the host
>>> regarding such ranges?
>> I don't know of anywhere that this is explicitly written.  It's probably
>> implicit _somewhere_ in the reams of VMX/TDX and base SDM docs, but heck
>> if I know where it is. 😄
> It is not specific to TDX: on x86 (and all architectures with precise
> exceptions) exception handling is delayed until instruction retirement and
> will not happen if speculation turned out to be wrong. And prefetching
> never generates exceptions.

Not to be Debbie Downer too much here, but it's *totally* possible for
speculative execution to go read memory that causes you to machine
check.  We've had such bugs in Linux.

We just happen to be lucky in this case that the unaccepted memory
exceptions don't generate machine checks *AND* TDX hardware does not
machine check on speculative accesses that would _just_ violate TDX
security properties.

You're right for normal, sane exceptions, though.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 6/9] efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2023-05-16 21:59             ` Dave Hansen
@ 2023-05-16 22:15               ` Ard Biesheuvel
  0 siblings, 0 replies; 38+ messages in thread
From: Ard Biesheuvel @ 2023-05-16 22:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel, Andi Kleen, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Dario Faggioli,
	Mike Rapoport, David Hildenbrand, Mel Gorman, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, aarcange, peterx, x86,
	linux-mm, linux-coco, linux-efi, linux-kernel, Dave Hansen

On Wed, 17 May 2023 at 00:00, Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/16/23 14:52, Kirill A. Shutemov wrote:
> > On Tue, May 16, 2023 at 01:03:32PM -0700, Dave Hansen wrote:
> >> On 5/16/23 11:35, Ard Biesheuvel wrote:
> >>>>> Does this mean that the kernel maps memory before accepting it? As
> >>>>> otherwise, I would assume that such an access would page fault inside
> >>>>> the guest before triggering an exception related to the unaccepted
> >>>>> state.
> >>>> Yes, the kernel maps memory before accepting it (modulo things like
> >>>> DEBUG_PAGEALLOC).
> >>>>
> >>> OK, and so the architecture stipulates that prefetching or other
> >>> speculative accesses must never deliver exceptions to the host
> >>> regarding such ranges?
> >> I don't know of anywhere that this is explicitly written.  It's probably
> >> implicit _somewhere_ in the reams of VMX/TDX and base SDM docs, but heck
> >> if I know where it is. 😄
> > It is not specific to TDX: on x86 (and all architectures with precise
> > exceptions) exception handling is delayed until instruction retirement and
> > will not happen if speculation turned out to be wrong. And prefetching
> > never generates exceptions.
>
> Not to be Debbie Downer too much here, but it's *totally* possible for
> speculative execution to go read memory that causes you to machine
> check.  We've had such bugs in Linux.
>
> We just happen to be lucky in this case that the unaccepted memory
> exceptions don't generate machine checks *AND* TDX hardware does not
> machine check on speculative accesses that would _just_ violate TDX
> security properties.
>
> You're right for normal, sane exceptions, though.

Same thing on ARM, although I'd have to check their RME stuff in more
detail to see how it behaves in this particular case.

But Kyrill is right that it doesn't really matter for the logic in
this patch - it just accepts some additional pages. The relevant
difference between implementations will likely be whether unaccepted
memory gets mapped beforehand in the first place, but we'll deal with
that once we have to.

As long as we only accept memory that appears in the bitmap as
'unaccepted', this kind of rounding seems safe and reasonable to me.

Reviewed-by: Ard Biesheuvel <ardb@kernel.org>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 0/9] mm, x86/cc, efi: Implement support for unaccepted memory
  2023-05-13 22:04 [PATCHv11 0/9] mm, x86/cc, efi: Implement support for unaccepted memory Kirill A. Shutemov
                   ` (8 preceding siblings ...)
  2023-05-13 22:04 ` [PATCHv11 9/9] x86/tdx: Add unaccepted memory support Kirill A. Shutemov
@ 2023-05-16 22:41 ` Tom Lendacky
  2023-05-16 23:22   ` Kirill A. Shutemov
  9 siblings, 1 reply; 38+ messages in thread
From: Tom Lendacky @ 2023-05-16 22:41 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Sean Christopherson, Andrew Morton, Joerg Roedel,
	Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Mike Rapoport, David Hildenbrand,
	Mel Gorman, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, aarcange, peterx, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel

On 5/13/23 17:04, Kirill A. Shutemov wrote:
> UEFI Specification version 2.9 introduces the concept of memory
> acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
> SEV-SNP, requiring memory to be accepted before it can be used by the
> guest. Accepting happens via a protocol specific for the Virtual
> Machine platform.
> 
> Accepting memory is costly and it makes VMM allocate memory for the
> accepted guest physical address range. It's better to postpone memory
> acceptance until memory is needed. It lowers boot time and reduces
> memory overhead.
> 
> The kernel needs to know what memory has been accepted. Firmware
> communicates this information via memory map: a new memory type --
> EFI_UNACCEPTED_MEMORY -- indicates such memory.
> 
> Range-based tracking works fine for firmware, but it gets bulky for
> the kernel: e820 has to be modified on every page acceptance. It leads
> to table fragmentation, but there's a limited number of entries in the
> e820 table
> 
> Another option is to mark such memory as usable in e820 and track if the
> range has been accepted in a bitmap. One bit in the bitmap represents
> 2MiB in the address space: one 4k page is enough to track 64GiB or
> physical address space.
> 
> In the worst-case scenario -- a huge hole in the middle of the
> address space -- It needs 256MiB to handle 4PiB of the address
> space.
> 
> Any unaccepted memory that is not aligned to 2M gets accepted upfront.
> 
> The approach lowers boot time substantially. Boot to shell is ~2.5x
> faster for 4G TDX VM and ~4x faster for 64G.
> 
> TDX-specific code isolated from the core of unaccepted memory support. It
> supposed to help to plug-in different implementation of unaccepted memory
> such as SEV-SNP.
> 
> -- Fragmentation study --
> 
> Vlastimil and Mel were concern about effect of unaccepted memory on
> fragmentation prevention measures in page allocator. I tried to evaluate
> it, but it is tricky. As suggested I tried to run multiple parallel kernel
> builds and follow how often kmem:mm_page_alloc_extfrag gets hit.
> 
> See results in the v9 of the patchset[1][2]
> 
> [1] https://lore.kernel.org/all/20230330114956.20342-1-kirill.shutemov@linux.intel.com
> [2] https://lore.kernel.org/all/20230416191940.ex7ao43pmrjhru2p@box.shutemov.name
> 
> --
> 
> The tree can be found here:
> 
> https://github.com/intel/tdx.git guest-unaccepted-memory

I get some failures when building without TDX support selected in my
kernel config after adding unaccepted memory support for SNP:

   In file included from arch/x86/boot/compressed/../../coco/tdx/tdx-shared.c:1,
                    from arch/x86/boot/compressed/tdx-shared.c:2:
   ./arch/x86/include/asm/tdx.h: In function ‘tdx_kvm_hypercall’:
   ./arch/x86/include/asm/tdx.h:72:17: error: ‘ENODEV’ undeclared (first use in this function)
      72 |         return -ENODEV;
         |                 ^~~~~~
   ./arch/x86/include/asm/tdx.h:72:17: note: each undeclared identifier is reported only once for each function it appears in

Adding an include for linux/errno.h gets past that error, but then
I get the following:

   ld: arch/x86/boot/compressed/tdx-shared.o: in function `tdx_enc_status_changed_phys':
   tdx-shared.c:(.text+0x42): undefined reference to `__tdx_hypercall'
   ld: tdx-shared.c:(.text+0x7f): undefined reference to `__tdx_module_call'
   ld: tdx-shared.c:(.text+0xce): undefined reference to `__tdx_module_call'
   ld: tdx-shared.c:(.text+0x13b): undefined reference to `__tdx_module_call'
   ld: tdx-shared.c:(.text+0x153): undefined reference to `cc_mkdec'
   ld: tdx-shared.c:(.text+0x15d): undefined reference to `cc_mkdec'
   ld: tdx-shared.c:(.text+0x18e): undefined reference to `__tdx_hypercall'
   ld: arch/x86/boot/compressed/vmlinux: hidden symbol `__tdx_hypercall' isn't defined
   ld: final link failed: bad value

So it looks like arch/x86/boot/compressed/tdx-shared.c is being
built, while arch/x86/boot/compressed/tdx.c isn't.

After setting TDX in the kernel config, I can build successfully, but
I'm running into an error when trying to accept memory during
decompression.

In drivers/firmware/efi/libstub/unaccepted_memory.c, I can see that the
unaccepted_table is allocated, but when accept_memory() is invoked the
table address is now zero. I thought maybe it had to do with bss, but even
putting it in the .data section didn't help. I'll keep digging, but if you
have any ideas, that would be great.

Thanks,
Tom

> 
> The patchset depends on MAX_ORDER changes in MM tree.
> 
> v11:
>   - Restructure the code to make it less x86-specific (suggested by Ard):
>     + use EFI configuration table instead of zero-page to pass down bitmap;
>     + do not imply 1bit == 2M in bitmap;
>     + move bulk of the code under driver/firmware/efi;
>   - The bitmap only covers unaccpeted memory now. All memory that is not covered
>     by the bitmap assumed accepted;
>   - Reviewed-by from Ard;
> v10:
>   - Restructure code around zones_with_unaccepted_pages static brach to avoid
>     unnecessary function calls (Suggested by Vlastimil);
>   - Drop mentions of PageUnaccepted();
>   - Drop patches that add fake unaccepted memory support and sysfs handle to
>     accept memory manually;
>   - Add Reviewed-by from Vlastimil;
> v9:
>   - Accept memory up to high watermark when kernel runs out of free memory;
>   - Treat unaccepted memory as unusable in __zone_watermark_unusable_free();
>   - Per-zone unaccepted memory accounting;
>   - All pages on unaccepted list are MAX_ORDER now;
>   - accept_memory=eager in cmdline to pre-accept memory during the boot;
>   - Implement fake unaccepted memory;
>   - Sysfs handle to accept memory manually;
>   - Drop PageUnaccepted();
>   - Rename unaccepted_pages static key to zones_with_unaccepted_pages;
> v8:
>   - Rewrite core-mm support for unaccepted memory (patch 02/14);
>   - s/UnacceptedPages/Unaccepted/ in meminfo;
>   - Drop arch/x86/boot/compressed/compiler.h;
>   - Fix build errors;
>   - Adjust commit messages and comments;
>   - Reviewed-bys from Dave and Borislav;
>   - Rebased to tip/master.
> v7:
>   - Rework meminfo counter to use PageUnaccepted() and move to generic code;
>   - Fix range_contains_unaccepted_memory() on machines without unaccepted memory;
>   - Add Reviewed-by from David;
> v6:
>   - Fix load_unaligned_zeropad() on machine with unaccepted memory;
>   - Clear PageUnaccepted() on merged pages, leaving it only on head;
>   - Clarify error handling in allocate_e820();
>   - Fix build with CONFIG_UNACCEPTED_MEMORY=y, but without TDX;
>   - Disable kexec at boottime instead of build conflict;
>   - Rebased to tip/master;
>   - Spelling fixes;
>   - Add Reviewed-by from Mike and David;
> v5:
>   - Updates comments and commit messages;
>     + Explain options for unaccepted memory handling;
>   - Expose amount of unaccepted memory in /proc/meminfo
>   - Adjust check in page_expected_state();
>   - Fix error code handling in allocate_e820();
>   - Centralize __pa()/__va() definitions in the boot stub;
>   - Avoid includes from the main kernel in the boot stub;
>   - Use an existing hole in boot_param for unaccepted_memory, instead of adding
>     to the end of the structure;
>   - Extract allocate_unaccepted_memory() form allocate_e820();
>   - Complain if there's unaccepted memory, but kernel does not support it;
>   - Fix vmstat counter;
>   - Split up few preparatory patches;
>   - Random readability adjustments;
> v4:
>   - PageBuddyUnaccepted() -> PageUnaccepted;
>   - Use separate page_type, not shared with offline;
>   - Rework interface between core-mm and arch code;
>   - Adjust commit messages;
>   - Ack from Mike;
> 
> Kirill A. Shutemov (9):
>    mm: Add support for unaccepted memory
>    efi/x86: Get full memory map in allocate_e820()
>    efi/libstub: Implement support for unaccepted memory
>    x86/boot/compressed: Handle unaccepted memory
>    efi: Provide helpers for unaccepted memory
>    efi/unaccepted: Avoid load_unaligned_zeropad() stepping into
>      unaccepted memory
>    x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in
>      boot stub
>    x86/tdx: Refactor try_accept_one()
>    x86/tdx: Add unaccepted memory support
> 
>   arch/x86/Kconfig                              |   2 +
>   arch/x86/boot/compressed/Makefile             |   1 +
>   arch/x86/boot/compressed/efi.h                |   1 +
>   arch/x86/boot/compressed/error.c              |  19 ++
>   arch/x86/boot/compressed/error.h              |   1 +
>   arch/x86/boot/compressed/kaslr.c              |  35 ++-
>   arch/x86/boot/compressed/mem.c                |  42 ++++
>   arch/x86/boot/compressed/misc.c               |   6 +
>   arch/x86/boot/compressed/misc.h               |   6 +
>   arch/x86/boot/compressed/tdx-shared.c         |   2 +
>   arch/x86/boot/compressed/tdx.c                |  37 +++
>   arch/x86/coco/tdx/Makefile                    |   2 +-
>   arch/x86/coco/tdx/tdx-shared.c                |  95 ++++++++
>   arch/x86/coco/tdx/tdx.c                       | 118 +---------
>   arch/x86/include/asm/efi.h                    |   2 +
>   arch/x86/include/asm/shared/tdx.h             |  53 +++++
>   arch/x86/include/asm/tdx.h                    |  21 +-
>   arch/x86/include/asm/unaccepted_memory.h      |  23 ++
>   drivers/base/node.c                           |   7 +
>   drivers/firmware/efi/Kconfig                  |  14 ++
>   drivers/firmware/efi/Makefile                 |   1 +
>   drivers/firmware/efi/efi.c                    |   7 +
>   drivers/firmware/efi/libstub/Makefile         |   2 +
>   drivers/firmware/efi/libstub/bitmap.c         |  41 ++++
>   drivers/firmware/efi/libstub/efistub.h        |   6 +
>   drivers/firmware/efi/libstub/find.c           |  43 ++++
>   .../firmware/efi/libstub/unaccepted_memory.c  | 222 ++++++++++++++++++
>   drivers/firmware/efi/libstub/x86-stub.c       |  39 +--
>   drivers/firmware/efi/unaccepted_memory.c      | 138 +++++++++++
>   fs/proc/meminfo.c                             |   5 +
>   include/linux/efi.h                           |  13 +-
>   include/linux/mm.h                            |  19 ++
>   include/linux/mmzone.h                        |   8 +
>   mm/internal.h                                 |   1 +
>   mm/memblock.c                                 |   9 +
>   mm/mm_init.c                                  |   7 +
>   mm/page_alloc.c                               | 173 ++++++++++++++
>   mm/vmstat.c                                   |   3 +
>   38 files changed, 1060 insertions(+), 164 deletions(-)
>   create mode 100644 arch/x86/boot/compressed/mem.c
>   create mode 100644 arch/x86/boot/compressed/tdx-shared.c
>   create mode 100644 arch/x86/coco/tdx/tdx-shared.c
>   create mode 100644 arch/x86/include/asm/unaccepted_memory.h
>   create mode 100644 drivers/firmware/efi/libstub/bitmap.c
>   create mode 100644 drivers/firmware/efi/libstub/find.c
>   create mode 100644 drivers/firmware/efi/libstub/unaccepted_memory.c
>   create mode 100644 drivers/firmware/efi/unaccepted_memory.c
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 6/9] efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2023-05-16 18:33     ` Kirill A. Shutemov
@ 2023-05-16 23:04       ` Dave Hansen
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Hansen @ 2023-05-16 23:04 UTC (permalink / raw)
  To: Kirill A. Shutemov, Ard Biesheuvel
  Cc: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Mike Rapoport, David Hildenbrand,
	Mel Gorman, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, aarcange, peterx, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Dave Hansen

On 5/16/23 11:33, Kirill A. Shutemov wrote:
> For context: there's a way configure TDX environment to trigger #VE on
> such accesses and it is default. But Linux requires such #VEs to be
> disabled as it opens attack vector from the host to the guest: host can
> pull any private page from under kernel at any point and trigger such #VE.
> If it happens in just a right time in syscall gap or NMI entry code it can
> be exploitable.

I'm kinda uncomfortable with saying it's exploitable.

It really boils down to not wanting to deal with managing a new IST
exception.  While the NMI IST implementation is about as good as we can
get it, I believe there are still holes in it (even if we consider only
how it interacts with #MC).  The more IST users we add, the more holes
there are.

You add the fact that an actual adversary can induce the exceptions
instead of (rare and mostly random) radiation that causes #MC, and it
makes me want to either curl up in a little ball or pursue a new career.

So, exploitable?  Dunno.  Do I want to touch an #VE/IST implementation?
No way, not with a 10 foot pole.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 0/9] mm, x86/cc, efi: Implement support for unaccepted memory
  2023-05-16 22:41 ` [PATCHv11 0/9] mm, x86/cc, efi: Implement support for unaccepted memory Tom Lendacky
@ 2023-05-16 23:22   ` Kirill A. Shutemov
  2023-05-17 14:32     ` Tom Lendacky
  0 siblings, 1 reply; 38+ messages in thread
From: Kirill A. Shutemov @ 2023-05-16 23:22 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Sean Christopherson, Andrew Morton, Joerg Roedel,
	Ard Biesheuvel, Andi Kleen, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Tue, May 16, 2023 at 05:41:55PM -0500, Tom Lendacky wrote:
> On 5/13/23 17:04, Kirill A. Shutemov wrote:
> > UEFI Specification version 2.9 introduces the concept of memory
> > acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
> > SEV-SNP, requiring memory to be accepted before it can be used by the
> > guest. Accepting happens via a protocol specific for the Virtual
> > Machine platform.
> > 
> > Accepting memory is costly and it makes VMM allocate memory for the
> > accepted guest physical address range. It's better to postpone memory
> > acceptance until memory is needed. It lowers boot time and reduces
> > memory overhead.
> > 
> > The kernel needs to know what memory has been accepted. Firmware
> > communicates this information via memory map: a new memory type --
> > EFI_UNACCEPTED_MEMORY -- indicates such memory.
> > 
> > Range-based tracking works fine for firmware, but it gets bulky for
> > the kernel: e820 has to be modified on every page acceptance. It leads
> > to table fragmentation, but there's a limited number of entries in the
> > e820 table
> > 
> > Another option is to mark such memory as usable in e820 and track if the
> > range has been accepted in a bitmap. One bit in the bitmap represents
> > 2MiB in the address space: one 4k page is enough to track 64GiB or
> > physical address space.
> > 
> > In the worst-case scenario -- a huge hole in the middle of the
> > address space -- It needs 256MiB to handle 4PiB of the address
> > space.
> > 
> > Any unaccepted memory that is not aligned to 2M gets accepted upfront.
> > 
> > The approach lowers boot time substantially. Boot to shell is ~2.5x
> > faster for 4G TDX VM and ~4x faster for 64G.
> > 
> > TDX-specific code isolated from the core of unaccepted memory support. It
> > supposed to help to plug-in different implementation of unaccepted memory
> > such as SEV-SNP.
> > 
> > -- Fragmentation study --
> > 
> > Vlastimil and Mel were concern about effect of unaccepted memory on
> > fragmentation prevention measures in page allocator. I tried to evaluate
> > it, but it is tricky. As suggested I tried to run multiple parallel kernel
> > builds and follow how often kmem:mm_page_alloc_extfrag gets hit.
> > 
> > See results in the v9 of the patchset[1][2]
> > 
> > [1] https://lore.kernel.org/all/20230330114956.20342-1-kirill.shutemov@linux.intel.com
> > [2] https://lore.kernel.org/all/20230416191940.ex7ao43pmrjhru2p@box.shutemov.name
> > 
> > --
> > 
> > The tree can be found here:
> > 
> > https://github.com/intel/tdx.git guest-unaccepted-memory
> 
> I get some failures when building without TDX support selected in my
> kernel config after adding unaccepted memory support for SNP:
> 
>   In file included from arch/x86/boot/compressed/../../coco/tdx/tdx-shared.c:1,
>                    from arch/x86/boot/compressed/tdx-shared.c:2:
>   ./arch/x86/include/asm/tdx.h: In function ‘tdx_kvm_hypercall’:
>   ./arch/x86/include/asm/tdx.h:72:17: error: ‘ENODEV’ undeclared (first use in this function)
>      72 |         return -ENODEV;
>         |                 ^~~~~~
>   ./arch/x86/include/asm/tdx.h:72:17: note: each undeclared identifier is reported only once for each function it appears in
> 
> Adding an include for linux/errno.h gets past that error, but then
> I get the following:
> 
>   ld: arch/x86/boot/compressed/tdx-shared.o: in function `tdx_enc_status_changed_phys':
>   tdx-shared.c:(.text+0x42): undefined reference to `__tdx_hypercall'
>   ld: tdx-shared.c:(.text+0x7f): undefined reference to `__tdx_module_call'
>   ld: tdx-shared.c:(.text+0xce): undefined reference to `__tdx_module_call'
>   ld: tdx-shared.c:(.text+0x13b): undefined reference to `__tdx_module_call'
>   ld: tdx-shared.c:(.text+0x153): undefined reference to `cc_mkdec'
>   ld: tdx-shared.c:(.text+0x15d): undefined reference to `cc_mkdec'
>   ld: tdx-shared.c:(.text+0x18e): undefined reference to `__tdx_hypercall'
>   ld: arch/x86/boot/compressed/vmlinux: hidden symbol `__tdx_hypercall' isn't defined
>   ld: final link failed: bad value
> 
> So it looks like arch/x86/boot/compressed/tdx-shared.c is being
> built, while arch/x86/boot/compressed/tdx.c isn't.

Right. I think this should help:

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 78f67e0a2666..b13a58021086 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -106,8 +106,8 @@ ifdef CONFIG_X86_64
 endif

 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
-vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
-vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/mem.o $(obj)/tdx-shared.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o $(obj)/tdx-shared.o
+vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/mem.o

 vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o

> After setting TDX in the kernel config, I can build successfully, but
> I'm running into an error when trying to accept memory during
> decompression.
> 
> In drivers/firmware/efi/libstub/unaccepted_memory.c, I can see that the
> unaccepted_table is allocated, but when accept_memory() is invoked the
> table address is now zero. I thought maybe it had to do with bss, but even
> putting it in the .data section didn't help. I'll keep digging, but if you
> have any ideas, that would be great.

Not right away. But maybe seeing your side of enabling would help.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 0/9] mm, x86/cc, efi: Implement support for unaccepted memory
  2023-05-16 23:22   ` Kirill A. Shutemov
@ 2023-05-17 14:32     ` Tom Lendacky
  2023-05-17 18:36       ` Kirill A. Shutemov
  0 siblings, 1 reply; 38+ messages in thread
From: Tom Lendacky @ 2023-05-17 14:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Sean Christopherson, Andrew Morton, Joerg Roedel,
	Ard Biesheuvel, Andi Kleen, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On 5/16/23 18:22, Kirill A. Shutemov wrote:
> On Tue, May 16, 2023 at 05:41:55PM -0500, Tom Lendacky wrote:
>> On 5/13/23 17:04, Kirill A. Shutemov wrote:
>>> UEFI Specification version 2.9 introduces the concept of memory
>>> acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
>>> SEV-SNP, requiring memory to be accepted before it can be used by the
>>> guest. Accepting happens via a protocol specific for the Virtual
>>> Machine platform.
>>>
>>> Accepting memory is costly and it makes VMM allocate memory for the
>>> accepted guest physical address range. It's better to postpone memory
>>> acceptance until memory is needed. It lowers boot time and reduces
>>> memory overhead.
>>>
>>> The kernel needs to know what memory has been accepted. Firmware
>>> communicates this information via memory map: a new memory type --
>>> EFI_UNACCEPTED_MEMORY -- indicates such memory.
>>>
>>> Range-based tracking works fine for firmware, but it gets bulky for
>>> the kernel: e820 has to be modified on every page acceptance. It leads
>>> to table fragmentation, but there's a limited number of entries in the
>>> e820 table
>>>
>>> Another option is to mark such memory as usable in e820 and track if the
>>> range has been accepted in a bitmap. One bit in the bitmap represents
>>> 2MiB in the address space: one 4k page is enough to track 64GiB or
>>> physical address space.
>>>
>>> In the worst-case scenario -- a huge hole in the middle of the
>>> address space -- It needs 256MiB to handle 4PiB of the address
>>> space.
>>>
>>> Any unaccepted memory that is not aligned to 2M gets accepted upfront.
>>>
>>> The approach lowers boot time substantially. Boot to shell is ~2.5x
>>> faster for 4G TDX VM and ~4x faster for 64G.
>>>
>>> TDX-specific code isolated from the core of unaccepted memory support. It
>>> supposed to help to plug-in different implementation of unaccepted memory
>>> such as SEV-SNP.
>>>
>>> -- Fragmentation study --
>>>
>>> Vlastimil and Mel were concern about effect of unaccepted memory on
>>> fragmentation prevention measures in page allocator. I tried to evaluate
>>> it, but it is tricky. As suggested I tried to run multiple parallel kernel
>>> builds and follow how often kmem:mm_page_alloc_extfrag gets hit.
>>>
>>> See results in the v9 of the patchset[1][2]
>>>
>>> [1] https://lore.kernel.org/all/20230330114956.20342-1-kirill.shutemov@linux.intel.com
>>> [2] https://lore.kernel.org/all/20230416191940.ex7ao43pmrjhru2p@box.shutemov.name
>>>
>>> --
>>>
>>> The tree can be found here:
>>>
>>> https://github.com/intel/tdx.git guest-unaccepted-memory
>>
>> I get some failures when building without TDX support selected in my
>> kernel config after adding unaccepted memory support for SNP:
>>
>>    In file included from arch/x86/boot/compressed/../../coco/tdx/tdx-shared.c:1,
>>                     from arch/x86/boot/compressed/tdx-shared.c:2:
>>    ./arch/x86/include/asm/tdx.h: In function ?tdx_kvm_hypercall?:
>>    ./arch/x86/include/asm/tdx.h:72:17: error: ?ENODEV? undeclared (first use in this function)
>>       72 |         return -ENODEV;
>>          |                 ^~~~~~
>>    ./arch/x86/include/asm/tdx.h:72:17: note: each undeclared identifier is reported only once for each function it appears in
>>
>> Adding an include for linux/errno.h gets past that error, but then
>> I get the following:
>>
>>    ld: arch/x86/boot/compressed/tdx-shared.o: in function `tdx_enc_status_changed_phys':
>>    tdx-shared.c:(.text+0x42): undefined reference to `__tdx_hypercall'
>>    ld: tdx-shared.c:(.text+0x7f): undefined reference to `__tdx_module_call'
>>    ld: tdx-shared.c:(.text+0xce): undefined reference to `__tdx_module_call'
>>    ld: tdx-shared.c:(.text+0x13b): undefined reference to `__tdx_module_call'
>>    ld: tdx-shared.c:(.text+0x153): undefined reference to `cc_mkdec'
>>    ld: tdx-shared.c:(.text+0x15d): undefined reference to `cc_mkdec'
>>    ld: tdx-shared.c:(.text+0x18e): undefined reference to `__tdx_hypercall'
>>    ld: arch/x86/boot/compressed/vmlinux: hidden symbol `__tdx_hypercall' isn't defined
>>    ld: final link failed: bad value
>>
>> So it looks like arch/x86/boot/compressed/tdx-shared.c is being
>> built, while arch/x86/boot/compressed/tdx.c isn't.
> 
> Right. I think this should help:
> 
> diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
> index 78f67e0a2666..b13a58021086 100644
> --- a/arch/x86/boot/compressed/Makefile
> +++ b/arch/x86/boot/compressed/Makefile
> @@ -106,8 +106,8 @@ ifdef CONFIG_X86_64
>   endif
> 
>   vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
> -vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
> -vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/mem.o $(obj)/tdx-shared.o
> +vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o $(obj)/tdx-shared.o
> +vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/mem.o
> 
>   vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
>   vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
> 
>> After setting TDX in the kernel config, I can build successfully, but
>> I'm running into an error when trying to accept memory during
>> decompression.
>>
>> In drivers/firmware/efi/libstub/unaccepted_memory.c, I can see that the
>> unaccepted_table is allocated, but when accept_memory() is invoked the
>> table address is now zero. I thought maybe it had to do with bss, but even
>> putting it in the .data section didn't help. I'll keep digging, but if you
>> have any ideas, that would be great.
> 
> Not right away. But maybe seeing your side of enabling would help.

Let me get something pushed up where you can access it and I'll also send
you my kernel config.

In the mean time I added the following and everything worked. But I'm not
sure how acceptable it is to always be checking for the table when the
value is zero is.


diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c b/drivers/firmware/efi/libstub/unaccepted_memory.c
index f4642c4f25dd..8c5632ab1208 100644
--- a/drivers/firmware/efi/libstub/unaccepted_memory.c
+++ b/drivers/firmware/efi/libstub/unaccepted_memory.c
@@ -183,8 +183,13 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
  	unsigned long bitmap_size;
  	u64 unit_size;
  
-	if (!unaccepted_table)
-		return;
+	if (!unaccepted_table) {
+		efi_guid_t unaccepted_table_guid = LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID;
+
+		unaccepted_table = get_efi_config_table(unaccepted_table_guid);
+		if (!unaccepted_table)
+			return;
+	}
  
  	unit_size = unaccepted_table->unit_size;
  

Thanks,
Tom

> 

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 4/9] x86/boot/compressed: Handle unaccepted memory
  2023-05-13 22:04 ` [PATCHv11 4/9] x86/boot/compressed: Handle " Kirill A. Shutemov
  2023-05-16 17:09   ` Liam Merwick
@ 2023-05-17 15:52   ` Tom Lendacky
  1 sibling, 0 replies; 38+ messages in thread
From: Tom Lendacky @ 2023-05-17 15:52 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Sean Christopherson, Andrew Morton, Joerg Roedel,
	Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Mike Rapoport, David Hildenbrand,
	Mel Gorman, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, aarcange, peterx, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel

On 5/13/23 17:04, Kirill A. Shutemov wrote:
> The firmware will pre-accept the memory used to run the stub. But, the
> stub is responsible for accepting the memory into which it decompresses
> the main kernel. Accept memory just before decompression starts.
> 
> The stub is also responsible for choosing a physical address in which to
> place the decompressed kernel image. The KASLR mechanism will randomize
> this physical address. Since the unaccepted memory region is relatively
> small, KASLR would be quite ineffective if it only used the pre-accepted
> area (EFI_CONVENTIONAL_MEMORY). Ensure that KASLR randomizes among the
> entire physical address space by also including EFI_UNACCEPTED_MEMORY.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>

> ---
>   arch/x86/boot/compressed/efi.h   |  1 +
>   arch/x86/boot/compressed/kaslr.c | 35 +++++++++++++++++++++-----------
>   arch/x86/boot/compressed/misc.c  |  6 ++++++
>   arch/x86/boot/compressed/misc.h  |  6 ++++++
>   4 files changed, 36 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/boot/compressed/efi.h b/arch/x86/boot/compressed/efi.h
> index 7db2f41b54cd..cf475243b6d5 100644
> --- a/arch/x86/boot/compressed/efi.h
> +++ b/arch/x86/boot/compressed/efi.h
> @@ -32,6 +32,7 @@ typedef	struct {
>   } efi_table_hdr_t;
>   
>   #define EFI_CONVENTIONAL_MEMORY		 7
> +#define EFI_UNACCEPTED_MEMORY		15
>   
>   #define EFI_MEMORY_MORE_RELIABLE \
>   				((u64)0x0000000000010000ULL)	/* higher reliability */
> diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
> index 454757fbdfe5..749f0fe7e446 100644
> --- a/arch/x86/boot/compressed/kaslr.c
> +++ b/arch/x86/boot/compressed/kaslr.c
> @@ -672,6 +672,28 @@ static bool process_mem_region(struct mem_vector *region,
>   }
>   
>   #ifdef CONFIG_EFI
> +
> +/*
> + * Only EFI_CONVENTIONAL_MEMORY and EFI_UNACCEPTED_MEMORY (if supported) are
> + * guaranteed to be free.
> + *
> + * It is more conservative in picking free memory than the EFI spec allows:
> + *
> + * According to the spec, EFI_BOOT_SERVICES_{CODE|DATA} are also free memory
> + * and thus available to place the kernel image into, but in practice there's
> + * firmware where using that memory leads to crashes.
> + */
> +static inline bool memory_type_is_free(efi_memory_desc_t *md)
> +{
> +	if (md->type == EFI_CONVENTIONAL_MEMORY)
> +		return true;
> +
> +	if (md->type == EFI_UNACCEPTED_MEMORY)
> +		return IS_ENABLED(CONFIG_UNACCEPTED_MEMORY);
> +
> +	return false;
> +}
> +
>   /*
>    * Returns true if we processed the EFI memmap, which we prefer over the E820
>    * table if it is available.
> @@ -716,18 +738,7 @@ process_efi_entries(unsigned long minimum, unsigned long image_size)
>   	for (i = 0; i < nr_desc; i++) {
>   		md = efi_early_memdesc_ptr(pmap, e->efi_memdesc_size, i);
>   
> -		/*
> -		 * Here we are more conservative in picking free memory than
> -		 * the EFI spec allows:
> -		 *
> -		 * According to the spec, EFI_BOOT_SERVICES_{CODE|DATA} are also
> -		 * free memory and thus available to place the kernel image into,
> -		 * but in practice there's firmware where using that memory leads
> -		 * to crashes.
> -		 *
> -		 * Only EFI_CONVENTIONAL_MEMORY is guaranteed to be free.
> -		 */
> -		if (md->type != EFI_CONVENTIONAL_MEMORY)
> +		if (!memory_type_is_free(md))
>   			continue;
>   
>   		if (efi_soft_reserve_enabled() &&
> diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
> index 014ff222bf4b..eb8df0d4ad51 100644
> --- a/arch/x86/boot/compressed/misc.c
> +++ b/arch/x86/boot/compressed/misc.c
> @@ -455,6 +455,12 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
>   #endif
>   
>   	debug_putstr("\nDecompressing Linux... ");
> +
> +	if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY)) {
> +		debug_putstr("Accepting memory... ");
> +		accept_memory(__pa(output), __pa(output) + needed_size);
> +	}
> +
>   	__decompress(input_data, input_len, NULL, NULL, output, output_len,
>   			NULL, error);
>   	entry_offset = parse_elf(output);
> diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
> index 2f155a0e3041..9663d1839f54 100644
> --- a/arch/x86/boot/compressed/misc.h
> +++ b/arch/x86/boot/compressed/misc.h
> @@ -247,4 +247,10 @@ static inline unsigned long efi_find_vendor_table(struct boot_params *bp,
>   }
>   #endif /* CONFIG_EFI */
>   
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +void accept_memory(phys_addr_t start, phys_addr_t end);
> +#else
> +static inline void accept_memory(phys_addr_t start, phys_addr_t end) {}
> +#endif
> +
>   #endif /* BOOT_COMPRESSED_MISC_H */

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv11.1 5/9] efi: Add unaccepted memory support
  2023-05-16 12:06   ` [PATCHv11.1 5/9] efi: Add unaccepted memory support Kirill A. Shutemov
  2023-05-16 17:25     ` Ard Biesheuvel
@ 2023-05-17 15:58     ` Tom Lendacky
  1 sibling, 0 replies; 38+ messages in thread
From: Tom Lendacky @ 2023-05-17 15:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: aarcange, ak, akpm, ardb, bp, dave.hansen, david, dfaggioli,
	jroedel, khalid.elmously, linux-coco, linux-efi, linux-kernel,
	linux-mm, luto, marcelo.cerri, mgorman, mingo, pbonzini, peterx,
	peterz, philip.cox, rientjes, rppt, sathyanarayanan.kuppuswamy,
	seanjc, tglx, tim.gardner, vbabka, x86

On 5/16/23 07:06, Kirill A. Shutemov wrote:
> efi_config_parse_tables() reserves memory that holds unaccepted memory
> configuration table so it won't be reused by page allocator.
> 
> Core-mm requires few helpers to support unaccepted memory:
> 
>   - accept_memory() checks the range of addresses against the bitmap and
>     accept memory if needed.
> 
>   - range_contains_unaccepted_memory() checks if anything within the
>     range requires acceptance.
> 
> Architectural code has to provide efi_get_unaccepted_table() that
> returns pointer to the unaccepted memory configuration table.
> 
> arch_accept_memory() handles arch-specific part of memory acceptance.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Can you also add the efi.unaccepted table to the efi_tables array in 
arch/x86/platform/efi/efi.c?

With that...

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>

> 
> v11.1:
>   - Add missing memblock_reserve() for the unaccepted memory
>     configuration table.
> 
> ---
>   drivers/firmware/efi/Makefile            |   1 +
>   drivers/firmware/efi/efi.c               |  25 ++++++
>   drivers/firmware/efi/unaccepted_memory.c | 103 +++++++++++++++++++++++
>   include/linux/efi.h                      |   1 +
>   4 files changed, 130 insertions(+)
>   create mode 100644 drivers/firmware/efi/unaccepted_memory.c
> 
> diff --git a/drivers/firmware/efi/Makefile b/drivers/firmware/efi/Makefile
> index b51f2a4c821e..e489fefd23da 100644
> --- a/drivers/firmware/efi/Makefile
> +++ b/drivers/firmware/efi/Makefile
> @@ -41,3 +41,4 @@ obj-$(CONFIG_EFI_CAPSULE_LOADER)	+= capsule-loader.o
>   obj-$(CONFIG_EFI_EARLYCON)		+= earlycon.o
>   obj-$(CONFIG_UEFI_CPER_ARM)		+= cper-arm.o
>   obj-$(CONFIG_UEFI_CPER_X86)		+= cper-x86.o
> +obj-$(CONFIG_UNACCEPTED_MEMORY)		+= unaccepted_memory.o
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index 7dce06e419c5..d817e7afd266 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -50,6 +50,9 @@ struct efi __read_mostly efi = {
>   #ifdef CONFIG_EFI_COCO_SECRET
>   	.coco_secret		= EFI_INVALID_TABLE_ADDR,
>   #endif
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +	.unaccepted		= EFI_INVALID_TABLE_ADDR,
> +#endif
>   };
>   EXPORT_SYMBOL(efi);
>   
> @@ -605,6 +608,9 @@ static const efi_config_table_type_t common_tables[] __initconst = {
>   #ifdef CONFIG_EFI_COCO_SECRET
>   	{LINUX_EFI_COCO_SECRET_AREA_GUID,	&efi.coco_secret,	"CocoSecret"	},
>   #endif
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +	{LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID,	&efi.unaccepted,	"Unaccepted"	},
> +#endif
>   #ifdef CONFIG_EFI_GENERIC_STUB
>   	{LINUX_EFI_SCREEN_INFO_TABLE_GUID,	&screen_info_table			},
>   #endif
> @@ -759,6 +765,25 @@ int __init efi_config_parse_tables(const efi_config_table_t *config_tables,
>   		}
>   	}
>   
> +	if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) &&
> +	    efi.unaccepted != EFI_INVALID_TABLE_ADDR) {
> +		struct efi_unaccepted_memory *unaccepted;
> +
> +		unaccepted = early_memremap(efi.unaccepted, sizeof(*unaccepted));
> +		if (unaccepted) {
> +			unsigned long size;
> +
> +			if (unaccepted->version == 1) {
> +				size = sizeof(*unaccepted) + unaccepted->size;
> +				memblock_reserve(efi.unaccepted, size);
> +			} else {
> +				efi.unaccepted = EFI_INVALID_TABLE_ADDR;
> +			}
> +
> +			early_memunmap(unaccepted, sizeof(*unaccepted));
> +		}
> +	}
> +
>   	return 0;
>   }
>   
> diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
> new file mode 100644
> index 000000000000..bb91c41f76fb
> --- /dev/null
> +++ b/drivers/firmware/efi/unaccepted_memory.c
> @@ -0,0 +1,103 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include <linux/efi.h>
> +#include <linux/memblock.h>
> +#include <linux/spinlock.h>
> +#include <asm/unaccepted_memory.h>
> +
> +/* Protects unaccepted memory bitmap */
> +static DEFINE_SPINLOCK(unaccepted_memory_lock);
> +
> +void accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +	struct efi_unaccepted_memory *unaccepted;
> +	unsigned long range_start, range_end;
> +	unsigned long flags;
> +	u64 unit_size;
> +
> +	if (efi.unaccepted == EFI_INVALID_TABLE_ADDR)
> +		return;
> +
> +	unaccepted = efi_get_unaccepted_table();
> +	if (!unaccepted)
> +		return;
> +
> +	unit_size = unaccepted->unit_size;
> +
> +	/*
> +	 * Only care for the part of the range that is represented
> +	 * in the bitmap.
> +	 */
> +	if (start < unaccepted->phys_base)
> +		start = unaccepted->phys_base;
> +	if (end < unaccepted->phys_base)
> +		return;
> +
> +	/* Translate to offsets from the beginning of the bitmap */
> +	start -= unaccepted->phys_base;
> +	end -= unaccepted->phys_base;
> +
> +	/* Make sure not to overrun the bitmap */
> +	if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
> +		end = unaccepted->size * unit_size * BITS_PER_BYTE;
> +
> +	range_start = start / unit_size;
> +
> +	spin_lock_irqsave(&unaccepted_memory_lock, flags);
> +	for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap,
> +				   DIV_ROUND_UP(end, unit_size)) {
> +		unsigned long phys_start, phys_end;
> +		unsigned long len = range_end - range_start;
> +
> +		phys_start = range_start * unit_size + unaccepted->phys_base;
> +		phys_end = range_end * unit_size + unaccepted->phys_base;
> +
> +		arch_accept_memory(phys_start, phys_end);
> +		bitmap_clear(unaccepted->bitmap, range_start, len);
> +	}
> +	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
> +}
> +
> +bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
> +{
> +	struct efi_unaccepted_memory *unaccepted;
> +	unsigned long flags;
> +	bool ret = false;
> +	u64 unit_size;
> +
> +	unaccepted = efi_get_unaccepted_table();
> +	if (!unaccepted)
> +		return false;
> +
> +	unit_size = unaccepted->unit_size;
> +
> +	/*
> +	 * Only care for the part of the range that is represented
> +	 * in the bitmap.
> +	 */
> +	if (start < unaccepted->phys_base)
> +		start = unaccepted->phys_base;
> +	if (end < unaccepted->phys_base)
> +		return false;
> +
> +	/* Translate to offsets from the beginning of the bitmap */
> +	start -= unaccepted->phys_base;
> +	end -= unaccepted->phys_base;
> +
> +	/* Make sure not to overrun the bitmap */
> +	if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
> +		end = unaccepted->size * unit_size * BITS_PER_BYTE;
> +
> +	spin_lock_irqsave(&unaccepted_memory_lock, flags);
> +	while (start < end) {
> +		if (test_bit(start / unit_size, unaccepted->bitmap)) {
> +			ret = true;
> +			break;
> +		}
> +
> +		start += unit_size;
> +	}
> +	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
> +
> +	return ret;
> +}
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index 29cc622910da..9864f9c00da2 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -646,6 +646,7 @@ extern struct efi {
>   	unsigned long			tpm_final_log;		/* TPM2 Final Events Log table */
>   	unsigned long			mokvar_table;		/* MOK variable config table */
>   	unsigned long			coco_secret;		/* Confidential computing secret table */
> +	unsigned long			unaccepted;		/* Unaccepted memory table */
>   
>   	efi_get_time_t			*get_time;
>   	efi_set_time_t			*set_time;

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 6/9] efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  2023-05-13 22:04 ` [PATCHv11 6/9] efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory Kirill A. Shutemov
  2023-05-16 18:08   ` Ard Biesheuvel
@ 2023-05-17 16:07   ` Tom Lendacky
  1 sibling, 0 replies; 38+ messages in thread
From: Tom Lendacky @ 2023-05-17 16:07 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Sean Christopherson, Andrew Morton, Joerg Roedel,
	Ard Biesheuvel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Dario Faggioli, Mike Rapoport, David Hildenbrand,
	Mel Gorman, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, aarcange, peterx, x86, linux-mm, linux-coco,
	linux-efi, linux-kernel, Dave Hansen

On 5/13/23 17:04, Kirill A. Shutemov wrote:
> load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
> The unwanted loads are typically harmless. But, they might be made to
> totally unrelated or even unmapped memory. load_unaligned_zeropad()
> relies on exception fixup (#PF, #GP and now #VE) to recover from these
> unwanted loads.
> 
> But, this approach does not work for unaccepted memory. For TDX, a load
> from unaccepted memory will not lead to a recoverable exception within
> the guest. The guest will exit to the VMM where the only recourse is to
> terminate the guest.
> 
> There are two parts to fix this issue and comprehensively avoid access
> to unaccepted memory. Together these ensure that an extra "guard" page
> is accepted in addition to the memory that needs to be used.
> 
> 1. Implicitly extend the range_contains_unaccepted_memory(start, end)
>     checks up to end+unit_size if 'end' is aligned on a unit_size
>     boundary.
> 2. Implicitly extend accept_memory(start, end) to end+unit_size if 'end'
>     is aligned on a unit_size boundary.
> 
> Side note: This leads to something strange. Pages which were accepted
> 	   at boot, marked by the firmware as accepted and will never
> 	   _need_ to be accepted might be on unaccepted_pages list
> 	   This is a cue to ensure that the next page is accepted
> 	   before 'page' can be used.
> 
> This is an actual, real-world problem which was discovered during TDX
> testing.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>

> ---
>   drivers/firmware/efi/unaccepted_memory.c | 35 ++++++++++++++++++++++++
>   1 file changed, 35 insertions(+)
> 
> diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
> index bb91c41f76fb..3d1ca60916dd 100644
> --- a/drivers/firmware/efi/unaccepted_memory.c
> +++ b/drivers/firmware/efi/unaccepted_memory.c
> @@ -37,6 +37,34 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
>   	start -= unaccepted->phys_base;
>   	end -= unaccepted->phys_base;
>   
> +	/*
> +	 * load_unaligned_zeropad() can lead to unwanted loads across page
> +	 * boundaries. The unwanted loads are typically harmless. But, they
> +	 * might be made to totally unrelated or even unmapped memory.
> +	 * load_unaligned_zeropad() relies on exception fixup (#PF, #GP and now
> +	 * #VE) to recover from these unwanted loads.
> +	 *
> +	 * But, this approach does not work for unaccepted memory. For TDX, a
> +	 * load from unaccepted memory will not lead to a recoverable exception
> +	 * within the guest. The guest will exit to the VMM where the only
> +	 * recourse is to terminate the guest.
> +	 *
> +	 * There are two parts to fix this issue and comprehensively avoid
> +	 * access to unaccepted memory. Together these ensure that an extra
> +	 * "guard" page is accepted in addition to the memory that needs to be
> +	 * used:
> +	 *
> +	 * 1. Implicitly extend the range_contains_unaccepted_memory(start, end)
> +	 *    checks up to end+unit_size if 'end' is aligned on a unit_size
> +	 *    boundary.
> +	 *
> +	 * 2. Implicitly extend accept_memory(start, end) to end+unit_size if
> +	 *    'end' is aligned on a unit_size boundary. (immediately following
> +	 *    this comment)
> +	 */
> +	if (!(end % unit_size))
> +		end += unit_size;
> +
>   	/* Make sure not to overrun the bitmap */
>   	if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
>   		end = unaccepted->size * unit_size * BITS_PER_BYTE;
> @@ -84,6 +112,13 @@ bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
>   	start -= unaccepted->phys_base;
>   	end -= unaccepted->phys_base;
>   
> +	/*
> +	 * Also consider the unaccepted state of the *next* page. See fix #1 in
> +	 * the comment on load_unaligned_zeropad() in accept_memory().
> +	 */
> +	if (!(end % unit_size))
> +		end += unit_size;
> +
>   	/* Make sure not to overrun the bitmap */
>   	if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
>   		end = unaccepted->size * unit_size * BITS_PER_BYTE;

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 0/9] mm, x86/cc, efi: Implement support for unaccepted memory
  2023-05-17 14:32     ` Tom Lendacky
@ 2023-05-17 18:36       ` Kirill A. Shutemov
  2023-05-17 18:50         ` Tom Lendacky
  0 siblings, 1 reply; 38+ messages in thread
From: Kirill A. Shutemov @ 2023-05-17 18:36 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Sean Christopherson, Andrew Morton, Joerg Roedel,
	Ard Biesheuvel, Andi Kleen, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On Wed, May 17, 2023 at 09:32:27AM -0500, Tom Lendacky wrote:
> On 5/16/23 18:22, Kirill A. Shutemov wrote:
> > On Tue, May 16, 2023 at 05:41:55PM -0500, Tom Lendacky wrote:
> > > On 5/13/23 17:04, Kirill A. Shutemov wrote:
> > > > UEFI Specification version 2.9 introduces the concept of memory
> > > > acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
> > > > SEV-SNP, requiring memory to be accepted before it can be used by the
> > > > guest. Accepting happens via a protocol specific for the Virtual
> > > > Machine platform.
> > > > 
> > > > Accepting memory is costly and it makes VMM allocate memory for the
> > > > accepted guest physical address range. It's better to postpone memory
> > > > acceptance until memory is needed. It lowers boot time and reduces
> > > > memory overhead.
> > > > 
> > > > The kernel needs to know what memory has been accepted. Firmware
> > > > communicates this information via memory map: a new memory type --
> > > > EFI_UNACCEPTED_MEMORY -- indicates such memory.
> > > > 
> > > > Range-based tracking works fine for firmware, but it gets bulky for
> > > > the kernel: e820 has to be modified on every page acceptance. It leads
> > > > to table fragmentation, but there's a limited number of entries in the
> > > > e820 table
> > > > 
> > > > Another option is to mark such memory as usable in e820 and track if the
> > > > range has been accepted in a bitmap. One bit in the bitmap represents
> > > > 2MiB in the address space: one 4k page is enough to track 64GiB or
> > > > physical address space.
> > > > 
> > > > In the worst-case scenario -- a huge hole in the middle of the
> > > > address space -- It needs 256MiB to handle 4PiB of the address
> > > > space.
> > > > 
> > > > Any unaccepted memory that is not aligned to 2M gets accepted upfront.
> > > > 
> > > > The approach lowers boot time substantially. Boot to shell is ~2.5x
> > > > faster for 4G TDX VM and ~4x faster for 64G.
> > > > 
> > > > TDX-specific code isolated from the core of unaccepted memory support. It
> > > > supposed to help to plug-in different implementation of unaccepted memory
> > > > such as SEV-SNP.
> > > > 
> > > > -- Fragmentation study --
> > > > 
> > > > Vlastimil and Mel were concern about effect of unaccepted memory on
> > > > fragmentation prevention measures in page allocator. I tried to evaluate
> > > > it, but it is tricky. As suggested I tried to run multiple parallel kernel
> > > > builds and follow how often kmem:mm_page_alloc_extfrag gets hit.
> > > > 
> > > > See results in the v9 of the patchset[1][2]
> > > > 
> > > > [1] https://lore.kernel.org/all/20230330114956.20342-1-kirill.shutemov@linux.intel.com
> > > > [2] https://lore.kernel.org/all/20230416191940.ex7ao43pmrjhru2p@box.shutemov.name
> > > > 
> > > > --
> > > > 
> > > > The tree can be found here:
> > > > 
> > > > https://github.com/intel/tdx.git guest-unaccepted-memory
> > > 
> > > I get some failures when building without TDX support selected in my
> > > kernel config after adding unaccepted memory support for SNP:
> > > 
> > >    In file included from arch/x86/boot/compressed/../../coco/tdx/tdx-shared.c:1,
> > >                     from arch/x86/boot/compressed/tdx-shared.c:2:
> > >    ./arch/x86/include/asm/tdx.h: In function ?tdx_kvm_hypercall?:
> > >    ./arch/x86/include/asm/tdx.h:72:17: error: ?ENODEV? undeclared (first use in this function)
> > >       72 |         return -ENODEV;
> > >          |                 ^~~~~~
> > >    ./arch/x86/include/asm/tdx.h:72:17: note: each undeclared identifier is reported only once for each function it appears in
> > > 
> > > Adding an include for linux/errno.h gets past that error, but then
> > > I get the following:
> > > 
> > >    ld: arch/x86/boot/compressed/tdx-shared.o: in function `tdx_enc_status_changed_phys':
> > >    tdx-shared.c:(.text+0x42): undefined reference to `__tdx_hypercall'
> > >    ld: tdx-shared.c:(.text+0x7f): undefined reference to `__tdx_module_call'
> > >    ld: tdx-shared.c:(.text+0xce): undefined reference to `__tdx_module_call'
> > >    ld: tdx-shared.c:(.text+0x13b): undefined reference to `__tdx_module_call'
> > >    ld: tdx-shared.c:(.text+0x153): undefined reference to `cc_mkdec'
> > >    ld: tdx-shared.c:(.text+0x15d): undefined reference to `cc_mkdec'
> > >    ld: tdx-shared.c:(.text+0x18e): undefined reference to `__tdx_hypercall'
> > >    ld: arch/x86/boot/compressed/vmlinux: hidden symbol `__tdx_hypercall' isn't defined
> > >    ld: final link failed: bad value
> > > 
> > > So it looks like arch/x86/boot/compressed/tdx-shared.c is being
> > > built, while arch/x86/boot/compressed/tdx.c isn't.
> > 
> > Right. I think this should help:
> > 
> > diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
> > index 78f67e0a2666..b13a58021086 100644
> > --- a/arch/x86/boot/compressed/Makefile
> > +++ b/arch/x86/boot/compressed/Makefile
> > @@ -106,8 +106,8 @@ ifdef CONFIG_X86_64
> >   endif
> > 
> >   vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
> > -vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
> > -vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/mem.o $(obj)/tdx-shared.o
> > +vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o $(obj)/tdx-shared.o
> > +vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/mem.o
> > 
> >   vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
> >   vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
> > 
> > > After setting TDX in the kernel config, I can build successfully, but
> > > I'm running into an error when trying to accept memory during
> > > decompression.
> > > 
> > > In drivers/firmware/efi/libstub/unaccepted_memory.c, I can see that the
> > > unaccepted_table is allocated, but when accept_memory() is invoked the
> > > table address is now zero. I thought maybe it had to do with bss, but even
> > > putting it in the .data section didn't help. I'll keep digging, but if you
> > > have any ideas, that would be great.
> > 
> > Not right away. But maybe seeing your side of enabling would help.
> 
> Let me get something pushed up where you can access it and I'll also send
> you my kernel config.
> 
> In the mean time I added the following and everything worked. But I'm not
> sure how acceptable it is to always be checking for the table when the
> value is zero is.
> 
> 
> diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c b/drivers/firmware/efi/libstub/unaccepted_memory.c
> index f4642c4f25dd..8c5632ab1208 100644
> --- a/drivers/firmware/efi/libstub/unaccepted_memory.c
> +++ b/drivers/firmware/efi/libstub/unaccepted_memory.c
> @@ -183,8 +183,13 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
>  	unsigned long bitmap_size;
>  	u64 unit_size;
> -	if (!unaccepted_table)
> -		return;
> +	if (!unaccepted_table) {
> +		efi_guid_t unaccepted_table_guid = LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID;
> +
> +		unaccepted_table = get_efi_config_table(unaccepted_table_guid);
> +		if (!unaccepted_table)
> +			return;
> +	}
>  	unit_size = unaccepted_table->unit_size;
> 

Kudos to Ard: if efi_relocate_kernel() triggered, it copies the kernel
image to the new place before the variable gets initialized, so it has to
be initialized explicitly by decompressor.

It also covers the cases when bootloader doesn't use EFI stub, including
kexec cases.

I think this fixup should work.

diff --git a/arch/x86/boot/compressed/efi.h b/arch/x86/boot/compressed/efi.h
index cf475243b6d5..866c0af8b5b9 100644
--- a/arch/x86/boot/compressed/efi.h
+++ b/arch/x86/boot/compressed/efi.h
@@ -16,6 +16,7 @@ typedef guid_t efi_guid_t __aligned(__alignof__(u32));
 #define ACPI_TABLE_GUID				EFI_GUID(0xeb9d2d30, 0x2d88, 0x11d3,  0x9a, 0x16, 0x00, 0x90, 0x27, 0x3f, 0xc1, 0x4d)
 #define ACPI_20_TABLE_GUID			EFI_GUID(0x8868e871, 0xe4f1, 0x11d3,  0xbc, 0x22, 0x00, 0x80, 0xc7, 0x3c, 0x88, 0x81)
 #define EFI_CC_BLOB_GUID			EFI_GUID(0x067b1f5f, 0xcf26, 0x44c5, 0x85, 0x54, 0x93, 0xd7, 0x77, 0x91, 0x2d, 0x42)
+#define LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID	EFI_GUID(0xd5d1de3c, 0x105c, 0x44f9,  0x9e, 0xa9, 0xbc, 0xef, 0x98, 0x12, 0x00, 0x31)
 
 #define EFI32_LOADER_SIGNATURE	"EL32"
 #define EFI64_LOADER_SIGNATURE	"EL64"
@@ -105,6 +106,14 @@ struct efi_setup_data {
 	u64 reserved[8];
 };
 
+struct efi_unaccepted_memory {
+	u32 version;
+	u32 unit_size;
+	u64 phys_base;
+	u64 size;
+	unsigned long bitmap[];
+};
+
 static inline int efi_guidcmp (efi_guid_t left, efi_guid_t right)
 {
 	return memcmp(&left, &right, sizeof (efi_guid_t));
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index a4308d077885..0108c97399a5 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -1,8 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0-only
 
-#include "../cpuflags.h"
-#include "../string.h"
 #include "error.h"
+#include "misc.h"
 #include "tdx.h"
 #include <asm/shared/tdx.h>
 
@@ -40,3 +39,25 @@ void arch_accept_memory(phys_addr_t start, phys_addr_t end)
 	else
 		error("Cannot accept memory: unknown platform\n");
 }
+
+void init_unaccepted_memory(void)
+{
+	guid_t guid =  LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID;
+	struct efi_unaccepted_memory *unaccepted_table;
+	unsigned long cfg_table_pa;
+	unsigned int cfg_table_len;
+	int ret;
+
+	ret = efi_get_conf_table(boot_params, &cfg_table_pa, &cfg_table_len);
+	if (ret)
+		error("EFI config table not found.");
+
+	unaccepted_table = (void *)efi_find_vendor_table(boot_params,
+							 cfg_table_pa,
+							 cfg_table_len,
+							 guid);
+	if (unaccepted_table->version != 1)
+		error("Unknown version of unaccepted memory table\n");
+
+	set_unaccepted_table(unaccepted_table);
+}
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index eb8df0d4ad51..36535a3753f5 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -458,6 +458,7 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
 
 	if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY)) {
 		debug_putstr("Accepting memory... ");
+		init_unaccepted_memory();
 		accept_memory(__pa(output), __pa(output) + needed_size);
 	}
 
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 9663d1839f54..e1a0b49e0ed2 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -247,10 +247,10 @@ static inline unsigned long efi_find_vendor_table(struct boot_params *bp,
 }
 #endif /* CONFIG_EFI */
 
-#ifdef CONFIG_UNACCEPTED_MEMORY
+void init_unaccepted_memory(void);
+
+/* Implemented in EFI stub */
+void set_unaccepted_table(struct efi_unaccepted_memory *table);
 void accept_memory(phys_addr_t start, phys_addr_t end);
-#else
-static inline void accept_memory(phys_addr_t start, phys_addr_t end) {}
-#endif
 
 #endif /* BOOT_COMPRESSED_MISC_H */
diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c b/drivers/firmware/efi/libstub/unaccepted_memory.c
index f4642c4f25dd..fd6a3195c68f 100644
--- a/drivers/firmware/efi/libstub/unaccepted_memory.c
+++ b/drivers/firmware/efi/libstub/unaccepted_memory.c
@@ -6,6 +6,18 @@
 
 static struct efi_unaccepted_memory *unaccepted_table;
 
+/*
+ * Decompressor needs to initialize the variable to cover cases when the table
+ * is not allocated by EFI stub or EFI stub copied the kernel image with
+ * efi_relocate_kernel() before the variable is set.
+ *
+ * It must be call before the first usage of accept_memory() by decompressor.
+ */
+void set_unaccepted_table(struct efi_unaccepted_memory *table)
+{
+	unaccepted_table = table;
+}
+
 efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
 					struct efi_boot_memmap *map)
 {
-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCHv11 0/9] mm, x86/cc, efi: Implement support for unaccepted memory
  2023-05-17 18:36       ` Kirill A. Shutemov
@ 2023-05-17 18:50         ` Tom Lendacky
  0 siblings, 0 replies; 38+ messages in thread
From: Tom Lendacky @ 2023-05-17 18:50 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Sean Christopherson, Andrew Morton, Joerg Roedel,
	Ard Biesheuvel, Andi Kleen, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli, Mike Rapoport,
	David Hildenbrand, Mel Gorman, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, aarcange, peterx, x86, linux-mm,
	linux-coco, linux-efi, linux-kernel

On 5/17/23 13:36, Kirill A. Shutemov wrote:
> On Wed, May 17, 2023 at 09:32:27AM -0500, Tom Lendacky wrote:
>> On 5/16/23 18:22, Kirill A. Shutemov wrote:
>>> On Tue, May 16, 2023 at 05:41:55PM -0500, Tom Lendacky wrote:
>>>> On 5/13/23 17:04, Kirill A. Shutemov wrote:
>>>>> UEFI Specification version 2.9 introduces the concept of memory
>>>>> acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
>>>>> SEV-SNP, requiring memory to be accepted before it can be used by the
>>>>> guest. Accepting happens via a protocol specific for the Virtual
>>>>> Machine platform.
>>>>>
>>>>> Accepting memory is costly and it makes VMM allocate memory for the
>>>>> accepted guest physical address range. It's better to postpone memory
>>>>> acceptance until memory is needed. It lowers boot time and reduces
>>>>> memory overhead.
>>>>>
>>>>> The kernel needs to know what memory has been accepted. Firmware
>>>>> communicates this information via memory map: a new memory type --
>>>>> EFI_UNACCEPTED_MEMORY -- indicates such memory.
>>>>>
>>>>> Range-based tracking works fine for firmware, but it gets bulky for
>>>>> the kernel: e820 has to be modified on every page acceptance. It leads
>>>>> to table fragmentation, but there's a limited number of entries in the
>>>>> e820 table
>>>>>
>>>>> Another option is to mark such memory as usable in e820 and track if the
>>>>> range has been accepted in a bitmap. One bit in the bitmap represents
>>>>> 2MiB in the address space: one 4k page is enough to track 64GiB or
>>>>> physical address space.
>>>>>
>>>>> In the worst-case scenario -- a huge hole in the middle of the
>>>>> address space -- It needs 256MiB to handle 4PiB of the address
>>>>> space.
>>>>>
>>>>> Any unaccepted memory that is not aligned to 2M gets accepted upfront.
>>>>>
>>>>> The approach lowers boot time substantially. Boot to shell is ~2.5x
>>>>> faster for 4G TDX VM and ~4x faster for 64G.
>>>>>
>>>>> TDX-specific code isolated from the core of unaccepted memory support. It
>>>>> supposed to help to plug-in different implementation of unaccepted memory
>>>>> such as SEV-SNP.
>>>>>
>>>>> -- Fragmentation study --
>>>>>
>>>>> Vlastimil and Mel were concern about effect of unaccepted memory on
>>>>> fragmentation prevention measures in page allocator. I tried to evaluate
>>>>> it, but it is tricky. As suggested I tried to run multiple parallel kernel
>>>>> builds and follow how often kmem:mm_page_alloc_extfrag gets hit.
>>>>>
>>>>> See results in the v9 of the patchset[1][2]
>>>>>
>>>>> [1] https://lore.kernel.org/all/20230330114956.20342-1-kirill.shutemov@linux.intel.com
>>>>> [2] https://lore.kernel.org/all/20230416191940.ex7ao43pmrjhru2p@box.shutemov.name
>>>>>
>>>>> --
>>>>>
>>>>> The tree can be found here:
>>>>>
>>>>> https://github.com/intel/tdx.git guest-unaccepted-memory
>>>>
>>>> I get some failures when building without TDX support selected in my
>>>> kernel config after adding unaccepted memory support for SNP:
>>>>
>>>>     In file included from arch/x86/boot/compressed/../../coco/tdx/tdx-shared.c:1,
>>>>                      from arch/x86/boot/compressed/tdx-shared.c:2:
>>>>     ./arch/x86/include/asm/tdx.h: In function ?tdx_kvm_hypercall?:
>>>>     ./arch/x86/include/asm/tdx.h:72:17: error: ?ENODEV? undeclared (first use in this function)
>>>>        72 |         return -ENODEV;
>>>>           |                 ^~~~~~
>>>>     ./arch/x86/include/asm/tdx.h:72:17: note: each undeclared identifier is reported only once for each function it appears in
>>>>
>>>> Adding an include for linux/errno.h gets past that error, but then
>>>> I get the following:
>>>>
>>>>     ld: arch/x86/boot/compressed/tdx-shared.o: in function `tdx_enc_status_changed_phys':
>>>>     tdx-shared.c:(.text+0x42): undefined reference to `__tdx_hypercall'
>>>>     ld: tdx-shared.c:(.text+0x7f): undefined reference to `__tdx_module_call'
>>>>     ld: tdx-shared.c:(.text+0xce): undefined reference to `__tdx_module_call'
>>>>     ld: tdx-shared.c:(.text+0x13b): undefined reference to `__tdx_module_call'
>>>>     ld: tdx-shared.c:(.text+0x153): undefined reference to `cc_mkdec'
>>>>     ld: tdx-shared.c:(.text+0x15d): undefined reference to `cc_mkdec'
>>>>     ld: tdx-shared.c:(.text+0x18e): undefined reference to `__tdx_hypercall'
>>>>     ld: arch/x86/boot/compressed/vmlinux: hidden symbol `__tdx_hypercall' isn't defined
>>>>     ld: final link failed: bad value
>>>>
>>>> So it looks like arch/x86/boot/compressed/tdx-shared.c is being
>>>> built, while arch/x86/boot/compressed/tdx.c isn't.
>>>
>>> Right. I think this should help:
>>>
>>> diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
>>> index 78f67e0a2666..b13a58021086 100644
>>> --- a/arch/x86/boot/compressed/Makefile
>>> +++ b/arch/x86/boot/compressed/Makefile
>>> @@ -106,8 +106,8 @@ ifdef CONFIG_X86_64
>>>    endif
>>>
>>>    vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
>>> -vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
>>> -vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/mem.o $(obj)/tdx-shared.o
>>> +vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o $(obj)/tdx-shared.o
>>> +vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/mem.o
>>>
>>>    vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
>>>    vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
>>>
>>>> After setting TDX in the kernel config, I can build successfully, but
>>>> I'm running into an error when trying to accept memory during
>>>> decompression.
>>>>
>>>> In drivers/firmware/efi/libstub/unaccepted_memory.c, I can see that the
>>>> unaccepted_table is allocated, but when accept_memory() is invoked the
>>>> table address is now zero. I thought maybe it had to do with bss, but even
>>>> putting it in the .data section didn't help. I'll keep digging, but if you
>>>> have any ideas, that would be great.
>>>
>>> Not right away. But maybe seeing your side of enabling would help.
>>
>> Let me get something pushed up where you can access it and I'll also send
>> you my kernel config.
>>
>> In the mean time I added the following and everything worked. But I'm not
>> sure how acceptable it is to always be checking for the table when the
>> value is zero is.
>>
>>
>> diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c b/drivers/firmware/efi/libstub/unaccepted_memory.c
>> index f4642c4f25dd..8c5632ab1208 100644
>> --- a/drivers/firmware/efi/libstub/unaccepted_memory.c
>> +++ b/drivers/firmware/efi/libstub/unaccepted_memory.c
>> @@ -183,8 +183,13 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
>>   	unsigned long bitmap_size;
>>   	u64 unit_size;
>> -	if (!unaccepted_table)
>> -		return;
>> +	if (!unaccepted_table) {
>> +		efi_guid_t unaccepted_table_guid = LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID;
>> +
>> +		unaccepted_table = get_efi_config_table(unaccepted_table_guid);
>> +		if (!unaccepted_table)
>> +			return;
>> +	}
>>   	unit_size = unaccepted_table->unit_size;
>>
> 
> Kudos to Ard: if efi_relocate_kernel() triggered, it copies the kernel
> image to the new place before the variable gets initialized, so it has to
> be initialized explicitly by decompressor.
> 
> It also covers the cases when bootloader doesn't use EFI stub, including
> kexec cases.
> 
> I think this fixup should work.

Yes, this fixup takes care of the problem I was seeing.

Thanks!
Tom

> 
> diff --git a/arch/x86/boot/compressed/efi.h b/arch/x86/boot/compressed/efi.h
> index cf475243b6d5..866c0af8b5b9 100644
> --- a/arch/x86/boot/compressed/efi.h
> +++ b/arch/x86/boot/compressed/efi.h
> @@ -16,6 +16,7 @@ typedef guid_t efi_guid_t __aligned(__alignof__(u32));
>   #define ACPI_TABLE_GUID				EFI_GUID(0xeb9d2d30, 0x2d88, 0x11d3,  0x9a, 0x16, 0x00, 0x90, 0x27, 0x3f, 0xc1, 0x4d)
>   #define ACPI_20_TABLE_GUID			EFI_GUID(0x8868e871, 0xe4f1, 0x11d3,  0xbc, 0x22, 0x00, 0x80, 0xc7, 0x3c, 0x88, 0x81)
>   #define EFI_CC_BLOB_GUID			EFI_GUID(0x067b1f5f, 0xcf26, 0x44c5, 0x85, 0x54, 0x93, 0xd7, 0x77, 0x91, 0x2d, 0x42)
> +#define LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID	EFI_GUID(0xd5d1de3c, 0x105c, 0x44f9,  0x9e, 0xa9, 0xbc, 0xef, 0x98, 0x12, 0x00, 0x31)
>   
>   #define EFI32_LOADER_SIGNATURE	"EL32"
>   #define EFI64_LOADER_SIGNATURE	"EL64"
> @@ -105,6 +106,14 @@ struct efi_setup_data {
>   	u64 reserved[8];
>   };
>   
> +struct efi_unaccepted_memory {
> +	u32 version;
> +	u32 unit_size;
> +	u64 phys_base;
> +	u64 size;
> +	unsigned long bitmap[];
> +};
> +
>   static inline int efi_guidcmp (efi_guid_t left, efi_guid_t right)
>   {
>   	return memcmp(&left, &right, sizeof (efi_guid_t));
> diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
> index a4308d077885..0108c97399a5 100644
> --- a/arch/x86/boot/compressed/mem.c
> +++ b/arch/x86/boot/compressed/mem.c
> @@ -1,8 +1,7 @@
>   // SPDX-License-Identifier: GPL-2.0-only
>   
> -#include "../cpuflags.h"
> -#include "../string.h"
>   #include "error.h"
> +#include "misc.h"
>   #include "tdx.h"
>   #include <asm/shared/tdx.h>
>   
> @@ -40,3 +39,25 @@ void arch_accept_memory(phys_addr_t start, phys_addr_t end)
>   	else
>   		error("Cannot accept memory: unknown platform\n");
>   }
> +
> +void init_unaccepted_memory(void)
> +{
> +	guid_t guid =  LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID;
> +	struct efi_unaccepted_memory *unaccepted_table;
> +	unsigned long cfg_table_pa;
> +	unsigned int cfg_table_len;
> +	int ret;
> +
> +	ret = efi_get_conf_table(boot_params, &cfg_table_pa, &cfg_table_len);
> +	if (ret)
> +		error("EFI config table not found.");
> +
> +	unaccepted_table = (void *)efi_find_vendor_table(boot_params,
> +							 cfg_table_pa,
> +							 cfg_table_len,
> +							 guid);
> +	if (unaccepted_table->version != 1)
> +		error("Unknown version of unaccepted memory table\n");
> +
> +	set_unaccepted_table(unaccepted_table);
> +}
> diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
> index eb8df0d4ad51..36535a3753f5 100644
> --- a/arch/x86/boot/compressed/misc.c
> +++ b/arch/x86/boot/compressed/misc.c
> @@ -458,6 +458,7 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
>   
>   	if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY)) {
>   		debug_putstr("Accepting memory... ");
> +		init_unaccepted_memory();
>   		accept_memory(__pa(output), __pa(output) + needed_size);
>   	}
>   
> diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
> index 9663d1839f54..e1a0b49e0ed2 100644
> --- a/arch/x86/boot/compressed/misc.h
> +++ b/arch/x86/boot/compressed/misc.h
> @@ -247,10 +247,10 @@ static inline unsigned long efi_find_vendor_table(struct boot_params *bp,
>   }
>   #endif /* CONFIG_EFI */
>   
> -#ifdef CONFIG_UNACCEPTED_MEMORY
> +void init_unaccepted_memory(void);
> +
> +/* Implemented in EFI stub */
> +void set_unaccepted_table(struct efi_unaccepted_memory *table);
>   void accept_memory(phys_addr_t start, phys_addr_t end);
> -#else
> -static inline void accept_memory(phys_addr_t start, phys_addr_t end) {}
> -#endif
>   
>   #endif /* BOOT_COMPRESSED_MISC_H */
> diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c b/drivers/firmware/efi/libstub/unaccepted_memory.c
> index f4642c4f25dd..fd6a3195c68f 100644
> --- a/drivers/firmware/efi/libstub/unaccepted_memory.c
> +++ b/drivers/firmware/efi/libstub/unaccepted_memory.c
> @@ -6,6 +6,18 @@
>   
>   static struct efi_unaccepted_memory *unaccepted_table;
>   
> +/*
> + * Decompressor needs to initialize the variable to cover cases when the table
> + * is not allocated by EFI stub or EFI stub copied the kernel image with
> + * efi_relocate_kernel() before the variable is set.
> + *
> + * It must be call before the first usage of accept_memory() by decompressor.
> + */
> +void set_unaccepted_table(struct efi_unaccepted_memory *table)
> +{
> +	unaccepted_table = table;
> +}
> +
>   efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
>   					struct efi_boot_memmap *map)
>   {

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2023-05-17 18:50 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-13 22:04 [PATCHv11 0/9] mm, x86/cc, efi: Implement support for unaccepted memory Kirill A. Shutemov
2023-05-13 22:04 ` [PATCHv11 1/9] mm: Add " Kirill A. Shutemov
2023-05-16 19:44   ` Tom Lendacky
2023-05-16 21:32     ` Kirill A. Shutemov
2023-05-13 22:04 ` [PATCHv11 2/9] efi/x86: Get full memory map in allocate_e820() Kirill A. Shutemov
2023-05-16 19:52   ` Tom Lendacky
2023-05-13 22:04 ` [PATCHv11 3/9] efi/libstub: Implement support for unaccepted memory Kirill A. Shutemov
2023-05-14  5:08   ` Mika Penttilä
2023-05-14 21:13     ` Kirill A. Shutemov
2023-05-16 18:01       ` Ard Biesheuvel
2023-05-16 18:06   ` Ard Biesheuvel
2023-05-13 22:04 ` [PATCHv11 4/9] x86/boot/compressed: Handle " Kirill A. Shutemov
2023-05-16 17:09   ` Liam Merwick
2023-05-17 15:52   ` Tom Lendacky
2023-05-13 22:04 ` [PATCHv11 5/9] efi: Provide helpers for " Kirill A. Shutemov
2023-05-16 12:06   ` [PATCHv11.1 5/9] efi: Add unaccepted memory support Kirill A. Shutemov
2023-05-16 17:25     ` Ard Biesheuvel
2023-05-17 15:58     ` Tom Lendacky
2023-05-13 22:04 ` [PATCHv11 6/9] efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory Kirill A. Shutemov
2023-05-16 18:08   ` Ard Biesheuvel
2023-05-16 18:27     ` Dave Hansen
2023-05-16 18:35       ` Ard Biesheuvel
2023-05-16 19:15         ` Kirill A. Shutemov
2023-05-16 20:03         ` Dave Hansen
2023-05-16 21:52           ` Kirill A. Shutemov
2023-05-16 21:59             ` Dave Hansen
2023-05-16 22:15               ` Ard Biesheuvel
2023-05-16 18:33     ` Kirill A. Shutemov
2023-05-16 23:04       ` Dave Hansen
2023-05-17 16:07   ` Tom Lendacky
2023-05-13 22:04 ` [PATCHv11 7/9] x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub Kirill A. Shutemov
2023-05-13 22:04 ` [PATCHv11 8/9] x86/tdx: Refactor try_accept_one() Kirill A. Shutemov
2023-05-13 22:04 ` [PATCHv11 9/9] x86/tdx: Add unaccepted memory support Kirill A. Shutemov
2023-05-16 22:41 ` [PATCHv11 0/9] mm, x86/cc, efi: Implement support for unaccepted memory Tom Lendacky
2023-05-16 23:22   ` Kirill A. Shutemov
2023-05-17 14:32     ` Tom Lendacky
2023-05-17 18:36       ` Kirill A. Shutemov
2023-05-17 18:50         ` Tom Lendacky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.