linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] riscv: Memory Hot(Un)Plug support
@ 2023-05-12 14:57 Björn Töpel
  2023-05-12 14:57 ` [PATCH 1/7] riscv: mm: Pre-allocate PGD leaves to avoid synchronization Björn Töpel
                   ` (7 more replies)
  0 siblings, 8 replies; 14+ messages in thread
From: Björn Töpel @ 2023-05-12 14:57 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv
  Cc: Björn Töpel, linux-kernel, linux-mm, David Hildenbrand,
	Oscar Salvador, virtualization, linux, Alexandre Ghiti

From: Björn Töpel <bjorn@rivosinc.com>

Memory Hot(Un)Plug support for the RISC-V port
==============================================

Introduction
------------

To quote "Documentation/admin-guide/mm/memory-hotplug.rst": "Memory
hot(un)plug allows for increasing and decreasing the size of physical
memory available to a machine at runtime."

This series attempts to add memory hot(un)plug support for the RISC-V
Linux port.

I'm sending the series as a v1, but it's borderline RFC. It definitely
needs more testing time, but it would be nice with some early input.

Implementation
--------------

From an arch perspective, a couple of callbacks needs to be
implemented to support hot plugging:

arch_add_memory()
This callback is responsible for updating the linear/direct map, and
call into the memory hot plugging generic code via __add_pages().

arch_remove_memory()
In this callback the linear/direct map is tore down.

vmemmap_free()
The function tears down the vmemmap mappings (if
CONFIG_SPARSEMEM_VMEMMAP is in-use), and also deallocates the backing
vmemmap pages. Note that for persistent memory, an alternative
allocator for the backing pages can be used -- the vmem_altmap. This
means that when the backing pages are cleared, extra care is needed so
that the correct deallocation method is used. Note that RISC-V
populates the vmemmap using vmemmap_populate_basepages(), so currently
no hugepages are used for the backing store.

The page table unmap/teardown functions are heavily based (copied!)
from the x86 tree. The same remove_pgd_mapping() is used in both
vmemmap_free() and arch_remove_memory(), but in the latter function
the backing pages are not removed.

On RISC-V, the PGD level kernel mappings needs to synchronized with
all page-tables (e.g. via sync_kernel_mappings()). Synchronization
involves special care, like locking. Instead, this patch series takes
a different approach (introduced by Jörg Rödel in the x86-tree);
Pre-allocate the PGD-leaves (P4D, PUD, or PMD depending on the paging
setup) at mem_init(), for vmemmap and the direct map.

Pre-allocating the PGD-leaves waste some memory, but is only enabled
for CONFIG_MEMORY_HOTPLUG. The number pages, potentially unused, are
~128 * 4K.

Patch 1: Preparation for hotplugging support, by pre-allocating the
         PGD leaves.

Patch 2: Changes the __init attribute to __meminit, to avoid that the
         functions are removed after init. __meminit keeps the
         functions after init, if memory hotplugging is enabled for
         the build.
         
Patch 3: Refactor the direct map setup, so it can be used for hot add.

Patch 4: The actual add/remove code. Mostly a page-table-walk
         exercise.

Patch 5: Turn on the arch support in Kconfig

Patch 6: Now that memory hotplugging is enabled, make virtio-mem
         usable for RISC-V
         
Patch 7: Pre-allocate vmalloc PGD-leaves as well, which removes the
         need for vmalloc faulting.
         
RFC
---

 * TLB flushes. The current series uses Big Hammer flush-it-all.
 * Pre-allocation vs explicit syncs

Testing
-------

ACPI support is still in the making for RISC-V, so tests that involve
CXL and similar fanciness is currently not possible. Virtio-mem,
however, works without proper ACPI support. In order to try this out
in Qemu, some additional patches for Qemu are needed:

 * Enable virtio-mem for RISC-V
 * Add proper hotplug support for virtio-mem
 
The patch for Qemu can be found is commit 5d90a7ef1bc0
("hw/riscv/virt: Support for virtio-mem-pci"), and can be found here

  https://github.com/bjoto/qemu/tree/riscv-virtio-mem

I will try to upstream that work in parallel with this.
  
Thanks to David Hildenbrand for valuable input for the Qemu side of
things.

The series is based on the RISC-V fixes tree
  https://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git/log/?h=fixes


Thanks,
Björn


Björn Töpel (7):
  riscv: mm: Pre-allocate PGD leaves to avoid synchronization
  riscv: mm: Change attribute from __init to __meminit for page
    functions
  riscv: mm: Refactor create_linear_mapping_range() for hot add
  riscv: mm: Add memory hot add/remove support
  riscv: Enable memory hot add/remove arch kbuild support
  virtio-mem: Enable virtio-mem for RISC-V
  riscv: mm: Pre-allocate vmalloc PGD leaves

 arch/riscv/Kconfig               |   2 +
 arch/riscv/include/asm/kasan.h   |   4 +-
 arch/riscv/include/asm/mmu.h     |   2 +-
 arch/riscv/include/asm/pgtable.h |   2 +-
 arch/riscv/mm/fault.c            |   7 +-
 arch/riscv/mm/init.c             | 387 ++++++++++++++++++++++++++++---
 drivers/virtio/Kconfig           |   2 +-
 7 files changed, 364 insertions(+), 42 deletions(-)


base-commit: 3b90b09af5be42491a8a74a549318cfa265b3029
-- 
2.39.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/7] riscv: mm: Pre-allocate PGD leaves to avoid synchronization
  2023-05-12 14:57 [PATCH 0/7] riscv: Memory Hot(Un)Plug support Björn Töpel
@ 2023-05-12 14:57 ` Björn Töpel
  2023-05-12 14:57 ` [PATCH 2/7] riscv: mm: Change attribute from __init to __meminit for page functions Björn Töpel
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Björn Töpel @ 2023-05-12 14:57 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv
  Cc: Björn Töpel, linux-kernel, linux-mm, David Hildenbrand,
	Oscar Salvador, virtualization, linux, Alexandre Ghiti

From: Björn Töpel <bjorn@rivosinc.com>

The RISC-V port copies PGD from init_mm to all userland pages-tables,
which means that when the PGD level of the init_mm table is changed,
other page-tables has to be updated.

One way to avoid synchronizing page-tables is to pre-allocate the
pages that are copied (need to be synchronized). For memory
hotswapping builds, prefer to waste some pages, rather than do
explicit synchronization.

Prepare the RISC-V port for memory add/remove, by getting rid of PGD
synchronization. Pre-allocate vmemmap, and direct map pages. This will
roughly waste ~128 worth of 4K pages.

Note that this is only done for memory hotswap enabled configuration.

Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
---
 arch/riscv/include/asm/kasan.h |  4 +-
 arch/riscv/mm/init.c           | 86 ++++++++++++++++++++++++++++++++++
 2 files changed, 88 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/include/asm/kasan.h b/arch/riscv/include/asm/kasan.h
index 0b85e363e778..e6a0071bdb56 100644
--- a/arch/riscv/include/asm/kasan.h
+++ b/arch/riscv/include/asm/kasan.h
@@ -6,8 +6,6 @@
 
 #ifndef __ASSEMBLY__
 
-#ifdef CONFIG_KASAN
-
 /*
  * The following comment was copied from arm64:
  * KASAN_SHADOW_START: beginning of the kernel virtual addresses.
@@ -34,6 +32,8 @@
  */
 #define KASAN_SHADOW_START	((KASAN_SHADOW_END - KASAN_SHADOW_SIZE) & PGDIR_MASK)
 #define KASAN_SHADOW_END	MODULES_LOWEST_VADDR
+
+#ifdef CONFIG_KASAN
 #define KASAN_SHADOW_OFFSET	_AC(CONFIG_KASAN_SHADOW_OFFSET, UL)
 
 void kasan_init(void);
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 747e5b1ef02d..d2595cc33a1c 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -31,6 +31,7 @@
 #include <asm/io.h>
 #include <asm/ptdump.h>
 #include <asm/numa.h>
+#include <asm/kasan.h>
 
 #include "../kernel/head.h"
 
@@ -156,6 +157,90 @@ static void __init print_vm_layout(void)
 static void print_vm_layout(void) { }
 #endif /* CONFIG_DEBUG_VM */
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+/*
+ * Pre-allocates page-table pages for a specific area in the kernel
+ * page-table. Only the level which needs to be synchronized between
+ * all page-tables is allocated because the synchronization can be
+ * expensive.
+ */
+static void __init preallocate_pgd_pages_range(unsigned long start, unsigned long end,
+					       const char *area)
+{
+	unsigned long addr;
+	const char *lvl;
+
+	for (addr = start; addr < end; addr = ALIGN(addr + 1, PGDIR_SIZE)) {
+		pgd_t *pgd = pgd_offset_k(addr);
+		p4d_t *p4d;
+		pud_t *pud;
+		pmd_t *pmd;
+
+		lvl = "p4d";
+		p4d = p4d_alloc(&init_mm, pgd, addr);
+		if (!p4d)
+			goto failed;
+
+		if (pgtable_l5_enabled)
+			continue;
+
+		/*
+		 * The goal here is to allocate all possibly required
+		 * hardware page tables pointed to by the top hardware
+		 * level.
+		 *
+		 * On 4-level systems, the P4D layer is folded away
+		 * and the above code does no preallocation.  Below,
+		 * go down to the pud _software_ level to ensure the
+		 * second hardware level is allocated on 4-level
+		 * systems too.
+		 */
+		lvl = "pud";
+		pud = pud_alloc(&init_mm, p4d, addr);
+		if (!pud)
+			goto failed;
+
+		if (pgtable_l4_enabled)
+			continue;
+		/*
+		 * The goal here is to allocate all possibly required
+		 * hardware page tables pointed to by the top hardware
+		 * level.
+		 *
+		 * On 3-level systems, the PUD layer is folded away
+		 * and the above code does no preallocation.  Below,
+		 * go down to the pmd _software_ level to ensure the
+		 * second hardware level is allocated on 3-level
+		 * systems too.
+		 */
+		lvl = "pmd";
+		pmd = pmd_alloc(&init_mm, pud, addr);
+		if (!pmd)
+			goto failed;
+	}
+
+	return;
+
+failed:
+
+	/*
+	 * The pages have to be there now or they will be missing in
+	 * process page-tables later.
+	 */
+	panic("Failed to pre-allocate %s pages for %s area\n", lvl, area);
+}
+
+#define PAGE_END KASAN_SHADOW_START
+#endif
+
+static void __init prepare_memory_hotplug(void)
+{
+#ifdef CONFIG_MEMORY_HOTPLUG
+	preallocate_pgd_pages_range(VMEMMAP_START, VMEMMAP_END, "vmemmap");
+	preallocate_pgd_pages_range(PAGE_OFFSET, PAGE_END, "direct map");
+#endif
+}
+
 void __init mem_init(void)
 {
 #ifdef CONFIG_FLATMEM
@@ -164,6 +249,7 @@ void __init mem_init(void)
 
 	swiotlb_init(max_pfn > PFN_DOWN(dma32_phys_limit), SWIOTLB_VERBOSE);
 	memblock_free_all();
+	prepare_memory_hotplug();
 
 	print_vm_layout();
 }
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/7] riscv: mm: Change attribute from __init to __meminit for page functions
  2023-05-12 14:57 [PATCH 0/7] riscv: Memory Hot(Un)Plug support Björn Töpel
  2023-05-12 14:57 ` [PATCH 1/7] riscv: mm: Pre-allocate PGD leaves to avoid synchronization Björn Töpel
@ 2023-05-12 14:57 ` Björn Töpel
  2023-05-12 14:57 ` [PATCH 3/7] riscv: mm: Refactor create_linear_mapping_range() for hot add Björn Töpel
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Björn Töpel @ 2023-05-12 14:57 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv
  Cc: Björn Töpel, linux-kernel, linux-mm, David Hildenbrand,
	Oscar Salvador, virtualization, linux, Alexandre Ghiti

From: Björn Töpel <bjorn@rivosinc.com>

Prepare for memory hot add/remove support by changing from __init to
__meminit for the page-table functions that are used by the upcoming
arch specific callbacks.

Changing the __init attribute to __meminit, avoids that the functions
are removed after init. __meminit keeps the functions after init, if
memory hotplugging is enabled for the build.

Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
---
 arch/riscv/include/asm/mmu.h     |  2 +-
 arch/riscv/include/asm/pgtable.h |  2 +-
 arch/riscv/mm/init.c             | 49 ++++++++++++++------------------
 3 files changed, 24 insertions(+), 29 deletions(-)

diff --git a/arch/riscv/include/asm/mmu.h b/arch/riscv/include/asm/mmu.h
index 0099dc116168..9e5d4f37ba2e 100644
--- a/arch/riscv/include/asm/mmu.h
+++ b/arch/riscv/include/asm/mmu.h
@@ -22,7 +22,7 @@ typedef struct {
 #endif
 } mm_context_t;
 
-void __init create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa,
+void __meminit create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa,
 			       phys_addr_t sz, pgprot_t prot);
 #endif /* __ASSEMBLY__ */
 
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 2258b27173b0..a4cdcb689959 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -147,7 +147,7 @@ struct pt_alloc_ops {
 #endif
 };
 
-extern struct pt_alloc_ops pt_ops __initdata;
+extern struct pt_alloc_ops pt_ops __meminitdata;
 
 #ifdef CONFIG_MMU
 /* Number of PGD entries that a user-mode program can use */
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index d2595cc33a1c..e974ff6ef036 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -356,7 +356,7 @@ static void __init setup_bootmem(void)
 }
 
 #ifdef CONFIG_MMU
-struct pt_alloc_ops pt_ops __initdata;
+struct pt_alloc_ops pt_ops __meminitdata;
 
 pgd_t swapper_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
 pgd_t trampoline_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
@@ -418,7 +418,7 @@ static inline pte_t *__init get_pte_virt_fixmap(phys_addr_t pa)
 	return (pte_t *)set_fixmap_offset(FIX_PTE, pa);
 }
 
-static inline pte_t *__init get_pte_virt_late(phys_addr_t pa)
+static inline pte_t *__meminit get_pte_virt_late(phys_addr_t pa)
 {
 	return (pte_t *) __va(pa);
 }
@@ -437,7 +437,7 @@ static inline phys_addr_t __init alloc_pte_fixmap(uintptr_t va)
 	return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
 }
 
-static phys_addr_t __init alloc_pte_late(uintptr_t va)
+static phys_addr_t __meminit alloc_pte_late(uintptr_t va)
 {
 	unsigned long vaddr;
 
@@ -447,9 +447,8 @@ static phys_addr_t __init alloc_pte_late(uintptr_t va)
 	return __pa(vaddr);
 }
 
-static void __init create_pte_mapping(pte_t *ptep,
-				      uintptr_t va, phys_addr_t pa,
-				      phys_addr_t sz, pgprot_t prot)
+static void __meminit create_pte_mapping(pte_t *ptep, uintptr_t va, phys_addr_t pa, phys_addr_t sz,
+					 pgprot_t prot)
 {
 	uintptr_t pte_idx = pte_index(va);
 
@@ -503,7 +502,7 @@ static pmd_t *__init get_pmd_virt_fixmap(phys_addr_t pa)
 	return (pmd_t *)set_fixmap_offset(FIX_PMD, pa);
 }
 
-static pmd_t *__init get_pmd_virt_late(phys_addr_t pa)
+static pmd_t *__meminit get_pmd_virt_late(phys_addr_t pa)
 {
 	return (pmd_t *) __va(pa);
 }
@@ -520,7 +519,7 @@ static phys_addr_t __init alloc_pmd_fixmap(uintptr_t va)
 	return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
 }
 
-static phys_addr_t __init alloc_pmd_late(uintptr_t va)
+static phys_addr_t __meminit alloc_pmd_late(uintptr_t va)
 {
 	unsigned long vaddr;
 
@@ -530,9 +529,8 @@ static phys_addr_t __init alloc_pmd_late(uintptr_t va)
 	return __pa(vaddr);
 }
 
-static void __init create_pmd_mapping(pmd_t *pmdp,
-				      uintptr_t va, phys_addr_t pa,
-				      phys_addr_t sz, pgprot_t prot)
+static void __meminit create_pmd_mapping(pmd_t *pmdp, uintptr_t va, phys_addr_t pa, phys_addr_t sz,
+					 pgprot_t prot)
 {
 	pte_t *ptep;
 	phys_addr_t pte_phys;
@@ -568,7 +566,7 @@ static pud_t *__init get_pud_virt_fixmap(phys_addr_t pa)
 	return (pud_t *)set_fixmap_offset(FIX_PUD, pa);
 }
 
-static pud_t *__init get_pud_virt_late(phys_addr_t pa)
+static pud_t *__meminit get_pud_virt_late(phys_addr_t pa)
 {
 	return (pud_t *)__va(pa);
 }
@@ -586,7 +584,7 @@ static phys_addr_t __init alloc_pud_fixmap(uintptr_t va)
 	return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
 }
 
-static phys_addr_t alloc_pud_late(uintptr_t va)
+static phys_addr_t __meminit alloc_pud_late(uintptr_t va)
 {
 	unsigned long vaddr;
 
@@ -606,7 +604,7 @@ static p4d_t *__init get_p4d_virt_fixmap(phys_addr_t pa)
 	return (p4d_t *)set_fixmap_offset(FIX_P4D, pa);
 }
 
-static p4d_t *__init get_p4d_virt_late(phys_addr_t pa)
+static p4d_t *__meminit get_p4d_virt_late(phys_addr_t pa)
 {
 	return (p4d_t *)__va(pa);
 }
@@ -624,7 +622,7 @@ static phys_addr_t __init alloc_p4d_fixmap(uintptr_t va)
 	return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
 }
 
-static phys_addr_t alloc_p4d_late(uintptr_t va)
+static phys_addr_t __meminit alloc_p4d_late(uintptr_t va)
 {
 	unsigned long vaddr;
 
@@ -633,9 +631,8 @@ static phys_addr_t alloc_p4d_late(uintptr_t va)
 	return __pa(vaddr);
 }
 
-static void __init create_pud_mapping(pud_t *pudp,
-				      uintptr_t va, phys_addr_t pa,
-				      phys_addr_t sz, pgprot_t prot)
+static void __meminit create_pud_mapping(pud_t *pudp, uintptr_t va, phys_addr_t pa, phys_addr_t sz,
+					 pgprot_t prot)
 {
 	pmd_t *nextp;
 	phys_addr_t next_phys;
@@ -660,9 +657,8 @@ static void __init create_pud_mapping(pud_t *pudp,
 	create_pmd_mapping(nextp, va, pa, sz, prot);
 }
 
-static void __init create_p4d_mapping(p4d_t *p4dp,
-				      uintptr_t va, phys_addr_t pa,
-				      phys_addr_t sz, pgprot_t prot)
+static void __meminit create_p4d_mapping(p4d_t *p4dp, uintptr_t va, phys_addr_t pa, phys_addr_t sz,
+					 pgprot_t prot)
 {
 	pud_t *nextp;
 	phys_addr_t next_phys;
@@ -718,9 +714,8 @@ static void __init create_p4d_mapping(p4d_t *p4dp,
 #define create_pmd_mapping(__pmdp, __va, __pa, __sz, __prot) do {} while(0)
 #endif /* __PAGETABLE_PMD_FOLDED */
 
-void __init create_pgd_mapping(pgd_t *pgdp,
-				      uintptr_t va, phys_addr_t pa,
-				      phys_addr_t sz, pgprot_t prot)
+void __meminit create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa, phys_addr_t sz,
+				  pgprot_t prot)
 {
 	pgd_next_t *nextp;
 	phys_addr_t next_phys;
@@ -745,7 +740,7 @@ void __init create_pgd_mapping(pgd_t *pgdp,
 	create_pgd_next_mapping(nextp, va, pa, sz, prot);
 }
 
-static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
+static uintptr_t __meminit best_map_size(phys_addr_t base, phys_addr_t size)
 {
 	if (!(base & (PGDIR_SIZE - 1)) && size >= PGDIR_SIZE)
 		return PGDIR_SIZE;
@@ -778,7 +773,7 @@ asmlinkage void __init __copy_data(void)
 #endif
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
-static __init pgprot_t pgprot_from_va(uintptr_t va)
+static __meminit pgprot_t pgprot_from_va(uintptr_t va)
 {
 	if (is_va_kernel_text(va))
 		return PAGE_KERNEL_READ_EXEC;
@@ -805,7 +800,7 @@ void mark_rodata_ro(void)
 	debug_checkwx();
 }
 #else
-static __init pgprot_t pgprot_from_va(uintptr_t va)
+static __meminit pgprot_t pgprot_from_va(uintptr_t va)
 {
 	if (IS_ENABLED(CONFIG_64BIT) && !is_kernel_mapping(va))
 		return PAGE_KERNEL;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/7] riscv: mm: Refactor create_linear_mapping_range() for hot add
  2023-05-12 14:57 [PATCH 0/7] riscv: Memory Hot(Un)Plug support Björn Töpel
  2023-05-12 14:57 ` [PATCH 1/7] riscv: mm: Pre-allocate PGD leaves to avoid synchronization Björn Töpel
  2023-05-12 14:57 ` [PATCH 2/7] riscv: mm: Change attribute from __init to __meminit for page functions Björn Töpel
@ 2023-05-12 14:57 ` Björn Töpel
  2023-06-21 23:56   ` Palmer Dabbelt
  2023-05-12 14:57 ` [PATCH 4/7] riscv: mm: Add memory hot add/remove support Björn Töpel
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 14+ messages in thread
From: Björn Töpel @ 2023-05-12 14:57 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv
  Cc: Björn Töpel, linux-kernel, linux-mm, David Hildenbrand,
	Oscar Salvador, virtualization, linux, Alexandre Ghiti

From: Björn Töpel <bjorn@rivosinc.com>

Add a parameter to the direct map setup function, so it can be used in
arch_add_memory() later.

Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
---
 arch/riscv/mm/init.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index e974ff6ef036..aea8ccb3f4ae 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1247,18 +1247,19 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 	pt_ops_set_fixmap();
 }
 
-static void __init create_linear_mapping_range(phys_addr_t start,
-					       phys_addr_t end)
+static void __meminit create_linear_mapping_range(phys_addr_t start, phys_addr_t end,
+						  struct mhp_params *params)
 {
 	phys_addr_t pa;
 	uintptr_t va, map_size;
 
 	for (pa = start; pa < end; pa += map_size) {
+		pgprot_t pgprot;
+
 		va = (uintptr_t)__va(pa);
+		pgprot =  params ? params->pgprot : pgprot_from_va(va);
 		map_size = best_map_size(pa, end - pa);
-
-		create_pgd_mapping(swapper_pg_dir, va, pa, map_size,
-				   pgprot_from_va(va));
+		create_pgd_mapping(swapper_pg_dir, va, pa, map_size, pgprot);
 	}
 }
 
@@ -1288,13 +1289,12 @@ static void __init create_linear_mapping_page_table(void)
 		if (end >= __pa(PAGE_OFFSET) + memory_limit)
 			end = __pa(PAGE_OFFSET) + memory_limit;
 
-		create_linear_mapping_range(start, end);
+		create_linear_mapping_range(start, end, NULL);
 	}
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
-	create_linear_mapping_range(ktext_start, ktext_start + ktext_size);
-	create_linear_mapping_range(krodata_start,
-				    krodata_start + krodata_size);
+	create_linear_mapping_range(ktext_start, ktext_start + ktext_size, NULL);
+	create_linear_mapping_range(krodata_start, krodata_start + krodata_size, NULL);
 
 	memblock_clear_nomap(ktext_start,  ktext_size);
 	memblock_clear_nomap(krodata_start, krodata_size);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/7] riscv: mm: Add memory hot add/remove support
  2023-05-12 14:57 [PATCH 0/7] riscv: Memory Hot(Un)Plug support Björn Töpel
                   ` (2 preceding siblings ...)
  2023-05-12 14:57 ` [PATCH 3/7] riscv: mm: Refactor create_linear_mapping_range() for hot add Björn Töpel
@ 2023-05-12 14:57 ` Björn Töpel
  2023-05-12 14:57 ` [PATCH 5/7] riscv: Enable memory hot add/remove arch kbuild support Björn Töpel
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Björn Töpel @ 2023-05-12 14:57 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv
  Cc: Björn Töpel, linux-kernel, linux-mm, David Hildenbrand,
	Oscar Salvador, virtualization, linux, Alexandre Ghiti

From: Björn Töpel <bjorn@rivosinc.com>

From an arch perspective, a couple of callbacks needs to be
implemented to support hotplugging:

arch_add_memory() This callback is responsible for updating the
linear/direct map, and call into the memory hotplugging generic code
via __add_pages().

arch_remove_memory() In this callback the linear/direct map is tore
down.

vmemmap_free() The function tears down the vmemmap mappings (if
CONFIG_SPARSEMEM_VMEMMAP is in-use), and also deallocates the backing
vmemmap pages. Note that for persistent memory, an alternative
allocator for the backing pages can be used -- the vmem_altmap. This
means that when the backing pages are cleared, extra care is needed so
that the correct deallocation method is used. Note that RISC-V
populates the vmemmap using vmemmap_populate_basepages(), so currently
no hugepages are used for the backing store.

The page table unmap/teardown functions are heavily based (copied!)
from the x86 tree. The same remove_pgd_mapping() is used in both
vmemmap_free() and arch_remove_memory(), but in the latter function
the backing pages are not removed.

Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
---
 arch/riscv/mm/init.c | 233 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 233 insertions(+)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index aea8ccb3f4ae..a468708d1e1c 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1444,3 +1444,236 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 	return vmemmap_populate_basepages(start, end, node, NULL);
 }
 #endif
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
+{
+	pte_t *pte;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PTE; i++) {
+		pte = pte_start + i;
+		if (!pte_none(*pte))
+			return;
+	}
+
+	free_pages((unsigned long)page_address(pmd_page(*pmd)), 0);
+	pmd_clear(pmd);
+}
+
+static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
+{
+	pmd_t *pmd;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		pmd = pmd_start + i;
+		if (!pmd_none(*pmd))
+			return;
+	}
+
+	free_pages((unsigned long)page_address(pud_page(*pud)), 0);
+	pud_clear(pud);
+}
+
+static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
+{
+	pud_t *pud;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PUD; i++) {
+		pud = pud_start + i;
+		if (!pud_none(*pud))
+			return;
+	}
+
+	free_pages((unsigned long)page_address(p4d_page(*p4d)), 0);
+	p4d_clear(p4d);
+}
+
+static void __meminit free_vmemmap_storage(struct page *page, size_t size,
+					   struct vmem_altmap *altmap)
+{
+	if (altmap)
+		vmem_altmap_free(altmap, size >> PAGE_SHIFT);
+	else
+		free_pages((unsigned long)page_address(page), get_order(size));
+}
+
+static void __meminit remove_pte_mapping(pte_t *pte_base, unsigned long addr, unsigned long end,
+					 bool is_vmemmap, struct vmem_altmap *altmap)
+{
+	unsigned long next;
+	pte_t *ptep, pte;
+
+	for (; addr < end; addr = next) {
+		next = (addr + PAGE_SIZE) & PAGE_MASK;
+		if (next > end)
+			next = end;
+
+		ptep = pte_base + pte_index(addr);
+		pte = READ_ONCE(*ptep);
+
+		if (!pte_present(*ptep))
+			continue;
+
+		pte_clear(&init_mm, addr, ptep);
+		if (is_vmemmap)
+			free_vmemmap_storage(pte_page(pte), PAGE_SIZE, altmap);
+	}
+}
+
+static void __meminit remove_pmd_mapping(pmd_t *pmd_base, unsigned long addr, unsigned long end,
+					 bool is_vmemmap, struct vmem_altmap *altmap)
+{
+	unsigned long next;
+	pte_t *pte_base;
+	pmd_t *pmdp, pmd;
+
+	for (; addr < end; addr = next) {
+		next = pmd_addr_end(addr, end);
+		pmdp = pmd_base + pmd_index(addr);
+		pmd = READ_ONCE(*pmdp);
+
+		if (!pmd_present(pmd))
+			continue;
+
+		if (pmd_leaf(pmd)) {
+			pmd_clear(pmdp);
+			if (is_vmemmap)
+				free_vmemmap_storage(pmd_page(pmd), PMD_SIZE, altmap);
+			continue;
+		}
+
+		pte_base = (pte_t *)pmd_page_vaddr(*pmdp);
+		remove_pte_mapping(pte_base, addr, next, is_vmemmap, altmap);
+		free_pte_table(pte_base, pmdp);
+	}
+}
+
+static void __meminit remove_pud_mapping(pud_t *pud_base, unsigned long addr, unsigned long end,
+					 bool is_vmemmap, struct vmem_altmap *altmap)
+{
+	unsigned long next;
+	pud_t *pudp, pud;
+	pmd_t *pmd_base;
+
+	for (; addr < end; addr = next) {
+		next = pud_addr_end(addr, end);
+		pudp = pud_base + pud_index(addr);
+		pud = READ_ONCE(*pudp);
+
+		if (!pud_present(pud))
+			continue;
+
+		if (pud_leaf(pud)) {
+			if (pgtable_l4_enabled) {
+				pud_clear(pudp);
+				if (is_vmemmap)
+					free_vmemmap_storage(pud_page(pud), PUD_SIZE, altmap);
+			}
+			continue;
+		}
+
+		pmd_base = pmd_offset(pudp, 0);
+		remove_pmd_mapping(pmd_base, addr, next, is_vmemmap, altmap);
+
+		if (pgtable_l4_enabled)
+			free_pmd_table(pmd_base, pudp);
+	}
+}
+
+static void __meminit remove_p4d_mapping(p4d_t *p4d_base, unsigned long addr, unsigned long end,
+					 bool is_vmemmap, struct vmem_altmap *altmap)
+{
+	unsigned long next;
+	p4d_t *p4dp, p4d;
+	pud_t *pud_base;
+
+	for (; addr < end; addr = next) {
+		next = p4d_addr_end(addr, end);
+		p4dp = p4d_base + p4d_index(addr);
+		p4d = READ_ONCE(*p4dp);
+
+		if (!p4d_present(p4d))
+			continue;
+
+		if (p4d_leaf(p4d)) {
+			if (pgtable_l5_enabled) {
+				p4d_clear(p4dp);
+				if (is_vmemmap)
+					free_vmemmap_storage(p4d_page(p4d), P4D_SIZE, altmap);
+			}
+			continue;
+		}
+
+		pud_base = pud_offset(p4dp, 0);
+		remove_pud_mapping(pud_base, addr, next, is_vmemmap, altmap);
+
+		if (pgtable_l5_enabled)
+			free_pud_table(pud_base, p4dp);
+	}
+}
+
+static void __meminit remove_pgd_mapping(unsigned long va, unsigned long end, bool is_vmemmap,
+					 struct vmem_altmap *altmap)
+{
+	unsigned long addr, next;
+	p4d_t *p4d_base;
+	pgd_t *pgd;
+
+	for (addr = va; addr < end; addr = next) {
+		next = pgd_addr_end(addr, end);
+		pgd = pgd_offset_k(addr);
+
+		if (!pgd_present(*pgd))
+			continue;
+
+		if (pgd_leaf(*pgd))
+			continue;
+
+		p4d_base = p4d_offset(pgd, 0);
+		remove_p4d_mapping(p4d_base, addr, next, is_vmemmap, altmap);
+	}
+
+	flush_tlb_all();
+}
+
+static void __meminit remove_linear_mapping(phys_addr_t start, u64 size)
+{
+	unsigned long va = (unsigned long)__va(start);
+	unsigned long end = (unsigned long)__va(start + size);
+
+	remove_pgd_mapping(va, end, false, NULL);
+}
+
+int __ref arch_add_memory(int nid, u64 start, u64 size, struct mhp_params *params)
+{
+	int ret;
+
+	create_linear_mapping_range(start, start + size, params);
+	flush_tlb_all();
+	ret = __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT, params);
+	if (ret) {
+		remove_linear_mapping(start, size);
+		return ret;
+	}
+
+	max_pfn = PFN_UP(start + size);
+	max_low_pfn = max_pfn;
+	return 0;
+}
+
+void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
+{
+	__remove_pages(start >> PAGE_SHIFT, size >> PAGE_SHIFT, altmap);
+	remove_linear_mapping(start, size);
+}
+
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+void __ref vmemmap_free(unsigned long start, unsigned long end, struct vmem_altmap *altmap)
+{
+	remove_pgd_mapping(start, end, true, altmap);
+}
+#endif /* CONFIG_SPARSEMEM_VMEMMAP */
+#endif /* CONFIG_MEMORY_HOTPLUG */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 5/7] riscv: Enable memory hot add/remove arch kbuild support
  2023-05-12 14:57 [PATCH 0/7] riscv: Memory Hot(Un)Plug support Björn Töpel
                   ` (3 preceding siblings ...)
  2023-05-12 14:57 ` [PATCH 4/7] riscv: mm: Add memory hot add/remove support Björn Töpel
@ 2023-05-12 14:57 ` Björn Töpel
  2023-05-12 14:57 ` [PATCH 6/7] virtio-mem: Enable virtio-mem for RISC-V Björn Töpel
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Björn Töpel @ 2023-05-12 14:57 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv
  Cc: Björn Töpel, linux-kernel, linux-mm, David Hildenbrand,
	Oscar Salvador, virtualization, linux, Alexandre Ghiti

From: Björn Töpel <bjorn@rivosinc.com>

Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for
RISC-V.

Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
---
 arch/riscv/Kconfig | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 348c0fa1fc8c..81b3f188f396 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -14,6 +14,8 @@ config RISCV
 	def_bool y
 	select ARCH_DMA_DEFAULT_COHERENT
 	select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
+	select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU
+	select ARCH_ENABLE_MEMORY_HOTREMOVE if SPARSEMEM && 64BIT && MMU
 	select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2
 	select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
 	select ARCH_HAS_BINFMT_FLAT
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 6/7] virtio-mem: Enable virtio-mem for RISC-V
  2023-05-12 14:57 [PATCH 0/7] riscv: Memory Hot(Un)Plug support Björn Töpel
                   ` (4 preceding siblings ...)
  2023-05-12 14:57 ` [PATCH 5/7] riscv: Enable memory hot add/remove arch kbuild support Björn Töpel
@ 2023-05-12 14:57 ` Björn Töpel
  2023-05-12 14:57 ` [PATCH 7/7] riscv: mm: Pre-allocate vmalloc PGD leaves Björn Töpel
  2023-05-17 13:49 ` [PATCH 0/7] riscv: Memory Hot(Un)Plug support David Hildenbrand
  7 siblings, 0 replies; 14+ messages in thread
From: Björn Töpel @ 2023-05-12 14:57 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv
  Cc: Björn Töpel, linux-kernel, linux-mm, David Hildenbrand,
	Oscar Salvador, virtualization, linux, Alexandre Ghiti

From: Björn Töpel <bjorn@rivosinc.com>

Now that RISC-V has memory hot add/remove support, virtio-mem can be
used on the platform.

Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
---
 drivers/virtio/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 0a53a61231c2..358e79ece169 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -117,7 +117,7 @@ config VIRTIO_BALLOON
 
 config VIRTIO_MEM
 	tristate "Virtio mem driver"
-	depends on X86_64 || ARM64
+	depends on X86_64 || ARM64 || RISCV
 	depends on VIRTIO
 	depends on MEMORY_HOTPLUG
 	depends on MEMORY_HOTREMOVE
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 7/7] riscv: mm: Pre-allocate vmalloc PGD leaves
  2023-05-12 14:57 [PATCH 0/7] riscv: Memory Hot(Un)Plug support Björn Töpel
                   ` (5 preceding siblings ...)
  2023-05-12 14:57 ` [PATCH 6/7] virtio-mem: Enable virtio-mem for RISC-V Björn Töpel
@ 2023-05-12 14:57 ` Björn Töpel
  2023-05-17 13:49 ` [PATCH 0/7] riscv: Memory Hot(Un)Plug support David Hildenbrand
  7 siblings, 0 replies; 14+ messages in thread
From: Björn Töpel @ 2023-05-12 14:57 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv
  Cc: Björn Töpel, linux-kernel, linux-mm, David Hildenbrand,
	Oscar Salvador, virtualization, linux, Alexandre Ghiti

From: Björn Töpel <bjorn@rivosinc.com>

Instead of relying on vmalloc_fault() to synchronize the page-tables,
pre-allocate the PGD leaves of the vmalloc area. This is only enabled
if memory hot/add is enabled by the build.

Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
---
 arch/riscv/mm/fault.c | 7 ++++++-
 arch/riscv/mm/init.c  | 1 +
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 8685f85a7474..b61e279acd50 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -233,12 +233,17 @@ void handle_page_fault(struct pt_regs *regs)
 	 * Fault-in kernel-space virtual memory on-demand.
 	 * The 'reference' page table is init_mm.pgd.
 	 *
+	 * For memory hotplug enabled systems, the PGD entries are
+	 * pre-allocated, which avoids the need to synchronize
+	 * pgd/fault-in.
+	 *
 	 * NOTE! We MUST NOT take any locks for this case. We may
 	 * be in an interrupt or a critical region, and should
 	 * only copy the information from the master page table,
 	 * nothing more.
 	 */
-	if (unlikely((addr >= VMALLOC_START) && (addr < VMALLOC_END))) {
+	if (unlikely(!IS_ENABLED(CONFIG_MEMORY_HOTPLUG) &&
+		     (addr >= VMALLOC_START) && (addr < VMALLOC_END))) {
 		vmalloc_fault(regs, code, addr);
 		return;
 	}
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index a468708d1e1c..fd5a6d3fe182 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -236,6 +236,7 @@ static void __init preallocate_pgd_pages_range(unsigned long start, unsigned lon
 static void __init prepare_memory_hotplug(void)
 {
 #ifdef CONFIG_MEMORY_HOTPLUG
+	preallocate_pgd_pages_range(VMALLOC_START, VMALLOC_END, "vmalloc");
 	preallocate_pgd_pages_range(VMEMMAP_START, VMEMMAP_END, "vmemmap");
 	preallocate_pgd_pages_range(PAGE_OFFSET, PAGE_END, "direct map");
 #endif
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/7] riscv: Memory Hot(Un)Plug support
  2023-05-12 14:57 [PATCH 0/7] riscv: Memory Hot(Un)Plug support Björn Töpel
                   ` (6 preceding siblings ...)
  2023-05-12 14:57 ` [PATCH 7/7] riscv: mm: Pre-allocate vmalloc PGD leaves Björn Töpel
@ 2023-05-17 13:49 ` David Hildenbrand
  2023-05-17 18:53   ` Björn Töpel
  7 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand @ 2023-05-17 13:49 UTC (permalink / raw)
  To: Björn Töpel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	linux-riscv
  Cc: Björn Töpel, linux-kernel, linux-mm, Oscar Salvador,
	virtualization, linux, Alexandre Ghiti

On 12.05.23 16:57, Björn Töpel wrote:
> From: Björn Töpel <bjorn@rivosinc.com>
> 
> Memory Hot(Un)Plug support for the RISC-V port
> ==============================================
> 
> Introduction
> ------------
> 
> To quote "Documentation/admin-guide/mm/memory-hotplug.rst": "Memory
> hot(un)plug allows for increasing and decreasing the size of physical
> memory available to a machine at runtime."
> 
> This series attempts to add memory hot(un)plug support for the RISC-V
> Linux port.
> 
> I'm sending the series as a v1, but it's borderline RFC. It definitely
> needs more testing time, but it would be nice with some early input.
> 
> Implementation
> --------------
> 
>  From an arch perspective, a couple of callbacks needs to be
> implemented to support hot plugging:
> 
> arch_add_memory()
> This callback is responsible for updating the linear/direct map, and
> call into the memory hot plugging generic code via __add_pages().
> 
> arch_remove_memory()
> In this callback the linear/direct map is tore down.
> 
> vmemmap_free()
> The function tears down the vmemmap mappings (if
> CONFIG_SPARSEMEM_VMEMMAP is in-use), and also deallocates the backing
> vmemmap pages. Note that for persistent memory, an alternative
> allocator for the backing pages can be used -- the vmem_altmap. This
> means that when the backing pages are cleared, extra care is needed so
> that the correct deallocation method is used. Note that RISC-V
> populates the vmemmap using vmemmap_populate_basepages(), so currently
> no hugepages are used for the backing store.
> 
> The page table unmap/teardown functions are heavily based (copied!)
> from the x86 tree. The same remove_pgd_mapping() is used in both
> vmemmap_free() and arch_remove_memory(), but in the latter function
> the backing pages are not removed.
> 
> On RISC-V, the PGD level kernel mappings needs to synchronized with
> all page-tables (e.g. via sync_kernel_mappings()). Synchronization
> involves special care, like locking. Instead, this patch series takes
> a different approach (introduced by Jörg Rödel in the x86-tree);
> Pre-allocate the PGD-leaves (P4D, PUD, or PMD depending on the paging
> setup) at mem_init(), for vmemmap and the direct map.
> 
> Pre-allocating the PGD-leaves waste some memory, but is only enabled
> for CONFIG_MEMORY_HOTPLUG. The number pages, potentially unused, are
> ~128 * 4K.
> 
> Patch 1: Preparation for hotplugging support, by pre-allocating the
>           PGD leaves.
> 
> Patch 2: Changes the __init attribute to __meminit, to avoid that the
>           functions are removed after init. __meminit keeps the
>           functions after init, if memory hotplugging is enabled for
>           the build.
>           
> Patch 3: Refactor the direct map setup, so it can be used for hot add.
> 
> Patch 4: The actual add/remove code. Mostly a page-table-walk
>           exercise.
> 
> Patch 5: Turn on the arch support in Kconfig
> 
> Patch 6: Now that memory hotplugging is enabled, make virtio-mem
>           usable for RISC-V
>           
> Patch 7: Pre-allocate vmalloc PGD-leaves as well, which removes the
>           need for vmalloc faulting.
>           
> RFC
> ---
> 
>   * TLB flushes. The current series uses Big Hammer flush-it-all.
>   * Pre-allocation vs explicit syncs
> 
> Testing
> -------
> 
> ACPI support is still in the making for RISC-V, so tests that involve
> CXL and similar fanciness is currently not possible. Virtio-mem,
> however, works without proper ACPI support. In order to try this out
> in Qemu, some additional patches for Qemu are needed:
> 
>   * Enable virtio-mem for RISC-V
>   * Add proper hotplug support for virtio-mem
>   
> The patch for Qemu can be found is commit 5d90a7ef1bc0
> ("hw/riscv/virt: Support for virtio-mem-pci"), and can be found here
> 
>    https://github.com/bjoto/qemu/tree/riscv-virtio-mem
> 
> I will try to upstream that work in parallel with this.
>    
> Thanks to David Hildenbrand for valuable input for the Qemu side of
> things.
> 
> The series is based on the RISC-V fixes tree
>    https://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git/log/?h=fixes
> 

Cool stuff! I'm fairly busy right now, so some high-level questions upfront:

What is the memory section size (which implies the memory block size 
and)? This implies the minimum DIMM granularity and the high-level 
granularity in which virtio-mem adds memory.

What is the pageblock size, implying the minimum granularity that 
virtio-mem can operate on?

On x86-64 and arm64 we currently use the ACPI SRAT to expose the maximum 
physical address where we can see memory getting hotplugged. [1] From 
that, we can derive the "max_possible_pfn" and prepare the kernel 
virtual memory layourt (especially, direct map).

Is something similar required on RISC-V? On s390x, I'm planning on 
adding a paravirtualized mechanism to detect where memory devices might 
be located. (I had a running RFC, but was distracted by all other kinds 
of stuff)


[1] https://virtio-mem.gitlab.io/developer-guide.html

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/7] riscv: Memory Hot(Un)Plug support
  2023-05-17 13:49 ` [PATCH 0/7] riscv: Memory Hot(Un)Plug support David Hildenbrand
@ 2023-05-17 18:53   ` Björn Töpel
  2023-05-21  9:15     ` Björn Töpel
  0 siblings, 1 reply; 14+ messages in thread
From: Björn Töpel @ 2023-05-17 18:53 UTC (permalink / raw)
  To: David Hildenbrand, Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv
  Cc: Björn Töpel, linux-kernel, linux-mm, Oscar Salvador,
	virtualization, linux, Alexandre Ghiti

David Hildenbrand <david@redhat.com> writes:

> On 12.05.23 16:57, Björn Töpel wrote:
>> From: Björn Töpel <bjorn@rivosinc.com>
>> 
>> Memory Hot(Un)Plug support for the RISC-V port
>> ==============================================

[...]

>
> Cool stuff! I'm fairly busy right now, so some high-level questions upfront:

No worries, and no rush! I'd say the v1 series was mainly for the RISC-V
folks, and I've got tons of (offline) comments from Alex -- and with
your comments below some more details to figure out.

> What is the memory section size (which implies the memory block size 
> and)? This implies the minimum DIMM granularity and the high-level 
> granularity in which virtio-mem adds memory.

It's 128M (27 bits) -- (like arm64 and x86-64?).

> What is the pageblock size, implying the minimum granularity that 
> virtio-mem can operate on?

Nothing special AFAIU; MAX_ORDER is 10, so PAGE_SIZE (4K) * 1024. Hmm, I
realize that I need to look into some more details of virtio-mem! :-)

> On x86-64 and arm64 we currently use the ACPI SRAT to expose the maximum 
> physical address where we can see memory getting hotplugged. [1] From 
> that, we can derive the "max_possible_pfn" and prepare the kernel 
> virtual memory layourt (especially, direct map).
>
> Is something similar required on RISC-V?

Yes! RISC-V is in the progress of getting proper ACPI support. Thanks
for pointing me in the these directions; Food for thought that I'll
digest for the next version.


Cheers,
Björn

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/7] riscv: Memory Hot(Un)Plug support
  2023-05-17 18:53   ` Björn Töpel
@ 2023-05-21  9:15     ` Björn Töpel
  2023-05-22  8:21       ` David Hildenbrand
  0 siblings, 1 reply; 14+ messages in thread
From: Björn Töpel @ 2023-05-21  9:15 UTC (permalink / raw)
  To: David Hildenbrand, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	linux-riscv, Anshuman Khandual
  Cc: Björn Töpel, linux-kernel, linux-mm, Oscar Salvador,
	virtualization, linux, Alexandre Ghiti

Hi David and Anshuman!

Björn Töpel <bjorn@kernel.org> writes:

> David Hildenbrand <david@redhat.com> writes:
>
>> On 12.05.23 16:57, Björn Töpel wrote:
>>> From: Björn Töpel <bjorn@rivosinc.com>
>>> 
>>> Memory Hot(Un)Plug support for the RISC-V port
>>> ==============================================
>
> [...]
>
>>
>> Cool stuff! I'm fairly busy right now, so some high-level questions upfront:
>
> No worries, and no rush! I'd say the v1 series was mainly for the RISC-V
> folks, and I've got tons of (offline) comments from Alex -- and with
> your comments below some more details to figure out.

One of the major issues with my v1 patch is around init_mm page table
synchronization, and that'll be part of the v2.

I've noticed there's a quite a difference between x86-64 and arm64 in
terms of locking, when updating (add/remove) the init_mm table. x86-64
uses the usual page table locking mechanisms (used by the generic
kernel functions), whereas arm64 does not.

How does arm64 manage to mix the "lock-less" updates (READ/WRITE_ONCE,
and fences in set_p?d+friends), with the generic kernel ones that uses
the regular page locking mechanism?

I'm obviously missing something about the locking rules for memory hot
add/remove... I've been reading the arm64 memory hot add/remove
series, but none the wiser! ;-)


Björn

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/7] riscv: Memory Hot(Un)Plug support
  2023-05-21  9:15     ` Björn Töpel
@ 2023-05-22  8:21       ` David Hildenbrand
  0 siblings, 0 replies; 14+ messages in thread
From: David Hildenbrand @ 2023-05-22  8:21 UTC (permalink / raw)
  To: Björn Töpel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	linux-riscv, Anshuman Khandual
  Cc: Björn Töpel, linux-kernel, linux-mm, Oscar Salvador,
	virtualization, linux, Alexandre Ghiti

On 21.05.23 11:15, Björn Töpel wrote:
> Hi David and Anshuman!
> 
> Björn Töpel <bjorn@kernel.org> writes:
> 
>> David Hildenbrand <david@redhat.com> writes:
>>
>>> On 12.05.23 16:57, Björn Töpel wrote:
>>>> From: Björn Töpel <bjorn@rivosinc.com>
>>>>
>>>> Memory Hot(Un)Plug support for the RISC-V port
>>>> ==============================================
>>
>> [...]
>>
>>>
>>> Cool stuff! I'm fairly busy right now, so some high-level questions upfront:
>>
>> No worries, and no rush! I'd say the v1 series was mainly for the RISC-V
>> folks, and I've got tons of (offline) comments from Alex -- and with
>> your comments below some more details to figure out.
> 
> One of the major issues with my v1 patch is around init_mm page table
> synchronization, and that'll be part of the v2.
> 
> I've noticed there's a quite a difference between x86-64 and arm64 in
> terms of locking, when updating (add/remove) the init_mm table. x86-64
> uses the usual page table locking mechanisms (used by the generic
> kernel functions), whereas arm64 does not.
> 
> How does arm64 manage to mix the "lock-less" updates (READ/WRITE_ONCE,
> and fences in set_p?d+friends), with the generic kernel ones that uses
> the regular page locking mechanism?
> 
> I'm obviously missing something about the locking rules for memory hot
> add/remove... I've been reading the arm64 memory hot add/remove
> series, but none the wiser! ;-)

In general, memory hot(un)plug is serialized on a high level using the 
mem_hotplug_lock. For example, in pagemap_range() or in 
add_memory_resource(), we grab that lock in write mode. So we'll never 
see memory getting added/removed concurrently from the direct map.

 From what I recall, the locking on the arch level is required for 
concurrent (direct mapping) page table modifications that target virtual 
address ranges adjacent to the ranges we hot(un)plug:
CONFIG_ARCH_HAS_SET_DIRECT_MAP and vmalloc come to mind.

For example, if a range would be mapped using a large PUD, but we have 
to unplug it partially (unplugging memory part of bootmem), we'd have to 
replace the large PUD by a PMD table first. That change (that could 
affect other concurrent page table walkers/operations) has to be 
synchronized.

I guess to which degree this applies to riscv depends on the virtual 
memory layout, direct mapping granularity and features (e.g., 
CONFIG_ARCH_HAS_SET_DIRECT_MAP).


One trick that arm64 implements is, that it only allows hotunplugging 
memory that was hotplugged (see prevent_bootmem_remove_notifier()). That 
might just rule out such problematic cases that require locking 
completely, and the high-level mem_hotplug_lock sufficient.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/7] riscv: mm: Refactor create_linear_mapping_range() for hot add
  2023-05-12 14:57 ` [PATCH 3/7] riscv: mm: Refactor create_linear_mapping_range() for hot add Björn Töpel
@ 2023-06-21 23:56   ` Palmer Dabbelt
  2023-06-22  4:56     ` Björn Töpel
  0 siblings, 1 reply; 14+ messages in thread
From: Palmer Dabbelt @ 2023-06-21 23:56 UTC (permalink / raw)
  To: bjorn
  Cc: Paul Walmsley, aou, linux-riscv, Bjorn Topel, linux-kernel,
	linux-mm, david, osalvador, virtualization, linux, alexghiti

On Fri, 12 May 2023 07:57:33 PDT (-0700), bjorn@kernel.org wrote:
> From: Björn Töpel <bjorn@rivosinc.com>
>
> Add a parameter to the direct map setup function, so it can be used in
> arch_add_memory() later.
>
> Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
> ---
>  arch/riscv/mm/init.c | 18 +++++++++---------
>  1 file changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index e974ff6ef036..aea8ccb3f4ae 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -1247,18 +1247,19 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>  	pt_ops_set_fixmap();
>  }
>
> -static void __init create_linear_mapping_range(phys_addr_t start,
> -					       phys_addr_t end)
> +static void __meminit create_linear_mapping_range(phys_addr_t start, phys_addr_t end,
> +						  struct mhp_params *params)

Sorry if I missed a v2, but it looks like this fails to build under 
CONFIG_MEMORY_HOTPLUG=n (as struct mhp_params isn't defined) -- unless I 
screwed up some merge conflict, but doesn't look like it here.

I'm getting

      CC      arch/riscv/mm/init.o
    arch/riscv/mm/init.c:1252:58: warning: ‘struct mhp_params’ declared inside parameter list will not be visible outside of this definition or declaration
     1252 |                                                   struct mhp_params *params)
          |                                                          ^~~~~~~~~~
    arch/riscv/mm/init.c: In function ‘create_linear_mapping_range’:
    arch/riscv/mm/init.c:1261:42: error: invalid use of undefined type ‘struct mhp_params’
     1261 |                 pgprot =  params ? params->pgprot : pgprot_from_va(va);
          |                                          ^~
    make[3]: *** [scripts/Makefile.build:252: arch/riscv/mm/init.o] Error 1
    make[2]: *** [scripts/Makefile.build:494: arch/riscv/mm] Error 2
    make[1]: *** [scripts/Makefile.build:494: arch/riscv] Error 2
    make: *** [Makefile:2026: .] Error 2

patchwork is saying something similar
<https://gist.github.com/conor-pwbot/9ed9a564e63d824aed1786050ee06558>.

>  {
>  	phys_addr_t pa;
>  	uintptr_t va, map_size;
>
>  	for (pa = start; pa < end; pa += map_size) {
> +		pgprot_t pgprot;
> +
>  		va = (uintptr_t)__va(pa);
> +		pgprot =  params ? params->pgprot : pgprot_from_va(va);
>  		map_size = best_map_size(pa, end - pa);
> -
> -		create_pgd_mapping(swapper_pg_dir, va, pa, map_size,
> -				   pgprot_from_va(va));
> +		create_pgd_mapping(swapper_pg_dir, va, pa, map_size, pgprot);
>  	}
>  }
>
> @@ -1288,13 +1289,12 @@ static void __init create_linear_mapping_page_table(void)
>  		if (end >= __pa(PAGE_OFFSET) + memory_limit)
>  			end = __pa(PAGE_OFFSET) + memory_limit;
>
> -		create_linear_mapping_range(start, end);
> +		create_linear_mapping_range(start, end, NULL);
>  	}
>
>  #ifdef CONFIG_STRICT_KERNEL_RWX
> -	create_linear_mapping_range(ktext_start, ktext_start + ktext_size);
> -	create_linear_mapping_range(krodata_start,
> -				    krodata_start + krodata_size);
> +	create_linear_mapping_range(ktext_start, ktext_start + ktext_size, NULL);
> +	create_linear_mapping_range(krodata_start, krodata_start + krodata_size, NULL);
>
>  	memblock_clear_nomap(ktext_start,  ktext_size);
>  	memblock_clear_nomap(krodata_start, krodata_size);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/7] riscv: mm: Refactor create_linear_mapping_range() for hot add
  2023-06-21 23:56   ` Palmer Dabbelt
@ 2023-06-22  4:56     ` Björn Töpel
  0 siblings, 0 replies; 14+ messages in thread
From: Björn Töpel @ 2023-06-22  4:56 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: Paul Walmsley, aou, linux-riscv, Bjorn Topel, linux-kernel,
	linux-mm, david, osalvador, virtualization, linux, alexghiti

Palmer Dabbelt <palmer@dabbelt.com> writes:

> On Fri, 12 May 2023 07:57:33 PDT (-0700), bjorn@kernel.org wrote:
>> From: Björn Töpel <bjorn@rivosinc.com>
>>
>> Add a parameter to the direct map setup function, so it can be used in
>> arch_add_memory() later.
>>
>> Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
>> ---
>>  arch/riscv/mm/init.c | 18 +++++++++---------
>>  1 file changed, 9 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>> index e974ff6ef036..aea8ccb3f4ae 100644
>> --- a/arch/riscv/mm/init.c
>> +++ b/arch/riscv/mm/init.c
>> @@ -1247,18 +1247,19 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>  	pt_ops_set_fixmap();
>>  }
>>
>> -static void __init create_linear_mapping_range(phys_addr_t start,
>> -					       phys_addr_t end)
>> +static void __meminit create_linear_mapping_range(phys_addr_t start, phys_addr_t end,
>> +						  struct mhp_params *params)
>
> Sorry if I missed a v2, but it looks like this fails to build under 
> CONFIG_MEMORY_HOTPLUG=n (as struct mhp_params isn't defined) -- unless I 
> screwed up some merge conflict, but doesn't look like it here.
>
> I'm getting
>
>       CC      arch/riscv/mm/init.o
>     arch/riscv/mm/init.c:1252:58: warning: ‘struct mhp_params’ declared inside parameter list will not be visible outside of this definition or declaration
>      1252 |                                                   struct mhp_params *params)
>           |                                                          ^~~~~~~~~~
>     arch/riscv/mm/init.c: In function ‘create_linear_mapping_range’:
>     arch/riscv/mm/init.c:1261:42: error: invalid use of undefined type ‘struct mhp_params’
>      1261 |                 pgprot =  params ? params->pgprot : pgprot_from_va(va);
>           |                                          ^~
>     make[3]: *** [scripts/Makefile.build:252: arch/riscv/mm/init.o] Error 1
>     make[2]: *** [scripts/Makefile.build:494: arch/riscv/mm] Error 2
>     make[1]: *** [scripts/Makefile.build:494: arch/riscv] Error 2
>     make: *** [Makefile:2026: .] Error 2
>
> patchwork is saying something similar
> <https://gist.github.com/conor-pwbot/9ed9a564e63d824aed1786050ee06558>.

Yup! Thanks for pointing that out. This series has a bunch of more
issues, that need to be resolved in a v2.


Björn

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-06-22  4:56 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-12 14:57 [PATCH 0/7] riscv: Memory Hot(Un)Plug support Björn Töpel
2023-05-12 14:57 ` [PATCH 1/7] riscv: mm: Pre-allocate PGD leaves to avoid synchronization Björn Töpel
2023-05-12 14:57 ` [PATCH 2/7] riscv: mm: Change attribute from __init to __meminit for page functions Björn Töpel
2023-05-12 14:57 ` [PATCH 3/7] riscv: mm: Refactor create_linear_mapping_range() for hot add Björn Töpel
2023-06-21 23:56   ` Palmer Dabbelt
2023-06-22  4:56     ` Björn Töpel
2023-05-12 14:57 ` [PATCH 4/7] riscv: mm: Add memory hot add/remove support Björn Töpel
2023-05-12 14:57 ` [PATCH 5/7] riscv: Enable memory hot add/remove arch kbuild support Björn Töpel
2023-05-12 14:57 ` [PATCH 6/7] virtio-mem: Enable virtio-mem for RISC-V Björn Töpel
2023-05-12 14:57 ` [PATCH 7/7] riscv: mm: Pre-allocate vmalloc PGD leaves Björn Töpel
2023-05-17 13:49 ` [PATCH 0/7] riscv: Memory Hot(Un)Plug support David Hildenbrand
2023-05-17 18:53   ` Björn Töpel
2023-05-21  9:15     ` Björn Töpel
2023-05-22  8:21       ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).