[PATCH v4 0/6] Add support for memmap on memory feature on ppc64

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/6] Add support for memmap on memory feature on ppc64
@ 2023-07-18  2:44 ` Aneesh Kumar K.V
  0 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K.V @ 2023-07-18  2:44 UTC (permalink / raw)
  To: linux-mm, akpm, mpe, linuxppc-dev, npiggin, christophe.leroy
  Cc: Oscar Salvador, David Hildenbrand, Michal Hocko, Vishal Verma,
	Aneesh Kumar K.V

This patch series update memmap on memory feature to fall back to
memmap allocation outside the memory block if the alignment rules are
not met. This makes the feature more useful on architectures like
ppc64 where alignment rules are different with 64K page size.

This patch series is dependent on dax vmemmap optimization series
posted here
https://lore.kernel.org/linux-mm/20230718022934.90447-1-aneesh.kumar@linux.ibm.com/

Changes from v3:
* Extend the module parameter memmap_on_memory to force allocation even
  though we can waste hotplug memory.

Changes from v2:
* Rebase to latest linus tree
* Redo the series based on review feedback. Multiple changes to the patchset.

Changes from v1:
* update the memblock to store vmemmap_altmap details. This is required
so that when we remove the memory we can find the altmap details which
is needed on some architectures.
* rebase to latest linus tree

Aneesh Kumar K.V (6):
  mm/hotplug: Simplify ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE kconfig
  mm/hotplug: Allow memmap on memory hotplug request to fallback
  mm/hotplug: Allow architecture to override memmap on memory support
    check
  mm/hotplug: Allow pageblock alignment via altmap reservation
  powerpc/book3s64/memhotplug: Enable memmap on memory for radix
  mm/hotplug: Embed vmem_altmap details in memory block

 arch/arm64/Kconfig                            |   4 +-
 arch/powerpc/Kconfig                          |   1 +
 arch/powerpc/include/asm/pgtable.h            |  24 +++
 .../platforms/pseries/hotplug-memory.c        |   3 +-
 arch/x86/Kconfig                              |   4 +-
 drivers/acpi/acpi_memhotplug.c                |   3 +-
 drivers/base/memory.c                         |  32 +++-
 include/linux/memory.h                        |   8 +-
 include/linux/memory_hotplug.h                |   3 +-
 mm/Kconfig                                    |   3 +
 mm/memory_hotplug.c                           | 168 ++++++++++++++----
 11 files changed, 193 insertions(+), 60 deletions(-)

-- 
2.41.0



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v4 0/6] Add support for memmap on memory feature on ppc64
@ 2023-07-18  2:44 ` Aneesh Kumar K.V
  0 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K.V @ 2023-07-18  2:44 UTC (permalink / raw)
  To: linux-mm, akpm, mpe, linuxppc-dev, npiggin, christophe.leroy
  Cc: Vishal Verma, David Hildenbrand, Michal Hocko, Aneesh Kumar K.V,
	Oscar Salvador

This patch series update memmap on memory feature to fall back to
memmap allocation outside the memory block if the alignment rules are
not met. This makes the feature more useful on architectures like
ppc64 where alignment rules are different with 64K page size.

This patch series is dependent on dax vmemmap optimization series
posted here
https://lore.kernel.org/linux-mm/20230718022934.90447-1-aneesh.kumar@linux.ibm.com/

Changes from v3:
* Extend the module parameter memmap_on_memory to force allocation even
  though we can waste hotplug memory.

Changes from v2:
* Rebase to latest linus tree
* Redo the series based on review feedback. Multiple changes to the patchset.

Changes from v1:
* update the memblock to store vmemmap_altmap details. This is required
so that when we remove the memory we can find the altmap details which
is needed on some architectures.
* rebase to latest linus tree

Aneesh Kumar K.V (6):
  mm/hotplug: Simplify ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE kconfig
  mm/hotplug: Allow memmap on memory hotplug request to fallback
  mm/hotplug: Allow architecture to override memmap on memory support
    check
  mm/hotplug: Allow pageblock alignment via altmap reservation
  powerpc/book3s64/memhotplug: Enable memmap on memory for radix
  mm/hotplug: Embed vmem_altmap details in memory block

 arch/arm64/Kconfig                            |   4 +-
 arch/powerpc/Kconfig                          |   1 +
 arch/powerpc/include/asm/pgtable.h            |  24 +++
 .../platforms/pseries/hotplug-memory.c        |   3 +-
 arch/x86/Kconfig                              |   4 +-
 drivers/acpi/acpi_memhotplug.c                |   3 +-
 drivers/base/memory.c                         |  32 +++-
 include/linux/memory.h                        |   8 +-
 include/linux/memory_hotplug.h                |   3 +-
 mm/Kconfig                                    |   3 +
 mm/memory_hotplug.c                           | 168 ++++++++++++++----
 11 files changed, 193 insertions(+), 60 deletions(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v4 1/6] mm/hotplug: Simplify ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE kconfig
  2023-07-18  2:44 ` Aneesh Kumar K.V
@ 2023-07-18  2:44   ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K.V @ 2023-07-18  2:44 UTC (permalink / raw)
  To: linux-mm, akpm, mpe, linuxppc-dev, npiggin, christophe.leroy
  Cc: Oscar Salvador, David Hildenbrand, Michal Hocko, Vishal Verma,
	Aneesh Kumar K.V

Instead of adding menu entry with all supported architectures, add
mm/Kconfig variable and select the same from supported architectures.

No functional change in this patch.

Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/arm64/Kconfig | 4 +---
 arch/x86/Kconfig   | 4 +---
 mm/Kconfig         | 3 +++
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a2511b30d0f6..20245bd72b8f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -78,6 +78,7 @@ config ARM64
 	select ARCH_INLINE_SPIN_UNLOCK_IRQ if !PREEMPTION
 	select ARCH_INLINE_SPIN_UNLOCK_IRQRESTORE if !PREEMPTION
 	select ARCH_KEEP_MEMBLOCK
+	select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
 	select ARCH_USE_CMPXCHG_LOCKREF
 	select ARCH_USE_GNU_PROPERTY
 	select ARCH_USE_MEMTEST
@@ -348,9 +349,6 @@ config GENERIC_CSUM
 config GENERIC_CALIBRATE_DELAY
 	def_bool y
 
-config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
-	def_bool y
-
 config SMP
 	def_bool y
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 78224aa76409..d0258e92a8af 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -102,6 +102,7 @@ config X86
 	select ARCH_HAS_DEBUG_WX
 	select ARCH_HAS_ZONE_DMA_SET if EXPERT
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
+	select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
 	select ARCH_MIGHT_HAVE_ACPI_PDC		if ACPI
 	select ARCH_MIGHT_HAVE_PC_PARPORT
 	select ARCH_MIGHT_HAVE_PC_SERIO
@@ -2610,9 +2611,6 @@ config ARCH_HAS_ADD_PAGES
 	def_bool y
 	depends on ARCH_ENABLE_MEMORY_HOTPLUG
 
-config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
-	def_bool y
-
 menu "Power management and ACPI options"
 
 config ARCH_HIBERNATION_HEADER
diff --git a/mm/Kconfig b/mm/Kconfig
index 923bd35f81f2..932349271e28 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -572,6 +572,9 @@ config MHP_MEMMAP_ON_MEMORY
 
 endif # MEMORY_HOTPLUG
 
+config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
+       bool
+
 # Heavily threaded applications may benefit from splitting the mm-wide
 # page_table_lock, so that faults on different parts of the user address
 # space can be handled with less contention: split it at this NR_CPUS.
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v4 1/6] mm/hotplug: Simplify ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE kconfig
@ 2023-07-18  2:44   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K.V @ 2023-07-18  2:44 UTC (permalink / raw)
  To: linux-mm, akpm, mpe, linuxppc-dev, npiggin, christophe.leroy
  Cc: Vishal Verma, David Hildenbrand, Michal Hocko, Aneesh Kumar K.V,
	Oscar Salvador

Instead of adding menu entry with all supported architectures, add
mm/Kconfig variable and select the same from supported architectures.

No functional change in this patch.

Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/arm64/Kconfig | 4 +---
 arch/x86/Kconfig   | 4 +---
 mm/Kconfig         | 3 +++
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a2511b30d0f6..20245bd72b8f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -78,6 +78,7 @@ config ARM64
 	select ARCH_INLINE_SPIN_UNLOCK_IRQ if !PREEMPTION
 	select ARCH_INLINE_SPIN_UNLOCK_IRQRESTORE if !PREEMPTION
 	select ARCH_KEEP_MEMBLOCK
+	select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
 	select ARCH_USE_CMPXCHG_LOCKREF
 	select ARCH_USE_GNU_PROPERTY
 	select ARCH_USE_MEMTEST
@@ -348,9 +349,6 @@ config GENERIC_CSUM
 config GENERIC_CALIBRATE_DELAY
 	def_bool y
 
-config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
-	def_bool y
-
 config SMP
 	def_bool y
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 78224aa76409..d0258e92a8af 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -102,6 +102,7 @@ config X86
 	select ARCH_HAS_DEBUG_WX
 	select ARCH_HAS_ZONE_DMA_SET if EXPERT
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
+	select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
 	select ARCH_MIGHT_HAVE_ACPI_PDC		if ACPI
 	select ARCH_MIGHT_HAVE_PC_PARPORT
 	select ARCH_MIGHT_HAVE_PC_SERIO
@@ -2610,9 +2611,6 @@ config ARCH_HAS_ADD_PAGES
 	def_bool y
 	depends on ARCH_ENABLE_MEMORY_HOTPLUG
 
-config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
-	def_bool y
-
 menu "Power management and ACPI options"
 
 config ARCH_HIBERNATION_HEADER
diff --git a/mm/Kconfig b/mm/Kconfig
index 923bd35f81f2..932349271e28 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -572,6 +572,9 @@ config MHP_MEMMAP_ON_MEMORY
 
 endif # MEMORY_HOTPLUG
 
+config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
+       bool
+
 # Heavily threaded applications may benefit from splitting the mm-wide
 # page_table_lock, so that faults on different parts of the user address
 # space can be handled with less contention: split it at this NR_CPUS.
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v4 2/6] mm/hotplug: Allow memmap on memory hotplug request to fallback
  2023-07-18  2:44 ` Aneesh Kumar K.V
@ 2023-07-18  2:44   ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K.V @ 2023-07-18  2:44 UTC (permalink / raw)
  To: linux-mm, akpm, mpe, linuxppc-dev, npiggin, christophe.leroy
  Cc: Oscar Salvador, David Hildenbrand, Michal Hocko, Vishal Verma,
	Aneesh Kumar K.V

If not supported, fallback to not using memap on memmory. This avoids
the need for callers to do the fallback.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 drivers/acpi/acpi_memhotplug.c |  3 +--
 include/linux/memory_hotplug.h |  3 ++-
 mm/memory_hotplug.c            | 13 ++++++-------
 3 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 24f662d8bd39..d0c1a71007d0 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -211,8 +211,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
 		if (!info->length)
 			continue;
 
-		if (mhp_supports_memmap_on_memory(info->length))
-			mhp_flags |= MHP_MEMMAP_ON_MEMORY;
+		mhp_flags |= MHP_MEMMAP_ON_MEMORY;
 		result = __add_memory(mgid, info->start_addr, info->length,
 				      mhp_flags);
 
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 013c69753c91..7d2076583494 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -97,6 +97,8 @@ typedef int __bitwise mhp_t;
  * To do so, we will use the beginning of the hot-added range to build
  * the page tables for the memmap array that describes the entire range.
  * Only selected architectures support it with SPARSE_VMEMMAP.
+ * This is only a hint, the core kernel can decide to not do this based on
+ * different alignment checks.
  */
 #define MHP_MEMMAP_ON_MEMORY   ((__force mhp_t)BIT(1))
 /*
@@ -354,7 +356,6 @@ extern struct zone *zone_for_pfn_range(int online_type, int nid,
 extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
 				      struct mhp_params *params);
 void arch_remove_linear_mapping(u64 start, u64 size);
-extern bool mhp_supports_memmap_on_memory(unsigned long size);
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 #endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 3f231cf1b410..1b19462f4e72 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1247,7 +1247,7 @@ static int online_memory_block(struct memory_block *mem, void *arg)
 	return device_online(&mem->dev);
 }
 
-bool mhp_supports_memmap_on_memory(unsigned long size)
+static bool mhp_supports_memmap_on_memory(unsigned long size)
 {
 	unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
 	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
@@ -1339,13 +1339,12 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 	 * Self hosted memmap array
 	 */
 	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
-		if (!mhp_supports_memmap_on_memory(size)) {
-			ret = -EINVAL;
-			goto error;
+		if (mhp_supports_memmap_on_memory(size)) {
+			mhp_altmap.free = PHYS_PFN(size);
+			mhp_altmap.base_pfn = PHYS_PFN(start);
+			params.altmap = &mhp_altmap;
 		}
-		mhp_altmap.free = PHYS_PFN(size);
-		mhp_altmap.base_pfn = PHYS_PFN(start);
-		params.altmap = &mhp_altmap;
+		/* fallback to not using altmap  */
 	}
 
 	/* call arch's memory hotadd */
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v4 2/6] mm/hotplug: Allow memmap on memory hotplug request to fallback
@ 2023-07-18  2:44   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K.V @ 2023-07-18  2:44 UTC (permalink / raw)
  To: linux-mm, akpm, mpe, linuxppc-dev, npiggin, christophe.leroy
  Cc: Vishal Verma, David Hildenbrand, Michal Hocko, Aneesh Kumar K.V,
	Oscar Salvador

If not supported, fallback to not using memap on memmory. This avoids
the need for callers to do the fallback.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 drivers/acpi/acpi_memhotplug.c |  3 +--
 include/linux/memory_hotplug.h |  3 ++-
 mm/memory_hotplug.c            | 13 ++++++-------
 3 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 24f662d8bd39..d0c1a71007d0 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -211,8 +211,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
 		if (!info->length)
 			continue;
 
-		if (mhp_supports_memmap_on_memory(info->length))
-			mhp_flags |= MHP_MEMMAP_ON_MEMORY;
+		mhp_flags |= MHP_MEMMAP_ON_MEMORY;
 		result = __add_memory(mgid, info->start_addr, info->length,
 				      mhp_flags);
 
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 013c69753c91..7d2076583494 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -97,6 +97,8 @@ typedef int __bitwise mhp_t;
  * To do so, we will use the beginning of the hot-added range to build
  * the page tables for the memmap array that describes the entire range.
  * Only selected architectures support it with SPARSE_VMEMMAP.
+ * This is only a hint, the core kernel can decide to not do this based on
+ * different alignment checks.
  */
 #define MHP_MEMMAP_ON_MEMORY   ((__force mhp_t)BIT(1))
 /*
@@ -354,7 +356,6 @@ extern struct zone *zone_for_pfn_range(int online_type, int nid,
 extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
 				      struct mhp_params *params);
 void arch_remove_linear_mapping(u64 start, u64 size);
-extern bool mhp_supports_memmap_on_memory(unsigned long size);
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 #endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 3f231cf1b410..1b19462f4e72 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1247,7 +1247,7 @@ static int online_memory_block(struct memory_block *mem, void *arg)
 	return device_online(&mem->dev);
 }
 
-bool mhp_supports_memmap_on_memory(unsigned long size)
+static bool mhp_supports_memmap_on_memory(unsigned long size)
 {
 	unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
 	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
@@ -1339,13 +1339,12 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 	 * Self hosted memmap array
 	 */
 	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
-		if (!mhp_supports_memmap_on_memory(size)) {
-			ret = -EINVAL;
-			goto error;
+		if (mhp_supports_memmap_on_memory(size)) {
+			mhp_altmap.free = PHYS_PFN(size);
+			mhp_altmap.base_pfn = PHYS_PFN(start);
+			params.altmap = &mhp_altmap;
 		}
-		mhp_altmap.free = PHYS_PFN(size);
-		mhp_altmap.base_pfn = PHYS_PFN(start);
-		params.altmap = &mhp_altmap;
+		/* fallback to not using altmap  */
 	}
 
 	/* call arch's memory hotadd */
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v4 3/6] mm/hotplug: Allow architecture to override memmap on memory support check
  2023-07-18  2:44 ` Aneesh Kumar K.V
@ 2023-07-18  2:44   ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K.V @ 2023-07-18  2:44 UTC (permalink / raw)
  To: linux-mm, akpm, mpe, linuxppc-dev, npiggin, christophe.leroy
  Cc: Oscar Salvador, David Hildenbrand, Michal Hocko, Vishal Verma,
	Aneesh Kumar K.V

Some architectures would want different restrictions. Hence add an
architecture-specific override.

Both the PMD_SIZE check and pageblock alignment check are moved there.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/memory_hotplug.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1b19462f4e72..5921c81fcb70 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1247,9 +1247,25 @@ static int online_memory_block(struct memory_block *mem, void *arg)
 	return device_online(&mem->dev);
 }
 
+#ifndef arch_supports_memmap_on_memory
+static inline bool arch_supports_memmap_on_memory(unsigned long size)
+{
+	unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
+	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
+
+	/*
+	 * As default, we want the vmemmap to span a complete PMD such that we
+	 * can map the vmemmap using a single PMD if supported by the
+	 * architecture.
+	 */
+	return IS_ALIGNED(vmemmap_size, PMD_SIZE);
+}
+#endif
+
 static bool mhp_supports_memmap_on_memory(unsigned long size)
 {
-	unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
+
+	unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
 	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
 	unsigned long remaining_size = size - vmemmap_size;
 
@@ -1281,8 +1297,8 @@ static bool mhp_supports_memmap_on_memory(unsigned long size)
 	 */
 	return mhp_memmap_on_memory() &&
 	       size == memory_block_size_bytes() &&
-	       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
-	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
+	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)) &&
+	       arch_supports_memmap_on_memory(size);
 }
 
 /*
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v4 3/6] mm/hotplug: Allow architecture to override memmap on memory support check
@ 2023-07-18  2:44   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K.V @ 2023-07-18  2:44 UTC (permalink / raw)
  To: linux-mm, akpm, mpe, linuxppc-dev, npiggin, christophe.leroy
  Cc: Vishal Verma, David Hildenbrand, Michal Hocko, Aneesh Kumar K.V,
	Oscar Salvador

Some architectures would want different restrictions. Hence add an
architecture-specific override.

Both the PMD_SIZE check and pageblock alignment check are moved there.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/memory_hotplug.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1b19462f4e72..5921c81fcb70 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1247,9 +1247,25 @@ static int online_memory_block(struct memory_block *mem, void *arg)
 	return device_online(&mem->dev);
 }
 
+#ifndef arch_supports_memmap_on_memory
+static inline bool arch_supports_memmap_on_memory(unsigned long size)
+{
+	unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
+	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
+
+	/*
+	 * As default, we want the vmemmap to span a complete PMD such that we
+	 * can map the vmemmap using a single PMD if supported by the
+	 * architecture.
+	 */
+	return IS_ALIGNED(vmemmap_size, PMD_SIZE);
+}
+#endif
+
 static bool mhp_supports_memmap_on_memory(unsigned long size)
 {
-	unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
+
+	unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
 	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
 	unsigned long remaining_size = size - vmemmap_size;
 
@@ -1281,8 +1297,8 @@ static bool mhp_supports_memmap_on_memory(unsigned long size)
 	 */
 	return mhp_memmap_on_memory() &&
 	       size == memory_block_size_bytes() &&
-	       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
-	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
+	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)) &&
+	       arch_supports_memmap_on_memory(size);
 }
 
 /*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v4 4/6] mm/hotplug: Allow pageblock alignment via altmap reservation
  2023-07-18  2:44 ` Aneesh Kumar K.V
@ 2023-07-18  2:44   ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K.V @ 2023-07-18  2:44 UTC (permalink / raw)
  To: linux-mm, akpm, mpe, linuxppc-dev, npiggin, christophe.leroy
  Cc: Oscar Salvador, David Hildenbrand, Michal Hocko, Vishal Verma,
	Aneesh Kumar K.V

Add a new kconfig option that can be selected if we want to allow
pageblock alignment by reserving pages in the vmemmap altmap area.
This implies we will be reserving some pages for every memoryblock
This also allows the memmap on memory feature to be widely useful
with different memory block size values.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/memory_hotplug.c | 109 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 96 insertions(+), 13 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 5921c81fcb70..c409f5ff6a59 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -41,17 +41,85 @@
 #include "internal.h"
 #include "shuffle.h"
 
+enum {
+	MEMMAP_ON_MEMORY_DISABLE = 0,
+	MEMMAP_ON_MEMORY_ENABLE,
+	MEMMAP_ON_MEMORY_FORCE,
+};
+
+static int memmap_mode __read_mostly = MEMMAP_ON_MEMORY_DISABLE;
+
+static inline unsigned long memory_block_align_base(unsigned long size)
+{
+	if (memmap_mode == MEMMAP_ON_MEMORY_FORCE) {
+		unsigned long align;
+		unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
+		unsigned long vmemmap_size;
+
+		vmemmap_size = DIV_ROUND_UP(nr_vmemmap_pages * sizeof(struct page), PAGE_SIZE);
+		align = pageblock_align(vmemmap_size) - vmemmap_size;
+		return align;
+	} else
+		return 0;
+}
+
 #ifdef CONFIG_MHP_MEMMAP_ON_MEMORY
 /*
  * memory_hotplug.memmap_on_memory parameter
  */
-static bool memmap_on_memory __ro_after_init;
-module_param(memmap_on_memory, bool, 0444);
-MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug");
+static int set_memmap_mode(const char *val, const struct kernel_param *kp)
+{
+	int ret, mode;
+	bool enabled;
+
+	if (sysfs_streq(val, "force") ||  sysfs_streq(val, "FORCE")) {
+		mode =  MEMMAP_ON_MEMORY_FORCE;
+		goto matched;
+	}
+
+	ret = kstrtobool(val, &enabled);
+	if (ret < 0)
+		return ret;
+	if (enabled)
+		mode =  MEMMAP_ON_MEMORY_ENABLE;
+	else
+		mode =  MEMMAP_ON_MEMORY_DISABLE;
+
+matched:
+	*((int *)kp->arg) =  mode;
+	if (mode == MEMMAP_ON_MEMORY_FORCE) {
+		pr_info("Memory hotplug will reserve %ld pages in each memory block\n",
+			memory_block_align_base(memory_block_size_bytes()));
+	}
+	return 0;
+}
+
+static int get_memmap_mode(char *buffer, const struct kernel_param *kp)
+{
+	if (*((int *)kp->arg) == MEMMAP_ON_MEMORY_FORCE)
+		return sprintf(buffer,  "force\n");
+	if (*((int *)kp->arg) == MEMMAP_ON_MEMORY_ENABLE)
+		return sprintf(buffer,  "y\n");
+
+	return sprintf(buffer,  "n\n");
+}
+
+static const struct kernel_param_ops memmap_mode_ops = {
+	.set = set_memmap_mode,
+	.get = get_memmap_mode,
+};
+module_param_cb(memmap_on_memory, &memmap_mode_ops, &memmap_mode, 0444);
+MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug\n"
+	"With value \"force\" it could result in memory wastage due to memmap size limitations \n"
+	"For example, if the memmap for a memory block requires 1 MiB, but the pageblock \n"
+	"size is 2 MiB, 1 MiB of hotplugged memory will be wasted. Note that there are \n"
+	"still cases where the feature cannot be enforced: for example, if the memmap is \n"
+	"smaller than a single page, or if the architecture does not support the forced \n"
+	"mode in all configurations. (y/n/force)");
 
 static inline bool mhp_memmap_on_memory(void)
 {
-	return memmap_on_memory;
+	return !!memmap_mode;
 }
 #else
 static inline bool mhp_memmap_on_memory(void)
@@ -1264,7 +1332,6 @@ static inline bool arch_supports_memmap_on_memory(unsigned long size)
 
 static bool mhp_supports_memmap_on_memory(unsigned long size)
 {
-
 	unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
 	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
 	unsigned long remaining_size = size - vmemmap_size;
@@ -1295,10 +1362,23 @@ static bool mhp_supports_memmap_on_memory(unsigned long size)
 	 *       altmap as an alternative source of memory, and we do not exactly
 	 *       populate a single PMD.
 	 */
-	return mhp_memmap_on_memory() &&
-	       size == memory_block_size_bytes() &&
-	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)) &&
-	       arch_supports_memmap_on_memory(size);
+	if (!mhp_memmap_on_memory() || size != memory_block_size_bytes())
+		return false;
+
+	/*
+	 * Make sure the vmemmap allocation is fully contained
+	 * so that we always allocate vmemmap memory from altmap area.
+	 */
+	if (!IS_ALIGNED(vmemmap_size,  PAGE_SIZE))
+		return false;
+	 /*
+	  * Without page reservation remaining pages should be pageblock aligned.
+	  */
+	if (memmap_mode != MEMMAP_ON_MEMORY_FORCE &&
+	    !IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)))
+		return false;
+
+	return arch_supports_memmap_on_memory(size);
 }
 
 /*
@@ -1311,7 +1391,11 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 {
 	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
 	enum memblock_flags memblock_flags = MEMBLOCK_NONE;
-	struct vmem_altmap mhp_altmap = {};
+	struct vmem_altmap mhp_altmap = {
+		.base_pfn =  PHYS_PFN(res->start),
+		.end_pfn  =  PHYS_PFN(res->end),
+		.reserve  = memory_block_align_base(resource_size(res)),
+	};
 	struct memory_group *group = NULL;
 	u64 start, size;
 	bool new_node = false;
@@ -1356,8 +1440,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 	 */
 	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
 		if (mhp_supports_memmap_on_memory(size)) {
-			mhp_altmap.free = PHYS_PFN(size);
-			mhp_altmap.base_pfn = PHYS_PFN(start);
+			mhp_altmap.free = PHYS_PFN(size) - mhp_altmap.reserve;
 			params.altmap = &mhp_altmap;
 		}
 		/* fallback to not using altmap  */
@@ -1369,7 +1452,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 		goto error;
 
 	/* create memory block devices after memory was added */
-	ret = create_memory_block_devices(start, size, mhp_altmap.alloc,
+	ret = create_memory_block_devices(start, size, mhp_altmap.alloc + mhp_altmap.reserve,
 					  group);
 	if (ret) {
 		arch_remove_memory(start, size, NULL);
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v4 4/6] mm/hotplug: Allow pageblock alignment via altmap reservation
@ 2023-07-18  2:44   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K.V @ 2023-07-18  2:44 UTC (permalink / raw)
  To: linux-mm, akpm, mpe, linuxppc-dev, npiggin, christophe.leroy
  Cc: Vishal Verma, David Hildenbrand, Michal Hocko, Aneesh Kumar K.V,
	Oscar Salvador

Add a new kconfig option that can be selected if we want to allow
pageblock alignment by reserving pages in the vmemmap altmap area.
This implies we will be reserving some pages for every memoryblock
This also allows the memmap on memory feature to be widely useful
with different memory block size values.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/memory_hotplug.c | 109 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 96 insertions(+), 13 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 5921c81fcb70..c409f5ff6a59 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -41,17 +41,85 @@
 #include "internal.h"
 #include "shuffle.h"
 
+enum {
+	MEMMAP_ON_MEMORY_DISABLE = 0,
+	MEMMAP_ON_MEMORY_ENABLE,
+	MEMMAP_ON_MEMORY_FORCE,
+};
+
+static int memmap_mode __read_mostly = MEMMAP_ON_MEMORY_DISABLE;
+
+static inline unsigned long memory_block_align_base(unsigned long size)
+{
+	if (memmap_mode == MEMMAP_ON_MEMORY_FORCE) {
+		unsigned long align;
+		unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
+		unsigned long vmemmap_size;
+
+		vmemmap_size = DIV_ROUND_UP(nr_vmemmap_pages * sizeof(struct page), PAGE_SIZE);
+		align = pageblock_align(vmemmap_size) - vmemmap_size;
+		return align;
+	} else
+		return 0;
+}
+
 #ifdef CONFIG_MHP_MEMMAP_ON_MEMORY
 /*
  * memory_hotplug.memmap_on_memory parameter
  */
-static bool memmap_on_memory __ro_after_init;
-module_param(memmap_on_memory, bool, 0444);
-MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug");
+static int set_memmap_mode(const char *val, const struct kernel_param *kp)
+{
+	int ret, mode;
+	bool enabled;
+
+	if (sysfs_streq(val, "force") ||  sysfs_streq(val, "FORCE")) {
+		mode =  MEMMAP_ON_MEMORY_FORCE;
+		goto matched;
+	}
+
+	ret = kstrtobool(val, &enabled);
+	if (ret < 0)
+		return ret;
+	if (enabled)
+		mode =  MEMMAP_ON_MEMORY_ENABLE;
+	else
+		mode =  MEMMAP_ON_MEMORY_DISABLE;
+
+matched:
+	*((int *)kp->arg) =  mode;
+	if (mode == MEMMAP_ON_MEMORY_FORCE) {
+		pr_info("Memory hotplug will reserve %ld pages in each memory block\n",
+			memory_block_align_base(memory_block_size_bytes()));
+	}
+	return 0;
+}
+
+static int get_memmap_mode(char *buffer, const struct kernel_param *kp)
+{
+	if (*((int *)kp->arg) == MEMMAP_ON_MEMORY_FORCE)
+		return sprintf(buffer,  "force\n");
+	if (*((int *)kp->arg) == MEMMAP_ON_MEMORY_ENABLE)
+		return sprintf(buffer,  "y\n");
+
+	return sprintf(buffer,  "n\n");
+}
+
+static const struct kernel_param_ops memmap_mode_ops = {
+	.set = set_memmap_mode,
+	.get = get_memmap_mode,
+};
+module_param_cb(memmap_on_memory, &memmap_mode_ops, &memmap_mode, 0444);
+MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug\n"
+	"With value \"force\" it could result in memory wastage due to memmap size limitations \n"
+	"For example, if the memmap for a memory block requires 1 MiB, but the pageblock \n"
+	"size is 2 MiB, 1 MiB of hotplugged memory will be wasted. Note that there are \n"
+	"still cases where the feature cannot be enforced: for example, if the memmap is \n"
+	"smaller than a single page, or if the architecture does not support the forced \n"
+	"mode in all configurations. (y/n/force)");
 
 static inline bool mhp_memmap_on_memory(void)
 {
-	return memmap_on_memory;
+	return !!memmap_mode;
 }
 #else
 static inline bool mhp_memmap_on_memory(void)
@@ -1264,7 +1332,6 @@ static inline bool arch_supports_memmap_on_memory(unsigned long size)
 
 static bool mhp_supports_memmap_on_memory(unsigned long size)
 {
-
 	unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
 	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
 	unsigned long remaining_size = size - vmemmap_size;
@@ -1295,10 +1362,23 @@ static bool mhp_supports_memmap_on_memory(unsigned long size)
 	 *       altmap as an alternative source of memory, and we do not exactly
 	 *       populate a single PMD.
 	 */
-	return mhp_memmap_on_memory() &&
-	       size == memory_block_size_bytes() &&
-	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)) &&
-	       arch_supports_memmap_on_memory(size);
+	if (!mhp_memmap_on_memory() || size != memory_block_size_bytes())
+		return false;
+
+	/*
+	 * Make sure the vmemmap allocation is fully contained
+	 * so that we always allocate vmemmap memory from altmap area.
+	 */
+	if (!IS_ALIGNED(vmemmap_size,  PAGE_SIZE))
+		return false;
+	 /*
+	  * Without page reservation remaining pages should be pageblock aligned.
+	  */
+	if (memmap_mode != MEMMAP_ON_MEMORY_FORCE &&
+	    !IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)))
+		return false;
+
+	return arch_supports_memmap_on_memory(size);
 }
 
 /*
@@ -1311,7 +1391,11 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 {
 	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
 	enum memblock_flags memblock_flags = MEMBLOCK_NONE;
-	struct vmem_altmap mhp_altmap = {};
+	struct vmem_altmap mhp_altmap = {
+		.base_pfn =  PHYS_PFN(res->start),
+		.end_pfn  =  PHYS_PFN(res->end),
+		.reserve  = memory_block_align_base(resource_size(res)),
+	};
 	struct memory_group *group = NULL;
 	u64 start, size;
 	bool new_node = false;
@@ -1356,8 +1440,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 	 */
 	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
 		if (mhp_supports_memmap_on_memory(size)) {
-			mhp_altmap.free = PHYS_PFN(size);
-			mhp_altmap.base_pfn = PHYS_PFN(start);
+			mhp_altmap.free = PHYS_PFN(size) - mhp_altmap.reserve;
 			params.altmap = &mhp_altmap;
 		}
 		/* fallback to not using altmap  */
@@ -1369,7 +1452,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 		goto error;
 
 	/* create memory block devices after memory was added */
-	ret = create_memory_block_devices(start, size, mhp_altmap.alloc,
+	ret = create_memory_block_devices(start, size, mhp_altmap.alloc + mhp_altmap.reserve,
 					  group);
 	if (ret) {
 		arch_remove_memory(start, size, NULL);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v4 5/6] powerpc/book3s64/memhotplug: Enable memmap on memory for radix
  2023-07-18  2:44 ` Aneesh Kumar K.V
@ 2023-07-18  2:44   ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K.V @ 2023-07-18  2:44 UTC (permalink / raw)
  To: linux-mm, akpm, mpe, linuxppc-dev, npiggin, christophe.leroy
  Cc: Oscar Salvador, David Hildenbrand, Michal Hocko, Vishal Verma,
	Aneesh Kumar K.V

Radix vmemmap mapping can map things correctly at the PMD level or PTE
level based on different device boundary checks. Hence we skip the
restrictions w.r.t vmemmap size to be multiple of PMD_SIZE. This also
makes the feature widely useful because to use PMD_SIZE vmemmap area we
require a memory block size of 2GiB

We can also use MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY to that the feature
can work with a memory block size of 256MB. Using altmap.reserve feature
to align things correctly at pageblock granularity. We can end up
losing some pages in memory with this. For ex: with a 256MiB memory block
size, we require 4 pages to map vmemmap pages, In order to align things
correctly we end up adding a reserve of 28 pages. ie, for every 4096
pages 28 pages get reserved.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/Kconfig                          |  1 +
 arch/powerpc/include/asm/pgtable.h            | 24 +++++++++++++++++++
 .../platforms/pseries/hotplug-memory.c        |  3 ++-
 mm/memory_hotplug.c                           |  2 ++
 4 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 116d6add0bb0..f890907e5bbf 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -157,6 +157,7 @@ config PPC
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select ARCH_KEEP_MEMBLOCK
+	select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE	if PPC_RADIX_MMU
 	select ARCH_MIGHT_HAVE_PC_PARPORT
 	select ARCH_MIGHT_HAVE_PC_SERIO
 	select ARCH_OPTIONAL_KERNEL_RWX		if ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index 68817ea7f994..3d35371395a9 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -169,6 +169,30 @@ static inline bool is_ioremap_addr(const void *x)
 int __meminit vmemmap_populated(unsigned long vmemmap_addr, int vmemmap_map_size);
 bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
 			   unsigned long page_size);
+/*
+ * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details
+ * some of the restrictions. We don't check for PMD_SIZE because our
+ * vmemmap allocation code can fallback correctly. The pageblock
+ * alignment requirement is met using altmap->reserve blocks.
+ */
+#define arch_supports_memmap_on_memory arch_supports_memmap_on_memory
+static inline bool arch_supports_memmap_on_memory(unsigned long size)
+{
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	unsigned long vmemmap_size = nr_pages * sizeof(struct page);
+
+	if (!radix_enabled())
+		return false;
+
+	if (IS_ENABLED(CONFIG_PPC_4K_PAGES))
+		return IS_ALIGNED(vmemmap_size, PMD_SIZE);
+	/*
+	 * The pageblock alignment requirement is met by using
+	 * reserve blocks in altmap.
+	 */
+	return true;
+}
+
 #endif /* CONFIG_PPC64 */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 9c62c2c3b3d0..1447509357a7 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -617,6 +617,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index)
 
 static int dlpar_add_lmb(struct drmem_lmb *lmb)
 {
+	mhp_t mhp_flags = MHP_NONE | MHP_MEMMAP_ON_MEMORY;
 	unsigned long block_sz;
 	int nid, rc;
 
@@ -637,7 +638,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 		nid = first_online_node;
 
 	/* Add the memory */
-	rc = __add_memory(nid, lmb->base_addr, block_sz, MHP_NONE);
+	rc = __add_memory(nid, lmb->base_addr, block_sz, mhp_flags);
 	if (rc) {
 		invalidate_lmb_associativity_index(lmb);
 		return rc;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c409f5ff6a59..6da063c80733 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -2174,6 +2174,8 @@ static int __ref try_remove_memory(u64 start, u64 size)
 			 * right thing if we used vmem_altmap when hot-adding
 			 * the range.
 			 */
+			mhp_altmap.base_pfn = PHYS_PFN(start);
+			mhp_altmap.free = PHYS_PFN(size) - nr_vmemmap_pages;
 			mhp_altmap.alloc = nr_vmemmap_pages;
 			altmap = &mhp_altmap;
 		}
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v4 5/6] powerpc/book3s64/memhotplug: Enable memmap on memory for radix
@ 2023-07-18  2:44   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K.V @ 2023-07-18  2:44 UTC (permalink / raw)
  To: linux-mm, akpm, mpe, linuxppc-dev, npiggin, christophe.leroy
  Cc: Vishal Verma, David Hildenbrand, Michal Hocko, Aneesh Kumar K.V,
	Oscar Salvador

Radix vmemmap mapping can map things correctly at the PMD level or PTE
level based on different device boundary checks. Hence we skip the
restrictions w.r.t vmemmap size to be multiple of PMD_SIZE. This also
makes the feature widely useful because to use PMD_SIZE vmemmap area we
require a memory block size of 2GiB

We can also use MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY to that the feature
can work with a memory block size of 256MB. Using altmap.reserve feature
to align things correctly at pageblock granularity. We can end up
losing some pages in memory with this. For ex: with a 256MiB memory block
size, we require 4 pages to map vmemmap pages, In order to align things
correctly we end up adding a reserve of 28 pages. ie, for every 4096
pages 28 pages get reserved.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/Kconfig                          |  1 +
 arch/powerpc/include/asm/pgtable.h            | 24 +++++++++++++++++++
 .../platforms/pseries/hotplug-memory.c        |  3 ++-
 mm/memory_hotplug.c                           |  2 ++
 4 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 116d6add0bb0..f890907e5bbf 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -157,6 +157,7 @@ config PPC
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select ARCH_KEEP_MEMBLOCK
+	select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE	if PPC_RADIX_MMU
 	select ARCH_MIGHT_HAVE_PC_PARPORT
 	select ARCH_MIGHT_HAVE_PC_SERIO
 	select ARCH_OPTIONAL_KERNEL_RWX		if ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index 68817ea7f994..3d35371395a9 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -169,6 +169,30 @@ static inline bool is_ioremap_addr(const void *x)
 int __meminit vmemmap_populated(unsigned long vmemmap_addr, int vmemmap_map_size);
 bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
 			   unsigned long page_size);
+/*
+ * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details
+ * some of the restrictions. We don't check for PMD_SIZE because our
+ * vmemmap allocation code can fallback correctly. The pageblock
+ * alignment requirement is met using altmap->reserve blocks.
+ */
+#define arch_supports_memmap_on_memory arch_supports_memmap_on_memory
+static inline bool arch_supports_memmap_on_memory(unsigned long size)
+{
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	unsigned long vmemmap_size = nr_pages * sizeof(struct page);
+
+	if (!radix_enabled())
+		return false;
+
+	if (IS_ENABLED(CONFIG_PPC_4K_PAGES))
+		return IS_ALIGNED(vmemmap_size, PMD_SIZE);
+	/*
+	 * The pageblock alignment requirement is met by using
+	 * reserve blocks in altmap.
+	 */
+	return true;
+}
+
 #endif /* CONFIG_PPC64 */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 9c62c2c3b3d0..1447509357a7 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -617,6 +617,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index)
 
 static int dlpar_add_lmb(struct drmem_lmb *lmb)
 {
+	mhp_t mhp_flags = MHP_NONE | MHP_MEMMAP_ON_MEMORY;
 	unsigned long block_sz;
 	int nid, rc;
 
@@ -637,7 +638,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 		nid = first_online_node;
 
 	/* Add the memory */
-	rc = __add_memory(nid, lmb->base_addr, block_sz, MHP_NONE);
+	rc = __add_memory(nid, lmb->base_addr, block_sz, mhp_flags);
 	if (rc) {
 		invalidate_lmb_associativity_index(lmb);
 		return rc;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c409f5ff6a59..6da063c80733 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -2174,6 +2174,8 @@ static int __ref try_remove_memory(u64 start, u64 size)
 			 * right thing if we used vmem_altmap when hot-adding
 			 * the range.
 			 */
+			mhp_altmap.base_pfn = PHYS_PFN(start);
+			mhp_altmap.free = PHYS_PFN(size) - nr_vmemmap_pages;
 			mhp_altmap.alloc = nr_vmemmap_pages;
 			altmap = &mhp_altmap;
 		}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v4 6/6] mm/hotplug: Embed vmem_altmap details in memory block
  2023-07-18  2:44 ` Aneesh Kumar K.V
@ 2023-07-18  2:44   ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K.V @ 2023-07-18  2:44 UTC (permalink / raw)
  To: linux-mm, akpm, mpe, linuxppc-dev, npiggin, christophe.leroy
  Cc: Oscar Salvador, David Hildenbrand, Michal Hocko, Vishal Verma,
	Aneesh Kumar K.V

With memmap on memory, some architecture needs more details w.r.t altmap
such as base_pfn, end_pfn, etc to unmap vmemmap memory. Instead of
computing them again when we remove a memory block embed vmem_altmap
details in struct memory_block if we are using memmap on memory block
feature.

No functional change in this patch

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 drivers/base/memory.c  | 32 +++++++++++++++++++++++---------
 include/linux/memory.h |  8 ++------
 mm/memory_hotplug.c    | 38 ++++++++++++++++++--------------------
 3 files changed, 43 insertions(+), 35 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index b456ac213610..cef6506f0209 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -106,6 +106,7 @@ static void memory_block_release(struct device *dev)
 {
 	struct memory_block *mem = to_memory_block(dev);
 
+	kfree(mem->altmap);
 	kfree(mem);
 }
 
@@ -183,7 +184,7 @@ static int memory_block_online(struct memory_block *mem)
 {
 	unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
 	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
-	unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
+	unsigned long nr_vmemmap_pages = 0;
 	struct zone *zone;
 	int ret;
 
@@ -200,6 +201,9 @@ static int memory_block_online(struct memory_block *mem)
 	 * stage helps to keep accounting easier to follow - e.g vmemmaps
 	 * belong to the same zone as the memory they backed.
 	 */
+	if (mem->altmap)
+		nr_vmemmap_pages = mem->altmap->alloc + mem->altmap->reserve;
+
 	if (nr_vmemmap_pages) {
 		ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone);
 		if (ret)
@@ -230,7 +234,7 @@ static int memory_block_offline(struct memory_block *mem)
 {
 	unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
 	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
-	unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
+	unsigned long nr_vmemmap_pages = 0;
 	int ret;
 
 	if (!mem->zone)
@@ -240,6 +244,9 @@ static int memory_block_offline(struct memory_block *mem)
 	 * Unaccount before offlining, such that unpopulated zone and kthreads
 	 * can properly be torn down in offline_pages().
 	 */
+	if (mem->altmap)
+		nr_vmemmap_pages = mem->altmap->alloc + mem->altmap->reserve;
+
 	if (nr_vmemmap_pages)
 		adjust_present_page_count(pfn_to_page(start_pfn), mem->group,
 					  -nr_vmemmap_pages);
@@ -726,7 +733,7 @@ void memory_block_add_nid(struct memory_block *mem, int nid,
 #endif
 
 static int add_memory_block(unsigned long block_id, unsigned long state,
-			    unsigned long nr_vmemmap_pages,
+			    struct vmem_altmap *altmap,
 			    struct memory_group *group)
 {
 	struct memory_block *mem;
@@ -744,7 +751,14 @@ static int add_memory_block(unsigned long block_id, unsigned long state,
 	mem->start_section_nr = block_id * sections_per_block;
 	mem->state = state;
 	mem->nid = NUMA_NO_NODE;
-	mem->nr_vmemmap_pages = nr_vmemmap_pages;
+	if (altmap) {
+		mem->altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL);
+		if (!mem->altmap) {
+			kfree(mem);
+			return -ENOMEM;
+		}
+		memcpy(mem->altmap, altmap, sizeof(*altmap));
+	}
 	INIT_LIST_HEAD(&mem->group_next);
 
 #ifndef CONFIG_NUMA
@@ -783,14 +797,14 @@ static int __init add_boot_memory_block(unsigned long base_section_nr)
 	if (section_count == 0)
 		return 0;
 	return add_memory_block(memory_block_id(base_section_nr),
-				MEM_ONLINE, 0,  NULL);
+				MEM_ONLINE, NULL,  NULL);
 }
 
 static int add_hotplug_memory_block(unsigned long block_id,
-				    unsigned long nr_vmemmap_pages,
+				    struct vmem_altmap *altmap,
 				    struct memory_group *group)
 {
-	return add_memory_block(block_id, MEM_OFFLINE, nr_vmemmap_pages, group);
+	return add_memory_block(block_id, MEM_OFFLINE, altmap, group);
 }
 
 static void remove_memory_block(struct memory_block *memory)
@@ -818,7 +832,7 @@ static void remove_memory_block(struct memory_block *memory)
  * Called under device_hotplug_lock.
  */
 int create_memory_block_devices(unsigned long start, unsigned long size,
-				unsigned long vmemmap_pages,
+				struct vmem_altmap *altmap,
 				struct memory_group *group)
 {
 	const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start));
@@ -832,7 +846,7 @@ int create_memory_block_devices(unsigned long start, unsigned long size,
 		return -EINVAL;
 
 	for (block_id = start_block_id; block_id != end_block_id; block_id++) {
-		ret = add_hotplug_memory_block(block_id, vmemmap_pages, group);
+		ret = add_hotplug_memory_block(block_id, altmap, group);
 		if (ret)
 			break;
 	}
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 31343566c221..f53cfdaaaa41 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -77,11 +77,7 @@ struct memory_block {
 	 */
 	struct zone *zone;
 	struct device dev;
-	/*
-	 * Number of vmemmap pages. These pages
-	 * lay at the beginning of the memory block.
-	 */
-	unsigned long nr_vmemmap_pages;
+	struct vmem_altmap *altmap;
 	struct memory_group *group;	/* group (if any) for this block */
 	struct list_head group_next;	/* next block inside memory group */
 #if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_MEMORY_HOTPLUG)
@@ -147,7 +143,7 @@ static inline int hotplug_memory_notifier(notifier_fn_t fn, int pri)
 extern int register_memory_notifier(struct notifier_block *nb);
 extern void unregister_memory_notifier(struct notifier_block *nb);
 int create_memory_block_devices(unsigned long start, unsigned long size,
-				unsigned long vmemmap_pages,
+				struct vmem_altmap *altmap,
 				struct memory_group *group);
 void remove_memory_block_devices(unsigned long start, unsigned long size);
 extern void memory_dev_init(void);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6da063c80733..6a8adbe030f9 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1452,8 +1452,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 		goto error;
 
 	/* create memory block devices after memory was added */
-	ret = create_memory_block_devices(start, size, mhp_altmap.alloc + mhp_altmap.reserve,
-					  group);
+	ret = create_memory_block_devices(start, size, params.altmap, group);
 	if (ret) {
 		arch_remove_memory(start, size, NULL);
 		goto error;
@@ -2054,12 +2053,18 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
 	return 0;
 }
 
-static int get_nr_vmemmap_pages_cb(struct memory_block *mem, void *arg)
+static int get_vmemmap_altmap_cb(struct memory_block *mem, void *arg)
 {
+	struct vmem_altmap *altmap = (struct vmem_altmap *)arg;
 	/*
-	 * If not set, continue with the next block.
+	 * If we have any pages allocated from altmap
+	 * return the altmap details and break callback.
 	 */
-	return mem->nr_vmemmap_pages;
+	if (mem->altmap) {
+		memcpy(altmap, mem->altmap, sizeof(struct vmem_altmap));
+		return 1;
+	}
+	return 0;
 }
 
 static int check_cpu_on_node(int nid)
@@ -2134,9 +2139,8 @@ EXPORT_SYMBOL(try_offline_node);
 
 static int __ref try_remove_memory(u64 start, u64 size)
 {
-	struct vmem_altmap mhp_altmap = {};
-	struct vmem_altmap *altmap = NULL;
-	unsigned long nr_vmemmap_pages;
+	int ret;
+	struct vmem_altmap mhp_altmap, *altmap = NULL;
 	int rc = 0, nid = NUMA_NO_NODE;
 
 	BUG_ON(check_hotplug_memory_range(start, size));
@@ -2159,24 +2163,15 @@ static int __ref try_remove_memory(u64 start, u64 size)
 	 * the same granularity it was added - a single memory block.
 	 */
 	if (mhp_memmap_on_memory()) {
-		nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
-						      get_nr_vmemmap_pages_cb);
-		if (nr_vmemmap_pages) {
+		ret = walk_memory_blocks(start, size, &mhp_altmap,
+					 get_vmemmap_altmap_cb);
+		if (ret) {
 			if (size != memory_block_size_bytes()) {
 				pr_warn("Refuse to remove %#llx - %#llx,"
 					"wrong granularity\n",
 					start, start + size);
 				return -EINVAL;
 			}
-
-			/*
-			 * Let remove_pmd_table->free_hugepage_table do the
-			 * right thing if we used vmem_altmap when hot-adding
-			 * the range.
-			 */
-			mhp_altmap.base_pfn = PHYS_PFN(start);
-			mhp_altmap.free = PHYS_PFN(size) - nr_vmemmap_pages;
-			mhp_altmap.alloc = nr_vmemmap_pages;
 			altmap = &mhp_altmap;
 		}
 	}
@@ -2194,6 +2189,9 @@ static int __ref try_remove_memory(u64 start, u64 size)
 
 	arch_remove_memory(start, size, altmap);
 
+	if (altmap)
+		WARN(altmap->alloc, "Altmap not fully unmapped");
+
 	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
 		memblock_phys_free(start, size);
 		memblock_remove(start, size);
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v4 6/6] mm/hotplug: Embed vmem_altmap details in memory block
@ 2023-07-18  2:44   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K.V @ 2023-07-18  2:44 UTC (permalink / raw)
  To: linux-mm, akpm, mpe, linuxppc-dev, npiggin, christophe.leroy
  Cc: Vishal Verma, David Hildenbrand, Michal Hocko, Aneesh Kumar K.V,
	Oscar Salvador

With memmap on memory, some architecture needs more details w.r.t altmap
such as base_pfn, end_pfn, etc to unmap vmemmap memory. Instead of
computing them again when we remove a memory block embed vmem_altmap
details in struct memory_block if we are using memmap on memory block
feature.

No functional change in this patch

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 drivers/base/memory.c  | 32 +++++++++++++++++++++++---------
 include/linux/memory.h |  8 ++------
 mm/memory_hotplug.c    | 38 ++++++++++++++++++--------------------
 3 files changed, 43 insertions(+), 35 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index b456ac213610..cef6506f0209 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -106,6 +106,7 @@ static void memory_block_release(struct device *dev)
 {
 	struct memory_block *mem = to_memory_block(dev);
 
+	kfree(mem->altmap);
 	kfree(mem);
 }
 
@@ -183,7 +184,7 @@ static int memory_block_online(struct memory_block *mem)
 {
 	unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
 	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
-	unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
+	unsigned long nr_vmemmap_pages = 0;
 	struct zone *zone;
 	int ret;
 
@@ -200,6 +201,9 @@ static int memory_block_online(struct memory_block *mem)
 	 * stage helps to keep accounting easier to follow - e.g vmemmaps
 	 * belong to the same zone as the memory they backed.
 	 */
+	if (mem->altmap)
+		nr_vmemmap_pages = mem->altmap->alloc + mem->altmap->reserve;
+
 	if (nr_vmemmap_pages) {
 		ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone);
 		if (ret)
@@ -230,7 +234,7 @@ static int memory_block_offline(struct memory_block *mem)
 {
 	unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
 	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
-	unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
+	unsigned long nr_vmemmap_pages = 0;
 	int ret;
 
 	if (!mem->zone)
@@ -240,6 +244,9 @@ static int memory_block_offline(struct memory_block *mem)
 	 * Unaccount before offlining, such that unpopulated zone and kthreads
 	 * can properly be torn down in offline_pages().
 	 */
+	if (mem->altmap)
+		nr_vmemmap_pages = mem->altmap->alloc + mem->altmap->reserve;
+
 	if (nr_vmemmap_pages)
 		adjust_present_page_count(pfn_to_page(start_pfn), mem->group,
 					  -nr_vmemmap_pages);
@@ -726,7 +733,7 @@ void memory_block_add_nid(struct memory_block *mem, int nid,
 #endif
 
 static int add_memory_block(unsigned long block_id, unsigned long state,
-			    unsigned long nr_vmemmap_pages,
+			    struct vmem_altmap *altmap,
 			    struct memory_group *group)
 {
 	struct memory_block *mem;
@@ -744,7 +751,14 @@ static int add_memory_block(unsigned long block_id, unsigned long state,
 	mem->start_section_nr = block_id * sections_per_block;
 	mem->state = state;
 	mem->nid = NUMA_NO_NODE;
-	mem->nr_vmemmap_pages = nr_vmemmap_pages;
+	if (altmap) {
+		mem->altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL);
+		if (!mem->altmap) {
+			kfree(mem);
+			return -ENOMEM;
+		}
+		memcpy(mem->altmap, altmap, sizeof(*altmap));
+	}
 	INIT_LIST_HEAD(&mem->group_next);
 
 #ifndef CONFIG_NUMA
@@ -783,14 +797,14 @@ static int __init add_boot_memory_block(unsigned long base_section_nr)
 	if (section_count == 0)
 		return 0;
 	return add_memory_block(memory_block_id(base_section_nr),
-				MEM_ONLINE, 0,  NULL);
+				MEM_ONLINE, NULL,  NULL);
 }
 
 static int add_hotplug_memory_block(unsigned long block_id,
-				    unsigned long nr_vmemmap_pages,
+				    struct vmem_altmap *altmap,
 				    struct memory_group *group)
 {
-	return add_memory_block(block_id, MEM_OFFLINE, nr_vmemmap_pages, group);
+	return add_memory_block(block_id, MEM_OFFLINE, altmap, group);
 }
 
 static void remove_memory_block(struct memory_block *memory)
@@ -818,7 +832,7 @@ static void remove_memory_block(struct memory_block *memory)
  * Called under device_hotplug_lock.
  */
 int create_memory_block_devices(unsigned long start, unsigned long size,
-				unsigned long vmemmap_pages,
+				struct vmem_altmap *altmap,
 				struct memory_group *group)
 {
 	const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start));
@@ -832,7 +846,7 @@ int create_memory_block_devices(unsigned long start, unsigned long size,
 		return -EINVAL;
 
 	for (block_id = start_block_id; block_id != end_block_id; block_id++) {
-		ret = add_hotplug_memory_block(block_id, vmemmap_pages, group);
+		ret = add_hotplug_memory_block(block_id, altmap, group);
 		if (ret)
 			break;
 	}
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 31343566c221..f53cfdaaaa41 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -77,11 +77,7 @@ struct memory_block {
 	 */
 	struct zone *zone;
 	struct device dev;
-	/*
-	 * Number of vmemmap pages. These pages
-	 * lay at the beginning of the memory block.
-	 */
-	unsigned long nr_vmemmap_pages;
+	struct vmem_altmap *altmap;
 	struct memory_group *group;	/* group (if any) for this block */
 	struct list_head group_next;	/* next block inside memory group */
 #if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_MEMORY_HOTPLUG)
@@ -147,7 +143,7 @@ static inline int hotplug_memory_notifier(notifier_fn_t fn, int pri)
 extern int register_memory_notifier(struct notifier_block *nb);
 extern void unregister_memory_notifier(struct notifier_block *nb);
 int create_memory_block_devices(unsigned long start, unsigned long size,
-				unsigned long vmemmap_pages,
+				struct vmem_altmap *altmap,
 				struct memory_group *group);
 void remove_memory_block_devices(unsigned long start, unsigned long size);
 extern void memory_dev_init(void);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6da063c80733..6a8adbe030f9 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1452,8 +1452,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 		goto error;
 
 	/* create memory block devices after memory was added */
-	ret = create_memory_block_devices(start, size, mhp_altmap.alloc + mhp_altmap.reserve,
-					  group);
+	ret = create_memory_block_devices(start, size, params.altmap, group);
 	if (ret) {
 		arch_remove_memory(start, size, NULL);
 		goto error;
@@ -2054,12 +2053,18 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
 	return 0;
 }
 
-static int get_nr_vmemmap_pages_cb(struct memory_block *mem, void *arg)
+static int get_vmemmap_altmap_cb(struct memory_block *mem, void *arg)
 {
+	struct vmem_altmap *altmap = (struct vmem_altmap *)arg;
 	/*
-	 * If not set, continue with the next block.
+	 * If we have any pages allocated from altmap
+	 * return the altmap details and break callback.
 	 */
-	return mem->nr_vmemmap_pages;
+	if (mem->altmap) {
+		memcpy(altmap, mem->altmap, sizeof(struct vmem_altmap));
+		return 1;
+	}
+	return 0;
 }
 
 static int check_cpu_on_node(int nid)
@@ -2134,9 +2139,8 @@ EXPORT_SYMBOL(try_offline_node);
 
 static int __ref try_remove_memory(u64 start, u64 size)
 {
-	struct vmem_altmap mhp_altmap = {};
-	struct vmem_altmap *altmap = NULL;
-	unsigned long nr_vmemmap_pages;
+	int ret;
+	struct vmem_altmap mhp_altmap, *altmap = NULL;
 	int rc = 0, nid = NUMA_NO_NODE;
 
 	BUG_ON(check_hotplug_memory_range(start, size));
@@ -2159,24 +2163,15 @@ static int __ref try_remove_memory(u64 start, u64 size)
 	 * the same granularity it was added - a single memory block.
 	 */
 	if (mhp_memmap_on_memory()) {
-		nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
-						      get_nr_vmemmap_pages_cb);
-		if (nr_vmemmap_pages) {
+		ret = walk_memory_blocks(start, size, &mhp_altmap,
+					 get_vmemmap_altmap_cb);
+		if (ret) {
 			if (size != memory_block_size_bytes()) {
 				pr_warn("Refuse to remove %#llx - %#llx,"
 					"wrong granularity\n",
 					start, start + size);
 				return -EINVAL;
 			}
-
-			/*
-			 * Let remove_pmd_table->free_hugepage_table do the
-			 * right thing if we used vmem_altmap when hot-adding
-			 * the range.
-			 */
-			mhp_altmap.base_pfn = PHYS_PFN(start);
-			mhp_altmap.free = PHYS_PFN(size) - nr_vmemmap_pages;
-			mhp_altmap.alloc = nr_vmemmap_pages;
 			altmap = &mhp_altmap;
 		}
 	}
@@ -2194,6 +2189,9 @@ static int __ref try_remove_memory(u64 start, u64 size)
 
 	arch_remove_memory(start, size, altmap);
 
+	if (altmap)
+		WARN(altmap->alloc, "Altmap not fully unmapped");
+
 	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
 		memblock_phys_free(start, size);
 		memblock_remove(start, size);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 2/6] mm/hotplug: Allow memmap on memory hotplug request to fallback
  2023-07-18  2:44   ` Aneesh Kumar K.V
@ 2023-07-24 12:29     ` David Hildenbrand
  -1 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2023-07-24 12:29 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm, mpe, linuxppc-dev, npiggin,
	christophe.leroy
  Cc: Oscar Salvador, Michal Hocko, Vishal Verma

On 18.07.23 04:44, Aneesh Kumar K.V wrote:
> If not supported, fallback to not using memap on memmory. This avoids
> the need for callers to do the fallback.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 2/6] mm/hotplug: Allow memmap on memory hotplug request to fallback
@ 2023-07-24 12:29     ` David Hildenbrand
  0 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2023-07-24 12:29 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm, mpe, linuxppc-dev, npiggin,
	christophe.leroy
  Cc: Vishal Verma, Michal Hocko, Oscar Salvador

On 18.07.23 04:44, Aneesh Kumar K.V wrote:
> If not supported, fallback to not using memap on memmory. This avoids
> the need for callers to do the fallback.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 3/6] mm/hotplug: Allow architecture to override memmap on memory support check
  2023-07-18  2:44   ` Aneesh Kumar K.V
@ 2023-07-24 12:30     ` David Hildenbrand
  -1 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2023-07-24 12:30 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm, mpe, linuxppc-dev, npiggin,
	christophe.leroy
  Cc: Oscar Salvador, Michal Hocko, Vishal Verma

On 18.07.23 04:44, Aneesh Kumar K.V wrote:
> Some architectures would want different restrictions. Hence add an
> architecture-specific override.
> 
> Both the PMD_SIZE check and pageblock alignment check are moved there.

No :)

> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>   mm/memory_hotplug.c | 22 +++++++++++++++++++---
>   1 file changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 1b19462f4e72..5921c81fcb70 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1247,9 +1247,25 @@ static int online_memory_block(struct memory_block *mem, void *arg)
>   	return device_online(&mem->dev);
>   }
>   
> +#ifndef arch_supports_memmap_on_memory
> +static inline bool arch_supports_memmap_on_memory(unsigned long size)
> +{
> +	unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
> +	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
> +
> +	/*
> +	 * As default, we want the vmemmap to span a complete PMD such that we
> +	 * can map the vmemmap using a single PMD if supported by the
> +	 * architecture.
> +	 */
> +	return IS_ALIGNED(vmemmap_size, PMD_SIZE);
> +}
> +#endif
> +
>   static bool mhp_supports_memmap_on_memory(unsigned long size)
>   {
> -	unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
> +
> +	unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
>   	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
>   	unsigned long remaining_size = size - vmemmap_size;
>   
> @@ -1281,8 +1297,8 @@ static bool mhp_supports_memmap_on_memory(unsigned long size)
>   	 */
>   	return mhp_memmap_on_memory() &&
>   	       size == memory_block_size_bytes() &&
> -	       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
> -	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
> +	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)) &&
> +	       arch_supports_memmap_on_memory(size);
>   }
>   
>   /*

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 3/6] mm/hotplug: Allow architecture to override memmap on memory support check
@ 2023-07-24 12:30     ` David Hildenbrand
  0 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2023-07-24 12:30 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm, mpe, linuxppc-dev, npiggin,
	christophe.leroy
  Cc: Vishal Verma, Michal Hocko, Oscar Salvador

On 18.07.23 04:44, Aneesh Kumar K.V wrote:
> Some architectures would want different restrictions. Hence add an
> architecture-specific override.
> 
> Both the PMD_SIZE check and pageblock alignment check are moved there.

No :)

> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>   mm/memory_hotplug.c | 22 +++++++++++++++++++---
>   1 file changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 1b19462f4e72..5921c81fcb70 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1247,9 +1247,25 @@ static int online_memory_block(struct memory_block *mem, void *arg)
>   	return device_online(&mem->dev);
>   }
>   
> +#ifndef arch_supports_memmap_on_memory
> +static inline bool arch_supports_memmap_on_memory(unsigned long size)
> +{
> +	unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
> +	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
> +
> +	/*
> +	 * As default, we want the vmemmap to span a complete PMD such that we
> +	 * can map the vmemmap using a single PMD if supported by the
> +	 * architecture.
> +	 */
> +	return IS_ALIGNED(vmemmap_size, PMD_SIZE);
> +}
> +#endif
> +
>   static bool mhp_supports_memmap_on_memory(unsigned long size)
>   {
> -	unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
> +
> +	unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
>   	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
>   	unsigned long remaining_size = size - vmemmap_size;
>   
> @@ -1281,8 +1297,8 @@ static bool mhp_supports_memmap_on_memory(unsigned long size)
>   	 */
>   	return mhp_memmap_on_memory() &&
>   	       size == memory_block_size_bytes() &&
> -	       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
> -	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
> +	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)) &&
> +	       arch_supports_memmap_on_memory(size);
>   }
>   
>   /*

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 3/6] mm/hotplug: Allow architecture to override memmap on memory support check
  2023-07-18  2:44   ` Aneesh Kumar K.V
@ 2023-07-24 13:47     ` David Hildenbrand
  -1 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2023-07-24 13:47 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm, mpe, linuxppc-dev, npiggin,
	christophe.leroy
  Cc: Oscar Salvador, Michal Hocko, Vishal Verma

On 18.07.23 04:44, Aneesh Kumar K.V wrote:
> Some architectures would want different restrictions. Hence add an
> architecture-specific override.
> 
> Both the PMD_SIZE check and pageblock alignment check are moved there.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>   mm/memory_hotplug.c | 22 +++++++++++++++++++---
>   1 file changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 1b19462f4e72..5921c81fcb70 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1247,9 +1247,25 @@ static int online_memory_block(struct memory_block *mem, void *arg)
>   	return device_online(&mem->dev);
>   }
>   
> +#ifndef arch_supports_memmap_on_memory
> +static inline bool arch_supports_memmap_on_memory(unsigned long size)
> +{
> +	unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
> +	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
> +
> +	/*
> +	 * As default, we want the vmemmap to span a complete PMD such that we
> +	 * can map the vmemmap using a single PMD if supported by the
> +	 * architecture.
> +	 */
> +	return IS_ALIGNED(vmemmap_size, PMD_SIZE);
> +}
> +#endif
> +
>   static bool mhp_supports_memmap_on_memory(unsigned long size)
>   {
> -	unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
> +

^ just spotted this empty line that gets added here and removed int he 
next patch.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 3/6] mm/hotplug: Allow architecture to override memmap on memory support check
@ 2023-07-24 13:47     ` David Hildenbrand
  0 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2023-07-24 13:47 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm, mpe, linuxppc-dev, npiggin,
	christophe.leroy
  Cc: Vishal Verma, Michal Hocko, Oscar Salvador

On 18.07.23 04:44, Aneesh Kumar K.V wrote:
> Some architectures would want different restrictions. Hence add an
> architecture-specific override.
> 
> Both the PMD_SIZE check and pageblock alignment check are moved there.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>   mm/memory_hotplug.c | 22 +++++++++++++++++++---
>   1 file changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 1b19462f4e72..5921c81fcb70 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1247,9 +1247,25 @@ static int online_memory_block(struct memory_block *mem, void *arg)
>   	return device_online(&mem->dev);
>   }
>   
> +#ifndef arch_supports_memmap_on_memory
> +static inline bool arch_supports_memmap_on_memory(unsigned long size)
> +{
> +	unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
> +	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
> +
> +	/*
> +	 * As default, we want the vmemmap to span a complete PMD such that we
> +	 * can map the vmemmap using a single PMD if supported by the
> +	 * architecture.
> +	 */
> +	return IS_ALIGNED(vmemmap_size, PMD_SIZE);
> +}
> +#endif
> +
>   static bool mhp_supports_memmap_on_memory(unsigned long size)
>   {
> -	unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
> +

^ just spotted this empty line that gets added here and removed int he 
next patch.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 4/6] mm/hotplug: Allow pageblock alignment via altmap reservation
  2023-07-18  2:44   ` Aneesh Kumar K.V
@ 2023-07-24 14:33     ` David Hildenbrand
  -1 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2023-07-24 14:33 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm, mpe, linuxppc-dev, npiggin,
	christophe.leroy
  Cc: Oscar Salvador, Michal Hocko, Vishal Verma

On 18.07.23 04:44, Aneesh Kumar K.V wrote:
> Add a new kconfig option that can be selected if we want to allow

That description seems outdated.

> pageblock alignment by reserving pages in the vmemmap altmap area.
> This implies we will be reserving some pages for every memoryblock
> This also allows the memmap on memory feature to be widely useful
> with different memory block size values.

Can you add some more meat to the description, and especially, in
which cases this might be desired and in which cases it might be
completely undesired?


Let's assume we hotplug a 1 GiB DIMM on arm64/64k. With 512 MiB pageblocks,
we'd waste 50% of the hotplugged memory.

Also, see below on the case where we could end up with 100% wasted memory,
which we want to block compeltely.


Also, I wonder if we can avoid talking about "page reservation" or "altmap reservation",
that's rather an implementation detail.

For example, I'd call this patch

"mm/hotplug: Support memmap_on_memory when memmap is not aligned to pageblocks"



> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>   mm/memory_hotplug.c | 109 ++++++++++++++++++++++++++++++++++++++------
>   1 file changed, 96 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 5921c81fcb70..c409f5ff6a59 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -41,17 +41,85 @@
>   #include "internal.h"
>   #include "shuffle.h"
>   
> +enum {
> +	MEMMAP_ON_MEMORY_DISABLE = 0,
> +	MEMMAP_ON_MEMORY_ENABLE,
> +	MEMMAP_ON_MEMORY_FORCE,
> +};
> +
> +static int memmap_mode __read_mostly = MEMMAP_ON_MEMORY_DISABLE;
> +
> +static inline unsigned long memory_block_align_base(unsigned long size)
> +{

Can we start with something like this instead?

memory_block_memmap_size() might be reasonable to put into the previous patch.


static inline unsigned long memory_block_memmap_size(void)
{
	return PHYS_PFN(memory_block_size_bytes()) * sizeof(struct page);
}

/*
  * In "forced" memmap_on_memory mode, we always align the vmemmap size up to cover
  * full pageblocks. That way, we can add memory even if the vmemmap size is not properly
  * aligned, however, we might waste memory.
  */
static inline unsigned long memory_block_memmap_on_memory_size(void)
{
	unsigned long size = memory_block_memmap_size();

	if (memmap_mode != MEMMAP_ON_MEMORY_FORCE)
		return size;
	return ALIGN(size, PFN_PHYS(pageblock_nr_pages));
}
	


> +	if (memmap_mode == MEMMAP_ON_MEMORY_FORCE) {
> +		unsigned long align;
> +		unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
> +		unsigned long vmemmap_size;
> +
> +		vmemmap_size = DIV_ROUND_UP(nr_vmemmap_pages * sizeof(struct page), PAGE_SIZE);
> +		align = pageblock_align(vmemmap_size) - vmemmap_size;
> +		return align;
> +	} else
> +		return 0;
> +}
> +
>   #ifdef CONFIG_MHP_MEMMAP_ON_MEMORY
>   /*
>    * memory_hotplug.memmap_on_memory parameter
>    */
> -static bool memmap_on_memory __ro_after_init;
> -module_param(memmap_on_memory, bool, 0444);
> -MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug");
> +static int set_memmap_mode(const char *val, const struct kernel_param *kp)
> +{
> +	int ret, mode;
> +	bool enabled;
> +
> +	if (sysfs_streq(val, "force") ||  sysfs_streq(val, "FORCE")) {
> +		mode =  MEMMAP_ON_MEMORY_FORCE;
> +		goto matched;
> +	}
> +
> +	ret = kstrtobool(val, &enabled);
> +	if (ret < 0)
> +		return ret;
> +	if (enabled)
> +		mode =  MEMMAP_ON_MEMORY_ENABLE;
> +	else
> +		mode =  MEMMAP_ON_MEMORY_DISABLE;
> +
> +matched:
> +	*((int *)kp->arg) =  mode;
> +	if (mode == MEMMAP_ON_MEMORY_FORCE) {
> +		pr_info("Memory hotplug will reserve %ld pages in each memory block\n",
> +			memory_block_align_base(memory_block_size_bytes()));
> +	}
> +	return 0;
> +}
> +
> +static int get_memmap_mode(char *buffer, const struct kernel_param *kp)
> +{
> +	if (*((int *)kp->arg) == MEMMAP_ON_MEMORY_FORCE)
> +		return sprintf(buffer,  "force\n");
> +	if (*((int *)kp->arg) == MEMMAP_ON_MEMORY_ENABLE)
> +		return sprintf(buffer,  "y\n");
> +
> +	return sprintf(buffer,  "n\n");

param_get_bool() uses uppercase Y / N. Maybe just return the uppercase variants here as well.

> +}
> +
> +static const struct kernel_param_ops memmap_mode_ops = {
> +	.set = set_memmap_mode,
> +	.get = get_memmap_mode,
> +};
> +module_param_cb(memmap_on_memory, &memmap_mode_ops, &memmap_mode, 0444);
> +MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug\n"
> +	"With value \"force\" it could result in memory wastage due to memmap size limitations \n"
> +	"For example, if the memmap for a memory block requires 1 MiB, but the pageblock \n"
> +	"size is 2 MiB, 1 MiB of hotplugged memory will be wasted. Note that there are \n"
> +	"still cases where the feature cannot be enforced: for example, if the memmap is \n"
> +	"smaller than a single page, or if the architecture does not support the forced \n"
> +	"mode in all configurations. (y/n/force)");

That's a bit mouthful. Can we simplify and put the full doc into

Documentation/admin-guide/mm/memory-hotplug.rst

?

>   
>   static inline bool mhp_memmap_on_memory(void)
>   {
> -	return memmap_on_memory;
> +	return !!memmap_mode;

Maybe better  "memmap_mode != MEMMAP_ON_MEMORY_DISABLE"

>   }
>   #else
>   static inline bool mhp_memmap_on_memory(void)
> @@ -1264,7 +1332,6 @@ static inline bool arch_supports_memmap_on_memory(unsigned long size)
>   
>   static bool mhp_supports_memmap_on_memory(unsigned long size)
>   {
> -
>   	unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
>   	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
>   	unsigned long remaining_size = size - vmemmap_size;
> @@ -1295,10 +1362,23 @@ static bool mhp_supports_memmap_on_memory(unsigned long size)
>   	 *       altmap as an alternative source of memory, and we do not exactly
>   	 *       populate a single PMD.
>   	 */
> -	return mhp_memmap_on_memory() &&
> -	       size == memory_block_size_bytes() &&
> -	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)) &&
> -	       arch_supports_memmap_on_memory(size);
> +	if (!mhp_memmap_on_memory() || size != memory_block_size_bytes())
> +		return false;
> +
> +	/*
> +	 * Make sure the vmemmap allocation is fully contained
> +	 * so that we always allocate vmemmap memory from altmap area.
> +	 */
> +	if (!IS_ALIGNED(vmemmap_size,  PAGE_SIZE))
> +		return false;
> +	 /*
> +	  * Without page reservation remaining pages should be pageblock aligned.
> +	  */
> +	if (memmap_mode != MEMMAP_ON_MEMORY_FORCE &&
> +	    !IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)))
> +		return false;

With our new helper, this becomes the following:

memmap_on_memory_size = memory_block_memmap_on_memory_size();

if (!IS_ALIGNED(memmap_on_memory_size, PFN_PHYS(pageblock_nr_pages))
	/* We're not allowed to waste any memory for the memmap. */
	return false;

if (memmap_on_memory_size == memory_block_size_bytes())
	/* No effective hotplugged memory doesn't make sense. */
	return false;	

> +	return arch_supports_memmap_on_memory(size);
>   }
>   
>   /*
> @@ -1311,7 +1391,11 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>   {
>   	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
>   	enum memblock_flags memblock_flags = MEMBLOCK_NONE;
> -	struct vmem_altmap mhp_altmap = {};
> +	struct vmem_altmap mhp_altmap = {
> +		.base_pfn =  PHYS_PFN(res->start),
> +		.end_pfn  =  PHYS_PFN(res->end),
> +		.reserve  = memory_block_align_base(resource_size(res)),

Can you remind me why we have to set reserve here at all?

IOW, can't we simply set

.free = memory_block_memmap_on_memory_size();

end then pass

mhp_altmap.alloc + mhp_altmap.free

to create_memory_block_devices() instead?


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 4/6] mm/hotplug: Allow pageblock alignment via altmap reservation
@ 2023-07-24 14:33     ` David Hildenbrand
  0 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2023-07-24 14:33 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm, mpe, linuxppc-dev, npiggin,
	christophe.leroy
  Cc: Vishal Verma, Michal Hocko, Oscar Salvador

On 18.07.23 04:44, Aneesh Kumar K.V wrote:
> Add a new kconfig option that can be selected if we want to allow

That description seems outdated.

> pageblock alignment by reserving pages in the vmemmap altmap area.
> This implies we will be reserving some pages for every memoryblock
> This also allows the memmap on memory feature to be widely useful
> with different memory block size values.

Can you add some more meat to the description, and especially, in
which cases this might be desired and in which cases it might be
completely undesired?


Let's assume we hotplug a 1 GiB DIMM on arm64/64k. With 512 MiB pageblocks,
we'd waste 50% of the hotplugged memory.

Also, see below on the case where we could end up with 100% wasted memory,
which we want to block compeltely.


Also, I wonder if we can avoid talking about "page reservation" or "altmap reservation",
that's rather an implementation detail.

For example, I'd call this patch

"mm/hotplug: Support memmap_on_memory when memmap is not aligned to pageblocks"



> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>   mm/memory_hotplug.c | 109 ++++++++++++++++++++++++++++++++++++++------
>   1 file changed, 96 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 5921c81fcb70..c409f5ff6a59 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -41,17 +41,85 @@
>   #include "internal.h"
>   #include "shuffle.h"
>   
> +enum {
> +	MEMMAP_ON_MEMORY_DISABLE = 0,
> +	MEMMAP_ON_MEMORY_ENABLE,
> +	MEMMAP_ON_MEMORY_FORCE,
> +};
> +
> +static int memmap_mode __read_mostly = MEMMAP_ON_MEMORY_DISABLE;
> +
> +static inline unsigned long memory_block_align_base(unsigned long size)
> +{

Can we start with something like this instead?

memory_block_memmap_size() might be reasonable to put into the previous patch.


static inline unsigned long memory_block_memmap_size(void)
{
	return PHYS_PFN(memory_block_size_bytes()) * sizeof(struct page);
}

/*
  * In "forced" memmap_on_memory mode, we always align the vmemmap size up to cover
  * full pageblocks. That way, we can add memory even if the vmemmap size is not properly
  * aligned, however, we might waste memory.
  */
static inline unsigned long memory_block_memmap_on_memory_size(void)
{
	unsigned long size = memory_block_memmap_size();

	if (memmap_mode != MEMMAP_ON_MEMORY_FORCE)
		return size;
	return ALIGN(size, PFN_PHYS(pageblock_nr_pages));
}
	


> +	if (memmap_mode == MEMMAP_ON_MEMORY_FORCE) {
> +		unsigned long align;
> +		unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
> +		unsigned long vmemmap_size;
> +
> +		vmemmap_size = DIV_ROUND_UP(nr_vmemmap_pages * sizeof(struct page), PAGE_SIZE);
> +		align = pageblock_align(vmemmap_size) - vmemmap_size;
> +		return align;
> +	} else
> +		return 0;
> +}
> +
>   #ifdef CONFIG_MHP_MEMMAP_ON_MEMORY
>   /*
>    * memory_hotplug.memmap_on_memory parameter
>    */
> -static bool memmap_on_memory __ro_after_init;
> -module_param(memmap_on_memory, bool, 0444);
> -MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug");
> +static int set_memmap_mode(const char *val, const struct kernel_param *kp)
> +{
> +	int ret, mode;
> +	bool enabled;
> +
> +	if (sysfs_streq(val, "force") ||  sysfs_streq(val, "FORCE")) {
> +		mode =  MEMMAP_ON_MEMORY_FORCE;
> +		goto matched;
> +	}
> +
> +	ret = kstrtobool(val, &enabled);
> +	if (ret < 0)
> +		return ret;
> +	if (enabled)
> +		mode =  MEMMAP_ON_MEMORY_ENABLE;
> +	else
> +		mode =  MEMMAP_ON_MEMORY_DISABLE;
> +
> +matched:
> +	*((int *)kp->arg) =  mode;
> +	if (mode == MEMMAP_ON_MEMORY_FORCE) {
> +		pr_info("Memory hotplug will reserve %ld pages in each memory block\n",
> +			memory_block_align_base(memory_block_size_bytes()));
> +	}
> +	return 0;
> +}
> +
> +static int get_memmap_mode(char *buffer, const struct kernel_param *kp)
> +{
> +	if (*((int *)kp->arg) == MEMMAP_ON_MEMORY_FORCE)
> +		return sprintf(buffer,  "force\n");
> +	if (*((int *)kp->arg) == MEMMAP_ON_MEMORY_ENABLE)
> +		return sprintf(buffer,  "y\n");
> +
> +	return sprintf(buffer,  "n\n");

param_get_bool() uses uppercase Y / N. Maybe just return the uppercase variants here as well.

> +}
> +
> +static const struct kernel_param_ops memmap_mode_ops = {
> +	.set = set_memmap_mode,
> +	.get = get_memmap_mode,
> +};
> +module_param_cb(memmap_on_memory, &memmap_mode_ops, &memmap_mode, 0444);
> +MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug\n"
> +	"With value \"force\" it could result in memory wastage due to memmap size limitations \n"
> +	"For example, if the memmap for a memory block requires 1 MiB, but the pageblock \n"
> +	"size is 2 MiB, 1 MiB of hotplugged memory will be wasted. Note that there are \n"
> +	"still cases where the feature cannot be enforced: for example, if the memmap is \n"
> +	"smaller than a single page, or if the architecture does not support the forced \n"
> +	"mode in all configurations. (y/n/force)");

That's a bit mouthful. Can we simplify and put the full doc into

Documentation/admin-guide/mm/memory-hotplug.rst

?

>   
>   static inline bool mhp_memmap_on_memory(void)
>   {
> -	return memmap_on_memory;
> +	return !!memmap_mode;

Maybe better  "memmap_mode != MEMMAP_ON_MEMORY_DISABLE"

>   }
>   #else
>   static inline bool mhp_memmap_on_memory(void)
> @@ -1264,7 +1332,6 @@ static inline bool arch_supports_memmap_on_memory(unsigned long size)
>   
>   static bool mhp_supports_memmap_on_memory(unsigned long size)
>   {
> -
>   	unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
>   	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
>   	unsigned long remaining_size = size - vmemmap_size;
> @@ -1295,10 +1362,23 @@ static bool mhp_supports_memmap_on_memory(unsigned long size)
>   	 *       altmap as an alternative source of memory, and we do not exactly
>   	 *       populate a single PMD.
>   	 */
> -	return mhp_memmap_on_memory() &&
> -	       size == memory_block_size_bytes() &&
> -	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)) &&
> -	       arch_supports_memmap_on_memory(size);
> +	if (!mhp_memmap_on_memory() || size != memory_block_size_bytes())
> +		return false;
> +
> +	/*
> +	 * Make sure the vmemmap allocation is fully contained
> +	 * so that we always allocate vmemmap memory from altmap area.
> +	 */
> +	if (!IS_ALIGNED(vmemmap_size,  PAGE_SIZE))
> +		return false;
> +	 /*
> +	  * Without page reservation remaining pages should be pageblock aligned.
> +	  */
> +	if (memmap_mode != MEMMAP_ON_MEMORY_FORCE &&
> +	    !IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)))
> +		return false;

With our new helper, this becomes the following:

memmap_on_memory_size = memory_block_memmap_on_memory_size();

if (!IS_ALIGNED(memmap_on_memory_size, PFN_PHYS(pageblock_nr_pages))
	/* We're not allowed to waste any memory for the memmap. */
	return false;

if (memmap_on_memory_size == memory_block_size_bytes())
	/* No effective hotplugged memory doesn't make sense. */
	return false;	

> +	return arch_supports_memmap_on_memory(size);
>   }
>   
>   /*
> @@ -1311,7 +1391,11 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>   {
>   	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
>   	enum memblock_flags memblock_flags = MEMBLOCK_NONE;
> -	struct vmem_altmap mhp_altmap = {};
> +	struct vmem_altmap mhp_altmap = {
> +		.base_pfn =  PHYS_PFN(res->start),
> +		.end_pfn  =  PHYS_PFN(res->end),
> +		.reserve  = memory_block_align_base(resource_size(res)),

Can you remind me why we have to set reserve here at all?

IOW, can't we simply set

.free = memory_block_memmap_on_memory_size();

end then pass

mhp_altmap.alloc + mhp_altmap.free

to create_memory_block_devices() instead?


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 5/6] powerpc/book3s64/memhotplug: Enable memmap on memory for radix
  2023-07-18  2:44   ` Aneesh Kumar K.V
@ 2023-07-24 14:34     ` David Hildenbrand
  -1 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2023-07-24 14:34 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm, mpe, linuxppc-dev, npiggin,
	christophe.leroy
  Cc: Oscar Salvador, Michal Hocko, Vishal Verma

On 18.07.23 04:44, Aneesh Kumar K.V wrote:
> Radix vmemmap mapping can map things correctly at the PMD level or PTE
> level based on different device boundary checks. Hence we skip the
> restrictions w.r.t vmemmap size to be multiple of PMD_SIZE. This also
> makes the feature widely useful because to use PMD_SIZE vmemmap area we
> require a memory block size of 2GiB
> 
> We can also use MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY to that the feature
> can work with a memory block size of 256MB. Using altmap.reserve feature
> to align things correctly at pageblock granularity. We can end up
> losing some pages in memory with this. For ex: with a 256MiB memory block
> size, we require 4 pages to map vmemmap pages, In order to align things
> correctly we end up adding a reserve of 28 pages. ie, for every 4096
> pages 28 pages get reserved.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>   arch/powerpc/Kconfig                          |  1 +
>   arch/powerpc/include/asm/pgtable.h            | 24 +++++++++++++++++++
>   .../platforms/pseries/hotplug-memory.c        |  3 ++-
>   mm/memory_hotplug.c                           |  2 ++
>   4 files changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 116d6add0bb0..f890907e5bbf 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -157,6 +157,7 @@ config PPC
>   	select ARCH_HAS_UBSAN_SANITIZE_ALL
>   	select ARCH_HAVE_NMI_SAFE_CMPXCHG
>   	select ARCH_KEEP_MEMBLOCK
> +	select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE	if PPC_RADIX_MMU
>   	select ARCH_MIGHT_HAVE_PC_PARPORT
>   	select ARCH_MIGHT_HAVE_PC_SERIO
>   	select ARCH_OPTIONAL_KERNEL_RWX		if ARCH_HAS_STRICT_KERNEL_RWX
> diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
> index 68817ea7f994..3d35371395a9 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -169,6 +169,30 @@ static inline bool is_ioremap_addr(const void *x)
>   int __meminit vmemmap_populated(unsigned long vmemmap_addr, int vmemmap_map_size);
>   bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
>   			   unsigned long page_size);
> +/*
> + * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details
> + * some of the restrictions. We don't check for PMD_SIZE because our
> + * vmemmap allocation code can fallback correctly. The pageblock
> + * alignment requirement is met using altmap->reserve blocks.
> + */
> +#define arch_supports_memmap_on_memory arch_supports_memmap_on_memory
> +static inline bool arch_supports_memmap_on_memory(unsigned long size)
> +{
> +	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	unsigned long vmemmap_size = nr_pages * sizeof(struct page);
> +
> +	if (!radix_enabled())
> +		return false;
> +
> +	if (IS_ENABLED(CONFIG_PPC_4K_PAGES))
> +		return IS_ALIGNED(vmemmap_size, PMD_SIZE);

Can you add a comment why we care about that in the 4K case only?

> +	/*
> +	 * The pageblock alignment requirement is met by using
> +	 * reserve blocks in altmap.
> +	 */

Just drop that comment, that's handled by common code now.

> +	return true;
> +}
> +
>   #endif /* CONFIG_PPC64 */
>   
>   #endif /* __ASSEMBLY__ */
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index 9c62c2c3b3d0..1447509357a7 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -617,6 +617,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index)
>   
>   static int dlpar_add_lmb(struct drmem_lmb *lmb)
>   {
> +	mhp_t mhp_flags = MHP_NONE | MHP_MEMMAP_ON_MEMORY;
>   	unsigned long block_sz;
>   	int nid, rc;
>   
> @@ -637,7 +638,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
>   		nid = first_online_node;
>   
>   	/* Add the memory */
> -	rc = __add_memory(nid, lmb->base_addr, block_sz, MHP_NONE);
> +	rc = __add_memory(nid, lmb->base_addr, block_sz, mhp_flags);
>   	if (rc) {
>   		invalidate_lmb_associativity_index(lmb);
>   		return rc;
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c409f5ff6a59..6da063c80733 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -2174,6 +2174,8 @@ static int __ref try_remove_memory(u64 start, u64 size)
>   			 * right thing if we used vmem_altmap when hot-adding
>   			 * the range.
>   			 */
> +			mhp_altmap.base_pfn = PHYS_PFN(start);
> +			mhp_altmap.free = PHYS_PFN(size) - nr_vmemmap_pages;


That change does not belong into this patch.

>   			mhp_altmap.alloc = nr_vmemmap_pages;
>   			altmap = &mhp_altmap;
>   		}

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 5/6] powerpc/book3s64/memhotplug: Enable memmap on memory for radix
@ 2023-07-24 14:34     ` David Hildenbrand
  0 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2023-07-24 14:34 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm, mpe, linuxppc-dev, npiggin,
	christophe.leroy
  Cc: Vishal Verma, Michal Hocko, Oscar Salvador

On 18.07.23 04:44, Aneesh Kumar K.V wrote:
> Radix vmemmap mapping can map things correctly at the PMD level or PTE
> level based on different device boundary checks. Hence we skip the
> restrictions w.r.t vmemmap size to be multiple of PMD_SIZE. This also
> makes the feature widely useful because to use PMD_SIZE vmemmap area we
> require a memory block size of 2GiB
> 
> We can also use MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY to that the feature
> can work with a memory block size of 256MB. Using altmap.reserve feature
> to align things correctly at pageblock granularity. We can end up
> losing some pages in memory with this. For ex: with a 256MiB memory block
> size, we require 4 pages to map vmemmap pages, In order to align things
> correctly we end up adding a reserve of 28 pages. ie, for every 4096
> pages 28 pages get reserved.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>   arch/powerpc/Kconfig                          |  1 +
>   arch/powerpc/include/asm/pgtable.h            | 24 +++++++++++++++++++
>   .../platforms/pseries/hotplug-memory.c        |  3 ++-
>   mm/memory_hotplug.c                           |  2 ++
>   4 files changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 116d6add0bb0..f890907e5bbf 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -157,6 +157,7 @@ config PPC
>   	select ARCH_HAS_UBSAN_SANITIZE_ALL
>   	select ARCH_HAVE_NMI_SAFE_CMPXCHG
>   	select ARCH_KEEP_MEMBLOCK
> +	select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE	if PPC_RADIX_MMU
>   	select ARCH_MIGHT_HAVE_PC_PARPORT
>   	select ARCH_MIGHT_HAVE_PC_SERIO
>   	select ARCH_OPTIONAL_KERNEL_RWX		if ARCH_HAS_STRICT_KERNEL_RWX
> diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
> index 68817ea7f994..3d35371395a9 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -169,6 +169,30 @@ static inline bool is_ioremap_addr(const void *x)
>   int __meminit vmemmap_populated(unsigned long vmemmap_addr, int vmemmap_map_size);
>   bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
>   			   unsigned long page_size);
> +/*
> + * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details
> + * some of the restrictions. We don't check for PMD_SIZE because our
> + * vmemmap allocation code can fallback correctly. The pageblock
> + * alignment requirement is met using altmap->reserve blocks.
> + */
> +#define arch_supports_memmap_on_memory arch_supports_memmap_on_memory
> +static inline bool arch_supports_memmap_on_memory(unsigned long size)
> +{
> +	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	unsigned long vmemmap_size = nr_pages * sizeof(struct page);
> +
> +	if (!radix_enabled())
> +		return false;
> +
> +	if (IS_ENABLED(CONFIG_PPC_4K_PAGES))
> +		return IS_ALIGNED(vmemmap_size, PMD_SIZE);

Can you add a comment why we care about that in the 4K case only?

> +	/*
> +	 * The pageblock alignment requirement is met by using
> +	 * reserve blocks in altmap.
> +	 */

Just drop that comment, that's handled by common code now.

> +	return true;
> +}
> +
>   #endif /* CONFIG_PPC64 */
>   
>   #endif /* __ASSEMBLY__ */
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index 9c62c2c3b3d0..1447509357a7 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -617,6 +617,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index)
>   
>   static int dlpar_add_lmb(struct drmem_lmb *lmb)
>   {
> +	mhp_t mhp_flags = MHP_NONE | MHP_MEMMAP_ON_MEMORY;
>   	unsigned long block_sz;
>   	int nid, rc;
>   
> @@ -637,7 +638,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
>   		nid = first_online_node;
>   
>   	/* Add the memory */
> -	rc = __add_memory(nid, lmb->base_addr, block_sz, MHP_NONE);
> +	rc = __add_memory(nid, lmb->base_addr, block_sz, mhp_flags);
>   	if (rc) {
>   		invalidate_lmb_associativity_index(lmb);
>   		return rc;
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c409f5ff6a59..6da063c80733 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -2174,6 +2174,8 @@ static int __ref try_remove_memory(u64 start, u64 size)
>   			 * right thing if we used vmem_altmap when hot-adding
>   			 * the range.
>   			 */
> +			mhp_altmap.base_pfn = PHYS_PFN(start);
> +			mhp_altmap.free = PHYS_PFN(size) - nr_vmemmap_pages;


That change does not belong into this patch.

>   			mhp_altmap.alloc = nr_vmemmap_pages;
>   			altmap = &mhp_altmap;
>   		}

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 5/6] powerpc/book3s64/memhotplug: Enable memmap on memory for radix
  2023-07-24 14:34     ` David Hildenbrand
@ 2023-07-24 14:46       ` Aneesh Kumar K V
  -1 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K V @ 2023-07-24 14:46 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, akpm, mpe, linuxppc-dev, npiggin,
	christophe.leroy
  Cc: Oscar Salvador, Michal Hocko, Vishal Verma

On 7/24/23 8:04 PM, David Hildenbrand wrote:
> On 18.07.23 04:44, Aneesh Kumar K.V wrote:
>> Radix vmemmap mapping can map things correctly at the PMD level or PTE
>> level based on different device boundary checks. Hence we skip the
>> restrictions w.r.t vmemmap size to be multiple of PMD_SIZE. This also
>> makes the feature widely useful because to use PMD_SIZE vmemmap area we
>> require a memory block size of 2GiB
>>
>> We can also use MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY to that the feature
>> can work with a memory block size of 256MB. Using altmap.reserve feature
>> to align things correctly at pageblock granularity. We can end up
>> losing some pages in memory with this. For ex: with a 256MiB memory block
>> size, we require 4 pages to map vmemmap pages, In order to align things
>> correctly we end up adding a reserve of 28 pages. ie, for every 4096
>> pages 28 pages get reserved.
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>   arch/powerpc/Kconfig                          |  1 +
>>   arch/powerpc/include/asm/pgtable.h            | 24 +++++++++++++++++++
>>   .../platforms/pseries/hotplug-memory.c        |  3 ++-
>>   mm/memory_hotplug.c                           |  2 ++
>>   4 files changed, 29 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>> index 116d6add0bb0..f890907e5bbf 100644
>> --- a/arch/powerpc/Kconfig
>> +++ b/arch/powerpc/Kconfig
>> @@ -157,6 +157,7 @@ config PPC
>>       select ARCH_HAS_UBSAN_SANITIZE_ALL
>>       select ARCH_HAVE_NMI_SAFE_CMPXCHG
>>       select ARCH_KEEP_MEMBLOCK
>> +    select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE    if PPC_RADIX_MMU
>>       select ARCH_MIGHT_HAVE_PC_PARPORT
>>       select ARCH_MIGHT_HAVE_PC_SERIO
>>       select ARCH_OPTIONAL_KERNEL_RWX        if ARCH_HAS_STRICT_KERNEL_RWX
>> diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
>> index 68817ea7f994..3d35371395a9 100644
>> --- a/arch/powerpc/include/asm/pgtable.h
>> +++ b/arch/powerpc/include/asm/pgtable.h
>> @@ -169,6 +169,30 @@ static inline bool is_ioremap_addr(const void *x)
>>   int __meminit vmemmap_populated(unsigned long vmemmap_addr, int vmemmap_map_size);
>>   bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
>>                  unsigned long page_size);
>> +/*
>> + * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details
>> + * some of the restrictions. We don't check for PMD_SIZE because our
>> + * vmemmap allocation code can fallback correctly. The pageblock
>> + * alignment requirement is met using altmap->reserve blocks.
>> + */
>> +#define arch_supports_memmap_on_memory arch_supports_memmap_on_memory
>> +static inline bool arch_supports_memmap_on_memory(unsigned long size)
>> +{
>> +    unsigned long nr_pages = size >> PAGE_SHIFT;
>> +    unsigned long vmemmap_size = nr_pages * sizeof(struct page);
>> +
>> +    if (!radix_enabled())
>> +        return false;
>> +
>> +    if (IS_ENABLED(CONFIG_PPC_4K_PAGES))
>> +        return IS_ALIGNED(vmemmap_size, PMD_SIZE);
> 
> Can you add a comment why we care about that in the 4K case only?


Sure. We keep the PMD_SIZE alignment for the same reason we have it for x86. With 4K page size and 2M hugepage size
things get properly aligned and we can make this feature useful even with this alignment restrictions. With 64K
page size and 2M hugepage size, having this alignment restrictions makes it more or less not useful to a large
number of memory blocksize we support. I will add that comment in here. 

> 
>> +    /*
>> +     * The pageblock alignment requirement is met by using
>> +     * reserve blocks in altmap.
>> +     */
> 
> Just drop that comment, that's handled by common code now.
> 

Ok. 

>> +    return true;
>> +}
>> +
>>   #endif /* CONFIG_PPC64 */
>>     #endif /* __ASSEMBLY__ */
>> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> index 9c62c2c3b3d0..1447509357a7 100644
>> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
>> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> @@ -617,6 +617,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index)
>>     static int dlpar_add_lmb(struct drmem_lmb *lmb)
>>   {
>> +    mhp_t mhp_flags = MHP_NONE | MHP_MEMMAP_ON_MEMORY;
>>       unsigned long block_sz;
>>       int nid, rc;
>>   @@ -637,7 +638,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
>>           nid = first_online_node;
>>         /* Add the memory */
>> -    rc = __add_memory(nid, lmb->base_addr, block_sz, MHP_NONE);
>> +    rc = __add_memory(nid, lmb->base_addr, block_sz, mhp_flags);
>>       if (rc) {
>>           invalidate_lmb_associativity_index(lmb);
>>           return rc;
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index c409f5ff6a59..6da063c80733 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -2174,6 +2174,8 @@ static int __ref try_remove_memory(u64 start, u64 size)
>>                * right thing if we used vmem_altmap when hot-adding
>>                * the range.
>>                */
>> +            mhp_altmap.base_pfn = PHYS_PFN(start);
>> +            mhp_altmap.free = PHYS_PFN(size) - nr_vmemmap_pages;
> 
> 
> That change does not belong into this patch.
> 


I kept that change with ppc64 enablement because only ppc64 arch got check against
those values in the free path. 

>>               mhp_altmap.alloc = nr_vmemmap_pages;
>>               altmap = &mhp_altmap;
>>           }
> 

-aneesh


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 5/6] powerpc/book3s64/memhotplug: Enable memmap on memory for radix
@ 2023-07-24 14:46       ` Aneesh Kumar K V
  0 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K V @ 2023-07-24 14:46 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, akpm, mpe, linuxppc-dev, npiggin,
	christophe.leroy
  Cc: Vishal Verma, Michal Hocko, Oscar Salvador

On 7/24/23 8:04 PM, David Hildenbrand wrote:
> On 18.07.23 04:44, Aneesh Kumar K.V wrote:
>> Radix vmemmap mapping can map things correctly at the PMD level or PTE
>> level based on different device boundary checks. Hence we skip the
>> restrictions w.r.t vmemmap size to be multiple of PMD_SIZE. This also
>> makes the feature widely useful because to use PMD_SIZE vmemmap area we
>> require a memory block size of 2GiB
>>
>> We can also use MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY to that the feature
>> can work with a memory block size of 256MB. Using altmap.reserve feature
>> to align things correctly at pageblock granularity. We can end up
>> losing some pages in memory with this. For ex: with a 256MiB memory block
>> size, we require 4 pages to map vmemmap pages, In order to align things
>> correctly we end up adding a reserve of 28 pages. ie, for every 4096
>> pages 28 pages get reserved.
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>   arch/powerpc/Kconfig                          |  1 +
>>   arch/powerpc/include/asm/pgtable.h            | 24 +++++++++++++++++++
>>   .../platforms/pseries/hotplug-memory.c        |  3 ++-
>>   mm/memory_hotplug.c                           |  2 ++
>>   4 files changed, 29 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>> index 116d6add0bb0..f890907e5bbf 100644
>> --- a/arch/powerpc/Kconfig
>> +++ b/arch/powerpc/Kconfig
>> @@ -157,6 +157,7 @@ config PPC
>>       select ARCH_HAS_UBSAN_SANITIZE_ALL
>>       select ARCH_HAVE_NMI_SAFE_CMPXCHG
>>       select ARCH_KEEP_MEMBLOCK
>> +    select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE    if PPC_RADIX_MMU
>>       select ARCH_MIGHT_HAVE_PC_PARPORT
>>       select ARCH_MIGHT_HAVE_PC_SERIO
>>       select ARCH_OPTIONAL_KERNEL_RWX        if ARCH_HAS_STRICT_KERNEL_RWX
>> diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
>> index 68817ea7f994..3d35371395a9 100644
>> --- a/arch/powerpc/include/asm/pgtable.h
>> +++ b/arch/powerpc/include/asm/pgtable.h
>> @@ -169,6 +169,30 @@ static inline bool is_ioremap_addr(const void *x)
>>   int __meminit vmemmap_populated(unsigned long vmemmap_addr, int vmemmap_map_size);
>>   bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
>>                  unsigned long page_size);
>> +/*
>> + * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details
>> + * some of the restrictions. We don't check for PMD_SIZE because our
>> + * vmemmap allocation code can fallback correctly. The pageblock
>> + * alignment requirement is met using altmap->reserve blocks.
>> + */
>> +#define arch_supports_memmap_on_memory arch_supports_memmap_on_memory
>> +static inline bool arch_supports_memmap_on_memory(unsigned long size)
>> +{
>> +    unsigned long nr_pages = size >> PAGE_SHIFT;
>> +    unsigned long vmemmap_size = nr_pages * sizeof(struct page);
>> +
>> +    if (!radix_enabled())
>> +        return false;
>> +
>> +    if (IS_ENABLED(CONFIG_PPC_4K_PAGES))
>> +        return IS_ALIGNED(vmemmap_size, PMD_SIZE);
> 
> Can you add a comment why we care about that in the 4K case only?


Sure. We keep the PMD_SIZE alignment for the same reason we have it for x86. With 4K page size and 2M hugepage size
things get properly aligned and we can make this feature useful even with this alignment restrictions. With 64K
page size and 2M hugepage size, having this alignment restrictions makes it more or less not useful to a large
number of memory blocksize we support. I will add that comment in here. 

> 
>> +    /*
>> +     * The pageblock alignment requirement is met by using
>> +     * reserve blocks in altmap.
>> +     */
> 
> Just drop that comment, that's handled by common code now.
> 

Ok. 

>> +    return true;
>> +}
>> +
>>   #endif /* CONFIG_PPC64 */
>>     #endif /* __ASSEMBLY__ */
>> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> index 9c62c2c3b3d0..1447509357a7 100644
>> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
>> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> @@ -617,6 +617,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index)
>>     static int dlpar_add_lmb(struct drmem_lmb *lmb)
>>   {
>> +    mhp_t mhp_flags = MHP_NONE | MHP_MEMMAP_ON_MEMORY;
>>       unsigned long block_sz;
>>       int nid, rc;
>>   @@ -637,7 +638,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
>>           nid = first_online_node;
>>         /* Add the memory */
>> -    rc = __add_memory(nid, lmb->base_addr, block_sz, MHP_NONE);
>> +    rc = __add_memory(nid, lmb->base_addr, block_sz, mhp_flags);
>>       if (rc) {
>>           invalidate_lmb_associativity_index(lmb);
>>           return rc;
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index c409f5ff6a59..6da063c80733 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -2174,6 +2174,8 @@ static int __ref try_remove_memory(u64 start, u64 size)
>>                * right thing if we used vmem_altmap when hot-adding
>>                * the range.
>>                */
>> +            mhp_altmap.base_pfn = PHYS_PFN(start);
>> +            mhp_altmap.free = PHYS_PFN(size) - nr_vmemmap_pages;
> 
> 
> That change does not belong into this patch.
> 


I kept that change with ppc64 enablement because only ppc64 arch got check against
those values in the free path. 

>>               mhp_altmap.alloc = nr_vmemmap_pages;
>>               altmap = &mhp_altmap;
>>           }
> 

-aneesh

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 4/6] mm/hotplug: Allow pageblock alignment via altmap reservation
  2023-07-24 14:33     ` David Hildenbrand
@ 2023-07-24 15:16       ` Aneesh Kumar K V
  -1 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K V @ 2023-07-24 15:16 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, akpm, mpe, linuxppc-dev, npiggin,
	christophe.leroy
  Cc: Oscar Salvador, Michal Hocko, Vishal Verma

On 7/24/23 8:03 PM, David Hildenbrand wrote:
> On 18.07.23 04:44, Aneesh Kumar K.V wrote:
>> Add a new kconfig option that can be selected if we want to allow
> 
> That description seems outdated.
> 


Will update

>> pageblock alignment by reserving pages in the vmemmap altmap area.
>> This implies we will be reserving some pages for every memoryblock
>> This also allows the memmap on memory feature to be widely useful
>> with different memory block size values.
> 
> Can you add some more meat to the description, and especially, in
> which cases this might be desired and in which cases it might be
> completely undesired?
> 
> 
> Let's assume we hotplug a 1 GiB DIMM on arm64/64k. With 512 MiB pageblocks,
> we'd waste 50% of the hotplugged memory.
> 
> Also, see below on the case where we could end up with 100% wasted memory,
> which we want to block compeltely.
> 
> 
> Also, I wonder if we can avoid talking about "page reservation" or "altmap reservation",
> that's rather an implementation detail.
> 
> For example, I'd call this patch
> 
> "mm/hotplug: Support memmap_on_memory when memmap is not aligned to pageblocks"
> 
> 

Ok will update

> 
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>   mm/memory_hotplug.c | 109 ++++++++++++++++++++++++++++++++++++++------
>>   1 file changed, 96 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 5921c81fcb70..c409f5ff6a59 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -41,17 +41,85 @@
>>   #include "internal.h"
>>   #include "shuffle.h"
>>   +enum {
>> +    MEMMAP_ON_MEMORY_DISABLE = 0,
>> +    MEMMAP_ON_MEMORY_ENABLE,
>> +    MEMMAP_ON_MEMORY_FORCE,
>> +};
>> +
>> +static int memmap_mode __read_mostly = MEMMAP_ON_MEMORY_DISABLE;
>> +
>> +static inline unsigned long memory_block_align_base(unsigned long size)
>> +{
> 
> Can we start with something like this instead?
> 
> memory_block_memmap_size() might be reasonable to put into the previous patch.
> 
> 
> static inline unsigned long memory_block_memmap_size(void)
> {
>     return PHYS_PFN(memory_block_size_bytes()) * sizeof(struct page);
> }
> 
> /*
>  * In "forced" memmap_on_memory mode, we always align the vmemmap size up to cover
>  * full pageblocks. That way, we can add memory even if the vmemmap size is not properly
>  * aligned, however, we might waste memory.
>  */

I am finding that confusing. We do want things to be pageblock_nr_pages aligned both ways. 
With MEMMAP_ON_MEMORY_FORCE, we do that by allocating more space for memmap and
in the default case we do that by making sure only memory blocks of specific size supporting
that alignment can use MEMMAP_ON_MEMORY feature. 

> static inline unsigned long memory_block_memmap_on_memory_size(void)
> {
>     unsigned long size = memory_block_memmap_size();
> 
>     if (memmap_mode != MEMMAP_ON_MEMORY_FORCE)
>         return size;
>     return ALIGN(size, PFN_PHYS(pageblock_nr_pages));
> }
>     
> 
> 
>> +    if (memmap_mode == MEMMAP_ON_MEMORY_FORCE) {
>> +        unsigned long align;
>> +        unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
>> +        unsigned long vmemmap_size;
>> +
>> +        vmemmap_size = DIV_ROUND_UP(nr_vmemmap_pages * sizeof(struct page), PAGE_SIZE);
>> +        align = pageblock_align(vmemmap_size) - vmemmap_size;
>> +        return align;
>> +    } else
>> +        return 0;
>> +}
>> +
>>   #ifdef CONFIG_MHP_MEMMAP_ON_MEMORY
>>   /*
>>    * memory_hotplug.memmap_on_memory parameter
>>    */
>> -static bool memmap_on_memory __ro_after_init;
>> -module_param(memmap_on_memory, bool, 0444);
>> -MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug");
>> +static int set_memmap_mode(const char *val, const struct kernel_param *kp)
>> +{
>> +    int ret, mode;
>> +    bool enabled;
>> +
>> +    if (sysfs_streq(val, "force") ||  sysfs_streq(val, "FORCE")) {
>> +        mode =  MEMMAP_ON_MEMORY_FORCE;
>> +        goto matched;
>> +    }
>> +
>> +    ret = kstrtobool(val, &enabled);
>> +    if (ret < 0)
>> +        return ret;
>> +    if (enabled)
>> +        mode =  MEMMAP_ON_MEMORY_ENABLE;
>> +    else
>> +        mode =  MEMMAP_ON_MEMORY_DISABLE;
>> +
>> +matched:
>> +    *((int *)kp->arg) =  mode;
>> +    if (mode == MEMMAP_ON_MEMORY_FORCE) {
>> +        pr_info("Memory hotplug will reserve %ld pages in each memory block\n",
>> +            memory_block_align_base(memory_block_size_bytes()));
>> +    }
>> +    return 0;
>> +}
>> +
>> +static int get_memmap_mode(char *buffer, const struct kernel_param *kp)
>> +{
>> +    if (*((int *)kp->arg) == MEMMAP_ON_MEMORY_FORCE)
>> +        return sprintf(buffer,  "force\n");
>> +    if (*((int *)kp->arg) == MEMMAP_ON_MEMORY_ENABLE)
>> +        return sprintf(buffer,  "y\n");
>> +
>> +    return sprintf(buffer,  "n\n");
> 
> param_get_bool() uses uppercase Y / N. Maybe just return the uppercase variants here as well.
> 
>> +}
>> +
>> +static const struct kernel_param_ops memmap_mode_ops = {
>> +    .set = set_memmap_mode,
>> +    .get = get_memmap_mode,
>> +};
>> +module_param_cb(memmap_on_memory, &memmap_mode_ops, &memmap_mode, 0444);
>> +MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug\n"
>> +    "With value \"force\" it could result in memory wastage due to memmap size limitations \n"
>> +    "For example, if the memmap for a memory block requires 1 MiB, but the pageblock \n"
>> +    "size is 2 MiB, 1 MiB of hotplugged memory will be wasted. Note that there are \n"
>> +    "still cases where the feature cannot be enforced: for example, if the memmap is \n"
>> +    "smaller than a single page, or if the architecture does not support the forced \n"
>> +    "mode in all configurations. (y/n/force)");
> 
> That's a bit mouthful. Can we simplify and put the full doc into
> 
> Documentation/admin-guide/mm/memory-hotplug.rst
> 
> ?


Will update

> 
>>     static inline bool mhp_memmap_on_memory(void)
>>   {
>> -    return memmap_on_memory;
>> +    return !!memmap_mode;
> 
> Maybe better  "memmap_mode != MEMMAP_ON_MEMORY_DISABLE"
> 

Will update

>>   }
>>   #else
>>   static inline bool mhp_memmap_on_memory(void)
>> @@ -1264,7 +1332,6 @@ static inline bool arch_supports_memmap_on_memory(unsigned long size)
>>     static bool mhp_supports_memmap_on_memory(unsigned long size)
>>   {
>> -
>>       unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
>>       unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
>>       unsigned long remaining_size = size - vmemmap_size;
>> @@ -1295,10 +1362,23 @@ static bool mhp_supports_memmap_on_memory(unsigned long size)
>>        *       altmap as an alternative source of memory, and we do not exactly
>>        *       populate a single PMD.
>>        */
>> -    return mhp_memmap_on_memory() &&
>> -           size == memory_block_size_bytes() &&
>> -           IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)) &&
>> -           arch_supports_memmap_on_memory(size);
>> +    if (!mhp_memmap_on_memory() || size != memory_block_size_bytes())
>> +        return false;
>> +
>> +    /*
>> +     * Make sure the vmemmap allocation is fully contained
>> +     * so that we always allocate vmemmap memory from altmap area.
>> +     */
>> +    if (!IS_ALIGNED(vmemmap_size,  PAGE_SIZE))
>> +        return false;
>> +     /*
>> +      * Without page reservation remaining pages should be pageblock aligned.
>> +      */
>> +    if (memmap_mode != MEMMAP_ON_MEMORY_FORCE &&
>> +        !IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)))
>> +        return false;
> 
> With our new helper, this becomes the following:
> 
> memmap_on_memory_size = memory_block_memmap_on_memory_size();
> 
> if (!IS_ALIGNED(memmap_on_memory_size, PFN_PHYS(pageblock_nr_pages))
>     /* We're not allowed to waste any memory for the memmap. */
>     return false;
> 
> if (memmap_on_memory_size == memory_block_size_bytes())
>     /* No effective hotplugged memory doesn't make sense. */
>     return false;   
> 

Wil update

>> +    return arch_supports_memmap_on_memory(size);
>>   }
>>     /*
>> @@ -1311,7 +1391,11 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>>   {
>>       struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
>>       enum memblock_flags memblock_flags = MEMBLOCK_NONE;
>> -    struct vmem_altmap mhp_altmap = {};
>> +    struct vmem_altmap mhp_altmap = {
>> +        .base_pfn =  PHYS_PFN(res->start),
>> +        .end_pfn  =  PHYS_PFN(res->end),
>> +        .reserve  = memory_block_align_base(resource_size(res)),
> 
> Can you remind me why we have to set reserve here at all?
> 
> IOW, can't we simply set
> 
> .free = memory_block_memmap_on_memory_size();
> 
> end then pass
> 
> mhp_altmap.alloc + mhp_altmap.free
> 
> to create_memory_block_devices() instead?
> 

But with the dax usage of altmap, altmap->reserve is what we use to reserve things to get
the required alignment. One difference is where we allocate the struct page at. For this specific
case it should not matter. 

static unsigned long __meminit vmem_altmap_next_pfn(struct vmem_altmap *altmap)
{
	return altmap->base_pfn + altmap->reserve + altmap->alloc
		+ altmap->align;
}

And other is where we online a memory block

We find the start pfn using mem->altmap->alloc + mem->altmap->reserve;  

Considering altmap->reserve is what dax pfn_dev use, is there a reason you want to use altmap->free for this? 
I find it confusing to update free when we haven't allocated any altmap blocks yet.

-aneesh



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 4/6] mm/hotplug: Allow pageblock alignment via altmap reservation
@ 2023-07-24 15:16       ` Aneesh Kumar K V
  0 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K V @ 2023-07-24 15:16 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, akpm, mpe, linuxppc-dev, npiggin,
	christophe.leroy
  Cc: Vishal Verma, Michal Hocko, Oscar Salvador

On 7/24/23 8:03 PM, David Hildenbrand wrote:
> On 18.07.23 04:44, Aneesh Kumar K.V wrote:
>> Add a new kconfig option that can be selected if we want to allow
> 
> That description seems outdated.
> 


Will update

>> pageblock alignment by reserving pages in the vmemmap altmap area.
>> This implies we will be reserving some pages for every memoryblock
>> This also allows the memmap on memory feature to be widely useful
>> with different memory block size values.
> 
> Can you add some more meat to the description, and especially, in
> which cases this might be desired and in which cases it might be
> completely undesired?
> 
> 
> Let's assume we hotplug a 1 GiB DIMM on arm64/64k. With 512 MiB pageblocks,
> we'd waste 50% of the hotplugged memory.
> 
> Also, see below on the case where we could end up with 100% wasted memory,
> which we want to block compeltely.
> 
> 
> Also, I wonder if we can avoid talking about "page reservation" or "altmap reservation",
> that's rather an implementation detail.
> 
> For example, I'd call this patch
> 
> "mm/hotplug: Support memmap_on_memory when memmap is not aligned to pageblocks"
> 
> 

Ok will update

> 
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>   mm/memory_hotplug.c | 109 ++++++++++++++++++++++++++++++++++++++------
>>   1 file changed, 96 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 5921c81fcb70..c409f5ff6a59 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -41,17 +41,85 @@
>>   #include "internal.h"
>>   #include "shuffle.h"
>>   +enum {
>> +    MEMMAP_ON_MEMORY_DISABLE = 0,
>> +    MEMMAP_ON_MEMORY_ENABLE,
>> +    MEMMAP_ON_MEMORY_FORCE,
>> +};
>> +
>> +static int memmap_mode __read_mostly = MEMMAP_ON_MEMORY_DISABLE;
>> +
>> +static inline unsigned long memory_block_align_base(unsigned long size)
>> +{
> 
> Can we start with something like this instead?
> 
> memory_block_memmap_size() might be reasonable to put into the previous patch.
> 
> 
> static inline unsigned long memory_block_memmap_size(void)
> {
>     return PHYS_PFN(memory_block_size_bytes()) * sizeof(struct page);
> }
> 
> /*
>  * In "forced" memmap_on_memory mode, we always align the vmemmap size up to cover
>  * full pageblocks. That way, we can add memory even if the vmemmap size is not properly
>  * aligned, however, we might waste memory.
>  */

I am finding that confusing. We do want things to be pageblock_nr_pages aligned both ways. 
With MEMMAP_ON_MEMORY_FORCE, we do that by allocating more space for memmap and
in the default case we do that by making sure only memory blocks of specific size supporting
that alignment can use MEMMAP_ON_MEMORY feature. 

> static inline unsigned long memory_block_memmap_on_memory_size(void)
> {
>     unsigned long size = memory_block_memmap_size();
> 
>     if (memmap_mode != MEMMAP_ON_MEMORY_FORCE)
>         return size;
>     return ALIGN(size, PFN_PHYS(pageblock_nr_pages));
> }
>     
> 
> 
>> +    if (memmap_mode == MEMMAP_ON_MEMORY_FORCE) {
>> +        unsigned long align;
>> +        unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
>> +        unsigned long vmemmap_size;
>> +
>> +        vmemmap_size = DIV_ROUND_UP(nr_vmemmap_pages * sizeof(struct page), PAGE_SIZE);
>> +        align = pageblock_align(vmemmap_size) - vmemmap_size;
>> +        return align;
>> +    } else
>> +        return 0;
>> +}
>> +
>>   #ifdef CONFIG_MHP_MEMMAP_ON_MEMORY
>>   /*
>>    * memory_hotplug.memmap_on_memory parameter
>>    */
>> -static bool memmap_on_memory __ro_after_init;
>> -module_param(memmap_on_memory, bool, 0444);
>> -MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug");
>> +static int set_memmap_mode(const char *val, const struct kernel_param *kp)
>> +{
>> +    int ret, mode;
>> +    bool enabled;
>> +
>> +    if (sysfs_streq(val, "force") ||  sysfs_streq(val, "FORCE")) {
>> +        mode =  MEMMAP_ON_MEMORY_FORCE;
>> +        goto matched;
>> +    }
>> +
>> +    ret = kstrtobool(val, &enabled);
>> +    if (ret < 0)
>> +        return ret;
>> +    if (enabled)
>> +        mode =  MEMMAP_ON_MEMORY_ENABLE;
>> +    else
>> +        mode =  MEMMAP_ON_MEMORY_DISABLE;
>> +
>> +matched:
>> +    *((int *)kp->arg) =  mode;
>> +    if (mode == MEMMAP_ON_MEMORY_FORCE) {
>> +        pr_info("Memory hotplug will reserve %ld pages in each memory block\n",
>> +            memory_block_align_base(memory_block_size_bytes()));
>> +    }
>> +    return 0;
>> +}
>> +
>> +static int get_memmap_mode(char *buffer, const struct kernel_param *kp)
>> +{
>> +    if (*((int *)kp->arg) == MEMMAP_ON_MEMORY_FORCE)
>> +        return sprintf(buffer,  "force\n");
>> +    if (*((int *)kp->arg) == MEMMAP_ON_MEMORY_ENABLE)
>> +        return sprintf(buffer,  "y\n");
>> +
>> +    return sprintf(buffer,  "n\n");
> 
> param_get_bool() uses uppercase Y / N. Maybe just return the uppercase variants here as well.
> 
>> +}
>> +
>> +static const struct kernel_param_ops memmap_mode_ops = {
>> +    .set = set_memmap_mode,
>> +    .get = get_memmap_mode,
>> +};
>> +module_param_cb(memmap_on_memory, &memmap_mode_ops, &memmap_mode, 0444);
>> +MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug\n"
>> +    "With value \"force\" it could result in memory wastage due to memmap size limitations \n"
>> +    "For example, if the memmap for a memory block requires 1 MiB, but the pageblock \n"
>> +    "size is 2 MiB, 1 MiB of hotplugged memory will be wasted. Note that there are \n"
>> +    "still cases where the feature cannot be enforced: for example, if the memmap is \n"
>> +    "smaller than a single page, or if the architecture does not support the forced \n"
>> +    "mode in all configurations. (y/n/force)");
> 
> That's a bit mouthful. Can we simplify and put the full doc into
> 
> Documentation/admin-guide/mm/memory-hotplug.rst
> 
> ?


Will update

> 
>>     static inline bool mhp_memmap_on_memory(void)
>>   {
>> -    return memmap_on_memory;
>> +    return !!memmap_mode;
> 
> Maybe better  "memmap_mode != MEMMAP_ON_MEMORY_DISABLE"
> 

Will update

>>   }
>>   #else
>>   static inline bool mhp_memmap_on_memory(void)
>> @@ -1264,7 +1332,6 @@ static inline bool arch_supports_memmap_on_memory(unsigned long size)
>>     static bool mhp_supports_memmap_on_memory(unsigned long size)
>>   {
>> -
>>       unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
>>       unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
>>       unsigned long remaining_size = size - vmemmap_size;
>> @@ -1295,10 +1362,23 @@ static bool mhp_supports_memmap_on_memory(unsigned long size)
>>        *       altmap as an alternative source of memory, and we do not exactly
>>        *       populate a single PMD.
>>        */
>> -    return mhp_memmap_on_memory() &&
>> -           size == memory_block_size_bytes() &&
>> -           IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)) &&
>> -           arch_supports_memmap_on_memory(size);
>> +    if (!mhp_memmap_on_memory() || size != memory_block_size_bytes())
>> +        return false;
>> +
>> +    /*
>> +     * Make sure the vmemmap allocation is fully contained
>> +     * so that we always allocate vmemmap memory from altmap area.
>> +     */
>> +    if (!IS_ALIGNED(vmemmap_size,  PAGE_SIZE))
>> +        return false;
>> +     /*
>> +      * Without page reservation remaining pages should be pageblock aligned.
>> +      */
>> +    if (memmap_mode != MEMMAP_ON_MEMORY_FORCE &&
>> +        !IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)))
>> +        return false;
> 
> With our new helper, this becomes the following:
> 
> memmap_on_memory_size = memory_block_memmap_on_memory_size();
> 
> if (!IS_ALIGNED(memmap_on_memory_size, PFN_PHYS(pageblock_nr_pages))
>     /* We're not allowed to waste any memory for the memmap. */
>     return false;
> 
> if (memmap_on_memory_size == memory_block_size_bytes())
>     /* No effective hotplugged memory doesn't make sense. */
>     return false;   
> 

Wil update

>> +    return arch_supports_memmap_on_memory(size);
>>   }
>>     /*
>> @@ -1311,7 +1391,11 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>>   {
>>       struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
>>       enum memblock_flags memblock_flags = MEMBLOCK_NONE;
>> -    struct vmem_altmap mhp_altmap = {};
>> +    struct vmem_altmap mhp_altmap = {
>> +        .base_pfn =  PHYS_PFN(res->start),
>> +        .end_pfn  =  PHYS_PFN(res->end),
>> +        .reserve  = memory_block_align_base(resource_size(res)),
> 
> Can you remind me why we have to set reserve here at all?
> 
> IOW, can't we simply set
> 
> .free = memory_block_memmap_on_memory_size();
> 
> end then pass
> 
> mhp_altmap.alloc + mhp_altmap.free
> 
> to create_memory_block_devices() instead?
> 

But with the dax usage of altmap, altmap->reserve is what we use to reserve things to get
the required alignment. One difference is where we allocate the struct page at. For this specific
case it should not matter. 

static unsigned long __meminit vmem_altmap_next_pfn(struct vmem_altmap *altmap)
{
	return altmap->base_pfn + altmap->reserve + altmap->alloc
		+ altmap->align;
}

And other is where we online a memory block

We find the start pfn using mem->altmap->alloc + mem->altmap->reserve;  

Considering altmap->reserve is what dax pfn_dev use, is there a reason you want to use altmap->free for this? 
I find it confusing to update free when we haven't allocated any altmap blocks yet.

-aneesh


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 4/6] mm/hotplug: Allow pageblock alignment via altmap reservation
  2023-07-24 15:16       ` Aneesh Kumar K V
@ 2023-07-24 15:41         ` David Hildenbrand
  -1 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2023-07-24 15:41 UTC (permalink / raw)
  To: Aneesh Kumar K V, linux-mm, akpm, mpe, linuxppc-dev, npiggin,
	christophe.leroy
  Cc: Oscar Salvador, Michal Hocko, Vishal Verma

On 24.07.23 17:16, Aneesh Kumar K V wrote:

>>
>> /*
>>   * In "forced" memmap_on_memory mode, we always align the vmemmap size up to cover
>>   * full pageblocks. That way, we can add memory even if the vmemmap size is not properly
>>   * aligned, however, we might waste memory.
>>   */
> 
> I am finding that confusing. We do want things to be pageblock_nr_pages aligned both ways.
> With MEMMAP_ON_MEMORY_FORCE, we do that by allocating more space for memmap and
> in the default case we do that by making sure only memory blocks of specific size supporting
> that alignment can use MEMMAP_ON_MEMORY feature.

See the usage inm hp_supports_memmap_on_memory(), I guess that makes 
sense then.

But if you have any ideas on how to clarify that (terminology), I'm all 
ears!

[...]

>>> +    return arch_supports_memmap_on_memory(size);
>>>    }
>>>      /*
>>> @@ -1311,7 +1391,11 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>>>    {
>>>        struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
>>>        enum memblock_flags memblock_flags = MEMBLOCK_NONE;
>>> -    struct vmem_altmap mhp_altmap = {};
>>> +    struct vmem_altmap mhp_altmap = {
>>> +        .base_pfn =  PHYS_PFN(res->start),
>>> +        .end_pfn  =  PHYS_PFN(res->end),
>>> +        .reserve  = memory_block_align_base(resource_size(res)),
>>
>> Can you remind me why we have to set reserve here at all?
>>
>> IOW, can't we simply set
>>
>> .free = memory_block_memmap_on_memory_size();
>>
>> end then pass
>>
>> mhp_altmap.alloc + mhp_altmap.free
>>
>> to create_memory_block_devices() instead?
>>
> 
> But with the dax usage of altmap, altmap->reserve is what we use to reserve things to get
> the required alignment. One difference is where we allocate the struct page at. For this specific
> case it should not matter.
> 
> static unsigned long __meminit vmem_altmap_next_pfn(struct vmem_altmap *altmap)
> {
> 	return altmap->base_pfn + altmap->reserve + altmap->alloc
> 		+ altmap->align;
> }
> 
> And other is where we online a memory block
> 
> We find the start pfn using mem->altmap->alloc + mem->altmap->reserve;
> 
> Considering altmap->reserve is what dax pfn_dev use, is there a reason you want to use altmap->free for this?

"Reserve" is all about "reserving that much memory for driver usage".

We don't care about that. We simply want vmemmap allocations coming from 
the pageblock(s) we set aside. Where exactly, we don't care.

> I find it confusing to update free when we haven't allocated any altmap blocks yet.

"
@reserve: pages mapped, but reserved for driver use (relative to @base)"
@free: free pages set aside in the mapping for memmap storage
@alloc: track pages consumed, private to vmemmap_populate()
"

To me, that implies that we can ignore "reserve". We set @free to the 
aligned value and let the vmemmap get allocated from anything in there.

free + alloc should always sum up to our set-aside pageblock(s), no?


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 4/6] mm/hotplug: Allow pageblock alignment via altmap reservation
@ 2023-07-24 15:41         ` David Hildenbrand
  0 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2023-07-24 15:41 UTC (permalink / raw)
  To: Aneesh Kumar K V, linux-mm, akpm, mpe, linuxppc-dev, npiggin,
	christophe.leroy
  Cc: Vishal Verma, Michal Hocko, Oscar Salvador

On 24.07.23 17:16, Aneesh Kumar K V wrote:

>>
>> /*
>>   * In "forced" memmap_on_memory mode, we always align the vmemmap size up to cover
>>   * full pageblocks. That way, we can add memory even if the vmemmap size is not properly
>>   * aligned, however, we might waste memory.
>>   */
> 
> I am finding that confusing. We do want things to be pageblock_nr_pages aligned both ways.
> With MEMMAP_ON_MEMORY_FORCE, we do that by allocating more space for memmap and
> in the default case we do that by making sure only memory blocks of specific size supporting
> that alignment can use MEMMAP_ON_MEMORY feature.

See the usage inm hp_supports_memmap_on_memory(), I guess that makes 
sense then.

But if you have any ideas on how to clarify that (terminology), I'm all 
ears!

[...]

>>> +    return arch_supports_memmap_on_memory(size);
>>>    }
>>>      /*
>>> @@ -1311,7 +1391,11 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>>>    {
>>>        struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
>>>        enum memblock_flags memblock_flags = MEMBLOCK_NONE;
>>> -    struct vmem_altmap mhp_altmap = {};
>>> +    struct vmem_altmap mhp_altmap = {
>>> +        .base_pfn =  PHYS_PFN(res->start),
>>> +        .end_pfn  =  PHYS_PFN(res->end),
>>> +        .reserve  = memory_block_align_base(resource_size(res)),
>>
>> Can you remind me why we have to set reserve here at all?
>>
>> IOW, can't we simply set
>>
>> .free = memory_block_memmap_on_memory_size();
>>
>> end then pass
>>
>> mhp_altmap.alloc + mhp_altmap.free
>>
>> to create_memory_block_devices() instead?
>>
> 
> But with the dax usage of altmap, altmap->reserve is what we use to reserve things to get
> the required alignment. One difference is where we allocate the struct page at. For this specific
> case it should not matter.
> 
> static unsigned long __meminit vmem_altmap_next_pfn(struct vmem_altmap *altmap)
> {
> 	return altmap->base_pfn + altmap->reserve + altmap->alloc
> 		+ altmap->align;
> }
> 
> And other is where we online a memory block
> 
> We find the start pfn using mem->altmap->alloc + mem->altmap->reserve;
> 
> Considering altmap->reserve is what dax pfn_dev use, is there a reason you want to use altmap->free for this?

"Reserve" is all about "reserving that much memory for driver usage".

We don't care about that. We simply want vmemmap allocations coming from 
the pageblock(s) we set aside. Where exactly, we don't care.

> I find it confusing to update free when we haven't allocated any altmap blocks yet.

"
@reserve: pages mapped, but reserved for driver use (relative to @base)"
@free: free pages set aside in the mapping for memmap storage
@alloc: track pages consumed, private to vmemmap_populate()
"

To me, that implies that we can ignore "reserve". We set @free to the 
aligned value and let the vmemmap get allocated from anything in there.

free + alloc should always sum up to our set-aside pageblock(s), no?


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 5/6] powerpc/book3s64/memhotplug: Enable memmap on memory for radix
  2023-07-24 14:46       ` Aneesh Kumar K V
@ 2023-07-24 15:52         ` David Hildenbrand
  -1 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2023-07-24 15:52 UTC (permalink / raw)
  To: Aneesh Kumar K V, linux-mm, akpm, mpe, linuxppc-dev, npiggin,
	christophe.leroy
  Cc: Oscar Salvador, Michal Hocko, Vishal Verma

>>> +            mhp_altmap.base_pfn = PHYS_PFN(start);
>>> +            mhp_altmap.free = PHYS_PFN(size) - nr_vmemmap_pages;
>>
>>
>> That change does not belong into this patch.
>>
> 
> 
> I kept that change with ppc64 enablement because only ppc64 arch got check against
> those values in the free path.

Let's make that accounting consistent in patch #4. I think it really 
belongs in there, especially once we clarify the "free vs. reserved" 
handling.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 5/6] powerpc/book3s64/memhotplug: Enable memmap on memory for radix
@ 2023-07-24 15:52         ` David Hildenbrand
  0 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2023-07-24 15:52 UTC (permalink / raw)
  To: Aneesh Kumar K V, linux-mm, akpm, mpe, linuxppc-dev, npiggin,
	christophe.leroy
  Cc: Vishal Verma, Michal Hocko, Oscar Salvador

>>> +            mhp_altmap.base_pfn = PHYS_PFN(start);
>>> +            mhp_altmap.free = PHYS_PFN(size) - nr_vmemmap_pages;
>>
>>
>> That change does not belong into this patch.
>>
> 
> 
> I kept that change with ppc64 enablement because only ppc64 arch got check against
> those values in the free path.

Let's make that accounting consistent in patch #4. I think it really 
belongs in there, especially once we clarify the "free vs. reserved" 
handling.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 4/6] mm/hotplug: Allow pageblock alignment via altmap reservation
  2023-07-24 15:41         ` David Hildenbrand
@ 2023-07-24 16:02           ` Aneesh Kumar K V
  -1 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K V @ 2023-07-24 16:02 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, akpm, mpe, linuxppc-dev, npiggin,
	christophe.leroy
  Cc: Vishal Verma, Michal Hocko, Oscar Salvador

On 7/24/23 9:11 PM, David Hildenbrand wrote:
> On 24.07.23 17:16, Aneesh Kumar K V wrote:
> 
>>>
>>> /*
>>>   * In "forced" memmap_on_memory mode, we always align the vmemmap size up to cover
>>>   * full pageblocks. That way, we can add memory even if the vmemmap size is not properly
>>>   * aligned, however, we might waste memory.
>>>   */
>>
>> I am finding that confusing. We do want things to be pageblock_nr_pages aligned both ways.
>> With MEMMAP_ON_MEMORY_FORCE, we do that by allocating more space for memmap and
>> in the default case we do that by making sure only memory blocks of specific size supporting
>> that alignment can use MEMMAP_ON_MEMORY feature.
> 
> See the usage inm hp_supports_memmap_on_memory(), I guess that makes sense then.
> 
> But if you have any ideas on how to clarify that (terminology), I'm all ears!
> 


I updated the commit message 

mm/hotplug: Support memmap_on_memory when memmap is not aligned to pageblocks

Currently, memmap_on_memory feature is only supported with memory block
sizes that result in vmemmap pages covering full page blocks. This is
because memory onlining/offlining code requires applicable ranges to be
pageblock-aligned, for example, to set the migratetypes properly.

This patch helps to lift that restriction by reserving more pages than
required for vmemmap space. This helps to align the start addr to be
page block aligned with different memory block sizes. This implies the
kernel will be reserving some pages for every memoryblock. This also
allows the memmap on memory feature to be widely useful with different
memory block size values.

For ex: with 64K page size and 256MiB memory block size, we require 4
pages to map vmemmap pages, To align things correctly we end up adding a
reserve of 28 pages. ie, for every 4096 pages 28 pages get reserved.


Also while implementing your  suggestion to use memory_block_memmap_on_memory_size()
I am finding it not really useful because in mhp_supports_memmap_on_memory() we are checking
if remaining_size is pageblock_nr_pages aligned (dax_kmem may want to use that helper
later). Also I still think altmap.reserve is easier because of the start_pfn calculation.
(more on this below)



> [...]
> 
>>>> +    return arch_supports_memmap_on_memory(size);
>>>>    }
>>>>      /*
>>>> @@ -1311,7 +1391,11 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>>>>    {
>>>>        struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
>>>>        enum memblock_flags memblock_flags = MEMBLOCK_NONE;
>>>> -    struct vmem_altmap mhp_altmap = {};
>>>> +    struct vmem_altmap mhp_altmap = {
>>>> +        .base_pfn =  PHYS_PFN(res->start),
>>>> +        .end_pfn  =  PHYS_PFN(res->end),
>>>> +        .reserve  = memory_block_align_base(resource_size(res)),
>>>
>>> Can you remind me why we have to set reserve here at all?
>>>
>>> IOW, can't we simply set
>>>
>>> .free = memory_block_memmap_on_memory_size();
>>>
>>> end then pass
>>>
>>> mhp_altmap.alloc + mhp_altmap.free
>>>
>>> to create_memory_block_devices() instead?
>>>
>>
>> But with the dax usage of altmap, altmap->reserve is what we use to reserve things to get
>> the required alignment. One difference is where we allocate the struct page at. For this specific
>> case it should not matter.
>>
>> static unsigned long __meminit vmem_altmap_next_pfn(struct vmem_altmap *altmap)
>> {
>>     return altmap->base_pfn + altmap->reserve + altmap->alloc
>>         + altmap->align;
>> }
>>
>> And other is where we online a memory block
>>
>> We find the start pfn using mem->altmap->alloc + mem->altmap->reserve;
>>
>> Considering altmap->reserve is what dax pfn_dev use, is there a reason you want to use altmap->free for this?
> 
> "Reserve" is all about "reserving that much memory for driver usage".
> 
> We don't care about that. We simply want vmemmap allocations coming from the pageblock(s) we set aside. Where exactly, we don't care.
> 
>> I find it confusing to update free when we haven't allocated any altmap blocks yet.
> 
> "
> @reserve: pages mapped, but reserved for driver use (relative to @base)"
> @free: free pages set aside in the mapping for memmap storage
> @alloc: track pages consumed, private to vmemmap_populate()
> "
> 
> To me, that implies that we can ignore "reserve". We set @free to the aligned value and let the vmemmap get allocated from anything in there.
> 
> free + alloc should always sum up to our set-aside pageblock(s), no?
> 
>

The difference is 

 mhp_altmap.free = PHYS_PFN(size) - reserved blocks;

ie, with 256MiB memory block size with 64K pages, we need 4 memmap pages and we reserve 28 pages for aligment.

mhp_altmap.free = PHYS_PFN(size) - 28. 

So that 4 pages from which we are allocating the memmap pages are still counted in free page.

We could all make it work by doing

mhp_altmap.free = PHYS_PFN(size) -  (memory_block_memmap_on_memory_size() - memory_block_memmap_size())

But is that any better than what we have now? I understand the term "reserved for driver use" is confusing for this use case.
But it is really reserving things for required alignment. 

-aneesh




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 4/6] mm/hotplug: Allow pageblock alignment via altmap reservation
@ 2023-07-24 16:02           ` Aneesh Kumar K V
  0 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K V @ 2023-07-24 16:02 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, akpm, mpe, linuxppc-dev, npiggin,
	christophe.leroy
  Cc: Oscar Salvador, Michal Hocko, Vishal Verma

On 7/24/23 9:11 PM, David Hildenbrand wrote:
> On 24.07.23 17:16, Aneesh Kumar K V wrote:
> 
>>>
>>> /*
>>>   * In "forced" memmap_on_memory mode, we always align the vmemmap size up to cover
>>>   * full pageblocks. That way, we can add memory even if the vmemmap size is not properly
>>>   * aligned, however, we might waste memory.
>>>   */
>>
>> I am finding that confusing. We do want things to be pageblock_nr_pages aligned both ways.
>> With MEMMAP_ON_MEMORY_FORCE, we do that by allocating more space for memmap and
>> in the default case we do that by making sure only memory blocks of specific size supporting
>> that alignment can use MEMMAP_ON_MEMORY feature.
> 
> See the usage inm hp_supports_memmap_on_memory(), I guess that makes sense then.
> 
> But if you have any ideas on how to clarify that (terminology), I'm all ears!
> 


I updated the commit message 

mm/hotplug: Support memmap_on_memory when memmap is not aligned to pageblocks

Currently, memmap_on_memory feature is only supported with memory block
sizes that result in vmemmap pages covering full page blocks. This is
because memory onlining/offlining code requires applicable ranges to be
pageblock-aligned, for example, to set the migratetypes properly.

This patch helps to lift that restriction by reserving more pages than
required for vmemmap space. This helps to align the start addr to be
page block aligned with different memory block sizes. This implies the
kernel will be reserving some pages for every memoryblock. This also
allows the memmap on memory feature to be widely useful with different
memory block size values.

For ex: with 64K page size and 256MiB memory block size, we require 4
pages to map vmemmap pages, To align things correctly we end up adding a
reserve of 28 pages. ie, for every 4096 pages 28 pages get reserved.


Also while implementing your  suggestion to use memory_block_memmap_on_memory_size()
I am finding it not really useful because in mhp_supports_memmap_on_memory() we are checking
if remaining_size is pageblock_nr_pages aligned (dax_kmem may want to use that helper
later). Also I still think altmap.reserve is easier because of the start_pfn calculation.
(more on this below)



> [...]
> 
>>>> +    return arch_supports_memmap_on_memory(size);
>>>>    }
>>>>      /*
>>>> @@ -1311,7 +1391,11 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>>>>    {
>>>>        struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
>>>>        enum memblock_flags memblock_flags = MEMBLOCK_NONE;
>>>> -    struct vmem_altmap mhp_altmap = {};
>>>> +    struct vmem_altmap mhp_altmap = {
>>>> +        .base_pfn =  PHYS_PFN(res->start),
>>>> +        .end_pfn  =  PHYS_PFN(res->end),
>>>> +        .reserve  = memory_block_align_base(resource_size(res)),
>>>
>>> Can you remind me why we have to set reserve here at all?
>>>
>>> IOW, can't we simply set
>>>
>>> .free = memory_block_memmap_on_memory_size();
>>>
>>> end then pass
>>>
>>> mhp_altmap.alloc + mhp_altmap.free
>>>
>>> to create_memory_block_devices() instead?
>>>
>>
>> But with the dax usage of altmap, altmap->reserve is what we use to reserve things to get
>> the required alignment. One difference is where we allocate the struct page at. For this specific
>> case it should not matter.
>>
>> static unsigned long __meminit vmem_altmap_next_pfn(struct vmem_altmap *altmap)
>> {
>>     return altmap->base_pfn + altmap->reserve + altmap->alloc
>>         + altmap->align;
>> }
>>
>> And other is where we online a memory block
>>
>> We find the start pfn using mem->altmap->alloc + mem->altmap->reserve;
>>
>> Considering altmap->reserve is what dax pfn_dev use, is there a reason you want to use altmap->free for this?
> 
> "Reserve" is all about "reserving that much memory for driver usage".
> 
> We don't care about that. We simply want vmemmap allocations coming from the pageblock(s) we set aside. Where exactly, we don't care.
> 
>> I find it confusing to update free when we haven't allocated any altmap blocks yet.
> 
> "
> @reserve: pages mapped, but reserved for driver use (relative to @base)"
> @free: free pages set aside in the mapping for memmap storage
> @alloc: track pages consumed, private to vmemmap_populate()
> "
> 
> To me, that implies that we can ignore "reserve". We set @free to the aligned value and let the vmemmap get allocated from anything in there.
> 
> free + alloc should always sum up to our set-aside pageblock(s), no?
> 
>

The difference is 

 mhp_altmap.free = PHYS_PFN(size) - reserved blocks;

ie, with 256MiB memory block size with 64K pages, we need 4 memmap pages and we reserve 28 pages for aligment.

mhp_altmap.free = PHYS_PFN(size) - 28. 

So that 4 pages from which we are allocating the memmap pages are still counted in free page.

We could all make it work by doing

mhp_altmap.free = PHYS_PFN(size) -  (memory_block_memmap_on_memory_size() - memory_block_memmap_size())

But is that any better than what we have now? I understand the term "reserved for driver use" is confusing for this use case.
But it is really reserving things for required alignment. 

-aneesh





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 4/6] mm/hotplug: Allow pageblock alignment via altmap reservation
  2023-07-24 16:02           ` Aneesh Kumar K V
@ 2023-07-24 16:24             ` David Hildenbrand
  -1 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2023-07-24 16:24 UTC (permalink / raw)
  To: Aneesh Kumar K V, linux-mm, akpm, mpe, linuxppc-dev, npiggin,
	christophe.leroy
  Cc: Oscar Salvador, Michal Hocko, Vishal Verma

On 24.07.23 18:02, Aneesh Kumar K V wrote:
> On 7/24/23 9:11 PM, David Hildenbrand wrote:
>> On 24.07.23 17:16, Aneesh Kumar K V wrote:
>>
>>>>
>>>> /*
>>>>    * In "forced" memmap_on_memory mode, we always align the vmemmap size up to cover
>>>>    * full pageblocks. That way, we can add memory even if the vmemmap size is not properly
>>>>    * aligned, however, we might waste memory.
>>>>    */
>>>
>>> I am finding that confusing. We do want things to be pageblock_nr_pages aligned both ways.
>>> With MEMMAP_ON_MEMORY_FORCE, we do that by allocating more space for memmap and
>>> in the default case we do that by making sure only memory blocks of specific size supporting
>>> that alignment can use MEMMAP_ON_MEMORY feature.
>>
>> See the usage inm hp_supports_memmap_on_memory(), I guess that makes sense then.
>>
>> But if you have any ideas on how to clarify that (terminology), I'm all ears!
>>
> 
> 
> I updated the commit message
> 
> mm/hotplug: Support memmap_on_memory when memmap is not aligned to pageblocks
> 
> Currently, memmap_on_memory feature is only supported with memory block
> sizes that result in vmemmap pages covering full page blocks. This is
> because memory onlining/offlining code requires applicable ranges to be
> pageblock-aligned, for example, to set the migratetypes properly.
> 
> This patch helps to lift that restriction by reserving more pages than
> required for vmemmap space. This helps to align the start addr to be
> page block aligned with different memory block sizes. This implies the
> kernel will be reserving some pages for every memoryblock. This also
> allows the memmap on memory feature to be widely useful with different
> memory block size values.
> 
> For ex: with 64K page size and 256MiB memory block size, we require 4
> pages to map vmemmap pages, To align things correctly we end up adding a
> reserve of 28 pages. ie, for every 4096 pages 28 pages get reserved.
> 
> 

Much better.

> Also while implementing your  suggestion to use memory_block_memmap_on_memory_size()
> I am finding it not really useful because in mhp_supports_memmap_on_memory() we are checking
> if remaining_size is pageblock_nr_pages aligned (dax_kmem may want to use that helper
> later).

Let's focus on this patchset here first.

Factoring out how manye memmap pages we actually need vs. how many pages 
we need when aligning up sound very reasonable to me.


Can you elaborate what the problem is?

> Also I still think altmap.reserve is easier because of the start_pfn calculation.
> (more on this below)

Can you elaborate? Do you mean the try_remove_memory() change?

> 
> 
>> [...]
>>
>>>>> +    return arch_supports_memmap_on_memory(size);
>>>>>     }
>>>>>       /*
>>>>> @@ -1311,7 +1391,11 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>>>>>     {
>>>>>         struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
>>>>>         enum memblock_flags memblock_flags = MEMBLOCK_NONE;
>>>>> -    struct vmem_altmap mhp_altmap = {};
>>>>> +    struct vmem_altmap mhp_altmap = {
>>>>> +        .base_pfn =  PHYS_PFN(res->start),
>>>>> +        .end_pfn  =  PHYS_PFN(res->end),
>>>>> +        .reserve  = memory_block_align_base(resource_size(res)),
>>>>
>>>> Can you remind me why we have to set reserve here at all?
>>>>
>>>> IOW, can't we simply set
>>>>
>>>> .free = memory_block_memmap_on_memory_size();
>>>>
>>>> end then pass
>>>>
>>>> mhp_altmap.alloc + mhp_altmap.free
>>>>
>>>> to create_memory_block_devices() instead?
>>>>
>>>
>>> But with the dax usage of altmap, altmap->reserve is what we use to reserve things to get
>>> the required alignment. One difference is where we allocate the struct page at. For this specific
>>> case it should not matter.
>>>
>>> static unsigned long __meminit vmem_altmap_next_pfn(struct vmem_altmap *altmap)
>>> {
>>>      return altmap->base_pfn + altmap->reserve + altmap->alloc
>>>          + altmap->align;
>>> }
>>>
>>> And other is where we online a memory block
>>>
>>> We find the start pfn using mem->altmap->alloc + mem->altmap->reserve;
>>>
>>> Considering altmap->reserve is what dax pfn_dev use, is there a reason you want to use altmap->free for this?
>>
>> "Reserve" is all about "reserving that much memory for driver usage".
>>
>> We don't care about that. We simply want vmemmap allocations coming from the pageblock(s) we set aside. Where exactly, we don't care.
>>
>>> I find it confusing to update free when we haven't allocated any altmap blocks yet.
>>
>> "
>> @reserve: pages mapped, but reserved for driver use (relative to @base)"
>> @free: free pages set aside in the mapping for memmap storage
>> @alloc: track pages consumed, private to vmemmap_populate()
>> "
>>
>> To me, that implies that we can ignore "reserve". We set @free to the aligned value and let the vmemmap get allocated from anything in there.
>>
>> free + alloc should always sum up to our set-aside pageblock(s), no?
>>
>>
> 
> The difference is
> 
>   mhp_altmap.free = PHYS_PFN(size) - reserved blocks;
> 
> ie, with 256MiB memory block size with 64K pages, we need 4 memmap pages and we reserve 28 pages for aligment.
> 
> mhp_altmap.free = PHYS_PFN(size) - 28.
> 
> So that 4 pages from which we are allocating the memmap pages are still counted in free page.
> 
> We could all make it work by doing
> 
> mhp_altmap.free = PHYS_PFN(size) -  (memory_block_memmap_on_memory_size() - memory_block_memmap_size())
> 
> But is that any better than what we have now? I understand the term "reserved for driver use" is confusing for this use case.
> But it is really reserving things for required alignment.


Let's take a step back.

altmap->alloc tells us how much was already allocated.

altmap->free tells us how much memory we can allocate at max (confusing, 
but see vmem_altmap_nr_free()).

altmap->free should actually have been called differently.


I think it's currently even *wrong* to set free = PHYS_PFN(size). We 
don't want to allocate beyond the first pageblock(s) we selected.


Can't we set:

1) add_memory_resource():

	.base_pfn = PHYS_PFN(start);
	.free = PHYS_PFN(memory_block_memmap_on_memory_size());

2) try_remove_memory():
	.base_pfn = PHYS_PFN(start);
	.alloc = PHYS_PFN(memory_block_memmap_on_memory_size());

Faking that all was allocated and avoiding any reservation terminology?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 4/6] mm/hotplug: Allow pageblock alignment via altmap reservation
@ 2023-07-24 16:24             ` David Hildenbrand
  0 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2023-07-24 16:24 UTC (permalink / raw)
  To: Aneesh Kumar K V, linux-mm, akpm, mpe, linuxppc-dev, npiggin,
	christophe.leroy
  Cc: Vishal Verma, Michal Hocko, Oscar Salvador

On 24.07.23 18:02, Aneesh Kumar K V wrote:
> On 7/24/23 9:11 PM, David Hildenbrand wrote:
>> On 24.07.23 17:16, Aneesh Kumar K V wrote:
>>
>>>>
>>>> /*
>>>>    * In "forced" memmap_on_memory mode, we always align the vmemmap size up to cover
>>>>    * full pageblocks. That way, we can add memory even if the vmemmap size is not properly
>>>>    * aligned, however, we might waste memory.
>>>>    */
>>>
>>> I am finding that confusing. We do want things to be pageblock_nr_pages aligned both ways.
>>> With MEMMAP_ON_MEMORY_FORCE, we do that by allocating more space for memmap and
>>> in the default case we do that by making sure only memory blocks of specific size supporting
>>> that alignment can use MEMMAP_ON_MEMORY feature.
>>
>> See the usage inm hp_supports_memmap_on_memory(), I guess that makes sense then.
>>
>> But if you have any ideas on how to clarify that (terminology), I'm all ears!
>>
> 
> 
> I updated the commit message
> 
> mm/hotplug: Support memmap_on_memory when memmap is not aligned to pageblocks
> 
> Currently, memmap_on_memory feature is only supported with memory block
> sizes that result in vmemmap pages covering full page blocks. This is
> because memory onlining/offlining code requires applicable ranges to be
> pageblock-aligned, for example, to set the migratetypes properly.
> 
> This patch helps to lift that restriction by reserving more pages than
> required for vmemmap space. This helps to align the start addr to be
> page block aligned with different memory block sizes. This implies the
> kernel will be reserving some pages for every memoryblock. This also
> allows the memmap on memory feature to be widely useful with different
> memory block size values.
> 
> For ex: with 64K page size and 256MiB memory block size, we require 4
> pages to map vmemmap pages, To align things correctly we end up adding a
> reserve of 28 pages. ie, for every 4096 pages 28 pages get reserved.
> 
> 

Much better.

> Also while implementing your  suggestion to use memory_block_memmap_on_memory_size()
> I am finding it not really useful because in mhp_supports_memmap_on_memory() we are checking
> if remaining_size is pageblock_nr_pages aligned (dax_kmem may want to use that helper
> later).

Let's focus on this patchset here first.

Factoring out how manye memmap pages we actually need vs. how many pages 
we need when aligning up sound very reasonable to me.


Can you elaborate what the problem is?

> Also I still think altmap.reserve is easier because of the start_pfn calculation.
> (more on this below)

Can you elaborate? Do you mean the try_remove_memory() change?

> 
> 
>> [...]
>>
>>>>> +    return arch_supports_memmap_on_memory(size);
>>>>>     }
>>>>>       /*
>>>>> @@ -1311,7 +1391,11 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>>>>>     {
>>>>>         struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
>>>>>         enum memblock_flags memblock_flags = MEMBLOCK_NONE;
>>>>> -    struct vmem_altmap mhp_altmap = {};
>>>>> +    struct vmem_altmap mhp_altmap = {
>>>>> +        .base_pfn =  PHYS_PFN(res->start),
>>>>> +        .end_pfn  =  PHYS_PFN(res->end),
>>>>> +        .reserve  = memory_block_align_base(resource_size(res)),
>>>>
>>>> Can you remind me why we have to set reserve here at all?
>>>>
>>>> IOW, can't we simply set
>>>>
>>>> .free = memory_block_memmap_on_memory_size();
>>>>
>>>> end then pass
>>>>
>>>> mhp_altmap.alloc + mhp_altmap.free
>>>>
>>>> to create_memory_block_devices() instead?
>>>>
>>>
>>> But with the dax usage of altmap, altmap->reserve is what we use to reserve things to get
>>> the required alignment. One difference is where we allocate the struct page at. For this specific
>>> case it should not matter.
>>>
>>> static unsigned long __meminit vmem_altmap_next_pfn(struct vmem_altmap *altmap)
>>> {
>>>      return altmap->base_pfn + altmap->reserve + altmap->alloc
>>>          + altmap->align;
>>> }
>>>
>>> And other is where we online a memory block
>>>
>>> We find the start pfn using mem->altmap->alloc + mem->altmap->reserve;
>>>
>>> Considering altmap->reserve is what dax pfn_dev use, is there a reason you want to use altmap->free for this?
>>
>> "Reserve" is all about "reserving that much memory for driver usage".
>>
>> We don't care about that. We simply want vmemmap allocations coming from the pageblock(s) we set aside. Where exactly, we don't care.
>>
>>> I find it confusing to update free when we haven't allocated any altmap blocks yet.
>>
>> "
>> @reserve: pages mapped, but reserved for driver use (relative to @base)"
>> @free: free pages set aside in the mapping for memmap storage
>> @alloc: track pages consumed, private to vmemmap_populate()
>> "
>>
>> To me, that implies that we can ignore "reserve". We set @free to the aligned value and let the vmemmap get allocated from anything in there.
>>
>> free + alloc should always sum up to our set-aside pageblock(s), no?
>>
>>
> 
> The difference is
> 
>   mhp_altmap.free = PHYS_PFN(size) - reserved blocks;
> 
> ie, with 256MiB memory block size with 64K pages, we need 4 memmap pages and we reserve 28 pages for aligment.
> 
> mhp_altmap.free = PHYS_PFN(size) - 28.
> 
> So that 4 pages from which we are allocating the memmap pages are still counted in free page.
> 
> We could all make it work by doing
> 
> mhp_altmap.free = PHYS_PFN(size) -  (memory_block_memmap_on_memory_size() - memory_block_memmap_size())
> 
> But is that any better than what we have now? I understand the term "reserved for driver use" is confusing for this use case.
> But it is really reserving things for required alignment.


Let's take a step back.

altmap->alloc tells us how much was already allocated.

altmap->free tells us how much memory we can allocate at max (confusing, 
but see vmem_altmap_nr_free()).

altmap->free should actually have been called differently.


I think it's currently even *wrong* to set free = PHYS_PFN(size). We 
don't want to allocate beyond the first pageblock(s) we selected.


Can't we set:

1) add_memory_resource():

	.base_pfn = PHYS_PFN(start);
	.free = PHYS_PFN(memory_block_memmap_on_memory_size());

2) try_remove_memory():
	.base_pfn = PHYS_PFN(start);
	.alloc = PHYS_PFN(memory_block_memmap_on_memory_size());

Faking that all was allocated and avoiding any reservation terminology?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 4/6] mm/hotplug: Allow pageblock alignment via altmap reservation
  2023-07-24 16:24             ` David Hildenbrand
@ 2023-07-24 17:29               ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K.V @ 2023-07-24 17:29 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, akpm, mpe, linuxppc-dev, npiggin,
	christophe.leroy
  Cc: Oscar Salvador, Michal Hocko, Vishal Verma

David Hildenbrand <david@redhat.com> writes:

> On 24.07.23 18:02, Aneesh Kumar K V wrote:
>> On 7/24/23 9:11 PM, David Hildenbrand wrote:
>>> On 24.07.23 17:16, Aneesh Kumar K V wrote:
>>>
>>>>>
>>>>> /*
>>>>>    * In "forced" memmap_on_memory mode, we always align the vmemmap size up to cover
>>>>>    * full pageblocks. That way, we can add memory even if the vmemmap size is not properly
>>>>>    * aligned, however, we might waste memory.
>>>>>    */
>>>>
>>>> I am finding that confusing. We do want things to be pageblock_nr_pages aligned both ways.
>>>> With MEMMAP_ON_MEMORY_FORCE, we do that by allocating more space for memmap and
>>>> in the default case we do that by making sure only memory blocks of specific size supporting
>>>> that alignment can use MEMMAP_ON_MEMORY feature.
>>>
>>> See the usage inm hp_supports_memmap_on_memory(), I guess that makes sense then.
>>>
>>> But if you have any ideas on how to clarify that (terminology), I'm all ears!
>>>
>> 
>> 
>> I updated the commit message
>> 
>> mm/hotplug: Support memmap_on_memory when memmap is not aligned to pageblocks
>> 
>> Currently, memmap_on_memory feature is only supported with memory block
>> sizes that result in vmemmap pages covering full page blocks. This is
>> because memory onlining/offlining code requires applicable ranges to be
>> pageblock-aligned, for example, to set the migratetypes properly.
>> 
>> This patch helps to lift that restriction by reserving more pages than
>> required for vmemmap space. This helps to align the start addr to be
>> page block aligned with different memory block sizes. This implies the
>> kernel will be reserving some pages for every memoryblock. This also
>> allows the memmap on memory feature to be widely useful with different
>> memory block size values.
>> 
>> For ex: with 64K page size and 256MiB memory block size, we require 4
>> pages to map vmemmap pages, To align things correctly we end up adding a
>> reserve of 28 pages. ie, for every 4096 pages 28 pages get reserved.
>> 
>> 
>
> Much better.
>
>> Also while implementing your  suggestion to use memory_block_memmap_on_memory_size()
>> I am finding it not really useful because in mhp_supports_memmap_on_memory() we are checking
>> if remaining_size is pageblock_nr_pages aligned (dax_kmem may want to use that helper
>> later).
>
> Let's focus on this patchset here first.
>
> Factoring out how manye memmap pages we actually need vs. how many pages 
> we need when aligning up sound very reasonable to me.
>
>
> Can you elaborate what the problem is?
>
>> Also I still think altmap.reserve is easier because of the start_pfn calculation.
>> (more on this below)
>
> Can you elaborate? Do you mean the try_remove_memory() change?
>
>> 
>> 
>>> [...]
>>>
>>>>>> +    return arch_supports_memmap_on_memory(size);
>>>>>>     }
>>>>>>       /*
>>>>>> @@ -1311,7 +1391,11 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>>>>>>     {
>>>>>>         struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
>>>>>>         enum memblock_flags memblock_flags = MEMBLOCK_NONE;
>>>>>> -    struct vmem_altmap mhp_altmap = {};
>>>>>> +    struct vmem_altmap mhp_altmap = {
>>>>>> +        .base_pfn =  PHYS_PFN(res->start),
>>>>>> +        .end_pfn  =  PHYS_PFN(res->end),
>>>>>> +        .reserve  = memory_block_align_base(resource_size(res)),
>>>>>
>>>>> Can you remind me why we have to set reserve here at all?
>>>>>
>>>>> IOW, can't we simply set
>>>>>
>>>>> .free = memory_block_memmap_on_memory_size();
>>>>>
>>>>> end then pass
>>>>>
>>>>> mhp_altmap.alloc + mhp_altmap.free
>>>>>
>>>>> to create_memory_block_devices() instead?
>>>>>
>>>>
>>>> But with the dax usage of altmap, altmap->reserve is what we use to reserve things to get
>>>> the required alignment. One difference is where we allocate the struct page at. For this specific
>>>> case it should not matter.
>>>>
>>>> static unsigned long __meminit vmem_altmap_next_pfn(struct vmem_altmap *altmap)
>>>> {
>>>>      return altmap->base_pfn + altmap->reserve + altmap->alloc
>>>>          + altmap->align;
>>>> }
>>>>
>>>> And other is where we online a memory block
>>>>
>>>> We find the start pfn using mem->altmap->alloc + mem->altmap->reserve;
>>>>
>>>> Considering altmap->reserve is what dax pfn_dev use, is there a reason you want to use altmap->free for this?
>>>
>>> "Reserve" is all about "reserving that much memory for driver usage".
>>>
>>> We don't care about that. We simply want vmemmap allocations coming from the pageblock(s) we set aside. Where exactly, we don't care.
>>>
>>>> I find it confusing to update free when we haven't allocated any altmap blocks yet.
>>>
>>> "
>>> @reserve: pages mapped, but reserved for driver use (relative to @base)"
>>> @free: free pages set aside in the mapping for memmap storage
>>> @alloc: track pages consumed, private to vmemmap_populate()
>>> "
>>>
>>> To me, that implies that we can ignore "reserve". We set @free to the aligned value and let the vmemmap get allocated from anything in there.
>>>
>>> free + alloc should always sum up to our set-aside pageblock(s), no?
>>>
>>>
>> 
>> The difference is
>> 
>>   mhp_altmap.free = PHYS_PFN(size) - reserved blocks;
>> 
>> ie, with 256MiB memory block size with 64K pages, we need 4 memmap pages and we reserve 28 pages for aligment.
>> 
>> mhp_altmap.free = PHYS_PFN(size) - 28.
>> 
>> So that 4 pages from which we are allocating the memmap pages are still counted in free page.
>> 
>> We could all make it work by doing
>> 
>> mhp_altmap.free = PHYS_PFN(size) -  (memory_block_memmap_on_memory_size() - memory_block_memmap_size())
>> 
>> But is that any better than what we have now? I understand the term "reserved for driver use" is confusing for this use case.
>> But it is really reserving things for required alignment.
>
>
> Let's take a step back.
>
> altmap->alloc tells us how much was already allocated.
>
> altmap->free tells us how much memory we can allocate at max (confusing, 
> but see vmem_altmap_nr_free()).
>
> altmap->free should actually have been called differently.
>
>
> I think it's currently even *wrong* to set free = PHYS_PFN(size). We 
> don't want to allocate beyond the first pageblock(s) we selected.
>

You are correct. The calculation of altmap.free was wrong.
It was wrong upstream and also had wrong computation in the ppc64
upstream code.

modified   arch/powerpc/mm/init_64.c
@@ -326,8 +326,7 @@ void __ref __vmemmap_free(unsigned long start, unsigned long end,
 	start = ALIGN_DOWN(start, page_size);
 	if (altmap) {
 		alt_start = altmap->base_pfn;
-		alt_end = altmap->base_pfn + altmap->reserve +
-			  altmap->free + altmap->alloc + altmap->align;
+		alt_end = altmap->base_pfn + altmap->reserve + altmap->free ;
 	}
 
 	pr_debug("vmemmap_free %lx...%lx\n", start, end);


Fixing all that up the patch is now updated as below

1 file changed, 109 insertions(+), 15 deletions(-)
mm/memory_hotplug.c | 124 +++++++++++++++++++++++++++++++++++++++++++++-------

modified   mm/memory_hotplug.c
@@ -41,17 +41,91 @@
 #include "internal.h"
 #include "shuffle.h"
 
+enum {
+	MEMMAP_ON_MEMORY_DISABLE = 0,
+	MEMMAP_ON_MEMORY_ENABLE,
+	MEMMAP_ON_MEMORY_FORCE,
+};
+
+static int memmap_mode __read_mostly = MEMMAP_ON_MEMORY_DISABLE;
+
+static inline unsigned long memory_block_memmap_size(void)
+{
+    return PHYS_PFN(memory_block_size_bytes()) * sizeof(struct page);
+}
+
+static inline unsigned long memory_block_memmap_on_memory_size(void)
+{
+    unsigned long size = memory_block_memmap_size();
+
+    /*
+     * In "forced" memmap_on_memory mode, we add extra pages to align the
+     * vmemmap size up to cover full pageblocks. That way, we can add memory
+     * even if the vmemmap size is not properly aligned, however, we might waste
+     * memory.
+     */
+    if (memmap_mode == MEMMAP_ON_MEMORY_FORCE)
+	    return ALIGN(size, PFN_PHYS(pageblock_nr_pages));
+    return size;
+}
+
 #ifdef CONFIG_MHP_MEMMAP_ON_MEMORY
 /*
  * memory_hotplug.memmap_on_memory parameter
  */
-static bool memmap_on_memory __ro_after_init;
-module_param(memmap_on_memory, bool, 0444);
-MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug");
+static int set_memmap_mode(const char *val, const struct kernel_param *kp)
+{
+	int ret, mode;
+	bool enabled;
+
+	if (sysfs_streq(val, "force") ||  sysfs_streq(val, "FORCE")) {
+		mode =  MEMMAP_ON_MEMORY_FORCE;
+		goto matched;
+	}
+
+	ret = kstrtobool(val, &enabled);
+	if (ret < 0)
+		return ret;
+	if (enabled)
+		mode =  MEMMAP_ON_MEMORY_ENABLE;
+	else
+		mode =  MEMMAP_ON_MEMORY_DISABLE;
+
+matched:
+	*((int *)kp->arg) =  mode;
+	if (mode == MEMMAP_ON_MEMORY_FORCE) {
+		pr_info("Memory hotplug will reserve %ld pages in each memory block\n",
+			memory_block_memmap_on_memory_size() - memory_block_memmap_size());
+	}
+	return 0;
+}
+
+static int get_memmap_mode(char *buffer, const struct kernel_param *kp)
+{
+	if (*((int *)kp->arg) == MEMMAP_ON_MEMORY_FORCE)
+		return sprintf(buffer,  "force\n");
+	if (*((int *)kp->arg) == MEMMAP_ON_MEMORY_ENABLE)
+		return sprintf(buffer,  "y\n");
+
+	return sprintf(buffer,  "n\n");
+}
+
+static const struct kernel_param_ops memmap_mode_ops = {
+	.set = set_memmap_mode,
+	.get = get_memmap_mode,
+};
+module_param_cb(memmap_on_memory, &memmap_mode_ops, &memmap_mode, 0444);
+MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug\n"
+	"With value \"force\" it could result in memory wastage due to memmap size limitations \n"
+	"For example, if the memmap for a memory block requires 1 MiB, but the pageblock \n"
+	"size is 2 MiB, 1 MiB of hotplugged memory will be wasted. Note that there are \n"
+	"still cases where the feature cannot be enforced: for example, if the memmap is \n"
+	"smaller than a single page, or if the architecture does not support the forced \n"
+	"mode in all configurations. (y/n/force)");
 
 static inline bool mhp_memmap_on_memory(void)
 {
-	return memmap_on_memory;
+	return memmap_mode != MEMMAP_ON_MEMORY_DISABLE;
 }
 #else
 static inline bool mhp_memmap_on_memory(void)
@@ -1266,7 +1340,7 @@ static bool mhp_supports_memmap_on_memory(unsigned long size)
 {
 	unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
 	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
-	unsigned long remaining_size = size - vmemmap_size;
+	unsigned long memmap_on_memory_size = memory_block_memmap_on_memory_size();
 
 	/*
 	 * Besides having arch support and the feature enabled at runtime, we
@@ -1294,10 +1368,28 @@ static bool mhp_supports_memmap_on_memory(unsigned long size)
 	 *       altmap as an alternative source of memory, and we do not exactly
 	 *       populate a single PMD.
 	 */
-	return mhp_memmap_on_memory() &&
-	       size == memory_block_size_bytes() &&
-	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)) &&
-	       arch_supports_memmap_on_memory(size);
+	if (!mhp_memmap_on_memory() || size != memory_block_size_bytes())
+		return false;
+
+	/*
+	 * Make sure the vmemmap allocation is fully contained
+	 * so that we always allocate vmemmap memory from altmap area.
+	 */
+	if (!IS_ALIGNED(vmemmap_size,  PAGE_SIZE))
+		return false;
+
+	/*
+	 * start pfn should be pageblock_nr_pages aligned for correctly
+	 * setting migrate types
+	 */
+	if (!IS_ALIGNED(memmap_on_memory_size, PFN_PHYS(pageblock_nr_pages)))
+		return false;
+
+	if (memmap_on_memory_size == memory_block_size_bytes())
+		/* No effective hotplugged memory doesn't make sense. */
+		return false;
+
+	return arch_supports_memmap_on_memory(size);
 }
 
 /*
@@ -1310,7 +1402,10 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 {
 	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
 	enum memblock_flags memblock_flags = MEMBLOCK_NONE;
-	struct vmem_altmap mhp_altmap = {};
+	struct vmem_altmap mhp_altmap = {
+		.base_pfn =  PHYS_PFN(res->start),
+		.end_pfn  =  PHYS_PFN(res->end),
+	};
 	struct memory_group *group = NULL;
 	u64 start, size;
 	bool new_node = false;
@@ -1355,8 +1450,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 	 */
 	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
 		if (mhp_supports_memmap_on_memory(size)) {
-			mhp_altmap.free = PHYS_PFN(size);
-			mhp_altmap.base_pfn = PHYS_PFN(start);
+			mhp_altmap.free = PHYS_PFN(memory_block_memmap_on_memory_size());
 			params.altmap = &mhp_altmap;
 		}
 		/* fallback to not using altmap  */
@@ -1368,8 +1462,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 		goto error;
 
 	/* create memory block devices after memory was added */
-	ret = create_memory_block_devices(start, size, mhp_altmap.alloc,
-					  group);
+	ret = create_memory_block_devices(start, size, mhp_altmap.free, group);
 	if (ret) {
 		arch_remove_memory(start, size, NULL);
 		goto error;
@@ -2090,7 +2183,8 @@ static int __ref try_remove_memory(u64 start, u64 size)
 			 * right thing if we used vmem_altmap when hot-adding
 			 * the range.
 			 */
-			mhp_altmap.alloc = nr_vmemmap_pages;
+			mhp_altmap.base_pfn = PHYS_PFN(start);
+			mhp_altmap.free = nr_vmemmap_pages;
 			altmap = &mhp_altmap;
 		}
 	}




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 4/6] mm/hotplug: Allow pageblock alignment via altmap reservation
@ 2023-07-24 17:29               ` Aneesh Kumar K.V
  0 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K.V @ 2023-07-24 17:29 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, akpm, mpe, linuxppc-dev, npiggin,
	christophe.leroy
  Cc: Vishal Verma, Michal Hocko, Oscar Salvador

David Hildenbrand <david@redhat.com> writes:

> On 24.07.23 18:02, Aneesh Kumar K V wrote:
>> On 7/24/23 9:11 PM, David Hildenbrand wrote:
>>> On 24.07.23 17:16, Aneesh Kumar K V wrote:
>>>
>>>>>
>>>>> /*
>>>>>    * In "forced" memmap_on_memory mode, we always align the vmemmap size up to cover
>>>>>    * full pageblocks. That way, we can add memory even if the vmemmap size is not properly
>>>>>    * aligned, however, we might waste memory.
>>>>>    */
>>>>
>>>> I am finding that confusing. We do want things to be pageblock_nr_pages aligned both ways.
>>>> With MEMMAP_ON_MEMORY_FORCE, we do that by allocating more space for memmap and
>>>> in the default case we do that by making sure only memory blocks of specific size supporting
>>>> that alignment can use MEMMAP_ON_MEMORY feature.
>>>
>>> See the usage inm hp_supports_memmap_on_memory(), I guess that makes sense then.
>>>
>>> But if you have any ideas on how to clarify that (terminology), I'm all ears!
>>>
>> 
>> 
>> I updated the commit message
>> 
>> mm/hotplug: Support memmap_on_memory when memmap is not aligned to pageblocks
>> 
>> Currently, memmap_on_memory feature is only supported with memory block
>> sizes that result in vmemmap pages covering full page blocks. This is
>> because memory onlining/offlining code requires applicable ranges to be
>> pageblock-aligned, for example, to set the migratetypes properly.
>> 
>> This patch helps to lift that restriction by reserving more pages than
>> required for vmemmap space. This helps to align the start addr to be
>> page block aligned with different memory block sizes. This implies the
>> kernel will be reserving some pages for every memoryblock. This also
>> allows the memmap on memory feature to be widely useful with different
>> memory block size values.
>> 
>> For ex: with 64K page size and 256MiB memory block size, we require 4
>> pages to map vmemmap pages, To align things correctly we end up adding a
>> reserve of 28 pages. ie, for every 4096 pages 28 pages get reserved.
>> 
>> 
>
> Much better.
>
>> Also while implementing your  suggestion to use memory_block_memmap_on_memory_size()
>> I am finding it not really useful because in mhp_supports_memmap_on_memory() we are checking
>> if remaining_size is pageblock_nr_pages aligned (dax_kmem may want to use that helper
>> later).
>
> Let's focus on this patchset here first.
>
> Factoring out how manye memmap pages we actually need vs. how many pages 
> we need when aligning up sound very reasonable to me.
>
>
> Can you elaborate what the problem is?
>
>> Also I still think altmap.reserve is easier because of the start_pfn calculation.
>> (more on this below)
>
> Can you elaborate? Do you mean the try_remove_memory() change?
>
>> 
>> 
>>> [...]
>>>
>>>>>> +    return arch_supports_memmap_on_memory(size);
>>>>>>     }
>>>>>>       /*
>>>>>> @@ -1311,7 +1391,11 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>>>>>>     {
>>>>>>         struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
>>>>>>         enum memblock_flags memblock_flags = MEMBLOCK_NONE;
>>>>>> -    struct vmem_altmap mhp_altmap = {};
>>>>>> +    struct vmem_altmap mhp_altmap = {
>>>>>> +        .base_pfn =  PHYS_PFN(res->start),
>>>>>> +        .end_pfn  =  PHYS_PFN(res->end),
>>>>>> +        .reserve  = memory_block_align_base(resource_size(res)),
>>>>>
>>>>> Can you remind me why we have to set reserve here at all?
>>>>>
>>>>> IOW, can't we simply set
>>>>>
>>>>> .free = memory_block_memmap_on_memory_size();
>>>>>
>>>>> end then pass
>>>>>
>>>>> mhp_altmap.alloc + mhp_altmap.free
>>>>>
>>>>> to create_memory_block_devices() instead?
>>>>>
>>>>
>>>> But with the dax usage of altmap, altmap->reserve is what we use to reserve things to get
>>>> the required alignment. One difference is where we allocate the struct page at. For this specific
>>>> case it should not matter.
>>>>
>>>> static unsigned long __meminit vmem_altmap_next_pfn(struct vmem_altmap *altmap)
>>>> {
>>>>      return altmap->base_pfn + altmap->reserve + altmap->alloc
>>>>          + altmap->align;
>>>> }
>>>>
>>>> And other is where we online a memory block
>>>>
>>>> We find the start pfn using mem->altmap->alloc + mem->altmap->reserve;
>>>>
>>>> Considering altmap->reserve is what dax pfn_dev use, is there a reason you want to use altmap->free for this?
>>>
>>> "Reserve" is all about "reserving that much memory for driver usage".
>>>
>>> We don't care about that. We simply want vmemmap allocations coming from the pageblock(s) we set aside. Where exactly, we don't care.
>>>
>>>> I find it confusing to update free when we haven't allocated any altmap blocks yet.
>>>
>>> "
>>> @reserve: pages mapped, but reserved for driver use (relative to @base)"
>>> @free: free pages set aside in the mapping for memmap storage
>>> @alloc: track pages consumed, private to vmemmap_populate()
>>> "
>>>
>>> To me, that implies that we can ignore "reserve". We set @free to the aligned value and let the vmemmap get allocated from anything in there.
>>>
>>> free + alloc should always sum up to our set-aside pageblock(s), no?
>>>
>>>
>> 
>> The difference is
>> 
>>   mhp_altmap.free = PHYS_PFN(size) - reserved blocks;
>> 
>> ie, with 256MiB memory block size with 64K pages, we need 4 memmap pages and we reserve 28 pages for aligment.
>> 
>> mhp_altmap.free = PHYS_PFN(size) - 28.
>> 
>> So that 4 pages from which we are allocating the memmap pages are still counted in free page.
>> 
>> We could all make it work by doing
>> 
>> mhp_altmap.free = PHYS_PFN(size) -  (memory_block_memmap_on_memory_size() - memory_block_memmap_size())
>> 
>> But is that any better than what we have now? I understand the term "reserved for driver use" is confusing for this use case.
>> But it is really reserving things for required alignment.
>
>
> Let's take a step back.
>
> altmap->alloc tells us how much was already allocated.
>
> altmap->free tells us how much memory we can allocate at max (confusing, 
> but see vmem_altmap_nr_free()).
>
> altmap->free should actually have been called differently.
>
>
> I think it's currently even *wrong* to set free = PHYS_PFN(size). We 
> don't want to allocate beyond the first pageblock(s) we selected.
>

You are correct. The calculation of altmap.free was wrong.
It was wrong upstream and also had wrong computation in the ppc64
upstream code.

modified   arch/powerpc/mm/init_64.c
@@ -326,8 +326,7 @@ void __ref __vmemmap_free(unsigned long start, unsigned long end,
 	start = ALIGN_DOWN(start, page_size);
 	if (altmap) {
 		alt_start = altmap->base_pfn;
-		alt_end = altmap->base_pfn + altmap->reserve +
-			  altmap->free + altmap->alloc + altmap->align;
+		alt_end = altmap->base_pfn + altmap->reserve + altmap->free ;
 	}
 
 	pr_debug("vmemmap_free %lx...%lx\n", start, end);


Fixing all that up the patch is now updated as below

1 file changed, 109 insertions(+), 15 deletions(-)
mm/memory_hotplug.c | 124 +++++++++++++++++++++++++++++++++++++++++++++-------

modified   mm/memory_hotplug.c
@@ -41,17 +41,91 @@
 #include "internal.h"
 #include "shuffle.h"
 
+enum {
+	MEMMAP_ON_MEMORY_DISABLE = 0,
+	MEMMAP_ON_MEMORY_ENABLE,
+	MEMMAP_ON_MEMORY_FORCE,
+};
+
+static int memmap_mode __read_mostly = MEMMAP_ON_MEMORY_DISABLE;
+
+static inline unsigned long memory_block_memmap_size(void)
+{
+    return PHYS_PFN(memory_block_size_bytes()) * sizeof(struct page);
+}
+
+static inline unsigned long memory_block_memmap_on_memory_size(void)
+{
+    unsigned long size = memory_block_memmap_size();
+
+    /*
+     * In "forced" memmap_on_memory mode, we add extra pages to align the
+     * vmemmap size up to cover full pageblocks. That way, we can add memory
+     * even if the vmemmap size is not properly aligned, however, we might waste
+     * memory.
+     */
+    if (memmap_mode == MEMMAP_ON_MEMORY_FORCE)
+	    return ALIGN(size, PFN_PHYS(pageblock_nr_pages));
+    return size;
+}
+
 #ifdef CONFIG_MHP_MEMMAP_ON_MEMORY
 /*
  * memory_hotplug.memmap_on_memory parameter
  */
-static bool memmap_on_memory __ro_after_init;
-module_param(memmap_on_memory, bool, 0444);
-MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug");
+static int set_memmap_mode(const char *val, const struct kernel_param *kp)
+{
+	int ret, mode;
+	bool enabled;
+
+	if (sysfs_streq(val, "force") ||  sysfs_streq(val, "FORCE")) {
+		mode =  MEMMAP_ON_MEMORY_FORCE;
+		goto matched;
+	}
+
+	ret = kstrtobool(val, &enabled);
+	if (ret < 0)
+		return ret;
+	if (enabled)
+		mode =  MEMMAP_ON_MEMORY_ENABLE;
+	else
+		mode =  MEMMAP_ON_MEMORY_DISABLE;
+
+matched:
+	*((int *)kp->arg) =  mode;
+	if (mode == MEMMAP_ON_MEMORY_FORCE) {
+		pr_info("Memory hotplug will reserve %ld pages in each memory block\n",
+			memory_block_memmap_on_memory_size() - memory_block_memmap_size());
+	}
+	return 0;
+}
+
+static int get_memmap_mode(char *buffer, const struct kernel_param *kp)
+{
+	if (*((int *)kp->arg) == MEMMAP_ON_MEMORY_FORCE)
+		return sprintf(buffer,  "force\n");
+	if (*((int *)kp->arg) == MEMMAP_ON_MEMORY_ENABLE)
+		return sprintf(buffer,  "y\n");
+
+	return sprintf(buffer,  "n\n");
+}
+
+static const struct kernel_param_ops memmap_mode_ops = {
+	.set = set_memmap_mode,
+	.get = get_memmap_mode,
+};
+module_param_cb(memmap_on_memory, &memmap_mode_ops, &memmap_mode, 0444);
+MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug\n"
+	"With value \"force\" it could result in memory wastage due to memmap size limitations \n"
+	"For example, if the memmap for a memory block requires 1 MiB, but the pageblock \n"
+	"size is 2 MiB, 1 MiB of hotplugged memory will be wasted. Note that there are \n"
+	"still cases where the feature cannot be enforced: for example, if the memmap is \n"
+	"smaller than a single page, or if the architecture does not support the forced \n"
+	"mode in all configurations. (y/n/force)");
 
 static inline bool mhp_memmap_on_memory(void)
 {
-	return memmap_on_memory;
+	return memmap_mode != MEMMAP_ON_MEMORY_DISABLE;
 }
 #else
 static inline bool mhp_memmap_on_memory(void)
@@ -1266,7 +1340,7 @@ static bool mhp_supports_memmap_on_memory(unsigned long size)
 {
 	unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
 	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
-	unsigned long remaining_size = size - vmemmap_size;
+	unsigned long memmap_on_memory_size = memory_block_memmap_on_memory_size();
 
 	/*
 	 * Besides having arch support and the feature enabled at runtime, we
@@ -1294,10 +1368,28 @@ static bool mhp_supports_memmap_on_memory(unsigned long size)
 	 *       altmap as an alternative source of memory, and we do not exactly
 	 *       populate a single PMD.
 	 */
-	return mhp_memmap_on_memory() &&
-	       size == memory_block_size_bytes() &&
-	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)) &&
-	       arch_supports_memmap_on_memory(size);
+	if (!mhp_memmap_on_memory() || size != memory_block_size_bytes())
+		return false;
+
+	/*
+	 * Make sure the vmemmap allocation is fully contained
+	 * so that we always allocate vmemmap memory from altmap area.
+	 */
+	if (!IS_ALIGNED(vmemmap_size,  PAGE_SIZE))
+		return false;
+
+	/*
+	 * start pfn should be pageblock_nr_pages aligned for correctly
+	 * setting migrate types
+	 */
+	if (!IS_ALIGNED(memmap_on_memory_size, PFN_PHYS(pageblock_nr_pages)))
+		return false;
+
+	if (memmap_on_memory_size == memory_block_size_bytes())
+		/* No effective hotplugged memory doesn't make sense. */
+		return false;
+
+	return arch_supports_memmap_on_memory(size);
 }
 
 /*
@@ -1310,7 +1402,10 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 {
 	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
 	enum memblock_flags memblock_flags = MEMBLOCK_NONE;
-	struct vmem_altmap mhp_altmap = {};
+	struct vmem_altmap mhp_altmap = {
+		.base_pfn =  PHYS_PFN(res->start),
+		.end_pfn  =  PHYS_PFN(res->end),
+	};
 	struct memory_group *group = NULL;
 	u64 start, size;
 	bool new_node = false;
@@ -1355,8 +1450,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 	 */
 	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
 		if (mhp_supports_memmap_on_memory(size)) {
-			mhp_altmap.free = PHYS_PFN(size);
-			mhp_altmap.base_pfn = PHYS_PFN(start);
+			mhp_altmap.free = PHYS_PFN(memory_block_memmap_on_memory_size());
 			params.altmap = &mhp_altmap;
 		}
 		/* fallback to not using altmap  */
@@ -1368,8 +1462,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 		goto error;
 
 	/* create memory block devices after memory was added */
-	ret = create_memory_block_devices(start, size, mhp_altmap.alloc,
-					  group);
+	ret = create_memory_block_devices(start, size, mhp_altmap.free, group);
 	if (ret) {
 		arch_remove_memory(start, size, NULL);
 		goto error;
@@ -2090,7 +2183,8 @@ static int __ref try_remove_memory(u64 start, u64 size)
 			 * right thing if we used vmem_altmap when hot-adding
 			 * the range.
 			 */
-			mhp_altmap.alloc = nr_vmemmap_pages;
+			mhp_altmap.base_pfn = PHYS_PFN(start);
+			mhp_altmap.free = nr_vmemmap_pages;
 			altmap = &mhp_altmap;
 		}
 	}



^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2023-07-24 17:45 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-18  2:44 [PATCH v4 0/6] Add support for memmap on memory feature on ppc64 Aneesh Kumar K.V
2023-07-18  2:44 ` Aneesh Kumar K.V
2023-07-18  2:44 ` [PATCH v4 1/6] mm/hotplug: Simplify ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE kconfig Aneesh Kumar K.V
2023-07-18  2:44   ` Aneesh Kumar K.V
2023-07-18  2:44 ` [PATCH v4 2/6] mm/hotplug: Allow memmap on memory hotplug request to fallback Aneesh Kumar K.V
2023-07-18  2:44   ` Aneesh Kumar K.V
2023-07-24 12:29   ` David Hildenbrand
2023-07-24 12:29     ` David Hildenbrand
2023-07-18  2:44 ` [PATCH v4 3/6] mm/hotplug: Allow architecture to override memmap on memory support check Aneesh Kumar K.V
2023-07-18  2:44   ` Aneesh Kumar K.V
2023-07-24 12:30   ` David Hildenbrand
2023-07-24 12:30     ` David Hildenbrand
2023-07-24 13:47   ` David Hildenbrand
2023-07-24 13:47     ` David Hildenbrand
2023-07-18  2:44 ` [PATCH v4 4/6] mm/hotplug: Allow pageblock alignment via altmap reservation Aneesh Kumar K.V
2023-07-18  2:44   ` Aneesh Kumar K.V
2023-07-24 14:33   ` David Hildenbrand
2023-07-24 14:33     ` David Hildenbrand
2023-07-24 15:16     ` Aneesh Kumar K V
2023-07-24 15:16       ` Aneesh Kumar K V
2023-07-24 15:41       ` David Hildenbrand
2023-07-24 15:41         ` David Hildenbrand
2023-07-24 16:02         ` Aneesh Kumar K V
2023-07-24 16:02           ` Aneesh Kumar K V
2023-07-24 16:24           ` David Hildenbrand
2023-07-24 16:24             ` David Hildenbrand
2023-07-24 17:29             ` Aneesh Kumar K.V
2023-07-24 17:29               ` Aneesh Kumar K.V
2023-07-18  2:44 ` [PATCH v4 5/6] powerpc/book3s64/memhotplug: Enable memmap on memory for radix Aneesh Kumar K.V
2023-07-18  2:44   ` Aneesh Kumar K.V
2023-07-24 14:34   ` David Hildenbrand
2023-07-24 14:34     ` David Hildenbrand
2023-07-24 14:46     ` Aneesh Kumar K V
2023-07-24 14:46       ` Aneesh Kumar K V
2023-07-24 15:52       ` David Hildenbrand
2023-07-24 15:52         ` David Hildenbrand
2023-07-18  2:44 ` [PATCH v4 6/6] mm/hotplug: Embed vmem_altmap details in memory block Aneesh Kumar K.V
2023-07-18  2:44   ` Aneesh Kumar K.V

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.