All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] implement "memmap on memory" feature on s390
@ 2023-11-14 18:02 Sumanth Korikkar
  2023-11-14 18:02 ` [PATCH 1/8] mm/memory_hotplug: fix memory hotplug locking order Sumanth Korikkar
                   ` (8 more replies)
  0 siblings, 9 replies; 40+ messages in thread
From: Sumanth Korikkar @ 2023-11-14 18:02 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Hildenbrand
  Cc: Oscar Salvador, Michal Hocko, Aneesh Kumar K.V,
	Anshuman Khandual, Gerald Schaefer, Alexander Gordeev,
	Heiko Carstens, Vasily Gorbik, linux-s390, LKML

Hi All,

The patch series implements "memmap on memory" feature on s390 and
provides the necessary fixes for it.

Patch 1  addresses the locking order in memory hotplug operations,
ensuring that the mem_hotplug_lock is held during critical operations
like mhp_init_memmap_on_memory() and mhp_deinit_memmap_on_memory()

Patch 2 deals with error handling in add_memory_resource() and considers
the possibility of altmap support. This ensures proper deallocation of
struct pages, aligning with the allocation strategy.

Patch 3 relocates the vmem_altmap code to sparse-vmemmap.c, enabling the
utilization of vmem_altmap_free() and vmem_altmap_offset() without the
dependency on CONFIG_ZONE_DEVICE. Note: These functions are also used in
arm64 architecture. However, ZONE_DEVICE or ARCH_HAS_ZONE_DEVICE doesnt
seems to be enabled in arm64.

Patch 4 introduces MEM_PHYS_ONLINE/OFFLINE memory notifiers. It
facilitates the emulation of dynamic ACPI event-triggered logic for
memory hotplug on platforms lacking such events. This sets the stage for
implementing the "memmap on memory" feature for s390 in subsequent
patches. All architecture/codepaths have the default cases handled in
memory notifiers. Hence, introducing new memory notifiers will have no
functional impact.

Patches 5 allocates vmemmap pages from self-contained memory range for
s390. It allocates memory map (struct pages array) from the hotplugged
memory range, rather than using system memory by passing altmap to
vmemmap functions.

Patch 6 implements MEM_PHYS_ONLINE and MEM_PHYS_OFFLINE memory notifiers
on s390. It involves making the memory block physically accessible and
then calling __add_pages()/__remove_pages() with altmap parameter.

Patch 7 removes unhandled memory notifier types. This is currently
handled in default case

Patch 8 finally enables MHP_MEMMAP_ON_MEMORY on s390

Thank you

Sumanth Korikkar (8):
  mm/memory_hotplug: fix memory hotplug locking order
  mm/memory_hotplug: fix error handling in add_memory_resource()
  mm: use vmem_altmap code without CONFIG_ZONE_DEVICE
  mm/memory_hotplug: introduce MEM_PHYS_ONLINE/OFFLINE memory notifiers
  s390/mm: allocate vmemmap pages from self-contained memory range
  s390/mm: implement MEM_PHYS_ONLINE MEM_PHYS_OFFLINE memory notifiers
  s390/sclp: remove unhandled memory notifier type
  s390: enable MHP_MEMMAP_ON_MEMORY

 arch/s390/Kconfig            |  1 +
 arch/s390/mm/init.c          | 19 ++++++++---
 arch/s390/mm/vmem.c          | 62 ++++++++++++++++++++----------------
 drivers/base/memory.c        | 28 ++++++++++++++--
 drivers/s390/char/sclp_cmd.c | 37 +++++++++++++++------
 include/linux/memory.h       |  2 ++
 include/linux/memremap.h     | 12 -------
 include/linux/mm.h           |  2 ++
 mm/memory_hotplug.c          | 15 ++++-----
 mm/memremap.c                | 14 +-------
 mm/sparse-vmemmap.c          | 13 ++++++++
 11 files changed, 129 insertions(+), 76 deletions(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/8] mm/memory_hotplug: fix memory hotplug locking order
  2023-11-14 18:02 [PATCH 0/8] implement "memmap on memory" feature on s390 Sumanth Korikkar
@ 2023-11-14 18:02 ` Sumanth Korikkar
  2023-11-14 18:22   ` David Hildenbrand
  2023-11-14 18:02 ` [PATCH 2/8] mm/memory_hotplug: fix error handling in add_memory_resource() Sumanth Korikkar
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 40+ messages in thread
From: Sumanth Korikkar @ 2023-11-14 18:02 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Hildenbrand
  Cc: Oscar Salvador, Michal Hocko, Aneesh Kumar K.V,
	Anshuman Khandual, Gerald Schaefer, Alexander Gordeev,
	Heiko Carstens, Vasily Gorbik, linux-s390, LKML

From Documentation/core-api/memory-hotplug.rst:
When adding/removing/onlining/offlining memory or adding/removing
heterogeneous/device memory, we should always hold the mem_hotplug_lock
in write mode to serialise memory hotplug (e.g. access to global/zone
variables).

mhp_(de)init_memmap_on_memory() functions can change zone stats and
struct page content, but they are currently called w/o the
mem_hotplug_lock.

When memory block is being offlined and when kmemleak goes through each
populated zone, the following theoretical race conditions could occur:
CPU 0:					     | CPU 1:
memory_offline()			     |
-> offline_pages()			     |
	-> mem_hotplug_begin()		     |
	   ...				     |
	-> mem_hotplug_done()		     |
					     | kmemleak_scan()
					     | -> get_online_mems()
					     |    ...
-> mhp_deinit_memmap_on_memory()	     |
  [not protected by mem_hotplug_begin/done()]|
  Marks memory section as offline,	     |   Retrieves zone_start_pfn
  poisons vmemmap struct pages and updates   |   and struct page members.
  the zone related data			     |
   					     |    ...
   					     | -> put_online_mems()

Fix this by ensuring mem_hotplug_lock is taken before performing
mhp_init_memmap_on_memory(). Also ensure that
mhp_deinit_memmap_on_memory() holds the lock.

online/offline_pages() are currently only called from
memory_block_online/offline(), so it is safe to move the locking there.

Fixes: a08a2ae34613 ("mm,memory_hotplug: allocate memmap from the added memory range")
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
---
 drivers/base/memory.c | 12 +++++++++---
 mm/memory_hotplug.c   | 13 ++++++-------
 2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index f3b9a4d0fa3b..1e9f6a1749b9 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -204,10 +204,11 @@ static int memory_block_online(struct memory_block *mem)
 	if (mem->altmap)
 		nr_vmemmap_pages = mem->altmap->free;
 
+	mem_hotplug_begin();
 	if (nr_vmemmap_pages) {
 		ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone);
 		if (ret)
-			return ret;
+			goto out;
 	}
 
 	ret = online_pages(start_pfn + nr_vmemmap_pages,
@@ -215,7 +216,7 @@ static int memory_block_online(struct memory_block *mem)
 	if (ret) {
 		if (nr_vmemmap_pages)
 			mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
-		return ret;
+		goto out;
 	}
 
 	/*
@@ -227,6 +228,8 @@ static int memory_block_online(struct memory_block *mem)
 					  nr_vmemmap_pages);
 
 	mem->zone = zone;
+out:
+	mem_hotplug_done();
 	return ret;
 }
 
@@ -247,6 +250,7 @@ static int memory_block_offline(struct memory_block *mem)
 	if (mem->altmap)
 		nr_vmemmap_pages = mem->altmap->free;
 
+	mem_hotplug_begin();
 	if (nr_vmemmap_pages)
 		adjust_present_page_count(pfn_to_page(start_pfn), mem->group,
 					  -nr_vmemmap_pages);
@@ -258,13 +262,15 @@ static int memory_block_offline(struct memory_block *mem)
 		if (nr_vmemmap_pages)
 			adjust_present_page_count(pfn_to_page(start_pfn),
 						  mem->group, nr_vmemmap_pages);
-		return ret;
+		goto out;
 	}
 
 	if (nr_vmemmap_pages)
 		mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
 
 	mem->zone = NULL;
+out:
+	mem_hotplug_done();
 	return ret;
 }
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1b03f4ec6fd2..c8238fc5edcb 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1129,6 +1129,9 @@ void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages)
 	kasan_remove_zero_shadow(__va(PFN_PHYS(pfn)), PFN_PHYS(nr_pages));
 }
 
+/*
+ * Must be called with mem_hotplug_lock in write mode.
+ */
 int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 		       struct zone *zone, struct memory_group *group)
 {
@@ -1149,7 +1152,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 			 !IS_ALIGNED(pfn + nr_pages, PAGES_PER_SECTION)))
 		return -EINVAL;
 
-	mem_hotplug_begin();
 
 	/* associate pfn range with the zone */
 	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE);
@@ -1208,7 +1210,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 	writeback_set_ratelimit();
 
 	memory_notify(MEM_ONLINE, &arg);
-	mem_hotplug_done();
 	return 0;
 
 failed_addition:
@@ -1217,7 +1218,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 		 (((unsigned long long) pfn + nr_pages) << PAGE_SHIFT) - 1);
 	memory_notify(MEM_CANCEL_ONLINE, &arg);
 	remove_pfn_range_from_zone(zone, pfn, nr_pages);
-	mem_hotplug_done();
 	return ret;
 }
 
@@ -1863,6 +1863,9 @@ static int count_system_ram_pages_cb(unsigned long start_pfn,
 	return 0;
 }
 
+/*
+ * Must be called with mem_hotplug_lock in write mode.
+ */
 int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 			struct zone *zone, struct memory_group *group)
 {
@@ -1885,8 +1888,6 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 			 !IS_ALIGNED(start_pfn + nr_pages, PAGES_PER_SECTION)))
 		return -EINVAL;
 
-	mem_hotplug_begin();
-
 	/*
 	 * Don't allow to offline memory blocks that contain holes.
 	 * Consequently, memory blocks with holes can never get onlined
@@ -2027,7 +2028,6 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 
 	memory_notify(MEM_OFFLINE, &arg);
 	remove_pfn_range_from_zone(zone, start_pfn, nr_pages);
-	mem_hotplug_done();
 	return 0;
 
 failed_removal_isolated:
@@ -2042,7 +2042,6 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 		 (unsigned long long) start_pfn << PAGE_SHIFT,
 		 ((unsigned long long) end_pfn << PAGE_SHIFT) - 1,
 		 reason);
-	mem_hotplug_done();
 	return ret;
 }
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/8] mm/memory_hotplug: fix error handling in add_memory_resource()
  2023-11-14 18:02 [PATCH 0/8] implement "memmap on memory" feature on s390 Sumanth Korikkar
  2023-11-14 18:02 ` [PATCH 1/8] mm/memory_hotplug: fix memory hotplug locking order Sumanth Korikkar
@ 2023-11-14 18:02 ` Sumanth Korikkar
  2023-11-14 18:36   ` David Hildenbrand
  2023-11-14 18:02 ` [PATCH 3/8] mm: use vmem_altmap code without CONFIG_ZONE_DEVICE Sumanth Korikkar
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 40+ messages in thread
From: Sumanth Korikkar @ 2023-11-14 18:02 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Hildenbrand
  Cc: Oscar Salvador, Michal Hocko, Aneesh Kumar K.V,
	Anshuman Khandual, Gerald Schaefer, Alexander Gordeev,
	Heiko Carstens, Vasily Gorbik, linux-s390, LKML

In add_memory_resource(), creation of memory block devices occurs after
successful call to arch_add_memory(). However, creation of memory block
devices could fail.  In that case, arch_remove_memory() is called to
perform necessary cleanup.

Currently with or without altmap support, arch_remove_memory() is always
passed with altmap set to NULL during error handling. This leads to
freeing of struct pages using free_pages(), eventhough the allocation
might have been performed with altmap support via
altmap_alloc_block_buf().

Fix the error handling by passing altmap in arch_remove_memory(). This
ensures the following:
* When altmap is disabled, deallocation of the struct pages array occurs
  via free_pages().
* When altmap is enabled, deallocation occurs via vmem_altmap_free().

Fixes: db051a0dac13 ("mm/memory_hotplug: create memory block devices after arch_add_memory()")
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
---
 mm/memory_hotplug.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c8238fc5edcb..4f476a970e84 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1458,7 +1458,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 	/* create memory block devices after memory was added */
 	ret = create_memory_block_devices(start, size, params.altmap, group);
 	if (ret) {
-		arch_remove_memory(start, size, NULL);
+		arch_remove_memory(start, size, params.altmap);
 		goto error_free;
 	}
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 3/8] mm: use vmem_altmap code without CONFIG_ZONE_DEVICE
  2023-11-14 18:02 [PATCH 0/8] implement "memmap on memory" feature on s390 Sumanth Korikkar
  2023-11-14 18:02 ` [PATCH 1/8] mm/memory_hotplug: fix memory hotplug locking order Sumanth Korikkar
  2023-11-14 18:02 ` [PATCH 2/8] mm/memory_hotplug: fix error handling in add_memory_resource() Sumanth Korikkar
@ 2023-11-14 18:02 ` Sumanth Korikkar
  2023-11-16 18:43   ` David Hildenbrand
  2023-11-17 21:39   ` kernel test robot
  2023-11-14 18:02 ` [PATCH 4/8] mm/memory_hotplug: introduce MEM_PHYS_ONLINE/OFFLINE memory notifiers Sumanth Korikkar
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 40+ messages in thread
From: Sumanth Korikkar @ 2023-11-14 18:02 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Hildenbrand
  Cc: Oscar Salvador, Michal Hocko, Aneesh Kumar K.V,
	Anshuman Khandual, Gerald Schaefer, Alexander Gordeev,
	Heiko Carstens, Vasily Gorbik, linux-s390, LKML

vmem_altmap_free() and vmem_altmap_offset() could be utlized without
CONFIG_ZONE_DEVICE enabled. Hence, move it to sparse-vmemmap.c

Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
---
 include/linux/memremap.h | 12 ------------
 include/linux/mm.h       |  2 ++
 mm/memremap.c            | 14 +-------------
 mm/sparse-vmemmap.c      | 13 +++++++++++++
 4 files changed, 16 insertions(+), 25 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 1314d9c5f05b..744c830f4b13 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -196,8 +196,6 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 		struct dev_pagemap *pgmap);
 bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
 
-unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
-void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
 unsigned long memremap_compat_align(void);
 #else
 static inline void *devm_memremap_pages(struct device *dev,
@@ -228,16 +226,6 @@ static inline bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn)
 	return false;
 }
 
-static inline unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
-{
-	return 0;
-}
-
-static inline void vmem_altmap_free(struct vmem_altmap *altmap,
-		unsigned long nr_pfns)
-{
-}
-
 /* when memremap_pages() is disabled all archs can remap a single page */
 static inline unsigned long memremap_compat_align(void)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf5d0b1b16f4..5edb0dfd2d01 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3765,6 +3765,8 @@ pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
 pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
 pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
 			    struct vmem_altmap *altmap, struct page *reuse);
+unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
+void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
 void *vmemmap_alloc_block(unsigned long size, int node);
 struct vmem_altmap;
 void *vmemmap_alloc_block_buf(unsigned long size, int node,
diff --git a/mm/memremap.c b/mm/memremap.c
index bee85560a243..9531faa92a7c 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -7,6 +7,7 @@
 #include <linux/memremap.h>
 #include <linux/pfn_t.h>
 #include <linux/swap.h>
+#include <linux/mm.h>
 #include <linux/mmzone.h>
 #include <linux/swapops.h>
 #include <linux/types.h>
@@ -422,19 +423,6 @@ void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap)
 }
 EXPORT_SYMBOL_GPL(devm_memunmap_pages);
 
-unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
-{
-	/* number of pfns from base where pfn_to_page() is valid */
-	if (altmap)
-		return altmap->reserve + altmap->free;
-	return 0;
-}
-
-void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns)
-{
-	altmap->alloc -= nr_pfns;
-}
-
 /**
  * get_dev_pagemap() - take a new live reference on the dev_pagemap for @pfn
  * @pfn: page frame number to lookup page_map
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index a2cbe44c48e1..bd1b9a137f93 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -46,6 +46,19 @@ static void * __ref __earlyonly_bootmem_alloc(int node,
 					       MEMBLOCK_ALLOC_ACCESSIBLE, node);
 }
 
+unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
+{
+	/* number of pfns from base where pfn_to_page() is valid */
+	if (altmap)
+		return altmap->reserve + altmap->free;
+	return 0;
+}
+
+void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns)
+{
+	altmap->alloc -= nr_pfns;
+}
+
 void * __meminit vmemmap_alloc_block(unsigned long size, int node)
 {
 	/* If the main allocator is up use that, fallback to bootmem. */
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 4/8] mm/memory_hotplug: introduce MEM_PHYS_ONLINE/OFFLINE memory notifiers
  2023-11-14 18:02 [PATCH 0/8] implement "memmap on memory" feature on s390 Sumanth Korikkar
                   ` (2 preceding siblings ...)
  2023-11-14 18:02 ` [PATCH 3/8] mm: use vmem_altmap code without CONFIG_ZONE_DEVICE Sumanth Korikkar
@ 2023-11-14 18:02 ` Sumanth Korikkar
  2023-11-14 18:27   ` David Hildenbrand
  2023-11-14 18:02 ` [PATCH 5/8] s390/mm: allocate vmemmap pages from self-contained memory range Sumanth Korikkar
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 40+ messages in thread
From: Sumanth Korikkar @ 2023-11-14 18:02 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Hildenbrand
  Cc: Oscar Salvador, Michal Hocko, Aneesh Kumar K.V,
	Anshuman Khandual, Gerald Schaefer, Alexander Gordeev,
	Heiko Carstens, Vasily Gorbik, linux-s390, LKML

Add new memory notifiers to mimic the dynamic ACPI event triggered logic
for memory hotplug on platforms that do not generate such events. This
will be used to implement "memmap on memory" feature for s390 in a later
patch.

Platforms such as x86 can support physical memory hotplug via ACPI. When
there is physical memory hotplug, ACPI event leads to the memory
addition with the following callchain:
acpi_memory_device_add()
  -> acpi_memory_enable_device()
     -> __add_memory()

After this, the hotplugged memory is physically accessible, and altmap
support prepared, before the "memmap on memory" initialization in
memory_block_online() is called.

On s390, memory hotplug works in a different way. The available hotplug
memory has to be defined upfront in the hypervisor, but it is made
physically accessible only when the user sets it online via sysfs,
currently in the MEM_GOING_ONLINE notifier. This requires calling
add_memory() during early memory detection, in order to get the sysfs
representation, but we cannot use "memmap on memory" altmap support at
this stage, w/o having it physically accessible.

Since no ACPI or similar events are generated, there is no way to set up
altmap support, or even make the memory physically accessible at all,
before the "memmap on memory" initialization in memory_block_online().

The new MEM_PHYS_ONLINE notifier allows to work around this, by
providing a hook to make the memory physically accessible, and also call
__add_pages() with altmap support, early in memory_block_online().
Similarly, the MEM_PHYS_OFFLINE notifier allows to make the memory
inaccessible and call __remove_pages(), at the end of
memory_block_offline().

Calling __add/remove_pages() requires mem_hotplug_lock, so move
mem_hotplug_begin/done() to include the new notifiers.

All architectures ignore unknown memory notifiers, so this patch should
not introduce any functional changes.

Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
---
 drivers/base/memory.c  | 18 +++++++++++++++++-
 include/linux/memory.h |  2 ++
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 1e9f6a1749b9..604940f62246 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -185,6 +185,7 @@ static int memory_block_online(struct memory_block *mem)
 	unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
 	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
 	unsigned long nr_vmemmap_pages = 0;
+	struct memory_notify arg;
 	struct zone *zone;
 	int ret;
 
@@ -194,6 +195,14 @@ static int memory_block_online(struct memory_block *mem)
 	zone = zone_for_pfn_range(mem->online_type, mem->nid, mem->group,
 				  start_pfn, nr_pages);
 
+	arg.start_pfn = start_pfn;
+	arg.nr_pages = nr_pages;
+	mem_hotplug_begin();
+	ret = memory_notify(MEM_PHYS_ONLINE, &arg);
+	ret = notifier_to_errno(ret);
+	if (ret)
+		goto out_notifier;
+
 	/*
 	 * Although vmemmap pages have a different lifecycle than the pages
 	 * they describe (they remain until the memory is unplugged), doing
@@ -204,7 +213,6 @@ static int memory_block_online(struct memory_block *mem)
 	if (mem->altmap)
 		nr_vmemmap_pages = mem->altmap->free;
 
-	mem_hotplug_begin();
 	if (nr_vmemmap_pages) {
 		ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone);
 		if (ret)
@@ -228,7 +236,11 @@ static int memory_block_online(struct memory_block *mem)
 					  nr_vmemmap_pages);
 
 	mem->zone = zone;
+	mem_hotplug_done();
+	return ret;
 out:
+	memory_notify(MEM_PHYS_OFFLINE, &arg);
+out_notifier:
 	mem_hotplug_done();
 	return ret;
 }
@@ -238,6 +250,7 @@ static int memory_block_offline(struct memory_block *mem)
 	unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
 	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
 	unsigned long nr_vmemmap_pages = 0;
+	struct memory_notify arg;
 	int ret;
 
 	if (!mem->zone)
@@ -269,6 +282,9 @@ static int memory_block_offline(struct memory_block *mem)
 		mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
 
 	mem->zone = NULL;
+	arg.start_pfn = start_pfn;
+	arg.nr_pages = nr_pages;
+	memory_notify(MEM_PHYS_OFFLINE, &arg);
 out:
 	mem_hotplug_done();
 	return ret;
diff --git a/include/linux/memory.h b/include/linux/memory.h
index f53cfdaaaa41..5d8b962b8fa1 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -96,6 +96,8 @@ int set_memory_block_size_order(unsigned int order);
 #define	MEM_GOING_ONLINE	(1<<3)
 #define	MEM_CANCEL_ONLINE	(1<<4)
 #define	MEM_CANCEL_OFFLINE	(1<<5)
+#define	MEM_PHYS_ONLINE		(1<<6)
+#define	MEM_PHYS_OFFLINE	(1<<7)
 
 struct memory_notify {
 	unsigned long start_pfn;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 5/8] s390/mm: allocate vmemmap pages from self-contained memory range
  2023-11-14 18:02 [PATCH 0/8] implement "memmap on memory" feature on s390 Sumanth Korikkar
                   ` (3 preceding siblings ...)
  2023-11-14 18:02 ` [PATCH 4/8] mm/memory_hotplug: introduce MEM_PHYS_ONLINE/OFFLINE memory notifiers Sumanth Korikkar
@ 2023-11-14 18:02 ` Sumanth Korikkar
  2023-11-14 18:02 ` [PATCH 6/8] s390/mm: implement MEM_PHYS_ONLINE MEM_PHYS_OFFLINE memory notifiers Sumanth Korikkar
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 40+ messages in thread
From: Sumanth Korikkar @ 2023-11-14 18:02 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Hildenbrand
  Cc: Oscar Salvador, Michal Hocko, Aneesh Kumar K.V,
	Anshuman Khandual, Gerald Schaefer, Alexander Gordeev,
	Heiko Carstens, Vasily Gorbik, linux-s390, LKML

Allocate memory map (struct pages array) from the hotplugged memory
range, rather than using system memory. The change addresses the issue
where standby memory, when configured to be much larger than online
memory, could potentially lead to ipl failure due to memory map
allocation from online memory. For example, 16MB of memory map
allocation is needed for a memory block size of 1GB and when standby
memory is configured much larger than online memory, this could lead to
ipl failure.

To address this issue, the solution involves introducing "memmap on
memory" using the vmem_altmap structure on s390.  Architectures that
want to implement it should pass the altmap to the vmemmap_populate()
function and its associated callchain. This enhancement is discussed in
the commit 4b94ffdc4163 ("x86, mm: introduce vmem_altmap to augment
vmemmap_populate()").

Provide "memmap on memory" support for s390 by passing the altmap in
vmemmap_populate() and its callchain. The allocation path is described
as follows:
* When altmap is NULL in vmemmap_populate(), memory map allocation
  occurs using the existing vmemmap_alloc_block_buf().
* When altmap is not NULL in vmemmap_populate(), memory map allocation
  still uses vmemmap_alloc_block_buf(), but this function internally
  calls altmap_alloc_block_buf().

For deallocation, the process is outlined as follows:
* When altmap is NULL in vmemmap_free(), memory map deallocation happens
  through free_pages().
* When altmap is not NULL in vmemmap_free(), memory map deallocation
  occurs via vmem_altmap_free().

While memory map allocation is primarily handled through the
self-contained memory map range, there might still be a small amount of
system memory allocation required for vmemmap pagetables. To mitigate
this impact, this feature will be limited to machines with EDAT1
support.

Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
---
 arch/s390/mm/init.c |  3 ---
 arch/s390/mm/vmem.c | 62 +++++++++++++++++++++++++--------------------
 2 files changed, 35 insertions(+), 30 deletions(-)

diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 43e612bc2bcd..8d9a60ccb777 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -281,9 +281,6 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	unsigned long size_pages = PFN_DOWN(size);
 	int rc;
 
-	if (WARN_ON_ONCE(params->altmap))
-		return -EINVAL;
-
 	if (WARN_ON_ONCE(params->pgprot.pgprot != PAGE_KERNEL.pgprot))
 		return -EINVAL;
 
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 186a020857cf..eb100479f7be 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -33,8 +33,12 @@ static void __ref *vmem_alloc_pages(unsigned int order)
 	return memblock_alloc(size, size);
 }
 
-static void vmem_free_pages(unsigned long addr, int order)
+static void vmem_free_pages(unsigned long addr, int order, struct vmem_altmap *altmap)
 {
+	if (altmap) {
+		vmem_altmap_free(altmap, 1 << order);
+		return;
+	}
 	/* We don't expect boot memory to be removed ever. */
 	if (!slab_is_available() ||
 	    WARN_ON_ONCE(PageReserved(virt_to_page((void *)addr))))
@@ -156,7 +160,8 @@ static bool vmemmap_unuse_sub_pmd(unsigned long start, unsigned long end)
 
 /* __ref: we'll only call vmemmap_alloc_block() via vmemmap_populate() */
 static int __ref modify_pte_table(pmd_t *pmd, unsigned long addr,
-				  unsigned long end, bool add, bool direct)
+				  unsigned long end, bool add, bool direct,
+				  struct vmem_altmap *altmap)
 {
 	unsigned long prot, pages = 0;
 	int ret = -ENOMEM;
@@ -172,11 +177,11 @@ static int __ref modify_pte_table(pmd_t *pmd, unsigned long addr,
 			if (pte_none(*pte))
 				continue;
 			if (!direct)
-				vmem_free_pages((unsigned long) pfn_to_virt(pte_pfn(*pte)), 0);
+				vmem_free_pages((unsigned long)pfn_to_virt(pte_pfn(*pte)), get_order(PAGE_SIZE), altmap);
 			pte_clear(&init_mm, addr, pte);
 		} else if (pte_none(*pte)) {
 			if (!direct) {
-				void *new_page = vmemmap_alloc_block(PAGE_SIZE, NUMA_NO_NODE);
+				void *new_page = vmemmap_alloc_block_buf(PAGE_SIZE, NUMA_NO_NODE, altmap);
 
 				if (!new_page)
 					goto out;
@@ -213,7 +218,8 @@ static void try_free_pte_table(pmd_t *pmd, unsigned long start)
 
 /* __ref: we'll only call vmemmap_alloc_block() via vmemmap_populate() */
 static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
-				  unsigned long end, bool add, bool direct)
+				  unsigned long end, bool add, bool direct,
+				  struct vmem_altmap *altmap)
 {
 	unsigned long next, prot, pages = 0;
 	int ret = -ENOMEM;
@@ -234,11 +240,11 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
 				if (IS_ALIGNED(addr, PMD_SIZE) &&
 				    IS_ALIGNED(next, PMD_SIZE)) {
 					if (!direct)
-						vmem_free_pages(pmd_deref(*pmd), get_order(PMD_SIZE));
+						vmem_free_pages(pmd_deref(*pmd), get_order(PMD_SIZE), altmap);
 					pmd_clear(pmd);
 					pages++;
 				} else if (!direct && vmemmap_unuse_sub_pmd(addr, next)) {
-					vmem_free_pages(pmd_deref(*pmd), get_order(PMD_SIZE));
+					vmem_free_pages(pmd_deref(*pmd), get_order(PMD_SIZE), altmap);
 					pmd_clear(pmd);
 				}
 				continue;
@@ -261,7 +267,7 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
 				 * page tables since vmemmap_populate gets
 				 * called for each section separately.
 				 */
-				new_page = vmemmap_alloc_block(PMD_SIZE, NUMA_NO_NODE);
+				new_page = vmemmap_alloc_block_buf(PMD_SIZE, NUMA_NO_NODE, altmap);
 				if (new_page) {
 					set_pmd(pmd, __pmd(__pa(new_page) | prot));
 					if (!IS_ALIGNED(addr, PMD_SIZE) ||
@@ -280,7 +286,7 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
 				vmemmap_use_sub_pmd(addr, next);
 			continue;
 		}
-		ret = modify_pte_table(pmd, addr, next, add, direct);
+		ret = modify_pte_table(pmd, addr, next, add, direct, altmap);
 		if (ret)
 			goto out;
 		if (!add)
@@ -302,12 +308,12 @@ static void try_free_pmd_table(pud_t *pud, unsigned long start)
 	for (i = 0; i < PTRS_PER_PMD; i++, pmd++)
 		if (!pmd_none(*pmd))
 			return;
-	vmem_free_pages(pud_deref(*pud), CRST_ALLOC_ORDER);
+	vmem_free_pages(pud_deref(*pud), CRST_ALLOC_ORDER, NULL);
 	pud_clear(pud);
 }
 
 static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
-			    bool add, bool direct)
+			    bool add, bool direct, struct vmem_altmap *altmap)
 {
 	unsigned long next, prot, pages = 0;
 	int ret = -ENOMEM;
@@ -347,7 +353,7 @@ static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
 		} else if (pud_large(*pud)) {
 			continue;
 		}
-		ret = modify_pmd_table(pud, addr, next, add, direct);
+		ret = modify_pmd_table(pud, addr, next, add, direct, altmap);
 		if (ret)
 			goto out;
 		if (!add)
@@ -370,12 +376,12 @@ static void try_free_pud_table(p4d_t *p4d, unsigned long start)
 		if (!pud_none(*pud))
 			return;
 	}
-	vmem_free_pages(p4d_deref(*p4d), CRST_ALLOC_ORDER);
+	vmem_free_pages(p4d_deref(*p4d), CRST_ALLOC_ORDER, NULL);
 	p4d_clear(p4d);
 }
 
 static int modify_p4d_table(pgd_t *pgd, unsigned long addr, unsigned long end,
-			    bool add, bool direct)
+			    bool add, bool direct, struct vmem_altmap *altmap)
 {
 	unsigned long next;
 	int ret = -ENOMEM;
@@ -394,7 +400,7 @@ static int modify_p4d_table(pgd_t *pgd, unsigned long addr, unsigned long end,
 				goto out;
 			p4d_populate(&init_mm, p4d, pud);
 		}
-		ret = modify_pud_table(p4d, addr, next, add, direct);
+		ret = modify_pud_table(p4d, addr, next, add, direct, altmap);
 		if (ret)
 			goto out;
 		if (!add)
@@ -415,12 +421,12 @@ static void try_free_p4d_table(pgd_t *pgd, unsigned long start)
 		if (!p4d_none(*p4d))
 			return;
 	}
-	vmem_free_pages(pgd_deref(*pgd), CRST_ALLOC_ORDER);
+	vmem_free_pages(pgd_deref(*pgd), CRST_ALLOC_ORDER, NULL);
 	pgd_clear(pgd);
 }
 
 static int modify_pagetable(unsigned long start, unsigned long end, bool add,
-			    bool direct)
+			    bool direct, struct vmem_altmap *altmap)
 {
 	unsigned long addr, next;
 	int ret = -ENOMEM;
@@ -445,7 +451,7 @@ static int modify_pagetable(unsigned long start, unsigned long end, bool add,
 				goto out;
 			pgd_populate(&init_mm, pgd, p4d);
 		}
-		ret = modify_p4d_table(pgd, addr, next, add, direct);
+		ret = modify_p4d_table(pgd, addr, next, add, direct, altmap);
 		if (ret)
 			goto out;
 		if (!add)
@@ -458,14 +464,16 @@ static int modify_pagetable(unsigned long start, unsigned long end, bool add,
 	return ret;
 }
 
-static int add_pagetable(unsigned long start, unsigned long end, bool direct)
+static int add_pagetable(unsigned long start, unsigned long end, bool direct,
+			 struct vmem_altmap *altmap)
 {
-	return modify_pagetable(start, end, true, direct);
+	return modify_pagetable(start, end, true, direct, altmap);
 }
 
-static int remove_pagetable(unsigned long start, unsigned long end, bool direct)
+static int remove_pagetable(unsigned long start, unsigned long end, bool direct,
+			    struct vmem_altmap *altmap)
 {
-	return modify_pagetable(start, end, false, direct);
+	return modify_pagetable(start, end, false, direct, altmap);
 }
 
 /*
@@ -474,7 +482,7 @@ static int remove_pagetable(unsigned long start, unsigned long end, bool direct)
 static int vmem_add_range(unsigned long start, unsigned long size)
 {
 	start = (unsigned long)__va(start);
-	return add_pagetable(start, start + size, true);
+	return add_pagetable(start, start + size, true, NULL);
 }
 
 /*
@@ -483,7 +491,7 @@ static int vmem_add_range(unsigned long start, unsigned long size)
 static void vmem_remove_range(unsigned long start, unsigned long size)
 {
 	start = (unsigned long)__va(start);
-	remove_pagetable(start, start + size, true);
+	remove_pagetable(start, start + size, true, NULL);
 }
 
 /*
@@ -496,9 +504,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 
 	mutex_lock(&vmem_mutex);
 	/* We don't care about the node, just use NUMA_NO_NODE on allocations */
-	ret = add_pagetable(start, end, false);
+	ret = add_pagetable(start, end, false, altmap);
 	if (ret)
-		remove_pagetable(start, end, false);
+		remove_pagetable(start, end, false, altmap);
 	mutex_unlock(&vmem_mutex);
 	return ret;
 }
@@ -509,7 +517,7 @@ void vmemmap_free(unsigned long start, unsigned long end,
 		  struct vmem_altmap *altmap)
 {
 	mutex_lock(&vmem_mutex);
-	remove_pagetable(start, end, false);
+	remove_pagetable(start, end, false, altmap);
 	mutex_unlock(&vmem_mutex);
 }
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 6/8] s390/mm: implement MEM_PHYS_ONLINE MEM_PHYS_OFFLINE memory notifiers
  2023-11-14 18:02 [PATCH 0/8] implement "memmap on memory" feature on s390 Sumanth Korikkar
                   ` (4 preceding siblings ...)
  2023-11-14 18:02 ` [PATCH 5/8] s390/mm: allocate vmemmap pages from self-contained memory range Sumanth Korikkar
@ 2023-11-14 18:02 ` Sumanth Korikkar
  2023-11-14 18:39   ` David Hildenbrand
  2023-11-14 18:02 ` [PATCH 7/8] s390/sclp: remove unhandled memory notifier type Sumanth Korikkar
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 40+ messages in thread
From: Sumanth Korikkar @ 2023-11-14 18:02 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Hildenbrand
  Cc: Oscar Salvador, Michal Hocko, Aneesh Kumar K.V,
	Anshuman Khandual, Gerald Schaefer, Alexander Gordeev,
	Heiko Carstens, Vasily Gorbik, linux-s390, LKML

Implement MEM_PHYS_ONLINE and MEM_PHYS_OFFLINE memory notifiers on s390

Implementation of MEM_PHYS_ONLINE Memory Notifier:
* Transition the memory block to an accessible/online state using the
  sclp assign command.
* Execute __add_pages() for the memory block, enabling a self-contained
  memory map range. For boot-time memory, vmemmap mapping is carried out
  through sparse_init().

Implementation of MEM_PHYS_OFFLINE Memory Notifier:
* Execute __remove_pages() exclusively for the memory block (applicable
  where a self-contained memory map was possible before).
* Shift the memory block to an inaccessible/offline state using the sclp
  unassign command.

Additional Implementation Considerations:
* When MHP_MEMMAP_ON_MEMORY is disabled, the system retains the old
  behavior. This means the memory map is allocated from default memory,
  and struct vmemmap pages are populated during the standby memory
  detection phase.
* With MHP_MEMMAP_ON_MEMORY enabled (allowing self-contained memory
  map), the memory map is allocated using the self-contained memory map
  range. Struct vmemmap pages are populated during the memory hotplug
  phase.
* If MACHINE_HAS_EDAT1 is unavailable, MHP_MEMMAP_ON_MEMORY is
  automatically disabled. This ensures that vmemmap pagetables do not
  consume additional memory from the default memory allocator.
* The MEM_GOING_ONLINE notifier has been modified to perform no
  operation, as MEM_PHYS_ONLINE already executes the sclp assign
  command.
* The MEM_CANCEL_ONLINE notifier now performs no operation, as
  MEM_PHYS_OFFLINE already executes the sclp unassign command.
* The call to __add_pages() in arch_add_memory() with altmap support is
  skipped. This operation is deferred and will be performed later in the
  MEM_PHYS_ONLINE notifier.

Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
---
 arch/s390/mm/init.c          | 16 +++++++++++++++-
 drivers/s390/char/sclp_cmd.c | 33 ++++++++++++++++++++++++++++++---
 2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 8d9a60ccb777..db505ed590b2 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -288,6 +288,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	rc = vmem_add_mapping(start, size);
 	if (rc)
 		return rc;
+	/*
+	 * If MHP_MEMMAP_ON_MEMORY is enabled, perform __add_pages() during memory
+	 * onlining phase
+	 */
+	if (params->altmap)
+		return 0;
 
 	rc = __add_pages(nid, start_pfn, size_pages, params);
 	if (rc)
@@ -300,7 +306,15 @@ void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
-	__remove_pages(start_pfn, nr_pages, altmap);
+	/*
+	 * On s390, currently arch_remove_memory() will be called during error
+	 * handling of add_memory_resource(). When MHP_MEMMAP_ON_MEMORY is
+	 * enabled, __add_pages() is performed later during the memory onlining
+	 * phase.  Hence, __remove_pages() should not be called here in that
+	 * case, but only later during memory offline phase
+	 */
+	if (!altmap)
+		__remove_pages(start_pfn, nr_pages, NULL);
 	vmem_remove_mapping(start, size);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
index 11c428f4c7cf..12f3d4af7e4e 100644
--- a/drivers/s390/char/sclp_cmd.c
+++ b/drivers/s390/char/sclp_cmd.c
@@ -18,6 +18,7 @@
 #include <linux/mm.h>
 #include <linux/mmzone.h>
 #include <linux/memory.h>
+#include <linux/memory_hotplug.h>
 #include <linux/module.h>
 #include <asm/ctlreg.h>
 #include <asm/chpid.h>
@@ -26,6 +27,7 @@
 #include <asm/sclp.h>
 #include <asm/numa.h>
 #include <asm/facility.h>
+#include <asm/page-states.h>
 
 #include "sclp.h"
 
@@ -319,6 +321,8 @@ static bool contains_standby_increment(unsigned long start, unsigned long end)
 static int sclp_mem_notifier(struct notifier_block *nb,
 			     unsigned long action, void *data)
 {
+	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
+	struct memory_block *memory_block;
 	unsigned long start, size;
 	struct memory_notify *arg;
 	unsigned char id;
@@ -330,6 +334,11 @@ static int sclp_mem_notifier(struct notifier_block *nb,
 	mutex_lock(&sclp_mem_mutex);
 	for_each_clear_bit(id, sclp_storage_ids, sclp_max_storage_id + 1)
 		sclp_attach_storage(id);
+	memory_block = find_memory_block(pfn_to_section_nr(arg->start_pfn));
+	if (!memory_block) {
+		rc = -EINVAL;
+		goto out;
+	}
 	switch (action) {
 	case MEM_GOING_OFFLINE:
 		/*
@@ -344,17 +353,34 @@ static int sclp_mem_notifier(struct notifier_block *nb,
 	case MEM_CANCEL_OFFLINE:
 		break;
 	case MEM_GOING_ONLINE:
+		break;
+	case MEM_PHYS_ONLINE:
 		rc = sclp_mem_change_state(start, size, 1);
+		if (rc || !memory_block->altmap)
+			goto out;
+		params.altmap = memory_block->altmap;
+		rc = __add_pages(0, arg->start_pfn, arg->nr_pages, &params);
+		if (rc)
+			sclp_mem_change_state(start, size, 0);
+		/*
+		 * Set CMMA state to nodat here, since the struct page memory
+		 * at the beginning of the memory block will not go through the
+		 * buddy allocator later.
+		 */
+		__arch_set_page_nodat((void *)start, memory_block->altmap->free);
 		break;
 	case MEM_CANCEL_ONLINE:
-		sclp_mem_change_state(start, size, 0);
-		break;
 	case MEM_OFFLINE:
+		break;
+	case MEM_PHYS_OFFLINE:
+		if (memory_block->altmap)
+			__remove_pages(arg->start_pfn, arg->nr_pages, memory_block->altmap);
 		sclp_mem_change_state(start, size, 0);
 		break;
 	default:
 		break;
 	}
+out:
 	mutex_unlock(&sclp_mem_mutex);
 	return rc ? NOTIFY_BAD : NOTIFY_OK;
 }
@@ -400,7 +426,8 @@ static void __init add_memory_merged(u16 rn)
 	if (!size)
 		goto skip_add;
 	for (addr = start; addr < start + size; addr += block_size)
-		add_memory(0, addr, block_size, MHP_NONE);
+		add_memory(0, addr, block_size,
+			   MACHINE_HAS_EDAT1 ? MHP_MEMMAP_ON_MEMORY : MHP_NONE);
 skip_add:
 	first_rn = rn;
 	num = 1;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 7/8] s390/sclp: remove unhandled memory notifier type
  2023-11-14 18:02 [PATCH 0/8] implement "memmap on memory" feature on s390 Sumanth Korikkar
                   ` (5 preceding siblings ...)
  2023-11-14 18:02 ` [PATCH 6/8] s390/mm: implement MEM_PHYS_ONLINE MEM_PHYS_OFFLINE memory notifiers Sumanth Korikkar
@ 2023-11-14 18:02 ` Sumanth Korikkar
  2023-11-16 19:33   ` David Hildenbrand
  2023-11-14 18:02 ` [PATCH 8/8] s390: enable MHP_MEMMAP_ON_MEMORY Sumanth Korikkar
  2023-11-16 23:08 ` [PATCH 0/8] implement "memmap on memory" feature on s390 David Hildenbrand
  8 siblings, 1 reply; 40+ messages in thread
From: Sumanth Korikkar @ 2023-11-14 18:02 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Hildenbrand
  Cc: Oscar Salvador, Michal Hocko, Aneesh Kumar K.V,
	Anshuman Khandual, Gerald Schaefer, Alexander Gordeev,
	Heiko Carstens, Vasily Gorbik, linux-s390, LKML

Remove memory notifier types which are unhandled by s390.  Unhandled
memory notifier types are covered by default case.

Suggested-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
---
 drivers/s390/char/sclp_cmd.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
index 12f3d4af7e4e..428f8a583e8f 100644
--- a/drivers/s390/char/sclp_cmd.c
+++ b/drivers/s390/char/sclp_cmd.c
@@ -349,11 +349,6 @@ static int sclp_mem_notifier(struct notifier_block *nb,
 		if (contains_standby_increment(start, start + size))
 			rc = -EPERM;
 		break;
-	case MEM_ONLINE:
-	case MEM_CANCEL_OFFLINE:
-		break;
-	case MEM_GOING_ONLINE:
-		break;
 	case MEM_PHYS_ONLINE:
 		rc = sclp_mem_change_state(start, size, 1);
 		if (rc || !memory_block->altmap)
@@ -369,9 +364,6 @@ static int sclp_mem_notifier(struct notifier_block *nb,
 		 */
 		__arch_set_page_nodat((void *)start, memory_block->altmap->free);
 		break;
-	case MEM_CANCEL_ONLINE:
-	case MEM_OFFLINE:
-		break;
 	case MEM_PHYS_OFFLINE:
 		if (memory_block->altmap)
 			__remove_pages(arg->start_pfn, arg->nr_pages, memory_block->altmap);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 8/8] s390: enable MHP_MEMMAP_ON_MEMORY
  2023-11-14 18:02 [PATCH 0/8] implement "memmap on memory" feature on s390 Sumanth Korikkar
                   ` (6 preceding siblings ...)
  2023-11-14 18:02 ` [PATCH 7/8] s390/sclp: remove unhandled memory notifier type Sumanth Korikkar
@ 2023-11-14 18:02 ` Sumanth Korikkar
  2023-11-16 23:08 ` [PATCH 0/8] implement "memmap on memory" feature on s390 David Hildenbrand
  8 siblings, 0 replies; 40+ messages in thread
From: Sumanth Korikkar @ 2023-11-14 18:02 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Hildenbrand
  Cc: Oscar Salvador, Michal Hocko, Aneesh Kumar K.V,
	Anshuman Khandual, Gerald Schaefer, Alexander Gordeev,
	Heiko Carstens, Vasily Gorbik, linux-s390, LKML

Enable MHP_MEMMAP_ON_MEMORY to support "memmap on memory".
memory_hotplug.memmap_on_memory=true kernel parameter should be set in
kernel boot option to enable the feature.

Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
---
 arch/s390/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 3bec98d20283..4b9b0f947ddb 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -113,6 +113,7 @@ config S390
 	select ARCH_INLINE_WRITE_UNLOCK_BH
 	select ARCH_INLINE_WRITE_UNLOCK_IRQ
 	select ARCH_INLINE_WRITE_UNLOCK_IRQRESTORE
+	select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
 	select ARCH_STACKWALK
 	select ARCH_SUPPORTS_ATOMIC_RMW
 	select ARCH_SUPPORTS_DEBUG_PAGEALLOC
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/8] mm/memory_hotplug: fix memory hotplug locking order
  2023-11-14 18:02 ` [PATCH 1/8] mm/memory_hotplug: fix memory hotplug locking order Sumanth Korikkar
@ 2023-11-14 18:22   ` David Hildenbrand
  2023-11-15 13:41     ` Sumanth Korikkar
  0 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2023-11-14 18:22 UTC (permalink / raw)
  To: Sumanth Korikkar, linux-mm, Andrew Morton
  Cc: Oscar Salvador, Michal Hocko, Aneesh Kumar K.V,
	Anshuman Khandual, Gerald Schaefer, Alexander Gordeev,
	Heiko Carstens, Vasily Gorbik, linux-s390, LKML

On 14.11.23 19:02, Sumanth Korikkar wrote:

The patch subject talks about "fixing locking order", but it's actually 
missing locking, no?

>  From Documentation/core-api/memory-hotplug.rst:
> When adding/removing/onlining/offlining memory or adding/removing
> heterogeneous/device memory, we should always hold the mem_hotplug_lock
> in write mode to serialise memory hotplug (e.g. access to global/zone
> variables).
> 
> mhp_(de)init_memmap_on_memory() functions can change zone stats and
> struct page content, but they are currently called w/o the
> mem_hotplug_lock.
> 
> When memory block is being offlined and when kmemleak goes through each
> populated zone, the following theoretical race conditions could occur:
> CPU 0:					     | CPU 1:
> memory_offline()			     |
> -> offline_pages()			     |
> 	-> mem_hotplug_begin()		     |
> 	   ...				     |
> 	-> mem_hotplug_done()		     |
> 					     | kmemleak_scan()
> 					     | -> get_online_mems()
> 					     |    ...
> -> mhp_deinit_memmap_on_memory()	     |
>    [not protected by mem_hotplug_begin/done()]|
>    Marks memory section as offline,	     |   Retrieves zone_start_pfn
>    poisons vmemmap struct pages and updates   |   and struct page members.
>    the zone related data			     |
>     					     |    ...
>     					     | -> put_online_mems()
> 
> Fix this by ensuring mem_hotplug_lock is taken before performing
> mhp_init_memmap_on_memory(). Also ensure that
> mhp_deinit_memmap_on_memory() holds the lock.

What speaks against grabbing that lock in these functions?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/8] mm/memory_hotplug: introduce MEM_PHYS_ONLINE/OFFLINE memory notifiers
  2023-11-14 18:02 ` [PATCH 4/8] mm/memory_hotplug: introduce MEM_PHYS_ONLINE/OFFLINE memory notifiers Sumanth Korikkar
@ 2023-11-14 18:27   ` David Hildenbrand
  2023-11-15 14:23     ` Sumanth Korikkar
  2023-11-15 15:03     ` Gerald Schaefer
  0 siblings, 2 replies; 40+ messages in thread
From: David Hildenbrand @ 2023-11-14 18:27 UTC (permalink / raw)
  To: Sumanth Korikkar, linux-mm, Andrew Morton
  Cc: Oscar Salvador, Michal Hocko, Aneesh Kumar K.V,
	Anshuman Khandual, Gerald Schaefer, Alexander Gordeev,
	Heiko Carstens, Vasily Gorbik, linux-s390, LKML

On 14.11.23 19:02, Sumanth Korikkar wrote:
> Add new memory notifiers to mimic the dynamic ACPI event triggered logic
> for memory hotplug on platforms that do not generate such events. This
> will be used to implement "memmap on memory" feature for s390 in a later
> patch.
> 
> Platforms such as x86 can support physical memory hotplug via ACPI. When
> there is physical memory hotplug, ACPI event leads to the memory
> addition with the following callchain:
> acpi_memory_device_add()
>    -> acpi_memory_enable_device()
>       -> __add_memory()
> 
> After this, the hotplugged memory is physically accessible, and altmap
> support prepared, before the "memmap on memory" initialization in
> memory_block_online() is called.
> 
> On s390, memory hotplug works in a different way. The available hotplug
> memory has to be defined upfront in the hypervisor, but it is made
> physically accessible only when the user sets it online via sysfs,
> currently in the MEM_GOING_ONLINE notifier. This requires calling
> add_memory() during early memory detection, in order to get the sysfs
> representation, but we cannot use "memmap on memory" altmap support at
> this stage, w/o having it physically accessible.
> 
> Since no ACPI or similar events are generated, there is no way to set up
> altmap support, or even make the memory physically accessible at all,
> before the "memmap on memory" initialization in memory_block_online().
> 
> The new MEM_PHYS_ONLINE notifier allows to work around this, by
> providing a hook to make the memory physically accessible, and also call
> __add_pages() with altmap support, early in memory_block_online().
> Similarly, the MEM_PHYS_OFFLINE notifier allows to make the memory
> inaccessible and call __remove_pages(), at the end of
> memory_block_offline().
> 
> Calling __add/remove_pages() requires mem_hotplug_lock, so move
> mem_hotplug_begin/done() to include the new notifiers.
> 
> All architectures ignore unknown memory notifiers, so this patch should
> not introduce any functional changes.

Sorry to say, no. No hacks please, and this is a hack for memory that 
has already been added to the system.

If you want memory without an altmap to suddenly not have an altmap 
anymore, then look into removing and readding that memory, or some way 
to convert offline memory.

But certainly not on the online/offline path triggered by sysfs.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/8] mm/memory_hotplug: fix error handling in add_memory_resource()
  2023-11-14 18:02 ` [PATCH 2/8] mm/memory_hotplug: fix error handling in add_memory_resource() Sumanth Korikkar
@ 2023-11-14 18:36   ` David Hildenbrand
  2023-11-15 13:45     ` Sumanth Korikkar
  0 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2023-11-14 18:36 UTC (permalink / raw)
  To: Sumanth Korikkar, linux-mm, Andrew Morton
  Cc: Oscar Salvador, Michal Hocko, Aneesh Kumar K.V,
	Anshuman Khandual, Gerald Schaefer, Alexander Gordeev,
	Heiko Carstens, Vasily Gorbik, linux-s390, LKML, Vishal Verma

On 14.11.23 19:02, Sumanth Korikkar wrote:
> In add_memory_resource(), creation of memory block devices occurs after
> successful call to arch_add_memory(). However, creation of memory block
> devices could fail.  In that case, arch_remove_memory() is called to
> perform necessary cleanup.
> 
> Currently with or without altmap support, arch_remove_memory() is always
> passed with altmap set to NULL during error handling. This leads to
> freeing of struct pages using free_pages(), eventhough the allocation
> might have been performed with altmap support via
> altmap_alloc_block_buf().
> 
> Fix the error handling by passing altmap in arch_remove_memory(). This
> ensures the following:
> * When altmap is disabled, deallocation of the struct pages array occurs
>    via free_pages().
> * When altmap is enabled, deallocation occurs via vmem_altmap_free().
> 
> Fixes: db051a0dac13 ("mm/memory_hotplug: create memory block devices after arch_add_memory()")

That's the wrong commit. We didn't support memmap-on-memory back then.

Likely it should be:

Fixes: a08a2ae34613 ("mm,memory_hotplug: allocate memmap from the added 
memory range")

> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
> ---
>   mm/memory_hotplug.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c8238fc5edcb..4f476a970e84 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1458,7 +1458,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>   	/* create memory block devices after memory was added */
>   	ret = create_memory_block_devices(start, size, params.altmap, group);
>   	if (ret) {
> -		arch_remove_memory(start, size, NULL);
> +		arch_remove_memory(start, size, params.altmap);
>   		goto error_free;
>   	}
>   

Indeed; this will conflict with Vishals patches, ccing him.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 6/8] s390/mm: implement MEM_PHYS_ONLINE MEM_PHYS_OFFLINE memory notifiers
  2023-11-14 18:02 ` [PATCH 6/8] s390/mm: implement MEM_PHYS_ONLINE MEM_PHYS_OFFLINE memory notifiers Sumanth Korikkar
@ 2023-11-14 18:39   ` David Hildenbrand
  2023-11-15 14:20     ` Sumanth Korikkar
  0 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2023-11-14 18:39 UTC (permalink / raw)
  To: Sumanth Korikkar, linux-mm, Andrew Morton
  Cc: Oscar Salvador, Michal Hocko, Aneesh Kumar K.V,
	Anshuman Khandual, Gerald Schaefer, Alexander Gordeev,
	Heiko Carstens, Vasily Gorbik, linux-s390, LKML

On 14.11.23 19:02, Sumanth Korikkar wrote:
> Implement MEM_PHYS_ONLINE and MEM_PHYS_OFFLINE memory notifiers on s390
> 
> Implementation of MEM_PHYS_ONLINE Memory Notifier:
> * Transition the memory block to an accessible/online state using the
>    sclp assign command.
> * Execute __add_pages() for the memory block, enabling a self-contained
>    memory map range. For boot-time memory, vmemmap mapping is carried out
>    through sparse_init().
> 
> Implementation of MEM_PHYS_OFFLINE Memory Notifier:
> * Execute __remove_pages() exclusively for the memory block (applicable
>    where a self-contained memory map was possible before).
> * Shift the memory block to an inaccessible/offline state using the sclp
>    unassign command.
> 
> Additional Implementation Considerations:
> * When MHP_MEMMAP_ON_MEMORY is disabled, the system retains the old
>    behavior. This means the memory map is allocated from default memory,
>    and struct vmemmap pages are populated during the standby memory
>    detection phase.
> * With MHP_MEMMAP_ON_MEMORY enabled (allowing self-contained memory
>    map), the memory map is allocated using the self-contained memory map
>    range. Struct vmemmap pages are populated during the memory hotplug
>    phase.
> * If MACHINE_HAS_EDAT1 is unavailable, MHP_MEMMAP_ON_MEMORY is
>    automatically disabled. This ensures that vmemmap pagetables do not
>    consume additional memory from the default memory allocator.
> * The MEM_GOING_ONLINE notifier has been modified to perform no
>    operation, as MEM_PHYS_ONLINE already executes the sclp assign
>    command.
> * The MEM_CANCEL_ONLINE notifier now performs no operation, as
>    MEM_PHYS_OFFLINE already executes the sclp unassign command.
> * The call to __add_pages() in arch_add_memory() with altmap support is
>    skipped. This operation is deferred and will be performed later in the
>    MEM_PHYS_ONLINE notifier.
> 
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
> ---
>   arch/s390/mm/init.c          | 16 +++++++++++++++-
>   drivers/s390/char/sclp_cmd.c | 33 ++++++++++++++++++++++++++++++---
>   2 files changed, 45 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index 8d9a60ccb777..db505ed590b2 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -288,6 +288,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
>   	rc = vmem_add_mapping(start, size);
>   	if (rc)
>   		return rc;
> +	/*
> +	 * If MHP_MEMMAP_ON_MEMORY is enabled, perform __add_pages() during memory
> +	 * onlining phase
> +	 */
> +	if (params->altmap)
> +		return 0;


So we'd have added memory blocks without a memmap? Sorry, but this seems 
to further hack into the s390x direction.

Maybe s390x should just provide a dedicate interface to add these memory 
blocks instead of adding them during boot and then relying on the old 
way of using online/offline set them online/offline.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/8] mm/memory_hotplug: fix memory hotplug locking order
  2023-11-14 18:22   ` David Hildenbrand
@ 2023-11-15 13:41     ` Sumanth Korikkar
  2023-11-16 18:40       ` David Hildenbrand
  0 siblings, 1 reply; 40+ messages in thread
From: Sumanth Korikkar @ 2023-11-15 13:41 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Andrew Morton, Oscar Salvador, Michal Hocko,
	Aneesh Kumar K.V, Anshuman Khandual, Gerald Schaefer,
	Alexander Gordeev, Heiko Carstens, Vasily Gorbik, linux-s390,
	LKML

On Tue, Nov 14, 2023 at 07:22:33PM +0100, David Hildenbrand wrote:
> On 14.11.23 19:02, Sumanth Korikkar wrote:
> 
> The patch subject talks about "fixing locking order", but it's actually
> missing locking, no?
> 
> >  From Documentation/core-api/memory-hotplug.rst:
> > When adding/removing/onlining/offlining memory or adding/removing
> > heterogeneous/device memory, we should always hold the mem_hotplug_lock
> > in write mode to serialise memory hotplug (e.g. access to global/zone
> > variables).
> > 
> > mhp_(de)init_memmap_on_memory() functions can change zone stats and
> > struct page content, but they are currently called w/o the
> > mem_hotplug_lock.
> > 
> > When memory block is being offlined and when kmemleak goes through each
> > populated zone, the following theoretical race conditions could occur:
> > CPU 0:					     | CPU 1:
> > memory_offline()			     |
> > -> offline_pages()			     |
> > 	-> mem_hotplug_begin()		     |
> > 	   ...				     |
> > 	-> mem_hotplug_done()		     |
> > 					     | kmemleak_scan()
> > 					     | -> get_online_mems()
> > 					     |    ...
> > -> mhp_deinit_memmap_on_memory()	     |
> >    [not protected by mem_hotplug_begin/done()]|
> >    Marks memory section as offline,	     |   Retrieves zone_start_pfn
> >    poisons vmemmap struct pages and updates   |   and struct page members.
> >    the zone related data			     |
> >     					     |    ...
> >     					     | -> put_online_mems()
> > 
> > Fix this by ensuring mem_hotplug_lock is taken before performing
> > mhp_init_memmap_on_memory(). Also ensure that
> > mhp_deinit_memmap_on_memory() holds the lock.
> 
> What speaks against grabbing that lock in these functions?
>
At present, the functions online_pages() and offline_pages() acquire the
mem_hotplug_lock right at the start. However, given the necessity of
locking in mhp_(de)init_memmap_on_memory(), it would be more efficient
to consolidate the locking process by holding the mem_hotplug_lock once
in memory_block_online() and memory_block_offline().

Moreover, the introduction of the 'memmap on memory' feature on s390
brings a new physical memory notifier, and functions like __add_pages()
or arch_add_memory() are consistently invoked with the mem_hotplug_lock
already acquired.

Considering these factors, it seemed more natural to move
mem_hotplug_lock in memory_block_online() and memory_block_offline(),
which was described as "fixing locking order" in the subject. 
I will change the subject to "add missing locking", if it is misleading .

Would you or Oscar agree that there is a need for those
mhp_(de)init_memmap_on_memory() functions to take lock at all?

Thanks

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/8] mm/memory_hotplug: fix error handling in add_memory_resource()
  2023-11-14 18:36   ` David Hildenbrand
@ 2023-11-15 13:45     ` Sumanth Korikkar
  0 siblings, 0 replies; 40+ messages in thread
From: Sumanth Korikkar @ 2023-11-15 13:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Andrew Morton, Oscar Salvador, Michal Hocko,
	Aneesh Kumar K.V, Anshuman Khandual, Gerald Schaefer,
	Alexander Gordeev, Heiko Carstens, Vasily Gorbik, linux-s390,
	LKML, Vishal Verma

On Tue, Nov 14, 2023 at 07:36:20PM +0100, David Hildenbrand wrote:
> On 14.11.23 19:02, Sumanth Korikkar wrote:
> > In add_memory_resource(), creation of memory block devices occurs after
> > successful call to arch_add_memory(). However, creation of memory block
> > devices could fail.  In that case, arch_remove_memory() is called to
> > perform necessary cleanup.
> > 
> > Currently with or without altmap support, arch_remove_memory() is always
> > passed with altmap set to NULL during error handling. This leads to
> > freeing of struct pages using free_pages(), eventhough the allocation
> > might have been performed with altmap support via
> > altmap_alloc_block_buf().
> > 
> > Fix the error handling by passing altmap in arch_remove_memory(). This
> > ensures the following:
> > * When altmap is disabled, deallocation of the struct pages array occurs
> >    via free_pages().
> > * When altmap is enabled, deallocation occurs via vmem_altmap_free().
> > 
> > Fixes: db051a0dac13 ("mm/memory_hotplug: create memory block devices after arch_add_memory()")
> 
> That's the wrong commit. We didn't support memmap-on-memory back then.
> 
> Likely it should be:
> 
> Fixes: a08a2ae34613 ("mm,memory_hotplug: allocate memmap from the added
> memory range")
>
Ok, I will change it accordingly

Thanks
...
> 
> Indeed; this will conflict with Vishals patches, ccing him.
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 6/8] s390/mm: implement MEM_PHYS_ONLINE MEM_PHYS_OFFLINE memory notifiers
  2023-11-14 18:39   ` David Hildenbrand
@ 2023-11-15 14:20     ` Sumanth Korikkar
  2023-11-16 19:16       ` David Hildenbrand
  0 siblings, 1 reply; 40+ messages in thread
From: Sumanth Korikkar @ 2023-11-15 14:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Andrew Morton, Oscar Salvador, Michal Hocko,
	Aneesh Kumar K.V, Anshuman Khandual, Gerald Schaefer,
	Alexander Gordeev, Heiko Carstens, Vasily Gorbik, linux-s390,
	LKML

On Tue, Nov 14, 2023 at 07:39:40PM +0100, David Hildenbrand wrote:
> On 14.11.23 19:02, Sumanth Korikkar wrote:
> > Implement MEM_PHYS_ONLINE and MEM_PHYS_OFFLINE memory notifiers on s390
> > 
...
> >   arch/s390/mm/init.c          | 16 +++++++++++++++-
> >   drivers/s390/char/sclp_cmd.c | 33 ++++++++++++++++++++++++++++++---
> >   2 files changed, 45 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> > index 8d9a60ccb777..db505ed590b2 100644
> > --- a/arch/s390/mm/init.c
> > +++ b/arch/s390/mm/init.c
> > @@ -288,6 +288,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
> >   	rc = vmem_add_mapping(start, size);
> >   	if (rc)
> >   		return rc;
> > +	/*
> > +	 * If MHP_MEMMAP_ON_MEMORY is enabled, perform __add_pages() during memory
> > +	 * onlining phase
> > +	 */
> > +	if (params->altmap)
> > +		return 0;
> 
> 
> So we'd have added memory blocks without a memmap? Sorry, but this seems to
> further hack into the s390x direction.

This new approach has the advantage that we do not need to allocate any
additional memory during online phase, neither for direct mapping page
tables nor struct pages, so that memory hotplug can never fail.

The old approach (without altmap) is already a hack, because we add
the memmap / struct pages, but for memory that is not really accessible.
And with all the disadvantage of pre-allocating struct pages from system
memory.

The new approach allows to better integrate s390 to the existing
interface, and also make use of altmap support, which would eliminate
the major disadvantage of the old behaviour. So from s390 perspective,
this new mechanism would be preferred, provided that there is no
functional issue with the "added memory blocks without a memmap"
approach.

Do you see any functional issues, e.g. conflict with common
code?
> 
> Maybe s390x should just provide a dedicate interface to add these memory
> blocks instead of adding them during boot and then relying on the old way of
> using online/offline set them online/offline.

Existing behavior:
The current 'lsmem -a' command displays both online and standby memory.

interface changes:
If a new interface is introduced and standby memory is no longer listed,
the following consequences might occur:

1. Running 'lsmem -a' would only show online memory, potentially leading
   to user complaints.
2. standby memory addition would need:
   * echo "standby memory addr" > /sys/devices/system/memory/probe
   As far as I understand, this interface is already deprecated.

3. To remove standby memory, a new interface probe_remove is needed
   * echo "standby memory addr" > /sys/devices/system/memory/probe_remove

4. Users may express a need to identify standby memory addresses,
resulting in the creation of another interface to list these standby
memory ranges.

Hence, introducing new physical memory notifiers to platforms lacking
dynamic ACPI events would be highly advantageous while maintaining
existing user-friendly interface.

Thanks

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/8] mm/memory_hotplug: introduce MEM_PHYS_ONLINE/OFFLINE memory notifiers
  2023-11-14 18:27   ` David Hildenbrand
@ 2023-11-15 14:23     ` Sumanth Korikkar
  2023-11-16 19:03       ` David Hildenbrand
  2023-11-15 15:03     ` Gerald Schaefer
  1 sibling, 1 reply; 40+ messages in thread
From: Sumanth Korikkar @ 2023-11-15 14:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Andrew Morton, Oscar Salvador, Michal Hocko,
	Aneesh Kumar K.V, Anshuman Khandual, Gerald Schaefer,
	Alexander Gordeev, Heiko Carstens, Vasily Gorbik, linux-s390,
	LKML

On Tue, Nov 14, 2023 at 07:27:35PM +0100, David Hildenbrand wrote:
> On 14.11.23 19:02, Sumanth Korikkar wrote:
> > Add new memory notifiers to mimic the dynamic ACPI event triggered logic
> > for memory hotplug on platforms that do not generate such events. This
> > will be used to implement "memmap on memory" feature for s390 in a later
> > patch.
> > 
> > Platforms such as x86 can support physical memory hotplug via ACPI. When
> > there is physical memory hotplug, ACPI event leads to the memory
> > addition with the following callchain:
> > acpi_memory_device_add()
> >    -> acpi_memory_enable_device()
> >       -> __add_memory()
> > 
> > After this, the hotplugged memory is physically accessible, and altmap
> > support prepared, before the "memmap on memory" initialization in
> > memory_block_online() is called.
> > 
> > On s390, memory hotplug works in a different way. The available hotplug
> > memory has to be defined upfront in the hypervisor, but it is made
> > physically accessible only when the user sets it online via sysfs,
> > currently in the MEM_GOING_ONLINE notifier. This requires calling
> > add_memory() during early memory detection, in order to get the sysfs
> > representation, but we cannot use "memmap on memory" altmap support at
> > this stage, w/o having it physically accessible.
> > 
> > Since no ACPI or similar events are generated, there is no way to set up
> > altmap support, or even make the memory physically accessible at all,
> > before the "memmap on memory" initialization in memory_block_online().
> > 
> > The new MEM_PHYS_ONLINE notifier allows to work around this, by
> > providing a hook to make the memory physically accessible, and also call
> > __add_pages() with altmap support, early in memory_block_online().
> > Similarly, the MEM_PHYS_OFFLINE notifier allows to make the memory
> > inaccessible and call __remove_pages(), at the end of
> > memory_block_offline().
> > 
> > Calling __add/remove_pages() requires mem_hotplug_lock, so move
> > mem_hotplug_begin/done() to include the new notifiers.
> > 
> > All architectures ignore unknown memory notifiers, so this patch should
> > not introduce any functional changes.
> 
> Sorry to say, no. No hacks please, and this is a hack for memory that has
> already been added to the system.
> 
> If you want memory without an altmap to suddenly not have an altmap anymore,
> then look into removing and readding that memory, or some way to convert
> offline memory.

Sorry, I couldnt get the context. Could you please give me more details?

Thanks

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/8] mm/memory_hotplug: introduce MEM_PHYS_ONLINE/OFFLINE memory notifiers
  2023-11-14 18:27   ` David Hildenbrand
  2023-11-15 14:23     ` Sumanth Korikkar
@ 2023-11-15 15:03     ` Gerald Schaefer
  2023-11-16 19:02       ` David Hildenbrand
  1 sibling, 1 reply; 40+ messages in thread
From: Gerald Schaefer @ 2023-11-15 15:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Sumanth Korikkar, linux-mm, Andrew Morton, Oscar Salvador,
	Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	Alexander Gordeev, Heiko Carstens, Vasily Gorbik, linux-s390,
	LKML

On Tue, 14 Nov 2023 19:27:35 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 14.11.23 19:02, Sumanth Korikkar wrote:
> > Add new memory notifiers to mimic the dynamic ACPI event triggered logic
> > for memory hotplug on platforms that do not generate such events. This
> > will be used to implement "memmap on memory" feature for s390 in a later
> > patch.
> > 
> > Platforms such as x86 can support physical memory hotplug via ACPI. When
> > there is physical memory hotplug, ACPI event leads to the memory
> > addition with the following callchain:
> > acpi_memory_device_add()  
> >    -> acpi_memory_enable_device()
> >       -> __add_memory()  
> > 
> > After this, the hotplugged memory is physically accessible, and altmap
> > support prepared, before the "memmap on memory" initialization in
> > memory_block_online() is called.
> > 
> > On s390, memory hotplug works in a different way. The available hotplug
> > memory has to be defined upfront in the hypervisor, but it is made
> > physically accessible only when the user sets it online via sysfs,
> > currently in the MEM_GOING_ONLINE notifier. This requires calling
> > add_memory() during early memory detection, in order to get the sysfs
> > representation, but we cannot use "memmap on memory" altmap support at
> > this stage, w/o having it physically accessible.
> > 
> > Since no ACPI or similar events are generated, there is no way to set up
> > altmap support, or even make the memory physically accessible at all,
> > before the "memmap on memory" initialization in memory_block_online().
> > 
> > The new MEM_PHYS_ONLINE notifier allows to work around this, by
> > providing a hook to make the memory physically accessible, and also call
> > __add_pages() with altmap support, early in memory_block_online().
> > Similarly, the MEM_PHYS_OFFLINE notifier allows to make the memory
> > inaccessible and call __remove_pages(), at the end of
> > memory_block_offline().
> > 
> > Calling __add/remove_pages() requires mem_hotplug_lock, so move
> > mem_hotplug_begin/done() to include the new notifiers.
> > 
> > All architectures ignore unknown memory notifiers, so this patch should
> > not introduce any functional changes.  
> 
> Sorry to say, no. No hacks please, and this is a hack for memory that 
> has already been added to the system.

IIUC, when we enter memory_block_online(), memory has always already
been added to the system, on all architectures. E.g. via ACPI events
on x86, or with the existing s390 hack, where we add it at boot time,
including memmap allocated from system memory. Without a preceding
add_memory() you cannot reach memory_block_online() via sysfs online.

The difference is that for s390, the memory is not yet physically
accessible, and therefore we cannot use the existing altmap support
in memory_block_online(), which requires that the memory is accessible
before it calls mhp_init_memmap_on_memory().

Currently, on s390 we make the memory accessible in the GOING_ONLINE
notifier, by sclp call to the hypervisor. That is too late for altmap
setup code in memory_block_online(), therefore we'd like to introduce
the new notifier, to have a hook where we can make it accessible
earlier, and after that there is no difference to how it works for
other architectures, and we can make use of the existing altmap support.

> 
> If you want memory without an altmap to suddenly not have an altmap 
> anymore, then look into removing and readding that memory, or some way 
> to convert offline memory.

We do not want to have memory suddenly not have an altmap support
any more, but simply get a hook so that we can prepare the memory
to have altmap support. This means making it physically accessible,
and calling __add_pages() for altmap support, which for other
architecture has already happened before.

Of course, it is a hack for s390, that we must skip __add_pages()
in the initial (arch_)add_memory() during boot time, when we want
altmap support, because the memory simply is not accessible at that
time. But s390 memory hotplug support has always been a hack, and
had to be, because of how it is implemented by the architecture.

So we replace one hack with another one, that has the huge advantage
that we do not need to allocate struct pages upfront from system
memory any more, for the whole possible online memory range.

And the current approach comes without any change to existing
interfaces, and minimal change to common code, i.e. these new
notifiers, that should not have any impact on other architectures.

What exactly is your concern regarding the new notifiers? Is it
useless no-op notifier calls on other archs (not sure if they
would get optimized out by compiler)?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/8] mm/memory_hotplug: fix memory hotplug locking order
  2023-11-15 13:41     ` Sumanth Korikkar
@ 2023-11-16 18:40       ` David Hildenbrand
  2023-11-17 13:42         ` Sumanth Korikkar
  0 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2023-11-16 18:40 UTC (permalink / raw)
  To: Sumanth Korikkar
  Cc: linux-mm, Andrew Morton, Oscar Salvador, Michal Hocko,
	Aneesh Kumar K.V, Anshuman Khandual, Gerald Schaefer,
	Alexander Gordeev, Heiko Carstens, Vasily Gorbik, linux-s390,
	LKML

On 15.11.23 14:41, Sumanth Korikkar wrote:
> On Tue, Nov 14, 2023 at 07:22:33PM +0100, David Hildenbrand wrote:
>> On 14.11.23 19:02, Sumanth Korikkar wrote:
>>
>> The patch subject talks about "fixing locking order", but it's actually
>> missing locking, no?
>>
>>>   From Documentation/core-api/memory-hotplug.rst:
>>> When adding/removing/onlining/offlining memory or adding/removing
>>> heterogeneous/device memory, we should always hold the mem_hotplug_lock
>>> in write mode to serialise memory hotplug (e.g. access to global/zone
>>> variables).
>>>
>>> mhp_(de)init_memmap_on_memory() functions can change zone stats and
>>> struct page content, but they are currently called w/o the
>>> mem_hotplug_lock.
>>>
>>> When memory block is being offlined and when kmemleak goes through each
>>> populated zone, the following theoretical race conditions could occur:
>>> CPU 0:					     | CPU 1:
>>> memory_offline()			     |
>>> -> offline_pages()			     |
>>> 	-> mem_hotplug_begin()		     |
>>> 	   ...				     |
>>> 	-> mem_hotplug_done()		     |
>>> 					     | kmemleak_scan()
>>> 					     | -> get_online_mems()
>>> 					     |    ...
>>> -> mhp_deinit_memmap_on_memory()	     |
>>>     [not protected by mem_hotplug_begin/done()]|
>>>     Marks memory section as offline,	     |   Retrieves zone_start_pfn
>>>     poisons vmemmap struct pages and updates   |   and struct page members.
>>>     the zone related data			     |
>>>      					     |    ...
>>>      					     | -> put_online_mems()
>>>
>>> Fix this by ensuring mem_hotplug_lock is taken before performing
>>> mhp_init_memmap_on_memory(). Also ensure that
>>> mhp_deinit_memmap_on_memory() holds the lock.
>>
>> What speaks against grabbing that lock in these functions?
>>
> At present, the functions online_pages() and offline_pages() acquire the
> mem_hotplug_lock right at the start. However, given the necessity of
> locking in mhp_(de)init_memmap_on_memory(), it would be more efficient
> to consolidate the locking process by holding the mem_hotplug_lock once
> in memory_block_online() and memory_block_offline().

Good point; can you similarly add comments to these two functions that 
they need that lock in write mode?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/8] mm: use vmem_altmap code without CONFIG_ZONE_DEVICE
  2023-11-14 18:02 ` [PATCH 3/8] mm: use vmem_altmap code without CONFIG_ZONE_DEVICE Sumanth Korikkar
@ 2023-11-16 18:43   ` David Hildenbrand
  2023-11-17 21:39   ` kernel test robot
  1 sibling, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2023-11-16 18:43 UTC (permalink / raw)
  To: Sumanth Korikkar, linux-mm, Andrew Morton
  Cc: Oscar Salvador, Michal Hocko, Aneesh Kumar K.V,
	Anshuman Khandual, Gerald Schaefer, Alexander Gordeev,
	Heiko Carstens, Vasily Gorbik, linux-s390, LKML

On 14.11.23 19:02, Sumanth Korikkar wrote:
> vmem_altmap_free() and vmem_altmap_offset() could be utlized without
> CONFIG_ZONE_DEVICE enabled. Hence, move it to sparse-vmemmap.c

Maybe give an example: For example mm/memory_hotplug.c:__add_pages() 
relies on that.

The altmap is no longer restricted to ZONE_DEVICE handling.

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/8] mm/memory_hotplug: introduce MEM_PHYS_ONLINE/OFFLINE memory notifiers
  2023-11-15 15:03     ` Gerald Schaefer
@ 2023-11-16 19:02       ` David Hildenbrand
  0 siblings, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2023-11-16 19:02 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Sumanth Korikkar, linux-mm, Andrew Morton, Oscar Salvador,
	Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	Alexander Gordeev, Heiko Carstens, Vasily Gorbik, linux-s390,
	LKML

On 15.11.23 16:03, Gerald Schaefer wrote:
> On Tue, 14 Nov 2023 19:27:35 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 14.11.23 19:02, Sumanth Korikkar wrote:
>>> Add new memory notifiers to mimic the dynamic ACPI event triggered logic
>>> for memory hotplug on platforms that do not generate such events. This
>>> will be used to implement "memmap on memory" feature for s390 in a later
>>> patch.
>>>
>>> Platforms such as x86 can support physical memory hotplug via ACPI. When
>>> there is physical memory hotplug, ACPI event leads to the memory
>>> addition with the following callchain:
>>> acpi_memory_device_add()
>>>     -> acpi_memory_enable_device()
>>>        -> __add_memory()
>>>
>>> After this, the hotplugged memory is physically accessible, and altmap
>>> support prepared, before the "memmap on memory" initialization in
>>> memory_block_online() is called.
>>>
>>> On s390, memory hotplug works in a different way. The available hotplug
>>> memory has to be defined upfront in the hypervisor, but it is made
>>> physically accessible only when the user sets it online via sysfs,
>>> currently in the MEM_GOING_ONLINE notifier. This requires calling
>>> add_memory() during early memory detection, in order to get the sysfs
>>> representation, but we cannot use "memmap on memory" altmap support at
>>> this stage, w/o having it physically accessible.
>>>
>>> Since no ACPI or similar events are generated, there is no way to set up
>>> altmap support, or even make the memory physically accessible at all,
>>> before the "memmap on memory" initialization in memory_block_online().
>>>
>>> The new MEM_PHYS_ONLINE notifier allows to work around this, by
>>> providing a hook to make the memory physically accessible, and also call
>>> __add_pages() with altmap support, early in memory_block_online().
>>> Similarly, the MEM_PHYS_OFFLINE notifier allows to make the memory
>>> inaccessible and call __remove_pages(), at the end of
>>> memory_block_offline().
>>>
>>> Calling __add/remove_pages() requires mem_hotplug_lock, so move
>>> mem_hotplug_begin/done() to include the new notifiers.
>>>
>>> All architectures ignore unknown memory notifiers, so this patch should
>>> not introduce any functional changes.
>>
>> Sorry to say, no. No hacks please, and this is a hack for memory that
>> has already been added to the system.
> 
> IIUC, when we enter memory_block_online(), memory has always already
> been added to the system, on all architectures. E.g. via ACPI events
> on x86, or with the existing s390 hack, where we add it at boot time,
> including memmap allocated from system memory. Without a preceding
> add_memory() you cannot reach memory_block_online() via sysfs online.

Adding that memory block at boot time is the legacy leftover s390x is 
carrying along; and now we want to "workaround" that by adding s390x 
special handling for online/offlining code and having memory blocks 
without any memmap, or configuring an altmap in the very last minute 
using a s390x specific memory notifier.

Instead, if you want to support the altmap, the kernel should not add 
standby memory to the system (if configured for this new feature), but 
instead only remember the standby memory ranges so it knows what can 
later be added and what can't.

 From there, users should have an interface where they can actually add 
memory to the system, and either online it manually or just let the 
kernel online it automatically.

s390x code will call add_memory() and properly prepare an altmap if 
requested and make that standby memory available. You can then even have 
an interface to remove that memory again once offline. That will work 
with an altmap or without an altmap.

This approach is aligned with any other code that hot(un)plugs memory 
and is compatible with things like variable-sized memory blocks people 
have been talking about quite a while already, and altmaps that span 
multiple memory blocks to make gigantic pages in such ranges usable.

Sure, you'll have a new interface and have to enable the new handling 
for the new kernel, but you're asking for supporting a new feature that 
cannot be supported cleanly just like any other architecture does. But 
it's a clean approach and probably should have been done that way right 
from the start (decades ago).

Note: We do have the same for other architectures without ACPI that add 
memory via the probe interface. But IIRC we cannot really do any checks 
there, because these architectures have no way of identifying what

> 
> The difference is that for s390, the memory is not yet physically
> accessible, and therefore we cannot use the existing altmap support
> in memory_block_online(), which requires that the memory is accessible
> before it calls mhp_init_memmap_on_memory().
> 
> Currently, on s390 we make the memory accessible in the GOING_ONLINE
> notifier, by sclp call to the hypervisor. That is too late for altmap
> setup code in memory_block_online(), therefore we'd like to introduce
> the new notifier, to have a hook where we can make it accessible
> earlier, and after that there is no difference to how it works for
> other architectures, and we can make use of the existing altmap support.
> 
>>
>> If you want memory without an altmap to suddenly not have an altmap
>> anymore, then look into removing and readding that memory, or some way
>> to convert offline memory.
> 
> We do not want to have memory suddenly not have an altmap support
> any more, but simply get a hook so that we can prepare the memory
> to have altmap support. This means making it physically accessible,
> and calling __add_pages() for altmap support, which for other
> architecture has already happened before.
> 
> Of course, it is a hack for s390, that we must skip __add_pages()
> in the initial (arch_)add_memory() during boot time, when we want
> altmap support, because the memory simply is not accessible at that
> time. But s390 memory hotplug support has always been a hack, and
> had to be, because of how it is implemented by the architecture.

I write above paragraph before reading this; and it's fully aligned with 
what I said above.

> 
> So we replace one hack with another one, that has the huge advantage
> that we do not need to allocate struct pages upfront from system
> memory any more, for the whole possible online memory range.
> 
> And the current approach comes without any change to existing
> interfaces, and minimal change to common code, i.e. these new
> notifiers, that should not have any impact on other architectures.
> 
> What exactly is your concern regarding the new notifiers? Is it
> useless no-op notifier calls on other archs (not sure if they
> would get optimized out by compiler)?

That it makes hotplug code more special because of s390x, instead of 
cleaning up that legacy code.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/8] mm/memory_hotplug: introduce MEM_PHYS_ONLINE/OFFLINE memory notifiers
  2023-11-15 14:23     ` Sumanth Korikkar
@ 2023-11-16 19:03       ` David Hildenbrand
  0 siblings, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2023-11-16 19:03 UTC (permalink / raw)
  To: Sumanth Korikkar
  Cc: linux-mm, Andrew Morton, Oscar Salvador, Michal Hocko,
	Aneesh Kumar K.V, Anshuman Khandual, Gerald Schaefer,
	Alexander Gordeev, Heiko Carstens, Vasily Gorbik, linux-s390,
	LKML

On 15.11.23 15:23, Sumanth Korikkar wrote:
> On Tue, Nov 14, 2023 at 07:27:35PM +0100, David Hildenbrand wrote:
>> On 14.11.23 19:02, Sumanth Korikkar wrote:
>>> Add new memory notifiers to mimic the dynamic ACPI event triggered logic
>>> for memory hotplug on platforms that do not generate such events. This
>>> will be used to implement "memmap on memory" feature for s390 in a later
>>> patch.
>>>
>>> Platforms such as x86 can support physical memory hotplug via ACPI. When
>>> there is physical memory hotplug, ACPI event leads to the memory
>>> addition with the following callchain:
>>> acpi_memory_device_add()
>>>     -> acpi_memory_enable_device()
>>>        -> __add_memory()
>>>
>>> After this, the hotplugged memory is physically accessible, and altmap
>>> support prepared, before the "memmap on memory" initialization in
>>> memory_block_online() is called.
>>>
>>> On s390, memory hotplug works in a different way. The available hotplug
>>> memory has to be defined upfront in the hypervisor, but it is made
>>> physically accessible only when the user sets it online via sysfs,
>>> currently in the MEM_GOING_ONLINE notifier. This requires calling
>>> add_memory() during early memory detection, in order to get the sysfs
>>> representation, but we cannot use "memmap on memory" altmap support at
>>> this stage, w/o having it physically accessible.
>>>
>>> Since no ACPI or similar events are generated, there is no way to set up
>>> altmap support, or even make the memory physically accessible at all,
>>> before the "memmap on memory" initialization in memory_block_online().
>>>
>>> The new MEM_PHYS_ONLINE notifier allows to work around this, by
>>> providing a hook to make the memory physically accessible, and also call
>>> __add_pages() with altmap support, early in memory_block_online().
>>> Similarly, the MEM_PHYS_OFFLINE notifier allows to make the memory
>>> inaccessible and call __remove_pages(), at the end of
>>> memory_block_offline().
>>>
>>> Calling __add/remove_pages() requires mem_hotplug_lock, so move
>>> mem_hotplug_begin/done() to include the new notifiers.
>>>
>>> All architectures ignore unknown memory notifiers, so this patch should
>>> not introduce any functional changes.
>>
>> Sorry to say, no. No hacks please, and this is a hack for memory that has
>> already been added to the system.
>>
>> If you want memory without an altmap to suddenly not have an altmap anymore,
>> then look into removing and readding that memory, or some way to convert
>> offline memory.
> 
> Sorry, I couldnt get the context. Could you please give me more details?

See my reply to Gerald.

In an ideal world, there would not be any new callbacks, we would get 
rid of them, and just let the architecture properly hotplug memory to 
the system when requested by the user.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 6/8] s390/mm: implement MEM_PHYS_ONLINE MEM_PHYS_OFFLINE memory notifiers
  2023-11-15 14:20     ` Sumanth Korikkar
@ 2023-11-16 19:16       ` David Hildenbrand
  2023-11-16 19:23         ` David Hildenbrand
  2023-11-17 13:00         ` Gerald Schaefer
  0 siblings, 2 replies; 40+ messages in thread
From: David Hildenbrand @ 2023-11-16 19:16 UTC (permalink / raw)
  To: Sumanth Korikkar
  Cc: linux-mm, Andrew Morton, Oscar Salvador, Michal Hocko,
	Aneesh Kumar K.V, Anshuman Khandual, Gerald Schaefer,
	Alexander Gordeev, Heiko Carstens, Vasily Gorbik, linux-s390,
	LKML

On 15.11.23 15:20, Sumanth Korikkar wrote:
> On Tue, Nov 14, 2023 at 07:39:40PM +0100, David Hildenbrand wrote:
>> On 14.11.23 19:02, Sumanth Korikkar wrote:
>>> Implement MEM_PHYS_ONLINE and MEM_PHYS_OFFLINE memory notifiers on s390
>>>
> ...
>>>    arch/s390/mm/init.c          | 16 +++++++++++++++-
>>>    drivers/s390/char/sclp_cmd.c | 33 ++++++++++++++++++++++++++++++---
>>>    2 files changed, 45 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
>>> index 8d9a60ccb777..db505ed590b2 100644
>>> --- a/arch/s390/mm/init.c
>>> +++ b/arch/s390/mm/init.c
>>> @@ -288,6 +288,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
>>>    	rc = vmem_add_mapping(start, size);
>>>    	if (rc)
>>>    		return rc;
>>> +	/*
>>> +	 * If MHP_MEMMAP_ON_MEMORY is enabled, perform __add_pages() during memory
>>> +	 * onlining phase
>>> +	 */
>>> +	if (params->altmap)
>>> +		return 0;
>>
>>
>> So we'd have added memory blocks without a memmap? Sorry, but this seems to
>> further hack into the s390x direction.
> 
> This new approach has the advantage that we do not need to allocate any
> additional memory during online phase, neither for direct mapping page
> tables nor struct pages, so that memory hotplug can never fail.

Right, just like any other architecture that (triggered by whatever 
mechanism) ends up calling add_memory() and friends.

> 
> The old approach (without altmap) is already a hack, because we add
> the memmap / struct pages, but for memory that is not really accessible.

Yes, it's disgusting. And you still allocate other things like memory 
block devices or the identify map.

> And with all the disadvantage of pre-allocating struct pages from system
> memory.

Jep. It never should have been done like that.

> 
> The new approach allows to better integrate s390 to the existing
> interface, and also make use of altmap support, which would eliminate
> the major disadvantage of the old behaviour. So from s390 perspective,
> this new mechanism would be preferred, provided that there is no
> functional issue with the "added memory blocks without a memmap"
> approach.

It achieves that by s390x specific hacks in common code :) Instead of 
everybody else that simply uses add_memory() and friends.

> 
> Do you see any functional issues, e.g. conflict with common
> code?

I don't see functional issues right now, just the way it is done to 
implement support for a new feature is a hack IMHO. Replacing hack #1 by 
hack #2 is not really something reasonable. Let's try to remove hacks.

>>
>> Maybe s390x should just provide a dedicate interface to add these memory
>> blocks instead of adding them during boot and then relying on the old way of
>> using online/offline set them online/offline.
> 
> Existing behavior:
> The current 'lsmem -a' command displays both online and standby memory.
> 
> interface changes:
> If a new interface is introduced and standby memory is no longer listed,
> the following consequences might occur:
> 
> 1. Running 'lsmem -a' would only show online memory, potentially leading
>     to user complaints.

That's why the new, clean way of doing it will require a world switch. 
If the admin wants the benefits of altmap/memmap allocation, it can be 
enabled.

> 2. standby memory addition would need:
>     * echo "standby memory addr" > /sys/devices/system/memory/probe
>     As far as I understand, this interface is already deprecated.

It should actually be an s390x specific interface where people are able 
to query the standby ranges, and request to add/remove them. There, 
s390x can perform checks and setup everything accordingly before calling 
add_memory() and have the memory onlined.

We do have something comparable with the dax/kmem infrastructure: users 
configure the available memory to hotplug, and then hotplug it. Tooling 
onlines that memory automatically.

Ideally they will add ranges, not memory blocks.

> 
> 3. To remove standby memory, a new interface probe_remove is needed
>     * echo "standby memory addr" > /sys/devices/system/memory/probe_remove
> 

Similarly, an s390x specific interface that performs checks and properly 
tears everything s390x-specifc down -- for example, turning system RAM 
into standby RAM again.


> 4. Users may express a need to identify standby memory addresses,
> resulting in the creation of another interface to list these standby
> memory ranges.

Exactly. Memory that is not added to the system that does not consume 
any resources, but can be added on demand using an interface that is not 
the second stage (onlining/offlining) of memory hot(un)plug.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 6/8] s390/mm: implement MEM_PHYS_ONLINE MEM_PHYS_OFFLINE memory notifiers
  2023-11-16 19:16       ` David Hildenbrand
@ 2023-11-16 19:23         ` David Hildenbrand
  2023-11-17 13:00         ` Gerald Schaefer
  1 sibling, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2023-11-16 19:23 UTC (permalink / raw)
  To: Sumanth Korikkar
  Cc: linux-mm, Andrew Morton, Oscar Salvador, Michal Hocko,
	Aneesh Kumar K.V, Anshuman Khandual, Gerald Schaefer,
	Alexander Gordeev, Heiko Carstens, Vasily Gorbik, linux-s390,
	LKML

>>>
>>> Maybe s390x should just provide a dedicate interface to add these memory
>>> blocks instead of adding them during boot and then relying on the old way of
>>> using online/offline set them online/offline.
>>
>> Existing behavior:
>> The current 'lsmem -a' command displays both online and standby memory.
>>
>> interface changes:
>> If a new interface is introduced and standby memory is no longer listed,
>> the following consequences might occur:
>>
>> 1. Running 'lsmem -a' would only show online memory, potentially leading
>>      to user complaints.
> 
> That's why the new, clean way of doing it will require a world switch.
> If the admin wants the benefits of altmap/memmap allocation, it can be
> enabled.

BTW, thinking about it, I guess one could teach lsmem (and maybe chmem) 
to consult additional interfaces on s390x to show standby memory that's 
not added to the system yet.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 7/8] s390/sclp: remove unhandled memory notifier type
  2023-11-14 18:02 ` [PATCH 7/8] s390/sclp: remove unhandled memory notifier type Sumanth Korikkar
@ 2023-11-16 19:33   ` David Hildenbrand
  0 siblings, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2023-11-16 19:33 UTC (permalink / raw)
  To: Sumanth Korikkar, linux-mm, Andrew Morton
  Cc: Oscar Salvador, Michal Hocko, Aneesh Kumar K.V,
	Anshuman Khandual, Gerald Schaefer, Alexander Gordeev,
	Heiko Carstens, Vasily Gorbik, linux-s390, LKML

On 14.11.23 19:02, Sumanth Korikkar wrote:
> Remove memory notifier types which are unhandled by s390.  Unhandled
> memory notifier types are covered by default case.
> 
> Suggested-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
> ---
>   drivers/s390/char/sclp_cmd.c | 8 --------
>   1 file changed, 8 deletions(-)
> 
> diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
> index 12f3d4af7e4e..428f8a583e8f 100644
> --- a/drivers/s390/char/sclp_cmd.c
> +++ b/drivers/s390/char/sclp_cmd.c
> @@ -349,11 +349,6 @@ static int sclp_mem_notifier(struct notifier_block *nb,
>   		if (contains_standby_increment(start, start + size))
>   			rc = -EPERM;
>   		break;
> -	case MEM_ONLINE:
> -	case MEM_CANCEL_OFFLINE:
> -		break;
> -	case MEM_GOING_ONLINE:
> -		break;
>   	case MEM_PHYS_ONLINE:
>   		rc = sclp_mem_change_state(start, size, 1);
>   		if (rc || !memory_block->altmap)
> @@ -369,9 +364,6 @@ static int sclp_mem_notifier(struct notifier_block *nb,
>   		 */
>   		__arch_set_page_nodat((void *)start, memory_block->altmap->free);
>   		break;
> -	case MEM_CANCEL_ONLINE:
> -	case MEM_OFFLINE:
> -		break;
>   	case MEM_PHYS_OFFLINE:
>   		if (memory_block->altmap)
>   			__remove_pages(arg->start_pfn, arg->nr_pages, memory_block->altmap);


You can remove MEM_ONLINE + MEM_CANCEL_OFFLINE ahead of time and just 
cleanup the others in the patch where you touch these cases.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/8] implement "memmap on memory" feature on s390
  2023-11-14 18:02 [PATCH 0/8] implement "memmap on memory" feature on s390 Sumanth Korikkar
                   ` (7 preceding siblings ...)
  2023-11-14 18:02 ` [PATCH 8/8] s390: enable MHP_MEMMAP_ON_MEMORY Sumanth Korikkar
@ 2023-11-16 23:08 ` David Hildenbrand
  2023-11-17 13:00   ` Gerald Schaefer
  2023-11-17 13:56   ` Sumanth Korikkar
  8 siblings, 2 replies; 40+ messages in thread
From: David Hildenbrand @ 2023-11-16 23:08 UTC (permalink / raw)
  To: Sumanth Korikkar, linux-mm, Andrew Morton
  Cc: Oscar Salvador, Michal Hocko, Aneesh Kumar K.V,
	Anshuman Khandual, Gerald Schaefer, Alexander Gordeev,
	Heiko Carstens, Vasily Gorbik, linux-s390, LKML

On 14.11.23 19:02, Sumanth Korikkar wrote:
> Hi All,
> 
> The patch series implements "memmap on memory" feature on s390 and
> provides the necessary fixes for it.

Thinking about this, one thing that makes s390x different from all the 
other architectures in this series is the altmap handling.

I'm curious, why is that even required?

A memmep that is not marked as online in the section should not be 
touched by anybody (except memory onlining code :) ). And if we do, it's 
usually a BUG because that memmap might contain garbage/be poisoned or 
completely stale, so we might want to track that down and fix it in any 
case.

So what speaks against just leaving add_memory() populate the memmap 
from the altmap? Then, also the page tables for the memmap are already 
in place when onlining memory.


Then, adding two new notifier calls on start of memory_block_online() 
called something like MEM_PREPARE_ONLINE and end the end of 
memory_block_offline() called something like MEM_FINISH_OFFLINE is still 
suboptimal, but that's where standby memory could be 
activated/deactivated, without messing with the altmap.

That way, the only s390x specific thing is that the memmap that should 
not be touched by anybody is actually inaccessible, and you'd 
activate/deactivate simply from the new notifier calls just the way we 
used to do.

It's still all worse than just adding/removing memory properly, using a 
proper interface -- where you could alloc/free an actual memmap when the 
altmap is not desired. But I know that people don't want to spend time 
just doing it cleanly from scratch.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/8] implement "memmap on memory" feature on s390
  2023-11-16 23:08 ` [PATCH 0/8] implement "memmap on memory" feature on s390 David Hildenbrand
@ 2023-11-17 13:00   ` Gerald Schaefer
  2023-11-17 15:37     ` David Hildenbrand
  2023-11-17 13:56   ` Sumanth Korikkar
  1 sibling, 1 reply; 40+ messages in thread
From: Gerald Schaefer @ 2023-11-17 13:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Sumanth Korikkar, linux-mm, Andrew Morton, Oscar Salvador,
	Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	Alexander Gordeev, Heiko Carstens, Vasily Gorbik, linux-s390,
	LKML

On Fri, 17 Nov 2023 00:08:31 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 14.11.23 19:02, Sumanth Korikkar wrote:
> > Hi All,
> > 
> > The patch series implements "memmap on memory" feature on s390 and
> > provides the necessary fixes for it.  
> 
> Thinking about this, one thing that makes s390x different from all the 
> other architectures in this series is the altmap handling.
> 
> I'm curious, why is that even required?
> 
> A memmep that is not marked as online in the section should not be 
> touched by anybody (except memory onlining code :) ). And if we do, it's 
> usually a BUG because that memmap might contain garbage/be poisoned or 
> completely stale, so we might want to track that down and fix it in any 
> case.
> 
> So what speaks against just leaving add_memory() populate the memmap 
> from the altmap? Then, also the page tables for the memmap are already 
> in place when onlining memory.

Good question, I am not 100% sure if we ran into bugs, or simply assumed
that it is not OK to call __add_pages() when the memory for the altmap
is not accessible.

Maybe there is also already a common code bug with that, s390 might be
special but that is often also good for finding bugs in common code ...

> Then, adding two new notifier calls on start of memory_block_online() 
> called something like MEM_PREPARE_ONLINE and end the end of 
> memory_block_offline() called something like MEM_FINISH_OFFLINE is still 
> suboptimal, but that's where standby memory could be 
> activated/deactivated, without messing with the altmap.
> 
> That way, the only s390x specific thing is that the memmap that should 
> not be touched by anybody is actually inaccessible, and you'd 
> activate/deactivate simply from the new notifier calls just the way we 
> used to do.
> 
> It's still all worse than just adding/removing memory properly, using a 
> proper interface -- where you could alloc/free an actual memmap when the 
> altmap is not desired. But I know that people don't want to spend time 
> just doing it cleanly from scratch.

Yes, sometimes they need to be forced to do that :-)

So, we'll look into defining a "proper interface", and treat patches 1-3
separately as bug fixes? Especially patch 3 might be interesting for arm,
if they do not have ZONE_DEVICE, but still use the functions, they might
end up with the no-op version, not really freeing any memory.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 6/8] s390/mm: implement MEM_PHYS_ONLINE MEM_PHYS_OFFLINE memory notifiers
  2023-11-16 19:16       ` David Hildenbrand
  2023-11-16 19:23         ` David Hildenbrand
@ 2023-11-17 13:00         ` Gerald Schaefer
  2023-11-20 14:49           ` David Hildenbrand
  1 sibling, 1 reply; 40+ messages in thread
From: Gerald Schaefer @ 2023-11-17 13:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Sumanth Korikkar, linux-mm, Andrew Morton, Oscar Salvador,
	Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	Alexander Gordeev, Heiko Carstens, Vasily Gorbik, linux-s390,
	LKML

On Thu, 16 Nov 2023 20:16:02 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 15.11.23 15:20, Sumanth Korikkar wrote:
> > On Tue, Nov 14, 2023 at 07:39:40PM +0100, David Hildenbrand wrote:  
> >> On 14.11.23 19:02, Sumanth Korikkar wrote:  
> >>> Implement MEM_PHYS_ONLINE and MEM_PHYS_OFFLINE memory notifiers on s390
> >>>  
> > ...  
> >>>    arch/s390/mm/init.c          | 16 +++++++++++++++-
> >>>    drivers/s390/char/sclp_cmd.c | 33 ++++++++++++++++++++++++++++++---
> >>>    2 files changed, 45 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> >>> index 8d9a60ccb777..db505ed590b2 100644
> >>> --- a/arch/s390/mm/init.c
> >>> +++ b/arch/s390/mm/init.c
> >>> @@ -288,6 +288,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
> >>>    	rc = vmem_add_mapping(start, size);
> >>>    	if (rc)
> >>>    		return rc;
> >>> +	/*
> >>> +	 * If MHP_MEMMAP_ON_MEMORY is enabled, perform __add_pages() during memory
> >>> +	 * onlining phase
> >>> +	 */
> >>> +	if (params->altmap)
> >>> +		return 0;  
> >>
> >>
> >> So we'd have added memory blocks without a memmap? Sorry, but this seems to
> >> further hack into the s390x direction.  
> > 
> > This new approach has the advantage that we do not need to allocate any
> > additional memory during online phase, neither for direct mapping page
> > tables nor struct pages, so that memory hotplug can never fail.  
> 
> Right, just like any other architecture that (triggered by whatever 
> mechanism) ends up calling add_memory() and friends.

Just for better understanding, are page tables for identity and also
vmemmap mapping not allocated from system memory on other archs? I.e.
no altmap support for that, only for struct pages (so far)?

> 
> > 
> > The old approach (without altmap) is already a hack, because we add
> > the memmap / struct pages, but for memory that is not really accessible.  
> 
> Yes, it's disgusting. And you still allocate other things like memory 
> block devices or the identify map.

I would say it is special :-). And again, for understanding, all other
things apart from struct pages, still would need to be allocated from
system memory on other archs?

Of course, struct pages would be by far the biggest part, so having
altmap support for that helps a lot. Doing the other allocations also
via altmap would feel natural, but it is not possible yet, or did we
miss something?

> 
> > And with all the disadvantage of pre-allocating struct pages from system
> > memory.  
> 
> Jep. It never should have been done like that.

At that time, it gave the benefit over all others, that we do not need
to allocate struct pages from system memory, at the time of memory online,
when memory pressure might be high and such allocations might fail.

I guess you can say that it should have been done "right" at that time,
e.g. by already adding something like altmap support, instead of our own
hack.

> 
> > 
> > The new approach allows to better integrate s390 to the existing
> > interface, and also make use of altmap support, which would eliminate
> > the major disadvantage of the old behaviour. So from s390 perspective,
> > this new mechanism would be preferred, provided that there is no
> > functional issue with the "added memory blocks without a memmap"
> > approach.  
> 
> It achieves that by s390x specific hacks in common code :) Instead of 
> everybody else that simply uses add_memory() and friends.
> 
> > 
> > Do you see any functional issues, e.g. conflict with common
> > code?  
> 
> I don't see functional issues right now, just the way it is done to 
> implement support for a new feature is a hack IMHO. Replacing hack #1 by 
> hack #2 is not really something reasonable. Let's try to remove hacks.

Ok, sounds reasonable, let's try that. Introducing some new s390-specific
interface also feels a bit hacky, or ugly, but we'll see if we find a
way so that it is only "special" :-)
Reminds me a bit of that "probe" attribute, that also was an arch-specific
hack initially, IIRC, and is now to be deprecated...

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/8] mm/memory_hotplug: fix memory hotplug locking order
  2023-11-16 18:40       ` David Hildenbrand
@ 2023-11-17 13:42         ` Sumanth Korikkar
  0 siblings, 0 replies; 40+ messages in thread
From: Sumanth Korikkar @ 2023-11-17 13:42 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Andrew Morton, Oscar Salvador, Michal Hocko,
	Aneesh Kumar K.V, Anshuman Khandual, Gerald Schaefer,
	Alexander Gordeev, Heiko Carstens, Vasily Gorbik, linux-s390,
	LKML

On Thu, Nov 16, 2023 at 07:40:34PM +0100, David Hildenbrand wrote:
> Good point; can you similarly add comments to these two functions that they
> need that lock in write mode?

Ok, will add it.

Thanks

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/8] implement "memmap on memory" feature on s390
  2023-11-16 23:08 ` [PATCH 0/8] implement "memmap on memory" feature on s390 David Hildenbrand
  2023-11-17 13:00   ` Gerald Schaefer
@ 2023-11-17 13:56   ` Sumanth Korikkar
  2023-11-17 15:37     ` David Hildenbrand
  1 sibling, 1 reply; 40+ messages in thread
From: Sumanth Korikkar @ 2023-11-17 13:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Andrew Morton, Oscar Salvador, Michal Hocko,
	Aneesh Kumar K.V, Anshuman Khandual, Gerald Schaefer,
	Alexander Gordeev, Heiko Carstens, Vasily Gorbik, linux-s390,
	LKML

On Fri, Nov 17, 2023 at 12:08:31AM +0100, David Hildenbrand wrote:
> On 14.11.23 19:02, Sumanth Korikkar wrote:
> > Hi All,
> > 
> > The patch series implements "memmap on memory" feature on s390 and
> > provides the necessary fixes for it.
> 
> Thinking about this, one thing that makes s390x different from all the other
> architectures in this series is the altmap handling.
> 
> I'm curious, why is that even required?
> 
> A memmep that is not marked as online in the section should not be touched
> by anybody (except memory onlining code :) ). And if we do, it's usually a
> BUG because that memmap might contain garbage/be poisoned or completely
> stale, so we might want to track that down and fix it in any case.
> 
> So what speaks against just leaving add_memory() populate the memmap from
> the altmap? Then, also the page tables for the memmap are already in place
> when onlining memory.
>

we do have page_init_poison() in sparse_add_section() which should be
handled later then. not in add_pages()
> 
> Then, adding two new notifier calls on start of memory_block_online() called
> something like MEM_PREPARE_ONLINE and end the end of memory_block_offline()
> called something like MEM_FINISH_OFFLINE is still suboptimal, but that's
> where standby memory could be activated/deactivated, without messing with
> the altmap.
> 
> That way, the only s390x specific thing is that the memmap that should not
> be touched by anybody is actually inaccessible, and you'd
> activate/deactivate simply from the new notifier calls just the way we used
> to do.
ok. 

Thanks

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/8] implement "memmap on memory" feature on s390
  2023-11-17 13:00   ` Gerald Schaefer
@ 2023-11-17 15:37     ` David Hildenbrand
  2023-11-17 19:46       ` Sumanth Korikkar
  2023-11-21 13:13       ` Sumanth Korikkar
  0 siblings, 2 replies; 40+ messages in thread
From: David Hildenbrand @ 2023-11-17 15:37 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Sumanth Korikkar, linux-mm, Andrew Morton, Oscar Salvador,
	Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	Alexander Gordeev, Heiko Carstens, Vasily Gorbik, linux-s390,
	LKML

On 17.11.23 14:00, Gerald Schaefer wrote:
> On Fri, 17 Nov 2023 00:08:31 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 14.11.23 19:02, Sumanth Korikkar wrote:
>>> Hi All,
>>>
>>> The patch series implements "memmap on memory" feature on s390 and
>>> provides the necessary fixes for it.
>>
>> Thinking about this, one thing that makes s390x different from all the
>> other architectures in this series is the altmap handling.
>>
>> I'm curious, why is that even required?
>>
>> A memmep that is not marked as online in the section should not be
>> touched by anybody (except memory onlining code :) ). And if we do, it's
>> usually a BUG because that memmap might contain garbage/be poisoned or
>> completely stale, so we might want to track that down and fix it in any
>> case.
>>
>> So what speaks against just leaving add_memory() populate the memmap
>> from the altmap? Then, also the page tables for the memmap are already
>> in place when onlining memory.
> 
> Good question, I am not 100% sure if we ran into bugs, or simply assumed
> that it is not OK to call __add_pages() when the memory for the altmap
> is not accessible.

I mean, we create the direct map even though nobody should access that 
memory, so maybe we can simply map the altmap even though nobody should 
should access that memory.

As I said, then, even the page tables for the altmap are allocated 
already and memory onlining likely doesn't need any allocation anymore 
(except, there is kasan or some other memory notifiers have special 
demands).

Certainly simpler :)

> 
> Maybe there is also already a common code bug with that, s390 might be
> special but that is often also good for finding bugs in common code ...

If it's only the page_init_poison() as noted by Sumanth, we could 
disable that on s390x with an altmap some way or the other; should be 
possible.

I mean, you effectively have your own poisoning if the altmap is 
effectively inaccessible and makes your CPU angry on access :)

Last but not least, support for an inaccessible altmap might come in 
handy for virtio-mem eventually, and make altmap support eventually 
simpler. So added bonus points.

> 
>> Then, adding two new notifier calls on start of memory_block_online()
>> called something like MEM_PREPARE_ONLINE and end the end of
>> memory_block_offline() called something like MEM_FINISH_OFFLINE is still
>> suboptimal, but that's where standby memory could be
>> activated/deactivated, without messing with the altmap.
>>
>> That way, the only s390x specific thing is that the memmap that should
>> not be touched by anybody is actually inaccessible, and you'd
>> activate/deactivate simply from the new notifier calls just the way we
>> used to do.
>>
>> It's still all worse than just adding/removing memory properly, using a
>> proper interface -- where you could alloc/free an actual memmap when the
>> altmap is not desired. But I know that people don't want to spend time
>> just doing it cleanly from scratch.
> 
> Yes, sometimes they need to be forced to do that :-)

I certainly won't force you if we can just keep the __add_pages() calls 
as is; having an altmap that is inaccessible but fully prepared sounds 
reasonable to me.

I can see how this gives an immediate benefit to existing s390x 
installations without being too hacky and without taking a long time to 
settle.

But I'll strongly suggest to evaluate a new interface long-term.

> 
> So, we'll look into defining a "proper interface", and treat patches 1-3
> separately as bug fixes? Especially patch 3 might be interesting for arm,
> if they do not have ZONE_DEVICE, but still use the functions, they might
> end up with the no-op version, not really freeing any memory.

It might make sense to

1) Send the first 3 out separately
2) Look into a simple variant that leaves __add_pages() calls alone and
    only adds the new MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers --
    well, and deals with an inaccessible altmap, like the
    page_init_poison() when the altmap might be inaccessible.
3) Look into a proper interface to add/remove memory instead of relying
    on online/offline.

2) is certainly an improvement and might be desired in some cases. 3) is 
more powerful (e.g., where you don't want an altmap because of 
fragmentation) and future proof.

I suspect there will be installations where an altmap is undesired: it 
fragments your address space with unmovable (memmap) allocations. 
Currently, runtime allocations of gigantic pages are affected. Long-term 
other large allocations (if we ever see very large THP) will be affected.

For that reason, we want to either support variable-sized memory blocks 
long-term, or simulate that by "grouping" memory blocks that share a 
same altmap located on the first memory blocks in that group: but 
onlining one block forces onlining of the whole group.

On s390x that adds all memory ahead of time, it's hard to make a 
decision what the right granularity will be, and seeing sudden 
online/offline changed behavior might be quite "surprising" for users. 
The user can give better hints when adding/removing memory explicitly.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/8] implement "memmap on memory" feature on s390
  2023-11-17 13:56   ` Sumanth Korikkar
@ 2023-11-17 15:37     ` David Hildenbrand
  0 siblings, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2023-11-17 15:37 UTC (permalink / raw)
  To: Sumanth Korikkar
  Cc: linux-mm, Andrew Morton, Oscar Salvador, Michal Hocko,
	Aneesh Kumar K.V, Anshuman Khandual, Gerald Schaefer,
	Alexander Gordeev, Heiko Carstens, Vasily Gorbik, linux-s390,
	LKML

On 17.11.23 14:56, Sumanth Korikkar wrote:
> On Fri, Nov 17, 2023 at 12:08:31AM +0100, David Hildenbrand wrote:
>> On 14.11.23 19:02, Sumanth Korikkar wrote:
>>> Hi All,
>>>
>>> The patch series implements "memmap on memory" feature on s390 and
>>> provides the necessary fixes for it.
>>
>> Thinking about this, one thing that makes s390x different from all the other
>> architectures in this series is the altmap handling.
>>
>> I'm curious, why is that even required?
>>
>> A memmep that is not marked as online in the section should not be touched
>> by anybody (except memory onlining code :) ). And if we do, it's usually a
>> BUG because that memmap might contain garbage/be poisoned or completely
>> stale, so we might want to track that down and fix it in any case.
>>
>> So what speaks against just leaving add_memory() populate the memmap from
>> the altmap? Then, also the page tables for the memmap are already in place
>> when onlining memory.
>>
> 
> we do have page_init_poison() in sparse_add_section() which should be
> handled later then. not in add_pages()

Was that all, or did you stumble over other things?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/8] implement "memmap on memory" feature on s390
  2023-11-17 15:37     ` David Hildenbrand
@ 2023-11-17 19:46       ` Sumanth Korikkar
  2023-11-21 13:13       ` Sumanth Korikkar
  1 sibling, 0 replies; 40+ messages in thread
From: Sumanth Korikkar @ 2023-11-17 19:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Gerald Schaefer, linux-mm, Andrew Morton, Oscar Salvador,
	Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	Alexander Gordeev, Heiko Carstens, Vasily Gorbik, linux-s390,
	LKML

On Fri, Nov 17, 2023 at 04:37:29PM +0100, David Hildenbrand wrote:
> 
> It might make sense to
> 
> 1) Send the first 3 out separately

Ok sure, I will first send 3 patches as bug fixes with your feedback
applied.

> 2) Look into a simple variant that leaves __add_pages() calls alone and
>    only adds the new MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers --
>    well, and deals with an inaccessible altmap, like the
>    page_init_poison() when the altmap might be inaccessible.

Thanks for the valuable feedback.

I just tried out quickly with disabling page_init_poison() and removing
the hack in arch_add_memory() and arch_remove_memory(). Also used new
MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers. The current testing
result looks promising and seems to work and no issues found so far.

I will also double check if there are any other memmap accesses in
add_pages() phase.

we will try to go for this approach currently, i.e. with the notifiers you
suggested, and __add_pages() change.

Do you have any suggestions with how we could check for inaccessible altmap?

> 3) Look into a proper interface to add/remove memory instead of relying
>    on online/offline.

agree for long term.
> 
> 2) is certainly an improvement and might be desired in some cases. 3) is
> more powerful (e.g., where you don't want an altmap because of
> fragmentation) and future proof.
> 
> I suspect there will be installations where an altmap is undesired: it
> fragments your address space with unmovable (memmap) allocations. Currently,
> runtime allocations of gigantic pages are affected. Long-term other large
> allocations (if we ever see very large THP) will be affected.
> 
> For that reason, we want to either support variable-sized memory blocks
> long-term, or simulate that by "grouping" memory blocks that share a same
> altmap located on the first memory blocks in that group: but onlining one
> block forces onlining of the whole group.
> 
> On s390x that adds all memory ahead of time, it's hard to make a decision
> what the right granularity will be, and seeing sudden online/offline changed
> behavior might be quite "surprising" for users. The user can give better
> hints when adding/removing memory explicitly.

Thanks for providing insights and details.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/8] mm: use vmem_altmap code without CONFIG_ZONE_DEVICE
  2023-11-14 18:02 ` [PATCH 3/8] mm: use vmem_altmap code without CONFIG_ZONE_DEVICE Sumanth Korikkar
  2023-11-16 18:43   ` David Hildenbrand
@ 2023-11-17 21:39   ` kernel test robot
  1 sibling, 0 replies; 40+ messages in thread
From: kernel test robot @ 2023-11-17 21:39 UTC (permalink / raw)
  To: Sumanth Korikkar, linux-mm, Andrew Morton, David Hildenbrand
  Cc: oe-kbuild-all, Oscar Salvador, Michal Hocko, Aneesh Kumar K.V,
	Anshuman Khandual, Gerald Schaefer, Alexander Gordeev,
	Heiko Carstens, Vasily Gorbik, linux-s390, LKML

Hi Sumanth,

kernel test robot noticed the following build errors:

[auto build test ERROR on s390/features]
[also build test ERROR on kvms390/next linus/master v6.7-rc1]
[cannot apply to akpm-mm/mm-everything next-20231117]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Sumanth-Korikkar/mm-memory_hotplug-fix-memory-hotplug-locking-order/20231115-035455
base:   https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git features
patch link:    https://lore.kernel.org/r/20231114180238.1522782-4-sumanthk%40linux.ibm.com
patch subject: [PATCH 3/8] mm: use vmem_altmap code without CONFIG_ZONE_DEVICE
config: x86_64-buildonly-randconfig-002-20231118 (https://download.01.org/0day-ci/archive/20231118/202311180545.VeyRXEDq-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231118/202311180545.VeyRXEDq-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311180545.VeyRXEDq-lkp@intel.com/

All errors (new ones prefixed by >>):

   ld: arch/x86/mm/init_64.o: in function `remove_pagetable':
>> init_64.c:(.meminit.text+0xfc7): undefined reference to `vmem_altmap_free'
   ld: mm/memory_hotplug.o: in function `__add_pages':
>> memory_hotplug.c:(.ref.text+0xc01): undefined reference to `vmem_altmap_offset'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 6/8] s390/mm: implement MEM_PHYS_ONLINE MEM_PHYS_OFFLINE memory notifiers
  2023-11-17 13:00         ` Gerald Schaefer
@ 2023-11-20 14:49           ` David Hildenbrand
  0 siblings, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2023-11-20 14:49 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Sumanth Korikkar, linux-mm, Andrew Morton, Oscar Salvador,
	Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	Alexander Gordeev, Heiko Carstens, Vasily Gorbik, linux-s390,
	LKML

[catching up on mails]

>>> This new approach has the advantage that we do not need to allocate any
>>> additional memory during online phase, neither for direct mapping page
>>> tables nor struct pages, so that memory hotplug can never fail.
>>
>> Right, just like any other architecture that (triggered by whatever
>> mechanism) ends up calling add_memory() and friends.
> 
> Just for better understanding, are page tables for identity and also
> vmemmap mapping not allocated from system memory on other archs? I.e.
> no altmap support for that, only for struct pages (so far)?

Yes, only the actual "memmap ("struct page")" comes from altmap space, 
everything else comes from the buddy during memory hotplug.

> 
>>
>>>
>>> The old approach (without altmap) is already a hack, because we add
>>> the memmap / struct pages, but for memory that is not really accessible.
>>
>> Yes, it's disgusting. And you still allocate other things like memory
>> block devices or the identify map.
> 
> I would say it is special :-). And again, for understanding, all other

:)

> things apart from struct pages, still would need to be allocated from
> system memory on other archs?

Yes!

> 
> Of course, struct pages would be by far the biggest part, so having
> altmap support for that helps a lot. Doing the other allocations also
> via altmap would feel natural, but it is not possible yet, or did we
> miss something?

The tricky part is making sure ahead of time that that we set aside the 
required number of pageblocks, to properly control during memory 
onlining what to set aside and what to expose to the buddy.

See mhp_supports_memmap_on_memory() / 
memory_block_memmap_on_memory_pages() for the dirty details :)

> 
>>
>>> And with all the disadvantage of pre-allocating struct pages from system
>>> memory.
>>
>> Jep. It never should have been done like that.
> 
> At that time, it gave the benefit over all others, that we do not need
> to allocate struct pages from system memory, at the time of memory online,
> when memory pressure might be high and such allocations might fail.

Agreed. Having the memmap already around can be helpful. But not for all 
standby memory, that's just pure waste.

... but as memory onlining is triggered by user space, it's likely that 
that user space cannot even make progress (e.g., start a process to set 
memory online) to actually trigger memory onlining in serious low-memory 
situations.

> 
> I guess you can say that it should have been done "right" at that time,
> e.g. by already adding something like altmap support, instead of our own
> hack.

Probably yes. IMHO, relying on the existing memory block interface was 
the low hanging fruit. Now, s390x is just special.

> 
>>
>>>
>>> The new approach allows to better integrate s390 to the existing
>>> interface, and also make use of altmap support, which would eliminate
>>> the major disadvantage of the old behaviour. So from s390 perspective,
>>> this new mechanism would be preferred, provided that there is no
>>> functional issue with the "added memory blocks without a memmap"
>>> approach.
>>
>> It achieves that by s390x specific hacks in common code :) Instead of
>> everybody else that simply uses add_memory() and friends.
>>
>>>
>>> Do you see any functional issues, e.g. conflict with common
>>> code?
>>
>> I don't see functional issues right now, just the way it is done to
>> implement support for a new feature is a hack IMHO. Replacing hack #1 by
>> hack #2 is not really something reasonable. Let's try to remove hacks.
> 
> Ok, sounds reasonable, let's try that. Introducing some new s390-specific
> interface also feels a bit hacky, or ugly, but we'll see if we find a
> way so that it is only "special" :-)

As proposed in my other mail, I think there are ways to make s390x happy 
first and look into a cleaner approach long-term.

> Reminds me a bit of that "probe" attribute, that also was an arch-specific
> hack initially, IIRC, and is now to be deprecated...

Yeah, that was for interfaces where the kernel has absolutely no clue 
where/what/how memory gets hotplugged. ARM64 without ACPI.

s390x is completely different though: you know exactly which standby 
memory exists, where it resides, in which granularity in can be 
enabled/disabled, ... you don't have to play dangerous "I'm pretty sure 
there is memory out there although nobody can check so I crash the 
kernel" games.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/8] implement "memmap on memory" feature on s390
  2023-11-17 15:37     ` David Hildenbrand
  2023-11-17 19:46       ` Sumanth Korikkar
@ 2023-11-21 13:13       ` Sumanth Korikkar
  2023-11-21 13:21         ` Sumanth Korikkar
  2023-11-21 19:24         ` David Hildenbrand
  1 sibling, 2 replies; 40+ messages in thread
From: Sumanth Korikkar @ 2023-11-21 13:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Gerald Schaefer, linux-mm, Andrew Morton, Oscar Salvador,
	Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	Alexander Gordeev, Heiko Carstens, Vasily Gorbik, linux-s390,
	LKML

On Fri, Nov 17, 2023 at 04:37:29PM +0100, David Hildenbrand wrote:
> > 
> > Maybe there is also already a common code bug with that, s390 might be
> > special but that is often also good for finding bugs in common code ...
> 
> If it's only the page_init_poison() as noted by Sumanth, we could disable
> that on s390x with an altmap some way or the other; should be possible.
> 
> I mean, you effectively have your own poisoning if the altmap is effectively
> inaccessible and makes your CPU angry on access :)
> 
> Last but not least, support for an inaccessible altmap might come in handy
> for virtio-mem eventually, and make altmap support eventually simpler. So
> added bonus points.

We tried out two possibilities dealing with vmemmap altmap inaccessibilty.
Approach 1: Add MHP_ALTMAP_INACCESSIBLE flag and pass it in add_memory()

diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
index 075094ca59b4..ab2dfcc7e9e4 100644
--- a/drivers/s390/char/sclp_cmd.c
+++ b/drivers/s390/char/sclp_cmd.c
@@ -358,6 +358,13 @@ static int sclp_mem_notifier(struct notifier_block *nb,
 		 * buddy allocator later.
 		 */
 		__arch_set_page_nodat((void *)__va(start), memory_block->altmap->free);
+		/*
+		 * Poison the struct pages after memory block is accessible.
+		 * This is needed for only altmap. Without altmap, the struct
+		 * pages are poisoined in sparse_add_section().
+		 */
+		if (memory_block->altmap->inaccessible)
+			page_init_poison(pfn_to_page(arg->start_pfn), memory_block->altmap->free);
 		break;
 	case MEM_FINISH_OFFLINE:
 		sclp_mem_change_state(start, size, 0);
@@ -412,7 +419,7 @@ static void __init add_memory_merged(u16 rn)
 		goto skip_add;
 	for (addr = start; addr < start + size; addr += block_size)
 		add_memory(0, addr, block_size,
-			   MACHINE_HAS_EDAT1 ? MHP_MEMMAP_ON_MEMORY : MHP_NONE);
+			   MACHINE_HAS_EDAT1 ? MHP_MEMMAP_ON_MEMORY|MHP_ALTMAP_INACCESSIBLE : MHP_NONE);
 skip_add:
 	first_rn = rn;
 	num = 1;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 7d2076583494..5c70707e706f 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -106,6 +106,11 @@ typedef int __bitwise mhp_t;
  * implies the node id (nid).
  */
 #define MHP_NID_IS_MGID		((__force mhp_t)BIT(2))
+/*
+ * Mark memmap on memory (struct pages array) as inaccessible during memory
+ * hotplug addition phase.
+ */
+#define MHP_ALTMAP_INACCESSIBLE	((__force mhp_t)BIT(3))
 
 /*
  * Extended parameters for memory hotplug:
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 744c830f4b13..9837f3e6fb95 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -25,6 +25,7 @@ struct vmem_altmap {
 	unsigned long free;
 	unsigned long align;
 	unsigned long alloc;
+	bool inaccessible;
 };
 
 /*
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7a5fc89a8652..d8299853cdcc 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1439,6 +1439,8 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
 		if (mhp_supports_memmap_on_memory(size)) {
 			mhp_altmap.free = memory_block_memmap_on_memory_pages();
+			if (mhp_flags & MHP_ALTMAP_INACCESSIBLE)
+				mhp_altmap.inaccessible = true;
 			params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL);
 			if (!params.altmap) {
 				ret = -ENOMEM;
diff --git a/mm/sparse.c b/mm/sparse.c
index 77d91e565045..3991c717b769 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -907,7 +907,8 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
 	 * Poison uninitialized struct pages in order to catch invalid flags
 	 * combinations.
 	 */
-	page_init_poison(memmap, sizeof(struct page) * nr_pages);
+	if (!altmap || !altmap->inaccessible)
+		page_init_poison(memmap, sizeof(struct page) * nr_pages);
 
 	ms = __nr_to_section(section_nr);
 	set_section_nid(section_nr, nid);


Approach 2:
===========
Shouldnt kasan zero shadow mapping performed first before
accessing/initializing memmap via page_init_poisining()?  If that is
true, then it is a problem for all architectures and should could be
fixed like:


diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7a5fc89a8652..eb3975740537 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1093,6 +1093,7 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
 	if (ret)
 		return ret;

+	page_init_poison(pfn_to_page(pfn), sizeof(struct page) * nr_pages);
 	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);

 	for (i = 0; i < nr_pages; i++)
diff --git a/mm/sparse.c b/mm/sparse.c
index 77d91e565045..4ddf53f52075 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -906,8 +906,11 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
 	/*
 	 * Poison uninitialized struct pages in order to catch invalid flags
 	 * combinations.
+	 * For altmap, do this later when onlining the memory, as it might
+	 * not be accessible at this point.
 	 */
-	page_init_poison(memmap, sizeof(struct page) * nr_pages);
+	if (!altmap)
+		page_init_poison(memmap, sizeof(struct page) * nr_pages);

 	ms = __nr_to_section(section_nr);
 	set_section_nid(section_nr, nid);



Also, if this approach is taken, should page_init_poison() be performed
with cond_resched() as mentioned in commit d33695b16a9f
("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()") ?

Opinions?

Thank you

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/8] implement "memmap on memory" feature on s390
  2023-11-21 13:13       ` Sumanth Korikkar
@ 2023-11-21 13:21         ` Sumanth Korikkar
  2023-11-21 14:42           ` David Hildenbrand
  2023-11-21 19:24         ` David Hildenbrand
  1 sibling, 1 reply; 40+ messages in thread
From: Sumanth Korikkar @ 2023-11-21 13:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Gerald Schaefer, linux-mm, Andrew Morton, Oscar Salvador,
	Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	Alexander Gordeev, Heiko Carstens, Vasily Gorbik, linux-s390,
	LKML

On Tue, Nov 21, 2023 at 02:13:22PM +0100, Sumanth Korikkar wrote:
> Approach 2:
> ===========
> Shouldnt kasan zero shadow mapping performed first before
> accessing/initializing memmap via page_init_poisining()?  If that is
> true, then it is a problem for all architectures and should could be
> fixed like:
> 
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 7a5fc89a8652..eb3975740537 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1093,6 +1093,7 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
>  	if (ret)
>  		return ret;
> 
> +	page_init_poison(pfn_to_page(pfn), sizeof(struct page) * nr_pages);
>  	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);
> 
>  	for (i = 0; i < nr_pages; i++)
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 77d91e565045..4ddf53f52075 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -906,8 +906,11 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
>  	/*
>  	 * Poison uninitialized struct pages in order to catch invalid flags
>  	 * combinations.
> +	 * For altmap, do this later when onlining the memory, as it might
> +	 * not be accessible at this point.
>  	 */
> -	page_init_poison(memmap, sizeof(struct page) * nr_pages);
> +	if (!altmap)
> +		page_init_poison(memmap, sizeof(struct page) * nr_pages);
> 
>  	ms = __nr_to_section(section_nr);
>  	set_section_nid(section_nr, nid);
> 
> 
> 
> Also, if this approach is taken, should page_init_poison() be performed
> with cond_resched() as mentioned in commit d33695b16a9f
> ("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()") ?

Sorry, wrong commit id.

should page_init_poison() be performed with cond_resched() as mentioned in
Commit b7e3debdd040 ("mm/memory_hotplug.c: fix false softlockup
during pfn range removal") ?

Thanks
> 
> Opinions?
> 
> Thank you

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/8] implement "memmap on memory" feature on s390
  2023-11-21 13:21         ` Sumanth Korikkar
@ 2023-11-21 14:42           ` David Hildenbrand
  0 siblings, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2023-11-21 14:42 UTC (permalink / raw)
  To: Sumanth Korikkar
  Cc: Gerald Schaefer, linux-mm, Andrew Morton, Oscar Salvador,
	Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	Alexander Gordeev, Heiko Carstens, Vasily Gorbik, linux-s390,
	LKML

On 21.11.23 14:21, Sumanth Korikkar wrote:
> On Tue, Nov 21, 2023 at 02:13:22PM +0100, Sumanth Korikkar wrote:
>> Approach 2:
>> ===========
>> Shouldnt kasan zero shadow mapping performed first before
>> accessing/initializing memmap via page_init_poisining()?  If that is
>> true, then it is a problem for all architectures and should could be
>> fixed like:
>>
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 7a5fc89a8652..eb3975740537 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1093,6 +1093,7 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
>>   	if (ret)
>>   		return ret;
>>
>> +	page_init_poison(pfn_to_page(pfn), sizeof(struct page) * nr_pages);
>>   	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);
>>
>>   	for (i = 0; i < nr_pages; i++)
>> diff --git a/mm/sparse.c b/mm/sparse.c
>> index 77d91e565045..4ddf53f52075 100644
>> --- a/mm/sparse.c
>> +++ b/mm/sparse.c
>> @@ -906,8 +906,11 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
>>   	/*
>>   	 * Poison uninitialized struct pages in order to catch invalid flags
>>   	 * combinations.
>> +	 * For altmap, do this later when onlining the memory, as it might
>> +	 * not be accessible at this point.
>>   	 */
>> -	page_init_poison(memmap, sizeof(struct page) * nr_pages);
>> +	if (!altmap)
>> +		page_init_poison(memmap, sizeof(struct page) * nr_pages);
>>
>>   	ms = __nr_to_section(section_nr);
>>   	set_section_nid(section_nr, nid);
>>
>>
>>
>> Also, if this approach is taken, should page_init_poison() be performed
>> with cond_resched() as mentioned in commit d33695b16a9f
>> ("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()") ?
> 
> Sorry, wrong commit id.
> 
> should page_init_poison() be performed with cond_resched() as mentioned in
> Commit b7e3debdd040 ("mm/memory_hotplug.c: fix false softlockup
> during pfn range removal") ?

I think people are currently looking into removing all that cond_resched():

https://lore.kernel.org/all/20231107230822.371443-29-ankur.a.arora@oracle.com/T/#mda52da685a142bec9607625386b0b660e5470abe
-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/8] implement "memmap on memory" feature on s390
  2023-11-21 13:13       ` Sumanth Korikkar
  2023-11-21 13:21         ` Sumanth Korikkar
@ 2023-11-21 19:24         ` David Hildenbrand
  2023-11-22 11:44           ` Sumanth Korikkar
  1 sibling, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2023-11-21 19:24 UTC (permalink / raw)
  To: Sumanth Korikkar
  Cc: Gerald Schaefer, linux-mm, Andrew Morton, Oscar Salvador,
	Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	Alexander Gordeev, Heiko Carstens, Vasily Gorbik, linux-s390,
	LKML

On 21.11.23 14:13, Sumanth Korikkar wrote:
> On Fri, Nov 17, 2023 at 04:37:29PM +0100, David Hildenbrand wrote:
>>>
>>> Maybe there is also already a common code bug with that, s390 might be
>>> special but that is often also good for finding bugs in common code ...
>>
>> If it's only the page_init_poison() as noted by Sumanth, we could disable
>> that on s390x with an altmap some way or the other; should be possible.
>>
>> I mean, you effectively have your own poisoning if the altmap is effectively
>> inaccessible and makes your CPU angry on access :)
>>
>> Last but not least, support for an inaccessible altmap might come in handy
>> for virtio-mem eventually, and make altmap support eventually simpler. So
>> added bonus points.
> 
> We tried out two possibilities dealing with vmemmap altmap inaccessibilty.
> Approach 1: Add MHP_ALTMAP_INACCESSIBLE flag and pass it in add_memory()
> 
> diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
> index 075094ca59b4..ab2dfcc7e9e4 100644
> --- a/drivers/s390/char/sclp_cmd.c
> +++ b/drivers/s390/char/sclp_cmd.c
> @@ -358,6 +358,13 @@ static int sclp_mem_notifier(struct notifier_block *nb,
>   		 * buddy allocator later.
>   		 */
>   		__arch_set_page_nodat((void *)__va(start), memory_block->altmap->free);
> +		/*
> +		 * Poison the struct pages after memory block is accessible.
> +		 * This is needed for only altmap. Without altmap, the struct
> +		 * pages are poisoined in sparse_add_section().
> +		 */
> +		if (memory_block->altmap->inaccessible)
> +			page_init_poison(pfn_to_page(arg->start_pfn), memory_block->altmap->free);

See below, likely it's best if the core will do that.

>   		break;
>   	case MEM_FINISH_OFFLINE:
>   		sclp_mem_change_state(start, size, 0);
> @@ -412,7 +419,7 @@ static void __init add_memory_merged(u16 rn)
>   		goto skip_add;
>   	for (addr = start; addr < start + size; addr += block_size)
>   		add_memory(0, addr, block_size,
> -			   MACHINE_HAS_EDAT1 ? MHP_MEMMAP_ON_MEMORY : MHP_NONE);
> +			   MACHINE_HAS_EDAT1 ? MHP_MEMMAP_ON_MEMORY|MHP_ALTMAP_INACCESSIBLE : MHP_NONE);
>   skip_add:
>   	first_rn = rn;
>   	num = 1;
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 7d2076583494..5c70707e706f 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -106,6 +106,11 @@ typedef int __bitwise mhp_t;
>    * implies the node id (nid).
>    */
>   #define MHP_NID_IS_MGID		((__force mhp_t)BIT(2))
> +/*
> + * Mark memmap on memory (struct pages array) as inaccessible during memory
> + * hotplug addition phase.
> + */
> +#define MHP_ALTMAP_INACCESSIBLE	((__force mhp_t)BIT(3))

If we go that path, maybe rather MHP_OFFLINE_INACCESSIBLE and document 
how this relates to/interacts with the memory notifier callbacks and the 
altmap.

Then, we can logically abstract this from altmap handling. Simply, the 
memory should not be read/written before the memory notifier was called.

In the core, you can do the poisioning for the altmap in that case after 
calling the notifier, probably in mhp_init_memmap_on_memory() as you do 
below.

>   
>   /*
>    * Extended parameters for memory hotplug:
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 744c830f4b13..9837f3e6fb95 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -25,6 +25,7 @@ struct vmem_altmap {
>   	unsigned long free;
>   	unsigned long align;
>   	unsigned long alloc;
> +	bool inaccessible;

We should then likely remember that information for the memory block, 
not the altmap.

[...]

> 
> 
> Approach 2:
> ===========
> Shouldnt kasan zero shadow mapping performed first before
> accessing/initializing memmap via page_init_poisining()?  If that is

Likely the existing way. The first access to the poisoned memmap should 
be a write, not a read. But I didn't look into the details.

Can you try reproducing?

> true, then it is a problem for all architectures and should could be
> fixed like:
> 
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 7a5fc89a8652..eb3975740537 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1093,6 +1093,7 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
>   	if (ret)
>   		return ret;
> 
> +	page_init_poison(pfn_to_page(pfn), sizeof(struct page) * nr_pages);
>   	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);
> 
>   	for (i = 0; i < nr_pages; i++)
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 77d91e565045..4ddf53f52075 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -906,8 +906,11 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
>   	/*
>   	 * Poison uninitialized struct pages in order to catch invalid flags
>   	 * combinations.
> +	 * For altmap, do this later when onlining the memory, as it might
> +	 * not be accessible at this point.
>   	 */
> -	page_init_poison(memmap, sizeof(struct page) * nr_pages);
> +	if (!altmap)
> +		page_init_poison(memmap, sizeof(struct page) * nr_pages);
> 
>   	ms = __nr_to_section(section_nr);
>   	set_section_nid(section_nr, nid);
> 

That's too generic when it comes to other altmap users, especially DAX 
or when the altmap is accessible while memory is offlining, and we want 
to catch that.

> 
> 
> Also, if this approach is taken, should page_init_poison() be performed
> with cond_resched() as mentioned in commit d33695b16a9f
> ("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()") ?

Likely as soon as possible after it is accessible :)


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/8] implement "memmap on memory" feature on s390
  2023-11-21 19:24         ` David Hildenbrand
@ 2023-11-22 11:44           ` Sumanth Korikkar
  0 siblings, 0 replies; 40+ messages in thread
From: Sumanth Korikkar @ 2023-11-22 11:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Gerald Schaefer, linux-mm, Andrew Morton, Oscar Salvador,
	Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	Alexander Gordeev, Heiko Carstens, Vasily Gorbik, linux-s390,
	LKML

On Tue, Nov 21, 2023 at 08:24:48PM +0100, David Hildenbrand wrote:
> > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> > index 7d2076583494..5c70707e706f 100644
> > --- a/include/linux/memory_hotplug.h
> > +++ b/include/linux/memory_hotplug.h
> > @@ -106,6 +106,11 @@ typedef int __bitwise mhp_t;
> >    * implies the node id (nid).
> >    */
> >   #define MHP_NID_IS_MGID		((__force mhp_t)BIT(2))
> > +/*
> > + * Mark memmap on memory (struct pages array) as inaccessible during memory
> > + * hotplug addition phase.
> > + */
> > +#define MHP_ALTMAP_INACCESSIBLE	((__force mhp_t)BIT(3))
> 
> If we go that path, maybe rather MHP_OFFLINE_INACCESSIBLE and document how
> this relates to/interacts with the memory notifier callbacks and the altmap.
> 
> Then, we can logically abstract this from altmap handling. Simply, the
> memory should not be read/written before the memory notifier was called.
> 
> In the core, you can do the poisioning for the altmap in that case after
> calling the notifier, probably in mhp_init_memmap_on_memory() as you do
> below.
ok, sure. Thanks.
> 
> >   /*
> >    * Extended parameters for memory hotplug:
> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > index 744c830f4b13..9837f3e6fb95 100644
> > --- a/include/linux/memremap.h
> > +++ b/include/linux/memremap.h
> > @@ -25,6 +25,7 @@ struct vmem_altmap {
> >   	unsigned long free;
> >   	unsigned long align;
> >   	unsigned long alloc;
> > +	bool inaccessible;
> 
> We should then likely remember that information for the memory block, not
> the altmap.

Tried using inaccessible field in memory_block and observed that the
memory block is created following the success of arch_add_memory().
Hence, when conducting checks for inaccessible memory in
sparse_add_section() (regarding page poisoning), there is still a
reliance on mhp_flags conveyed through add_memory(). Subsequently, then
memory_block inaccessible state is set in create_memory_block_devices(). 

I hope this approach is fine.

On the other hand, relying on inaccessible flag in vmem_altmap could
make checks in arch_add_memory() and other functions easier I suppose.

> 
> [...]
> 
> > 
> > 
> > Approach 2:
> > ===========
> > Shouldnt kasan zero shadow mapping performed first before
> > accessing/initializing memmap via page_init_poisining()?  If that is
> 
> Likely the existing way. The first access to the poisoned memmap should be a
> write, not a read. But I didn't look into the details.
> 
> Can you try reproducing?
>

Executing page_init_poison() right before kasan_add_zero_shadow() in the
context of altmap usage did not result in a crash when I attempted to
reproduce it.

However, altmap + page_init_poisoning() within sparse_add_section(), a
crash occurs on our arch, as previously discussed in this thread.

Thank you

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2023-11-22 11:45 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-14 18:02 [PATCH 0/8] implement "memmap on memory" feature on s390 Sumanth Korikkar
2023-11-14 18:02 ` [PATCH 1/8] mm/memory_hotplug: fix memory hotplug locking order Sumanth Korikkar
2023-11-14 18:22   ` David Hildenbrand
2023-11-15 13:41     ` Sumanth Korikkar
2023-11-16 18:40       ` David Hildenbrand
2023-11-17 13:42         ` Sumanth Korikkar
2023-11-14 18:02 ` [PATCH 2/8] mm/memory_hotplug: fix error handling in add_memory_resource() Sumanth Korikkar
2023-11-14 18:36   ` David Hildenbrand
2023-11-15 13:45     ` Sumanth Korikkar
2023-11-14 18:02 ` [PATCH 3/8] mm: use vmem_altmap code without CONFIG_ZONE_DEVICE Sumanth Korikkar
2023-11-16 18:43   ` David Hildenbrand
2023-11-17 21:39   ` kernel test robot
2023-11-14 18:02 ` [PATCH 4/8] mm/memory_hotplug: introduce MEM_PHYS_ONLINE/OFFLINE memory notifiers Sumanth Korikkar
2023-11-14 18:27   ` David Hildenbrand
2023-11-15 14:23     ` Sumanth Korikkar
2023-11-16 19:03       ` David Hildenbrand
2023-11-15 15:03     ` Gerald Schaefer
2023-11-16 19:02       ` David Hildenbrand
2023-11-14 18:02 ` [PATCH 5/8] s390/mm: allocate vmemmap pages from self-contained memory range Sumanth Korikkar
2023-11-14 18:02 ` [PATCH 6/8] s390/mm: implement MEM_PHYS_ONLINE MEM_PHYS_OFFLINE memory notifiers Sumanth Korikkar
2023-11-14 18:39   ` David Hildenbrand
2023-11-15 14:20     ` Sumanth Korikkar
2023-11-16 19:16       ` David Hildenbrand
2023-11-16 19:23         ` David Hildenbrand
2023-11-17 13:00         ` Gerald Schaefer
2023-11-20 14:49           ` David Hildenbrand
2023-11-14 18:02 ` [PATCH 7/8] s390/sclp: remove unhandled memory notifier type Sumanth Korikkar
2023-11-16 19:33   ` David Hildenbrand
2023-11-14 18:02 ` [PATCH 8/8] s390: enable MHP_MEMMAP_ON_MEMORY Sumanth Korikkar
2023-11-16 23:08 ` [PATCH 0/8] implement "memmap on memory" feature on s390 David Hildenbrand
2023-11-17 13:00   ` Gerald Schaefer
2023-11-17 15:37     ` David Hildenbrand
2023-11-17 19:46       ` Sumanth Korikkar
2023-11-21 13:13       ` Sumanth Korikkar
2023-11-21 13:21         ` Sumanth Korikkar
2023-11-21 14:42           ` David Hildenbrand
2023-11-21 19:24         ` David Hildenbrand
2023-11-22 11:44           ` Sumanth Korikkar
2023-11-17 13:56   ` Sumanth Korikkar
2023-11-17 15:37     ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.