[PATCH v2 0/3] mm: use memmap_on_memory semantics for dax/kmem

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] mm: use memmap_on_memory semantics for dax/kmem
@ 2023-07-20  7:14 Vishal Verma
  2023-07-20  7:14 ` [PATCH v2 1/3] mm/memory_hotplug: Export symbol mhp_supports_memmap_on_memory() Vishal Verma
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Vishal Verma @ 2023-07-20  7:14 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Oscar Salvador, Dan Williams,
	Dave Jiang
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Huang Ying,
	Dave Hansen, Aneesh Kumar K.V, Jonathan Cameron, Jeff Moyer,
	Vishal Verma

The dax/kmem driver can potentially hot-add large amounts of memory
originating from CXL memory expanders, or NVDIMMs, or other 'device
memories'. There is a chance there isn't enough regular system memory
available to fit the memmap for this new memory. It's therefore
desirable, if all other conditions are met, for the kmem managed memory
to place its memmap on the newly added memory itself.

The main hurdle for accomplishing this for kmem is that memmap_on_memory
can only be done if the memory being added is equal to the size of one
memblock. To overcome this,allow the hotplug code to split an add_memory()
request into memblock-sized chunks, and try_remove_memory() to also
expect and handle such a scenario.

Patch 1 exports mhp_supports_memmap_on_memory() so it can be used by the
kmem driver.

Patch 2 teaches the memory_hotplug code to allow for splitting
add_memory() and remove_memory() requests over memblock sized chunks.

Patch 3 adds a sysfs control for the kmem driver that would
allow an opt-out of using memmap_on_memory for the memory being added.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
Changes in v2:
- Drop the patch to create an override path for the memmap_on_memory
  module param (David)
- Move the chunking into memory_hotplug.c so that any caller of
  add_memory() can request this behavior. (David)
- Handle remove_memory() too. (David, Ying)
- Add a sysfs control in the kmem driver for memmap_on_memory semantics
  (David, Jonathan)
- Add a #else case to define mhp_supports_memmap_on_memory() if
  CONFIG_MEMORY_HOTPLUG is unset. (0day report)
- Link to v1: https://lore.kernel.org/r/20230613-vv-kmem_memmap-v1-0-f6de9c6af2c6@intel.com

---
Vishal Verma (3):
      mm/memory_hotplug: Export symbol mhp_supports_memmap_on_memory()
      mm/memory_hotplug: split memmap_on_memory requests across memblocks
      dax/kmem: allow kmem to add memory with memmap_on_memory

 include/linux/memory_hotplug.h |   5 ++
 drivers/dax/dax-private.h      |   1 +
 drivers/dax/bus.c              |  48 +++++++++++++
 drivers/dax/kmem.c             |   7 +-
 mm/memory_hotplug.c            | 155 ++++++++++++++++++++++++-----------------
 5 files changed, 152 insertions(+), 64 deletions(-)
---
base-commit: 06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5
change-id: 20230613-vv-kmem_memmap-5483c8d04279

Best regards,
-- 
Vishal Verma <vishal.l.verma@intel.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/3] mm/memory_hotplug: Export symbol mhp_supports_memmap_on_memory()
  2023-07-20  7:14 [PATCH v2 0/3] mm: use memmap_on_memory semantics for dax/kmem Vishal Verma
@ 2023-07-20  7:14 ` Vishal Verma
  2023-07-24  6:00   ` Huang, Ying
  2023-07-20  7:14 ` [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks Vishal Verma
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Vishal Verma @ 2023-07-20  7:14 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Oscar Salvador, Dan Williams,
	Dave Jiang
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Huang Ying,
	Dave Hansen, Aneesh Kumar K.V, Jonathan Cameron, Jeff Moyer,
	Vishal Verma

In preparation for dax drivers, which can be built as modules,
to use this interface, export it with EXPORT_SYMBOL_GPL(). Add a #else
case for the symbol for builds without CONFIG_MEMORY_HOTPLUG.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
 include/linux/memory_hotplug.h | 5 +++++
 mm/memory_hotplug.c            | 1 +
 2 files changed, 6 insertions(+)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 013c69753c91..fc5da07ad011 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -355,6 +355,11 @@ extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
 				      struct mhp_params *params);
 void arch_remove_linear_mapping(u64 start, u64 size);
 extern bool mhp_supports_memmap_on_memory(unsigned long size);
+#else
+static inline bool mhp_supports_memmap_on_memory(unsigned long size)
+{
+	return false;
+}
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 #endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 3f231cf1b410..e9bcacbcbae2 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1284,6 +1284,7 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
 	       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
 	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
 }
+EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory);
 
 /*
  * NOTE: The caller must call lock_device_hotplug() to serialize hotplug

-- 
2.41.0



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks
  2023-07-20  7:14 [PATCH v2 0/3] mm: use memmap_on_memory semantics for dax/kmem Vishal Verma
  2023-07-20  7:14 ` [PATCH v2 1/3] mm/memory_hotplug: Export symbol mhp_supports_memmap_on_memory() Vishal Verma
@ 2023-07-20  7:14 ` Vishal Verma
  2023-07-21 12:20   ` Aneesh Kumar K.V
                     ` (2 more replies)
  2023-07-20  7:14 ` [PATCH v2 3/3] dax/kmem: allow kmem to add memory with memmap_on_memory Vishal Verma
  2023-07-25 15:54 ` [PATCH v2 0/3] mm: use memmap_on_memory semantics for dax/kmem David Hildenbrand
  3 siblings, 3 replies; 15+ messages in thread
From: Vishal Verma @ 2023-07-20  7:14 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Oscar Salvador, Dan Williams,
	Dave Jiang
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Huang Ying,
	Dave Hansen, Aneesh Kumar K.V, Jonathan Cameron, Jeff Moyer,
	Vishal Verma

The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is currently
restricted to 'memblock_size' chunks of memory being added. Adding a
larger span of memory precludes memmap_on_memory semantics.

For users of hotplug such as kmem, large amounts of memory might get
added from the CXL subsystem. In some cases, this amount may exceed the
available 'main memory' to store the memmap for the memory being added.
In this case, it is useful to have a way to place the memmap on the
memory being added, even if it means splitting the addition into
memblock-sized chunks.

Change add_memory_resource() to loop over memblock-sized chunks of
memory if caller requested memmap_on_memory, and if other conditions for
it are met,. Teach try_remove_memory() to also expect that a memory
range being removed might have been split up into memblock sized chunks,
and to loop through those as needed.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
 mm/memory_hotplug.c | 154 +++++++++++++++++++++++++++++++---------------------
 1 file changed, 91 insertions(+), 63 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e9bcacbcbae2..20456f0d28e6 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1286,6 +1286,35 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
 }
 EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory);
 
+static int add_memory_create_devices(int nid, struct memory_group *group,
+				     u64 start, u64 size, mhp_t mhp_flags)
+{
+	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
+	struct vmem_altmap mhp_altmap = {};
+	int ret;
+
+	if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) {
+		mhp_altmap.free = PHYS_PFN(size);
+		mhp_altmap.base_pfn = PHYS_PFN(start);
+		params.altmap = &mhp_altmap;
+	}
+
+	/* call arch's memory hotadd */
+	ret = arch_add_memory(nid, start, size, &params);
+	if (ret < 0)
+		return ret;
+
+	/* create memory block devices after memory was added */
+	ret = create_memory_block_devices(start, size, mhp_altmap.alloc,
+					  group);
+	if (ret) {
+		arch_remove_memory(start, size, NULL);
+		return ret;
+	}
+
+	return 0;
+}
+
 /*
  * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
  * and online/offline operations (triggered e.g. by sysfs).
@@ -1294,11 +1323,10 @@ EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory);
  */
 int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 {
-	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
+	unsigned long memblock_size = memory_block_size_bytes();
 	enum memblock_flags memblock_flags = MEMBLOCK_NONE;
-	struct vmem_altmap mhp_altmap = {};
 	struct memory_group *group = NULL;
-	u64 start, size;
+	u64 start, size, cur_start;
 	bool new_node = false;
 	int ret;
 
@@ -1339,27 +1367,20 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 	/*
 	 * Self hosted memmap array
 	 */
-	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
-		if (!mhp_supports_memmap_on_memory(size)) {
-			ret = -EINVAL;
+	if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
+	    mhp_supports_memmap_on_memory(memblock_size)) {
+		for (cur_start = start; cur_start < start + size;
+		     cur_start += memblock_size) {
+			ret = add_memory_create_devices(nid, group, cur_start,
+							memblock_size,
+							mhp_flags);
+			if (ret)
+				goto error;
+		}
+	} else {
+		ret = add_memory_create_devices(nid, group, start, size, mhp_flags);
+		if (ret)
 			goto error;
-		}
-		mhp_altmap.free = PHYS_PFN(size);
-		mhp_altmap.base_pfn = PHYS_PFN(start);
-		params.altmap = &mhp_altmap;
-	}
-
-	/* call arch's memory hotadd */
-	ret = arch_add_memory(nid, start, size, &params);
-	if (ret < 0)
-		goto error;
-
-	/* create memory block devices after memory was added */
-	ret = create_memory_block_devices(start, size, mhp_altmap.alloc,
-					  group);
-	if (ret) {
-		arch_remove_memory(start, size, NULL);
-		goto error;
 	}
 
 	if (new_node) {
@@ -2035,12 +2056,38 @@ void try_offline_node(int nid)
 }
 EXPORT_SYMBOL(try_offline_node);
 
-static int __ref try_remove_memory(u64 start, u64 size)
+static void __ref __try_remove_memory(int nid, u64 start, u64 size,
+				     struct vmem_altmap *altmap)
 {
-	struct vmem_altmap mhp_altmap = {};
-	struct vmem_altmap *altmap = NULL;
-	unsigned long nr_vmemmap_pages;
-	int rc = 0, nid = NUMA_NO_NODE;
+	/* remove memmap entry */
+	firmware_map_remove(start, start + size, "System RAM");
+
+	/*
+	 * Memory block device removal under the device_hotplug_lock is
+	 * a barrier against racing online attempts.
+	 */
+	remove_memory_block_devices(start, size);
+
+	mem_hotplug_begin();
+
+	arch_remove_memory(start, size, altmap);
+
+	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
+		memblock_phys_free(start, size);
+		memblock_remove(start, size);
+	}
+
+	release_mem_region_adjustable(start, size);
+
+	if (nid != NUMA_NO_NODE)
+		try_offline_node(nid);
+
+	mem_hotplug_done();
+}
+
+static int try_remove_memory(u64 start, u64 size)
+{
+	int rc, nid = NUMA_NO_NODE;
 
 	BUG_ON(check_hotplug_memory_range(start, size));
 
@@ -2058,20 +2105,21 @@ static int __ref try_remove_memory(u64 start, u64 size)
 		return rc;
 
 	/*
-	 * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
-	 * the same granularity it was added - a single memory block.
+	 * For memmap_on_memory, the altmaps could have been added on
+	 * a per-memblock basis. Loop through the entire range if so,
+	 * and remove each memblock and its altmap
 	 */
 	if (mhp_memmap_on_memory()) {
-		nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
-						      get_nr_vmemmap_pages_cb);
-		if (nr_vmemmap_pages) {
-			if (size != memory_block_size_bytes()) {
-				pr_warn("Refuse to remove %#llx - %#llx,"
-					"wrong granularity\n",
-					start, start + size);
-				return -EINVAL;
-			}
+		unsigned long memblock_size = memory_block_size_bytes();
+		struct vmem_altmap mhp_altmap = {};
+		struct vmem_altmap *altmap;
+		u64 cur_start;
 
+		for (cur_start = start; cur_start < start + size;
+		     cur_start += memblock_size) {
+			unsigned long nr_vmemmap_pages =
+				walk_memory_blocks(start, memblock_size, NULL,
+						   get_nr_vmemmap_pages_cb);
 			/*
 			 * Let remove_pmd_table->free_hugepage_table do the
 			 * right thing if we used vmem_altmap when hot-adding
@@ -2079,33 +2127,13 @@ static int __ref try_remove_memory(u64 start, u64 size)
 			 */
 			mhp_altmap.alloc = nr_vmemmap_pages;
 			altmap = &mhp_altmap;
+			__try_remove_memory(nid, cur_start, memblock_size,
+						 altmap);
 		}
+	} else {
+		__try_remove_memory(nid, start, size, NULL);
 	}
 
-	/* remove memmap entry */
-	firmware_map_remove(start, start + size, "System RAM");
-
-	/*
-	 * Memory block device removal under the device_hotplug_lock is
-	 * a barrier against racing online attempts.
-	 */
-	remove_memory_block_devices(start, size);
-
-	mem_hotplug_begin();
-
-	arch_remove_memory(start, size, altmap);
-
-	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
-		memblock_phys_free(start, size);
-		memblock_remove(start, size);
-	}
-
-	release_mem_region_adjustable(start, size);
-
-	if (nid != NUMA_NO_NODE)
-		try_offline_node(nid);
-
-	mem_hotplug_done();
 	return 0;
 }
 

-- 
2.41.0



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 3/3] dax/kmem: allow kmem to add memory with memmap_on_memory
  2023-07-20  7:14 [PATCH v2 0/3] mm: use memmap_on_memory semantics for dax/kmem Vishal Verma
  2023-07-20  7:14 ` [PATCH v2 1/3] mm/memory_hotplug: Export symbol mhp_supports_memmap_on_memory() Vishal Verma
  2023-07-20  7:14 ` [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks Vishal Verma
@ 2023-07-20  7:14 ` Vishal Verma
  2023-07-25 15:54 ` [PATCH v2 0/3] mm: use memmap_on_memory semantics for dax/kmem David Hildenbrand
  3 siblings, 0 replies; 15+ messages in thread
From: Vishal Verma @ 2023-07-20  7:14 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Oscar Salvador, Dan Williams,
	Dave Jiang
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Huang Ying,
	Dave Hansen, Aneesh Kumar K.V, Jonathan Cameron, Jeff Moyer,
	Vishal Verma

Large amounts of memory managed by the kmem driver may come in via CXL,
and it is often desirable to have the memmap for this memory on the new
memory itself.

Enroll kmem-managed memory for memmap_on_memory semantics as a default
if other requirements for it are met. Add a sysfs override under the dax
device to opt out of this behavior.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
 drivers/dax/dax-private.h |  1 +
 drivers/dax/bus.c         | 48 +++++++++++++++++++++++++++++++++++++++++++++++
 drivers/dax/kmem.c        |  7 ++++++-
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 27cf2daaaa79..446617b73aea 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -70,6 +70,7 @@ struct dev_dax {
 	struct ida ida;
 	struct device dev;
 	struct dev_pagemap *pgmap;
+	bool memmap_on_memory;
 	int nr_range;
 	struct dev_dax_range {
 		unsigned long pgoff;
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 0ee96e6fc426..c8e3ea7c674d 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1,6 +1,8 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright(c) 2017-2018 Intel Corporation. All rights reserved. */
+#include <linux/memory_hotplug.h>
 #include <linux/memremap.h>
+#include <linux/memory.h>
 #include <linux/device.h>
 #include <linux/mutex.h>
 #include <linux/list.h>
@@ -1269,6 +1271,43 @@ static ssize_t numa_node_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(numa_node);
 
+static ssize_t memmap_on_memory_show(struct device *dev,
+				     struct device_attribute *attr, char *buf)
+{
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+
+	return sprintf(buf, "%d\n", dev_dax->memmap_on_memory);
+}
+
+static ssize_t memmap_on_memory_store(struct device *dev,
+				      struct device_attribute *attr,
+				      const char *buf, size_t len)
+{
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+	struct dax_region *dax_region = dev_dax->region;
+	ssize_t rc;
+	bool val;
+
+	rc = kstrtobool(buf, &val);
+	if (rc)
+		return rc;
+
+	device_lock(dax_region->dev);
+	if (!dax_region->dev->driver) {
+		device_unlock(dax_region->dev);
+		return -ENXIO;
+	}
+
+	if (mhp_supports_memmap_on_memory(memory_block_size_bytes()))
+		dev_dax->memmap_on_memory = val;
+	else
+		rc = -ENXIO;
+
+	device_unlock(dax_region->dev);
+	return rc == 0 ? len : rc;
+}
+static DEVICE_ATTR_RW(memmap_on_memory);
+
 static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
 {
 	struct device *dev = container_of(kobj, struct device, kobj);
@@ -1295,6 +1334,7 @@ static struct attribute *dev_dax_attributes[] = {
 	&dev_attr_align.attr,
 	&dev_attr_resource.attr,
 	&dev_attr_numa_node.attr,
+	&dev_attr_memmap_on_memory.attr,
 	NULL,
 };
 
@@ -1400,6 +1440,14 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
 	dev_dax->align = dax_region->align;
 	ida_init(&dev_dax->ida);
 
+	/*
+	 * If supported by memory_hotplug, allow memmap_on_memory behavior by
+	 * default. This can be overridden via sysfs before handing the memory
+	 * over to kmem if desired.
+	 */
+	if (mhp_supports_memmap_on_memory(memory_block_size_bytes()))
+		dev_dax->memmap_on_memory = true;
+
 	inode = dax_inode(dax_dev);
 	dev->devt = inode->i_rdev;
 	dev->bus = &dax_bus_type;
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 898ca9505754..e6976a79093d 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -56,6 +56,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 	unsigned long total_len = 0;
 	struct dax_kmem_data *data;
 	int i, rc, mapped = 0;
+	mhp_t mhp_flags;
 	int numa_node;
 
 	/*
@@ -136,12 +137,16 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 		 */
 		res->flags = IORESOURCE_SYSTEM_RAM;
 
+		mhp_flags = MHP_NID_IS_MGID;
+		if (dev_dax->memmap_on_memory)
+			mhp_flags |= MHP_MEMMAP_ON_MEMORY;
+
 		/*
 		 * Ensure that future kexec'd kernels will not treat
 		 * this as RAM automatically.
 		 */
 		rc = add_memory_driver_managed(data->mgid, range.start,
-				range_len(&range), kmem_name, MHP_NID_IS_MGID);
+				range_len(&range), kmem_name, mhp_flags);
 
 		if (rc) {
 			dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",

-- 
2.41.0



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks
  2023-07-20  7:14 ` [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks Vishal Verma
@ 2023-07-21 12:20   ` Aneesh Kumar K.V
  2023-07-23 14:53   ` Aneesh Kumar K.V
  2023-07-24  5:54   ` Huang, Ying
  2 siblings, 0 replies; 15+ messages in thread
From: Aneesh Kumar K.V @ 2023-07-21 12:20 UTC (permalink / raw)
  To: Vishal Verma, Andrew Morton, David Hildenbrand, Oscar Salvador,
	Dan Williams, Dave Jiang
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Huang Ying,
	Dave Hansen, Jonathan Cameron, Jeff Moyer, Vishal Verma

Vishal Verma <vishal.l.verma@intel.com> writes:

> The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is currently
> restricted to 'memblock_size' chunks of memory being added. Adding a
> larger span of memory precludes memmap_on_memory semantics.
>
> For users of hotplug such as kmem, large amounts of memory might get
> added from the CXL subsystem. In some cases, this amount may exceed the
> available 'main memory' to store the memmap for the memory being added.
> In this case, it is useful to have a way to place the memmap on the
> memory being added, even if it means splitting the addition into
> memblock-sized chunks.
>
> Change add_memory_resource() to loop over memblock-sized chunks of
> memory if caller requested memmap_on_memory, and if other conditions for
> it are met,. Teach try_remove_memory() to also expect that a memory
> range being removed might have been split up into memblock sized chunks,
> and to loop through those as needed.
>

This conflicts with https://lore.kernel.org/linux-mm/20230718024409.95742-1-aneesh.kumar@linux.ibm.com/
IIUC Andrew was planning add that series to -mm. Also that patchset makes
some of related changes in this patch not required. Can you rebase this
series on top of that ? 


>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> ---
>  mm/memory_hotplug.c | 154 +++++++++++++++++++++++++++++++---------------------
>  1 file changed, 91 insertions(+), 63 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e9bcacbcbae2..20456f0d28e6 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1286,6 +1286,35 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
>  }
>  EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory);
>  
> +static int add_memory_create_devices(int nid, struct memory_group *group,
> +				     u64 start, u64 size, mhp_t mhp_flags)
> +{
> +	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> +	struct vmem_altmap mhp_altmap = {};
> +	int ret;
> +
> +	if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) {
> +		mhp_altmap.free = PHYS_PFN(size);
> +		mhp_altmap.base_pfn = PHYS_PFN(start);
> +		params.altmap = &mhp_altmap;
> +	}
> +
> +	/* call arch's memory hotadd */
> +	ret = arch_add_memory(nid, start, size, &params);
> +	if (ret < 0)
> +		return ret;
> +
> +	/* create memory block devices after memory was added */
> +	ret = create_memory_block_devices(start, size, mhp_altmap.alloc,
> +					  group);
> +	if (ret) {
> +		arch_remove_memory(start, size, NULL);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
>  /*
>   * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
>   * and online/offline operations (triggered e.g. by sysfs).
> @@ -1294,11 +1323,10 @@ EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory);
>   */
>  int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>  {
> -	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> +	unsigned long memblock_size = memory_block_size_bytes();
>  	enum memblock_flags memblock_flags = MEMBLOCK_NONE;
> -	struct vmem_altmap mhp_altmap = {};
>  	struct memory_group *group = NULL;
> -	u64 start, size;
> +	u64 start, size, cur_start;
>  	bool new_node = false;
>  	int ret;
>  
> @@ -1339,27 +1367,20 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>  	/*
>  	 * Self hosted memmap array
>  	 */
> -	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> -		if (!mhp_supports_memmap_on_memory(size)) {
> -			ret = -EINVAL;
> +	if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
> +	    mhp_supports_memmap_on_memory(memblock_size)) {
> +		for (cur_start = start; cur_start < start + size;
> +		     cur_start += memblock_size) {
> +			ret = add_memory_create_devices(nid, group, cur_start,
> +							memblock_size,
> +							mhp_flags);
> +			if (ret)
> +				goto error;
> +		}
> +	} else {
> +		ret = add_memory_create_devices(nid, group, start, size, mhp_flags);
> +		if (ret)
>  			goto error;
> -		}
> -		mhp_altmap.free = PHYS_PFN(size);
> -		mhp_altmap.base_pfn = PHYS_PFN(start);
> -		params.altmap = &mhp_altmap;
> -	}
> -
> -	/* call arch's memory hotadd */
> -	ret = arch_add_memory(nid, start, size, &params);
> -	if (ret < 0)
> -		goto error;
> -
> -	/* create memory block devices after memory was added */
> -	ret = create_memory_block_devices(start, size, mhp_altmap.alloc,
> -					  group);
> -	if (ret) {
> -		arch_remove_memory(start, size, NULL);
> -		goto error;
>  	}
>  
>  	if (new_node) {
> @@ -2035,12 +2056,38 @@ void try_offline_node(int nid)
>  }
>  EXPORT_SYMBOL(try_offline_node);
>  
> -static int __ref try_remove_memory(u64 start, u64 size)
> +static void __ref __try_remove_memory(int nid, u64 start, u64 size,
> +				     struct vmem_altmap *altmap)
>  {
> -	struct vmem_altmap mhp_altmap = {};
> -	struct vmem_altmap *altmap = NULL;
> -	unsigned long nr_vmemmap_pages;
> -	int rc = 0, nid = NUMA_NO_NODE;
> +	/* remove memmap entry */
> +	firmware_map_remove(start, start + size, "System RAM");
> +
> +	/*
> +	 * Memory block device removal under the device_hotplug_lock is
> +	 * a barrier against racing online attempts.
> +	 */
> +	remove_memory_block_devices(start, size);
> +
> +	mem_hotplug_begin();
> +
> +	arch_remove_memory(start, size, altmap);
> +
> +	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
> +		memblock_phys_free(start, size);
> +		memblock_remove(start, size);
> +	}
> +
> +	release_mem_region_adjustable(start, size);
> +
> +	if (nid != NUMA_NO_NODE)
> +		try_offline_node(nid);
> +
> +	mem_hotplug_done();
> +}
> +
> +static int try_remove_memory(u64 start, u64 size)
> +{
> +	int rc, nid = NUMA_NO_NODE;
>  
>  	BUG_ON(check_hotplug_memory_range(start, size));
>  
> @@ -2058,20 +2105,21 @@ static int __ref try_remove_memory(u64 start, u64 size)
>  		return rc;
>  
>  	/*
> -	 * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
> -	 * the same granularity it was added - a single memory block.
> +	 * For memmap_on_memory, the altmaps could have been added on
> +	 * a per-memblock basis. Loop through the entire range if so,
> +	 * and remove each memblock and its altmap
>  	 */
>  	if (mhp_memmap_on_memory()) {
> -		nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
> -						      get_nr_vmemmap_pages_cb);
> -		if (nr_vmemmap_pages) {
> -			if (size != memory_block_size_bytes()) {
> -				pr_warn("Refuse to remove %#llx - %#llx,"
> -					"wrong granularity\n",
> -					start, start + size);
> -				return -EINVAL;
> -			}
> +		unsigned long memblock_size = memory_block_size_bytes();
> +		struct vmem_altmap mhp_altmap = {};
> +		struct vmem_altmap *altmap;
> +		u64 cur_start;
>  
> +		for (cur_start = start; cur_start < start + size;
> +		     cur_start += memblock_size) {
> +			unsigned long nr_vmemmap_pages =
> +				walk_memory_blocks(start, memblock_size, NULL,
> +						   get_nr_vmemmap_pages_cb);
>  			/*
>  			 * Let remove_pmd_table->free_hugepage_table do the
>  			 * right thing if we used vmem_altmap when hot-adding
> @@ -2079,33 +2127,13 @@ static int __ref try_remove_memory(u64 start, u64 size)
>  			 */
>  			mhp_altmap.alloc = nr_vmemmap_pages;
>  			altmap = &mhp_altmap;
> +			__try_remove_memory(nid, cur_start, memblock_size,
> +						 altmap);
>  		}
> +	} else {
> +		__try_remove_memory(nid, start, size, NULL);
>  	}
>  
> -	/* remove memmap entry */
> -	firmware_map_remove(start, start + size, "System RAM");
> -
> -	/*
> -	 * Memory block device removal under the device_hotplug_lock is
> -	 * a barrier against racing online attempts.
> -	 */
> -	remove_memory_block_devices(start, size);
> -
> -	mem_hotplug_begin();
> -
> -	arch_remove_memory(start, size, altmap);
> -
> -	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
> -		memblock_phys_free(start, size);
> -		memblock_remove(start, size);
> -	}
> -
> -	release_mem_region_adjustable(start, size);
> -
> -	if (nid != NUMA_NO_NODE)
> -		try_offline_node(nid);
> -
> -	mem_hotplug_done();
>  	return 0;
>  }
>  
>
> -- 
> 2.41.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks
  2023-07-20  7:14 ` [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks Vishal Verma
  2023-07-21 12:20   ` Aneesh Kumar K.V
@ 2023-07-23 14:53   ` Aneesh Kumar K.V
  2023-07-24  3:16     ` Huang, Ying
  2023-07-24  5:54   ` Huang, Ying
  2 siblings, 1 reply; 15+ messages in thread
From: Aneesh Kumar K.V @ 2023-07-23 14:53 UTC (permalink / raw)
  To: Vishal Verma, Andrew Morton, David Hildenbrand, Oscar Salvador,
	Dan Williams, Dave Jiang
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Huang Ying,
	Dave Hansen, Jonathan Cameron, Jeff Moyer, Vishal Verma

Vishal Verma <vishal.l.verma@intel.com> writes:

> The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is currently
> restricted to 'memblock_size' chunks of memory being added. Adding a
> larger span of memory precludes memmap_on_memory semantics.
>
> For users of hotplug such as kmem, large amounts of memory might get
> added from the CXL subsystem. In some cases, this amount may exceed the
> available 'main memory' to store the memmap for the memory being added.
> In this case, it is useful to have a way to place the memmap on the
> memory being added, even if it means splitting the addition into
> memblock-sized chunks.
>
> Change add_memory_resource() to loop over memblock-sized chunks of
> memory if caller requested memmap_on_memory, and if other conditions for
> it are met,. Teach try_remove_memory() to also expect that a memory
> range being removed might have been split up into memblock sized chunks,
> and to loop through those as needed.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> ---
>  mm/memory_hotplug.c | 154 +++++++++++++++++++++++++++++++---------------------
>  1 file changed, 91 insertions(+), 63 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e9bcacbcbae2..20456f0d28e6 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1286,6 +1286,35 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
>  }
>  EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory);
>  
> +static int add_memory_create_devices(int nid, struct memory_group *group,
> +				     u64 start, u64 size, mhp_t mhp_flags)
> +{
> +	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> +	struct vmem_altmap mhp_altmap = {};
> +	int ret;
> +
> +	if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) {
> +		mhp_altmap.free = PHYS_PFN(size);
> +		mhp_altmap.base_pfn = PHYS_PFN(start);
> +		params.altmap = &mhp_altmap;
> +	}
> +
> +	/* call arch's memory hotadd */
> +	ret = arch_add_memory(nid, start, size, &params);
> +	if (ret < 0)
> +		return ret;
> +
> +	/* create memory block devices after memory was added */
> +	ret = create_memory_block_devices(start, size, mhp_altmap.alloc,
> +					  group);
> +	if (ret) {
> +		arch_remove_memory(start, size, NULL);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
>  /*
>   * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
>   * and online/offline operations (triggered e.g. by sysfs).
> @@ -1294,11 +1323,10 @@ EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory);
>   */
>  int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>  {
> -	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> +	unsigned long memblock_size = memory_block_size_bytes();
>  	enum memblock_flags memblock_flags = MEMBLOCK_NONE;
> -	struct vmem_altmap mhp_altmap = {};
>  	struct memory_group *group = NULL;
> -	u64 start, size;
> +	u64 start, size, cur_start;
>  	bool new_node = false;
>  	int ret;
>  
> @@ -1339,27 +1367,20 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>  	/*
>  	 * Self hosted memmap array
>  	 */
> -	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> -		if (!mhp_supports_memmap_on_memory(size)) {
> -			ret = -EINVAL;
> +	if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
> +	    mhp_supports_memmap_on_memory(memblock_size)) {
> +		for (cur_start = start; cur_start < start + size;
> +		     cur_start += memblock_size) {
> +			ret = add_memory_create_devices(nid, group, cur_start,
> +							memblock_size,
> +							mhp_flags);
> +			if (ret)
> +				goto error;
> +		}

We should handle the below error details here. 

1) If we hit an error after some blocks got added, should we iterate over rest of the dev_dax->nr_range.
2) With some blocks added if we return a failure here, we remove the
resource in dax_kmem. Is that ok? 

IMHO error handling with partial creation of memory blocks in a resource range should be
documented with this change.


> +	} else {
> +		ret = add_memory_create_devices(nid, group, start, size, mhp_flags);
> +		if (ret)
>  			goto error;
> -		}
> -		mhp_altmap.free = PHYS_PFN(size);
> -		mhp_altmap.base_pfn = PHYS_PFN(start);
> -		params.altmap = &mhp_altmap;
> -	}
> -
> -	/* call arch's memory hotadd */
> -	ret = arch_add_memory(nid, start, size, &params);
> -	if (ret < 0)
> -		goto error;
> -
> -	/* create memory block devices after memory was added */
> -	ret = create_memory_block_devices(start, size, mhp_altmap.alloc,
> -					  group);
> -	if (ret) {
> -		arch_remove_memory(start, size, NULL);
> -		goto error;
>  	}
>  
>  	if (new_node) {
> @@ -2035,12 +2056,38 @@ void try_offline_node(int nid)
>  }
>  EXPORT_SYMBOL(try_offline_node);
>  
> -static int __ref try_remove_memory(u64 start, u64 size)
> +static void __ref __try_remove_memory(int nid, u64 start, u64 size,
> +				     struct vmem_altmap *altmap)
>  {
> -	struct vmem_altmap mhp_altmap = {};
> -	struct vmem_altmap *altmap = NULL;
> -	unsigned long nr_vmemmap_pages;
> -	int rc = 0, nid = NUMA_NO_NODE;
> +	/* remove memmap entry */
> +	firmware_map_remove(start, start + size, "System RAM");
> +
> +	/*
> +	 * Memory block device removal under the device_hotplug_lock is
> +	 * a barrier against racing online attempts.
> +	 */
> +	remove_memory_block_devices(start, size);
> +
> +	mem_hotplug_begin();
> +
> +	arch_remove_memory(start, size, altmap);
> +
> +	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
> +		memblock_phys_free(start, size);
> +		memblock_remove(start, size);
> +	}
> +
> +	release_mem_region_adjustable(start, size);
> +
> +	if (nid != NUMA_NO_NODE)
> +		try_offline_node(nid);
> +
> +	mem_hotplug_done();
> +}
> +
> +static int try_remove_memory(u64 start, u64 size)
> +{
> +	int rc, nid = NUMA_NO_NODE;
>  
>  	BUG_ON(check_hotplug_memory_range(start, size));
>  
> @@ -2058,20 +2105,21 @@ static int __ref try_remove_memory(u64 start, u64 size)
>  		return rc;
>  
>  	/*
> -	 * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
> -	 * the same granularity it was added - a single memory block.
> +	 * For memmap_on_memory, the altmaps could have been added on
> +	 * a per-memblock basis. Loop through the entire range if so,
> +	 * and remove each memblock and its altmap
>  	 */
>  	if (mhp_memmap_on_memory()) {
> -		nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
> -						      get_nr_vmemmap_pages_cb);
> -		if (nr_vmemmap_pages) {
> -			if (size != memory_block_size_bytes()) {
> -				pr_warn("Refuse to remove %#llx - %#llx,"
> -					"wrong granularity\n",
> -					start, start + size);
> -				return -EINVAL;
> -			}
> +		unsigned long memblock_size = memory_block_size_bytes();
> +		struct vmem_altmap mhp_altmap = {};
> +		struct vmem_altmap *altmap;
> +		u64 cur_start;
>  
> +		for (cur_start = start; cur_start < start + size;
> +		     cur_start += memblock_size) {
> +			unsigned long nr_vmemmap_pages =
> +				walk_memory_blocks(start, memblock_size, NULL,
> +						   get_nr_vmemmap_pages_cb);
>  			/*
>  			 * Let remove_pmd_table->free_hugepage_table do the
>  			 * right thing if we used vmem_altmap when hot-adding
> @@ -2079,33 +2127,13 @@ static int __ref try_remove_memory(u64 start, u64 size)
>  			 */
>  			mhp_altmap.alloc = nr_vmemmap_pages;
>  			altmap = &mhp_altmap;
> +			__try_remove_memory(nid, cur_start, memblock_size,
> +						 altmap);
>  		}
> +	} else {
> +		__try_remove_memory(nid, start, size, NULL);
>  	}
>  
> -	/* remove memmap entry */
> -	firmware_map_remove(start, start + size, "System RAM");
> -
> -	/*
> -	 * Memory block device removal under the device_hotplug_lock is
> -	 * a barrier against racing online attempts.
> -	 */
> -	remove_memory_block_devices(start, size);
> -
> -	mem_hotplug_begin();
> -
> -	arch_remove_memory(start, size, altmap);
> -
> -	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
> -		memblock_phys_free(start, size);
> -		memblock_remove(start, size);
> -	}
> -
> -	release_mem_region_adjustable(start, size);
> -
> -	if (nid != NUMA_NO_NODE)
> -		try_offline_node(nid);
> -
> -	mem_hotplug_done();
>  	return 0;
>  }
>  
>
> -- 
> 2.41.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks
  2023-07-23 14:53   ` Aneesh Kumar K.V
@ 2023-07-24  3:16     ` Huang, Ying
  2023-08-02  6:02       ` Verma, Vishal L
  0 siblings, 1 reply; 15+ messages in thread
From: Huang, Ying @ 2023-07-24  3:16 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Vishal Verma, Andrew Morton, David Hildenbrand, Oscar Salvador,
	Dan Williams, Dave Jiang, linux-kernel, linux-mm, nvdimm,
	linux-cxl, Dave Hansen, Jonathan Cameron, Jeff Moyer

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> Vishal Verma <vishal.l.verma@intel.com> writes:
>
>> The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is currently
>> restricted to 'memblock_size' chunks of memory being added. Adding a
>> larger span of memory precludes memmap_on_memory semantics.
>>
>> For users of hotplug such as kmem, large amounts of memory might get
>> added from the CXL subsystem. In some cases, this amount may exceed the
>> available 'main memory' to store the memmap for the memory being added.
>> In this case, it is useful to have a way to place the memmap on the
>> memory being added, even if it means splitting the addition into
>> memblock-sized chunks.
>>
>> Change add_memory_resource() to loop over memblock-sized chunks of
>> memory if caller requested memmap_on_memory, and if other conditions for
>> it are met,. Teach try_remove_memory() to also expect that a memory
>> range being removed might have been split up into memblock sized chunks,
>> and to loop through those as needed.
>>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Dave Jiang <dave.jiang@intel.com>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Huang Ying <ying.huang@intel.com>
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
>> ---
>>  mm/memory_hotplug.c | 154 +++++++++++++++++++++++++++++++---------------------
>>  1 file changed, 91 insertions(+), 63 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index e9bcacbcbae2..20456f0d28e6 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1286,6 +1286,35 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
>>  }
>>  EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory);
>>  
>> +static int add_memory_create_devices(int nid, struct memory_group *group,
>> +				     u64 start, u64 size, mhp_t mhp_flags)
>> +{
>> +	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
>> +	struct vmem_altmap mhp_altmap = {};
>> +	int ret;
>> +
>> +	if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) {
>> +		mhp_altmap.free = PHYS_PFN(size);
>> +		mhp_altmap.base_pfn = PHYS_PFN(start);
>> +		params.altmap = &mhp_altmap;
>> +	}
>> +
>> +	/* call arch's memory hotadd */
>> +	ret = arch_add_memory(nid, start, size, &params);
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	/* create memory block devices after memory was added */
>> +	ret = create_memory_block_devices(start, size, mhp_altmap.alloc,
>> +					  group);
>> +	if (ret) {
>> +		arch_remove_memory(start, size, NULL);
>> +		return ret;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>  /*
>>   * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
>>   * and online/offline operations (triggered e.g. by sysfs).
>> @@ -1294,11 +1323,10 @@ EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory);
>>   */
>>  int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>>  {
>> -	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
>> +	unsigned long memblock_size = memory_block_size_bytes();
>>  	enum memblock_flags memblock_flags = MEMBLOCK_NONE;
>> -	struct vmem_altmap mhp_altmap = {};
>>  	struct memory_group *group = NULL;
>> -	u64 start, size;
>> +	u64 start, size, cur_start;
>>  	bool new_node = false;
>>  	int ret;
>>  
>> @@ -1339,27 +1367,20 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>>  	/*
>>  	 * Self hosted memmap array
>>  	 */
>> -	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
>> -		if (!mhp_supports_memmap_on_memory(size)) {
>> -			ret = -EINVAL;
>> +	if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
>> +	    mhp_supports_memmap_on_memory(memblock_size)) {
>> +		for (cur_start = start; cur_start < start + size;
>> +		     cur_start += memblock_size) {
>> +			ret = add_memory_create_devices(nid, group, cur_start,
>> +							memblock_size,
>> +							mhp_flags);
>> +			if (ret)
>> +				goto error;
>> +		}
>
> We should handle the below error details here. 
>
> 1) If we hit an error after some blocks got added, should we iterate over rest of the dev_dax->nr_range.
> 2) With some blocks added if we return a failure here, we remove the
> resource in dax_kmem. Is that ok? 
>
> IMHO error handling with partial creation of memory blocks in a resource range should be
> documented with this change.

Or, should we remove all added memory blocks upon error?

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks
  2023-07-20  7:14 ` [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks Vishal Verma
  2023-07-21 12:20   ` Aneesh Kumar K.V
  2023-07-23 14:53   ` Aneesh Kumar K.V
@ 2023-07-24  5:54   ` Huang, Ying
  2023-08-02  6:08     ` Verma, Vishal L
  2 siblings, 1 reply; 15+ messages in thread
From: Huang, Ying @ 2023-07-24  5:54 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Andrew Morton, David Hildenbrand, Oscar Salvador, Dan Williams,
	Dave Jiang, linux-kernel, linux-mm, nvdimm, linux-cxl,
	Dave Hansen, Aneesh Kumar K.V, Jonathan Cameron, Jeff Moyer

Vishal Verma <vishal.l.verma@intel.com> writes:

> The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is currently
> restricted to 'memblock_size' chunks of memory being added. Adding a
> larger span of memory precludes memmap_on_memory semantics.
>
> For users of hotplug such as kmem, large amounts of memory might get
> added from the CXL subsystem. In some cases, this amount may exceed the
> available 'main memory' to store the memmap for the memory being added.
> In this case, it is useful to have a way to place the memmap on the
> memory being added, even if it means splitting the addition into
> memblock-sized chunks.
>
> Change add_memory_resource() to loop over memblock-sized chunks of
> memory if caller requested memmap_on_memory, and if other conditions for
> it are met,. Teach try_remove_memory() to also expect that a memory
> range being removed might have been split up into memblock sized chunks,
> and to loop through those as needed.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> ---
>  mm/memory_hotplug.c | 154 +++++++++++++++++++++++++++++++---------------------
>  1 file changed, 91 insertions(+), 63 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e9bcacbcbae2..20456f0d28e6 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1286,6 +1286,35 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
>  }
>  EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory);
>  
> +static int add_memory_create_devices(int nid, struct memory_group *group,
> +				     u64 start, u64 size, mhp_t mhp_flags)
> +{
> +	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> +	struct vmem_altmap mhp_altmap = {};
> +	int ret;
> +
> +	if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) {
> +		mhp_altmap.free = PHYS_PFN(size);
> +		mhp_altmap.base_pfn = PHYS_PFN(start);
> +		params.altmap = &mhp_altmap;
> +	}
> +
> +	/* call arch's memory hotadd */
> +	ret = arch_add_memory(nid, start, size, &params);
> +	if (ret < 0)
> +		return ret;
> +
> +	/* create memory block devices after memory was added */
> +	ret = create_memory_block_devices(start, size, mhp_altmap.alloc,
> +					  group);
> +	if (ret) {
> +		arch_remove_memory(start, size, NULL);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
>  /*
>   * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
>   * and online/offline operations (triggered e.g. by sysfs).
> @@ -1294,11 +1323,10 @@ EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory);
>   */
>  int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>  {
> -	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> +	unsigned long memblock_size = memory_block_size_bytes();
>  	enum memblock_flags memblock_flags = MEMBLOCK_NONE;
> -	struct vmem_altmap mhp_altmap = {};
>  	struct memory_group *group = NULL;
> -	u64 start, size;
> +	u64 start, size, cur_start;
>  	bool new_node = false;
>  	int ret;
>  
> @@ -1339,27 +1367,20 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>  	/*
>  	 * Self hosted memmap array
>  	 */
> -	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> -		if (!mhp_supports_memmap_on_memory(size)) {
> -			ret = -EINVAL;
> +	if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
> +	    mhp_supports_memmap_on_memory(memblock_size)) {
> +		for (cur_start = start; cur_start < start + size;
> +		     cur_start += memblock_size) {
> +			ret = add_memory_create_devices(nid, group, cur_start,
> +							memblock_size,
> +							mhp_flags);
> +			if (ret)
> +				goto error;
> +		}
> +	} else {
> +		ret = add_memory_create_devices(nid, group, start, size, mhp_flags);
> +		if (ret)
>  			goto error;

Another choice to organize code is to use different step (memblock_size
vs. size) in "for" loop.

It's not necessary in this patchset.  It appears that we cannot create
1GB mapping if we put memmap on memory now, right?  If so, is it doable
to support that via separating creating memory mapping from
arch_add_memory()?

> -		}
> -		mhp_altmap.free = PHYS_PFN(size);
> -		mhp_altmap.base_pfn = PHYS_PFN(start);
> -		params.altmap = &mhp_altmap;
> -	}
> -
> -	/* call arch's memory hotadd */
> -	ret = arch_add_memory(nid, start, size, &params);
> -	if (ret < 0)
> -		goto error;
> -
> -	/* create memory block devices after memory was added */
> -	ret = create_memory_block_devices(start, size, mhp_altmap.alloc,
> -					  group);
> -	if (ret) {
> -		arch_remove_memory(start, size, NULL);
> -		goto error;
>  	}
>  
>  	if (new_node) {
> @@ -2035,12 +2056,38 @@ void try_offline_node(int nid)
>  }
>  EXPORT_SYMBOL(try_offline_node);
>  
> -static int __ref try_remove_memory(u64 start, u64 size)
> +static void __ref __try_remove_memory(int nid, u64 start, u64 size,
> +				     struct vmem_altmap *altmap)
>  {
> -	struct vmem_altmap mhp_altmap = {};
> -	struct vmem_altmap *altmap = NULL;
> -	unsigned long nr_vmemmap_pages;
> -	int rc = 0, nid = NUMA_NO_NODE;
> +	/* remove memmap entry */
> +	firmware_map_remove(start, start + size, "System RAM");

If mhp_supports_memmap_on_memory(), we will call
firmware_map_add_hotplug() for whole range.  But here we may call
firmware_map_remove() for part of range.  Is it OK?

--
Best Regards,
Huang, Ying

> +
> +	/*
> +	 * Memory block device removal under the device_hotplug_lock is
> +	 * a barrier against racing online attempts.
> +	 */
> +	remove_memory_block_devices(start, size);
> +
> +	mem_hotplug_begin();
> +
> +	arch_remove_memory(start, size, altmap);
> +
> +	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
> +		memblock_phys_free(start, size);
> +		memblock_remove(start, size);
> +	}
> +
> +	release_mem_region_adjustable(start, size);
> +
> +	if (nid != NUMA_NO_NODE)
> +		try_offline_node(nid);
> +
> +	mem_hotplug_done();
> +}
> +
> +static int try_remove_memory(u64 start, u64 size)
> +{
> +	int rc, nid = NUMA_NO_NODE;
>  
>  	BUG_ON(check_hotplug_memory_range(start, size));
>  
> @@ -2058,20 +2105,21 @@ static int __ref try_remove_memory(u64 start, u64 size)
>  		return rc;
>  
>  	/*
> -	 * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
> -	 * the same granularity it was added - a single memory block.
> +	 * For memmap_on_memory, the altmaps could have been added on
> +	 * a per-memblock basis. Loop through the entire range if so,
> +	 * and remove each memblock and its altmap
>  	 */
>  	if (mhp_memmap_on_memory()) {
> -		nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
> -						      get_nr_vmemmap_pages_cb);
> -		if (nr_vmemmap_pages) {
> -			if (size != memory_block_size_bytes()) {
> -				pr_warn("Refuse to remove %#llx - %#llx,"
> -					"wrong granularity\n",
> -					start, start + size);
> -				return -EINVAL;
> -			}
> +		unsigned long memblock_size = memory_block_size_bytes();
> +		struct vmem_altmap mhp_altmap = {};
> +		struct vmem_altmap *altmap;
> +		u64 cur_start;
>  
> +		for (cur_start = start; cur_start < start + size;
> +		     cur_start += memblock_size) {
> +			unsigned long nr_vmemmap_pages =
> +				walk_memory_blocks(start, memblock_size, NULL,
> +						   get_nr_vmemmap_pages_cb);
>  			/*
>  			 * Let remove_pmd_table->free_hugepage_table do the
>  			 * right thing if we used vmem_altmap when hot-adding
> @@ -2079,33 +2127,13 @@ static int __ref try_remove_memory(u64 start, u64 size)
>  			 */
>  			mhp_altmap.alloc = nr_vmemmap_pages;
>  			altmap = &mhp_altmap;
> +			__try_remove_memory(nid, cur_start, memblock_size,
> +						 altmap);
>  		}
> +	} else {
> +		__try_remove_memory(nid, start, size, NULL);
>  	}
>  
> -	/* remove memmap entry */
> -	firmware_map_remove(start, start + size, "System RAM");
> -
> -	/*
> -	 * Memory block device removal under the device_hotplug_lock is
> -	 * a barrier against racing online attempts.
> -	 */
> -	remove_memory_block_devices(start, size);
> -
> -	mem_hotplug_begin();
> -
> -	arch_remove_memory(start, size, altmap);
> -
> -	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
> -		memblock_phys_free(start, size);
> -		memblock_remove(start, size);
> -	}
> -
> -	release_mem_region_adjustable(start, size);
> -
> -	if (nid != NUMA_NO_NODE)
> -		try_offline_node(nid);
> -
> -	mem_hotplug_done();
>  	return 0;
>  }


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/3] mm/memory_hotplug: Export symbol mhp_supports_memmap_on_memory()
  2023-07-20  7:14 ` [PATCH v2 1/3] mm/memory_hotplug: Export symbol mhp_supports_memmap_on_memory() Vishal Verma
@ 2023-07-24  6:00   ` Huang, Ying
  0 siblings, 0 replies; 15+ messages in thread
From: Huang, Ying @ 2023-07-24  6:00 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Andrew Morton, David Hildenbrand, Oscar Salvador, Dan Williams,
	Dave Jiang, linux-kernel, linux-mm, nvdimm, linux-cxl,
	Dave Hansen, Aneesh Kumar K.V, Jonathan Cameron, Jeff Moyer

Vishal Verma <vishal.l.verma@intel.com> writes:

> In preparation for dax drivers, which can be built as modules,
> to use this interface, export it with EXPORT_SYMBOL_GPL(). Add a #else
> case for the symbol for builds without CONFIG_MEMORY_HOTPLUG.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> ---
>  include/linux/memory_hotplug.h | 5 +++++
>  mm/memory_hotplug.c            | 1 +
>  2 files changed, 6 insertions(+)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 013c69753c91..fc5da07ad011 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -355,6 +355,11 @@ extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
>  				      struct mhp_params *params);
>  void arch_remove_linear_mapping(u64 start, u64 size);
>  extern bool mhp_supports_memmap_on_memory(unsigned long size);
> +#else
> +static inline bool mhp_supports_memmap_on_memory(unsigned long size)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_MEMORY_HOTPLUG */

It appears that there is no user of mhp_supports_memmap_on_memory() that
may be compiled with !CONFIG_MEMORY_HOTPLUG?

--
Best Regards,
Huang, Ying

>  #endif /* __LINUX_MEMORY_HOTPLUG_H */
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 3f231cf1b410..e9bcacbcbae2 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1284,6 +1284,7 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
>  	       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
>  	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
>  }
> +EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory);
>  
>  /*
>   * NOTE: The caller must call lock_device_hotplug() to serialize hotplug


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/3] mm: use memmap_on_memory semantics for dax/kmem
  2023-07-20  7:14 [PATCH v2 0/3] mm: use memmap_on_memory semantics for dax/kmem Vishal Verma
                   ` (2 preceding siblings ...)
  2023-07-20  7:14 ` [PATCH v2 3/3] dax/kmem: allow kmem to add memory with memmap_on_memory Vishal Verma
@ 2023-07-25 15:54 ` David Hildenbrand
  3 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2023-07-25 15:54 UTC (permalink / raw)
  To: Vishal Verma, Andrew Morton, Oscar Salvador, Dan Williams, Dave Jiang
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Huang Ying,
	Dave Hansen, Aneesh Kumar K.V, Jonathan Cameron, Jeff Moyer

On 20.07.23 09:14, Vishal Verma wrote:
> The dax/kmem driver can potentially hot-add large amounts of memory
> originating from CXL memory expanders, or NVDIMMs, or other 'device
> memories'. There is a chance there isn't enough regular system memory
> available to fit the memmap for this new memory. It's therefore
> desirable, if all other conditions are met, for the kmem managed memory
> to place its memmap on the newly added memory itself.
> 
> The main hurdle for accomplishing this for kmem is that memmap_on_memory
> can only be done if the memory being added is equal to the size of one
> memblock. To overcome this,allow the hotplug code to split an add_memory()
> request into memblock-sized chunks, and try_remove_memory() to also
> expect and handle such a scenario.
> 
> Patch 1 exports mhp_supports_memmap_on_memory() so it can be used by the
> kmem driver.
> 
> Patch 2 teaches the memory_hotplug code to allow for splitting
> add_memory() and remove_memory() requests over memblock sized chunks.
> 
> Patch 3 adds a sysfs control for the kmem driver that would
> allow an opt-out of using memmap_on_memory for the memory being added.
> 

It might be reasonable to rebase this on Aneesh's work. For example, 
patch #1 might not be required anymore.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks
  2023-07-24  3:16     ` Huang, Ying
@ 2023-08-02  6:02       ` Verma, Vishal L
  2023-08-14  6:04         ` Huang, Ying
  0 siblings, 1 reply; 15+ messages in thread
From: Verma, Vishal L @ 2023-08-02  6:02 UTC (permalink / raw)
  To: Huang, Ying, aneesh.kumar
  Cc: david, Jiang, Dave, linux-mm, osalvador, akpm, linux-kernel,
	Williams, Dan J, dave.hansen, Jonathan.Cameron, nvdimm, jmoyer,
	linux-cxl

On Mon, 2023-07-24 at 11:16 +0800, Huang, Ying wrote:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> > 
> > > @@ -1339,27 +1367,20 @@ int __ref add_memory_resource(int nid,
> > > struct resource *res, mhp_t mhp_flags)
> > >         /*
> > >          * Self hosted memmap array
> > >          */
> > > -       if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> > > -               if (!mhp_supports_memmap_on_memory(size)) {
> > > -                       ret = -EINVAL;
> > > +       if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
> > > +           mhp_supports_memmap_on_memory(memblock_size)) {
> > > +               for (cur_start = start; cur_start < start + size;
> > > +                    cur_start += memblock_size) {
> > > +                       ret = add_memory_create_devices(nid,
> > > group, cur_start,
> > > +                                                       memblock_
> > > size,
> > > +                                                       mhp_flags
> > > );
> > > +                       if (ret)
> > > +                               goto error;
> > > +               }
> > 
> > We should handle the below error details here. 
> > 
> > 1) If we hit an error after some blocks got added, should we
> > iterate over rest of the dev_dax->nr_range.
> > 2) With some blocks added if we return a failure here, we remove
> > the
> > resource in dax_kmem. Is that ok? 
> > 
> > IMHO error handling with partial creation of memory blocks in a
> > resource range should be
> > documented with this change.
> 
> Or, should we remove all added memory blocks upon error?
> 
I didn't address these in v3 - I wasn't sure how we'd proceed here.
Something obviously went very wrong and I'd imagine it is okay if this
memory is unusable as a result.

What woyuld removing the blocks we added look like? Just call
try_remove_memory() from the error path in add_memory_resource()? (for
a range of [start, cur_start) ?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks
  2023-07-24  5:54   ` Huang, Ying
@ 2023-08-02  6:08     ` Verma, Vishal L
  2023-08-14  6:45       ` Huang, Ying
  0 siblings, 1 reply; 15+ messages in thread
From: Verma, Vishal L @ 2023-08-02  6:08 UTC (permalink / raw)
  To: Huang, Ying
  Cc: david, Jiang, Dave, linux-mm, osalvador, akpm, linux-kernel,
	Williams, Dan J, dave.hansen, Jonathan.Cameron, nvdimm,
	aneesh.kumar, jmoyer, linux-cxl

On Mon, 2023-07-24 at 13:54 +0800, Huang, Ying wrote:
> Vishal Verma <vishal.l.verma@intel.com> writes:
> 
> > 
> > @@ -2035,12 +2056,38 @@ void try_offline_node(int nid)
> >  }
> >  EXPORT_SYMBOL(try_offline_node);
> >  
> > -static int __ref try_remove_memory(u64 start, u64 size)
> > +static void __ref __try_remove_memory(int nid, u64 start, u64 size,
> > +                                    struct vmem_altmap *altmap)
> >  {
> > -       struct vmem_altmap mhp_altmap = {};
> > -       struct vmem_altmap *altmap = NULL;
> > -       unsigned long nr_vmemmap_pages;
> > -       int rc = 0, nid = NUMA_NO_NODE;
> > +       /* remove memmap entry */
> > +       firmware_map_remove(start, start + size, "System RAM");
> 
> If mhp_supports_memmap_on_memory(), we will call
> firmware_map_add_hotplug() for whole range.  But here we may call
> firmware_map_remove() for part of range.  Is it OK?
> 

Good point, this is a discrepancy in the add vs remove path. Can the
firmware memmap entries be moved up a bit in the add path, and is it
okay to create these for each memblock? Or should these be for the
whole range? I'm not familiar with the implications. (I've left it as
is for v3 for now, but depending on the direction I can update in a
future rev).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks
  2023-08-02  6:02       ` Verma, Vishal L
@ 2023-08-14  6:04         ` Huang, Ying
  0 siblings, 0 replies; 15+ messages in thread
From: Huang, Ying @ 2023-08-14  6:04 UTC (permalink / raw)
  To: Verma, Vishal L
  Cc: aneesh.kumar, david, Jiang, Dave, linux-mm, osalvador, akpm,
	linux-kernel, Williams, Dan J, dave.hansen, Jonathan.Cameron,
	nvdimm, jmoyer, linux-cxl

"Verma, Vishal L" <vishal.l.verma@intel.com> writes:

> On Mon, 2023-07-24 at 11:16 +0800, Huang, Ying wrote:
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>> >
>> > > @@ -1339,27 +1367,20 @@ int __ref add_memory_resource(int nid,
>> > > struct resource *res, mhp_t mhp_flags)
>> > >         /*
>> > >          * Self hosted memmap array
>> > >          */
>> > > -       if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
>> > > -               if (!mhp_supports_memmap_on_memory(size)) {
>> > > -                       ret = -EINVAL;
>> > > +       if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
>> > > +           mhp_supports_memmap_on_memory(memblock_size)) {
>> > > +               for (cur_start = start; cur_start < start + size;
>> > > +                    cur_start += memblock_size) {
>> > > +                       ret = add_memory_create_devices(nid,
>> > > group, cur_start,
>> > > +                                                       memblock_
>> > > size,
>> > > +                                                       mhp_flags
>> > > );
>> > > +                       if (ret)
>> > > +                               goto error;
>> > > +               }
>> >
>> > We should handle the below error details here.
>> >
>> > 1) If we hit an error after some blocks got added, should we
>> > iterate over rest of the dev_dax->nr_range.
>> > 2) With some blocks added if we return a failure here, we remove
>> > the
>> > resource in dax_kmem. Is that ok?
>> >
>> > IMHO error handling with partial creation of memory blocks in a
>> > resource range should be
>> > documented with this change.
>>
>> Or, should we remove all added memory blocks upon error?
>>
> I didn't address these in v3 - I wasn't sure how we'd proceed here.
> Something obviously went very wrong and I'd imagine it is okay if this
> memory is unusable as a result.
>
> What woyuld removing the blocks we added look like? Just call
> try_remove_memory() from the error path in add_memory_resource()? (for
> a range of [start, cur_start) ?

I guess that we can just keep the original behavior.  Originally, if
something goes wrong, arch_remove_memory() and remove_memory_block() (in
create_memory_block_devices()) will be called for all added memory
range.  So, we should do that too?

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks
  2023-08-02  6:08     ` Verma, Vishal L
@ 2023-08-14  6:45       ` Huang, Ying
  2023-08-14  7:20         ` David Hildenbrand
  0 siblings, 1 reply; 15+ messages in thread
From: Huang, Ying @ 2023-08-14  6:45 UTC (permalink / raw)
  To: Verma, Vishal L
  Cc: david, Jiang, Dave, linux-mm, osalvador, akpm, linux-kernel,
	Williams, Dan J, dave.hansen, Jonathan.Cameron, nvdimm,
	aneesh.kumar, jmoyer, linux-cxl, Greg Kroah-Hartman,
	Mike Rapoport, Bernhard Walle

"Verma, Vishal L" <vishal.l.verma@intel.com> writes:

> On Mon, 2023-07-24 at 13:54 +0800, Huang, Ying wrote:
>> Vishal Verma <vishal.l.verma@intel.com> writes:
>>
>> >
>> > @@ -2035,12 +2056,38 @@ void try_offline_node(int nid)
>> >  }
>> >  EXPORT_SYMBOL(try_offline_node);
>> >
>> > -static int __ref try_remove_memory(u64 start, u64 size)
>> > +static void __ref __try_remove_memory(int nid, u64 start, u64 size,
>> > +                                    struct vmem_altmap *altmap)
>> >  {
>> > -       struct vmem_altmap mhp_altmap = {};
>> > -       struct vmem_altmap *altmap = NULL;
>> > -       unsigned long nr_vmemmap_pages;
>> > -       int rc = 0, nid = NUMA_NO_NODE;
>> > +       /* remove memmap entry */
>> > +       firmware_map_remove(start, start + size, "System RAM");
>>
>> If mhp_supports_memmap_on_memory(), we will call
>> firmware_map_add_hotplug() for whole range.  But here we may call
>> firmware_map_remove() for part of range.  Is it OK?
>>
>
> Good point, this is a discrepancy in the add vs remove path. Can the
> firmware memmap entries be moved up a bit in the add path, and is it
> okay to create these for each memblock? Or should these be for the
> whole range? I'm not familiar with the implications. (I've left it as
> is for v3 for now, but depending on the direction I can update in a
> future rev).

Cced more firmware map developers and maintainers.

Per my understanding, we should create one firmware memmap entry for
each memblock.

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks
  2023-08-14  6:45       ` Huang, Ying
@ 2023-08-14  7:20         ` David Hildenbrand
  0 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2023-08-14  7:20 UTC (permalink / raw)
  To: Huang, Ying, Verma, Vishal L
  Cc: Jiang, Dave, linux-mm, osalvador, akpm, linux-kernel, Williams,
	Dan J, dave.hansen, Jonathan.Cameron, nvdimm, aneesh.kumar,
	jmoyer, linux-cxl, Greg Kroah-Hartman, Mike Rapoport,
	Bernhard Walle

On 14.08.23 08:45, Huang, Ying wrote:
> "Verma, Vishal L" <vishal.l.verma@intel.com> writes:
> 
>> On Mon, 2023-07-24 at 13:54 +0800, Huang, Ying wrote:
>>> Vishal Verma <vishal.l.verma@intel.com> writes:
>>>
>>>>
>>>> @@ -2035,12 +2056,38 @@ void try_offline_node(int nid)
>>>>   }
>>>>   EXPORT_SYMBOL(try_offline_node);
>>>>
>>>> -static int __ref try_remove_memory(u64 start, u64 size)
>>>> +static void __ref __try_remove_memory(int nid, u64 start, u64 size,
>>>> +                                    struct vmem_altmap *altmap)
>>>>   {
>>>> -       struct vmem_altmap mhp_altmap = {};
>>>> -       struct vmem_altmap *altmap = NULL;
>>>> -       unsigned long nr_vmemmap_pages;
>>>> -       int rc = 0, nid = NUMA_NO_NODE;
>>>> +       /* remove memmap entry */
>>>> +       firmware_map_remove(start, start + size, "System RAM");
>>>
>>> If mhp_supports_memmap_on_memory(), we will call
>>> firmware_map_add_hotplug() for whole range.  But here we may call
>>> firmware_map_remove() for part of range.  Is it OK?
>>>
>>
>> Good point, this is a discrepancy in the add vs remove path. Can the
>> firmware memmap entries be moved up a bit in the add path, and is it
>> okay to create these for each memblock? Or should these be for the
>> whole range? I'm not familiar with the implications. (I've left it as
>> is for v3 for now, but depending on the direction I can update in a
>> future rev).
> 
> Cced more firmware map developers and maintainers.
> 
> Per my understanding, we should create one firmware memmap entry for
> each memblock.

Ideally we should create it for the whole range, ti limit the ranges. 
But it really only matters for DIMMs; for dax/kmem, we'll not create any 
firmware entries.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-08-14  7:20 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-20  7:14 [PATCH v2 0/3] mm: use memmap_on_memory semantics for dax/kmem Vishal Verma
2023-07-20  7:14 ` [PATCH v2 1/3] mm/memory_hotplug: Export symbol mhp_supports_memmap_on_memory() Vishal Verma
2023-07-24  6:00   ` Huang, Ying
2023-07-20  7:14 ` [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks Vishal Verma
2023-07-21 12:20   ` Aneesh Kumar K.V
2023-07-23 14:53   ` Aneesh Kumar K.V
2023-07-24  3:16     ` Huang, Ying
2023-08-02  6:02       ` Verma, Vishal L
2023-08-14  6:04         ` Huang, Ying
2023-07-24  5:54   ` Huang, Ying
2023-08-02  6:08     ` Verma, Vishal L
2023-08-14  6:45       ` Huang, Ying
2023-08-14  7:20         ` David Hildenbrand
2023-07-20  7:14 ` [PATCH v2 3/3] dax/kmem: allow kmem to add memory with memmap_on_memory Vishal Verma
2023-07-25 15:54 ` [PATCH v2 0/3] mm: use memmap_on_memory semantics for dax/kmem David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).