All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] virtio-mem: Add support of memory_hotplug.memmap_on_memory
@ 2021-06-23 11:58 Hui Zhu
  2021-06-23 15:04 ` David Hildenbrand
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Hui Zhu @ 2021-06-23 11:58 UTC (permalink / raw)
  To: david, mst, jasowang, virtualization, linux-kernel; +Cc: Hui Zhu

From: Hui Zhu <teawater@antfin.com>

We did some virtio-mem resize tests in high memory pressure environment.
Memory increases slowly and sometimes fails in these tests.
This is a way to reproduce the issue.
Start a qemu with a small size of memory (132Mb) and resize the
virtio-mem to hotplug memory.
Then will get following error:
[    8.097461] virtio_mem virtio0: requested size: 0x10000000
[    8.098038] virtio_mem virtio0: plugging memory: 0x100000000 -
0x107ffffff
[    8.098829] virtio_mem virtio0: adding memory: 0x100000000 -
0x107ffffff
[    8.106298] kworker/0:1: vmemmap alloc failure: order:9,
mode:0x4cc0(GFP_KERNEL|__GFP_RETRY_MAYFAIL),
nodemask=(null),cpuset=/,mems_allowed=0
[    8.107609] CPU: 0 PID: 14 Comm: kworker/0:1 Not tainted 5.13.0-rc7+
[    8.108295] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[    8.109476] Workqueue: events_freezable virtio_mem_run_wq
[    8.110039] Call Trace:
[    8.110305]  dump_stack+0x76/0x94
[    8.110654]  warn_alloc.cold+0x7b/0xdf
[    8.111054]  ? __alloc_pages+0x2c2/0x310
[    8.111462]  vmemmap_alloc_block+0x86/0xdc
[    8.111891]  vmemmap_populate+0xfc/0x325
[    8.112309]  __populate_section_memmap+0x38/0x4e
[    8.112787]  sparse_add_section+0x167/0x244
[    8.113226]  __add_pages+0xa6/0x130
[    8.113592]  add_pages+0x12/0x60
[    8.113934]  add_memory_resource+0x114/0x2d0
[    8.114377]  add_memory_driver_managed+0x7c/0xc0
[    8.114852]  virtio_mem_add_memory+0x57/0xe0
[    8.115304]  virtio_mem_sbm_plug_and_add_mb+0x9a/0x130
[    8.115833]  virtio_mem_run_wq+0x9d5/0x1100
I think allocating 2 Mb contiguous memory will be slow and failed
in some cases, especially in high memory pressure environment.
This commit try to add support of memory_hotplug.memmap_on_memory to
handle this issue.

Just let SBM mode support it because memory_hotplug.memmap_on_memory
need a single memory block.

Add nr_vmemmap_pages and sbs_vmemmap to struct sbm.
If memory_hotplug.memmap_on_memory is open, pages number of a memory
block's internal metadata will be store in nr_vmemmap_pages.
sbs_vmemmap is the number of vmemmap subblocks per Linux memory block.
The pages in the vmemmap subblocks should bigger than nr_vmemmap_pages
because sb_size need to span at least MAX_ORDER_NR_PAGES and
pageblock_nr_pages pages (virtio_mem_init).
All the pages in vmemmap subblocks is not going to add to the buddy
even if the pages that are not used to store the internal metadata
(struct pages) because they should not work reliably with
alloc_contig_range().

When resize virtio-mem, sbs_vmemmap is going to count in
virtio_mem_sbm_plug_and_add_mb, virtio_mem_sbm_unplug_any_sb_offline
and virtio_mem_sbm_unplug_any_sb_online.
Because internal metadata also need the real pages in the host to store
it.  I think resize virtio-mem size same with the actual memory
footprint
on the host is better if we want setup a memory cgroup for QEMU.

I did not add special module_param for this function and did not move
code
inside CONFIG_MHP_MEMMAP_ON_MEMORY.
Do I need add them?

Thanks for help from David Hildenbrand about this RFC.

Signed-off-by: Hui Zhu <teawater@antfin.com>
---
 drivers/virtio/virtio_mem.c | 210 ++++++++++++++++++++++++++++++------
 1 file changed, 180 insertions(+), 30 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 10ec60d81e84..5716ea656cd8 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -189,6 +189,38 @@ struct virtio_mem {
 			 * in one 4 KiB page.
 			 */
 			unsigned long *sb_states;
+
+			/*
+			 * If nr_vmemmap_pages is not 0, current virtio-mem
+			 * device allocate its internal metadata (struct pages)
+			 * from the hotadded memory.
+			 * See memory_hotplug.memmap_on_memory to get
+			 * more detailed info.
+			 *
+			 * nr_vmemmap_pages is the vmemmap pages number of
+			 * current device.
+			 */
+			unsigned long nr_vmemmap_pages;
+
+			/*
+			 * The number of vmemmap subblocks per Linux memory
+			 * block.
+			 *
+			 * The first sb_id of the Linux memory block that
+			 * can add to the buddy.
+			 *
+			 * The pages in the vmemmap subblocks should bigger
+			 * than nr_vmemmap_pages because sb_size need to
+			 * span at least MAX_ORDER_NR_PAGES and
+			 * pageblock_nr_pages pages (virtio_mem_init).
+			 *
+			 * All the pages in vmemmap subblocks is not going to
+			 * add to the buddy even if the pages that are not
+			 * used to store the internal metadata (struct pages)
+			 * because they should not work reliably with
+			 * alloc_contig_range().
+			 */
+			uint32_t sbs_vmemmap;
 		} sbm;
 
 		struct {
@@ -545,10 +577,13 @@ static bool virtio_mem_sbm_test_sb_unplugged(struct virtio_mem *vm,
 static int virtio_mem_sbm_first_unplugged_sb(struct virtio_mem *vm,
 					    unsigned long mb_id)
 {
-	const int bit = virtio_mem_sbm_sb_state_bit_nr(vm, mb_id, 0);
+	const int bit =
+		virtio_mem_sbm_sb_state_bit_nr(vm, mb_id,
+					       vm->sbm.sbs_vmemmap);
 
 	return find_next_zero_bit(vm->sbm.sb_states,
-				  bit + vm->sbm.sbs_per_mb, bit) - bit;
+				  bit + vm->sbm.sbs_per_mb, bit) - bit +
+	       vm->sbm.sbs_vmemmap;
 }
 
 /*
@@ -603,9 +638,10 @@ static bool virtio_mem_could_add_memory(struct virtio_mem *vm, uint64_t size)
  * Will not modify the state of memory blocks in virtio-mem.
  */
 static int virtio_mem_add_memory(struct virtio_mem *vm, uint64_t addr,
-				 uint64_t size)
+				 uint64_t size, bool memmap_on_memory)
 {
 	int rc;
+	mhp_t mhp_flags = MHP_MERGE_RESOURCE;
 
 	/*
 	 * When force-unloading the driver and we still have memory added to
@@ -622,8 +658,10 @@ static int virtio_mem_add_memory(struct virtio_mem *vm, uint64_t addr,
 		addr + size - 1);
 	/* Memory might get onlined immediately. */
 	atomic64_add(size, &vm->offline_size);
+	if (memmap_on_memory)
+		mhp_flags |= MHP_MEMMAP_ON_MEMORY;
 	rc = add_memory_driver_managed(vm->nid, addr, size, vm->resource_name,
-				       MHP_MERGE_RESOURCE);
+				       mhp_flags);
 	if (rc) {
 		atomic64_sub(size, &vm->offline_size);
 		dev_warn(&vm->vdev->dev, "adding memory failed: %d\n", rc);
@@ -643,7 +681,8 @@ static int virtio_mem_sbm_add_mb(struct virtio_mem *vm, unsigned long mb_id)
 	const uint64_t addr = virtio_mem_mb_id_to_phys(mb_id);
 	const uint64_t size = memory_block_size_bytes();
 
-	return virtio_mem_add_memory(vm, addr, size);
+	return virtio_mem_add_memory(vm, addr, size,
+				     vm->sbm.sbs_vmemmap > 0);
 }
 
 /*
@@ -654,7 +693,7 @@ static int virtio_mem_bbm_add_bb(struct virtio_mem *vm, unsigned long bb_id)
 	const uint64_t addr = virtio_mem_bb_id_to_phys(vm, bb_id);
 	const uint64_t size = vm->bbm.bb_size;
 
-	return virtio_mem_add_memory(vm, addr, size);
+	return virtio_mem_add_memory(vm, addr, size, false);
 }
 
 /*
@@ -871,7 +910,23 @@ static void virtio_mem_sbm_notify_going_offline(struct virtio_mem *vm,
 	unsigned long pfn;
 	int sb_id;
 
-	for (sb_id = 0; sb_id < vm->sbm.sbs_per_mb; sb_id++) {
+	if (vm->sbm.sbs_vmemmap) {
+		unsigned long other_pages
+				= vm->sbm.sbs_vmemmap * nr_pages -
+				  vm->sbm.nr_vmemmap_pages;
+
+		if (other_pages) {
+			unsigned long other_pfn
+				= PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id)) +
+				  vm->sbm.nr_vmemmap_pages;
+
+			virtio_mem_fake_offline_going_offline(other_pfn,
+							      other_pages);
+		}
+	}
+
+	for (sb_id = vm->sbm.sbs_vmemmap;
+	     sb_id < vm->sbm.sbs_per_mb; sb_id++) {
 		if (virtio_mem_sbm_test_sb_plugged(vm, mb_id, sb_id, 1))
 			continue;
 		pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
@@ -887,7 +942,23 @@ static void virtio_mem_sbm_notify_cancel_offline(struct virtio_mem *vm,
 	unsigned long pfn;
 	int sb_id;
 
-	for (sb_id = 0; sb_id < vm->sbm.sbs_per_mb; sb_id++) {
+	if (vm->sbm.sbs_vmemmap) {
+		unsigned long other_pages
+				= vm->sbm.sbs_vmemmap * nr_pages -
+				  vm->sbm.nr_vmemmap_pages;
+
+		if (other_pages) {
+			unsigned long other_pfn
+				= PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id)) +
+				  vm->sbm.nr_vmemmap_pages;
+
+			virtio_mem_fake_offline_cancel_offline(other_pfn,
+							       other_pages);
+		}
+	}
+
+	for (sb_id = vm->sbm.sbs_vmemmap;
+	     sb_id < vm->sbm.sbs_per_mb; sb_id++) {
 		if (virtio_mem_sbm_test_sb_plugged(vm, mb_id, sb_id, 1))
 			continue;
 		pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
@@ -933,8 +1004,8 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb,
 	struct virtio_mem *vm = container_of(nb, struct virtio_mem,
 					     memory_notifier);
 	struct memory_notify *mhp = arg;
-	const unsigned long start = PFN_PHYS(mhp->start_pfn);
-	const unsigned long size = PFN_PHYS(mhp->nr_pages);
+	unsigned long start = PFN_PHYS(mhp->start_pfn);
+	unsigned long size = PFN_PHYS(mhp->nr_pages);
 	int rc = NOTIFY_OK;
 	unsigned long id;
 
@@ -942,6 +1013,13 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb,
 		return NOTIFY_DONE;
 
 	if (vm->in_sbm) {
+		if (vm->sbm.nr_vmemmap_pages) {
+			unsigned long vmemmap_size = vm->sbm.nr_vmemmap_pages;
+
+			vmemmap_size <<= PAGE_SHIFT;
+			start -= vmemmap_size;
+			size += vmemmap_size;
+		}
 		id = virtio_mem_phys_to_mb_id(start);
 		/*
 		 * In SBM, we add memory in separate memory blocks - we expect
@@ -1227,7 +1305,10 @@ static void virtio_mem_online_page_cb(struct page *page, unsigned int order)
 			 */
 			id = virtio_mem_phys_to_mb_id(addr);
 			sb_id = virtio_mem_phys_to_sb_id(vm, addr);
-			do_online = virtio_mem_sbm_test_sb_plugged(vm, id,
+			if (sb_id < vm->sbm.sbs_vmemmap)
+				do_online = false;
+			else
+				do_online = virtio_mem_sbm_test_sb_plugged(vm, id,
 								   sb_id, 1);
 		} else {
 			/*
@@ -1462,14 +1543,14 @@ static int virtio_mem_sbm_unplug_any_sb(struct virtio_mem *vm,
 	sb_id = vm->sbm.sbs_per_mb - 1;
 	while (*nb_sb) {
 		/* Find the next candidate subblock */
-		while (sb_id >= 0 &&
+		while (sb_id >= vm->sbm.sbs_vmemmap &&
 		       virtio_mem_sbm_test_sb_unplugged(vm, mb_id, sb_id, 1))
 			sb_id--;
-		if (sb_id < 0)
+		if (sb_id < vm->sbm.sbs_vmemmap)
 			break;
 		/* Try to unplug multiple subblocks at a time */
 		count = 1;
-		while (count < *nb_sb && sb_id > 0 &&
+		while (count < *nb_sb && sb_id > vm->sbm.sbs_vmemmap &&
 		       virtio_mem_sbm_test_sb_plugged(vm, mb_id, sb_id - 1, 1)) {
 			count++;
 			sb_id--;
@@ -1494,7 +1575,7 @@ static int virtio_mem_sbm_unplug_any_sb(struct virtio_mem *vm,
  */
 static int virtio_mem_sbm_unplug_mb(struct virtio_mem *vm, unsigned long mb_id)
 {
-	uint64_t nb_sb = vm->sbm.sbs_per_mb;
+	uint64_t nb_sb = vm->sbm.sbs_per_mb - vm->sbm.sbs_vmemmap;
 
 	return virtio_mem_sbm_unplug_any_sb(vm, mb_id, &nb_sb);
 }
@@ -1534,12 +1615,15 @@ static int virtio_mem_sbm_prepare_next_mb(struct virtio_mem *vm,
 static int virtio_mem_sbm_plug_and_add_mb(struct virtio_mem *vm,
 					  unsigned long mb_id, uint64_t *nb_sb)
 {
-	const int count = min_t(int, *nb_sb, vm->sbm.sbs_per_mb);
+	int count = min_t(int, *nb_sb, vm->sbm.sbs_per_mb);
 	int rc;
 
 	if (WARN_ON_ONCE(!count))
 		return -EINVAL;
 
+	if (vm->sbm.sbs_vmemmap)
+		count = max_t(int, count, vm->sbm.sbs_vmemmap);
+
 	/*
 	 * Plug the requested number of subblocks before adding it to linux,
 	 * so that onlining will directly online all plugged subblocks.
@@ -1570,7 +1654,10 @@ static int virtio_mem_sbm_plug_and_add_mb(struct virtio_mem *vm,
 		return rc;
 	}
 
-	*nb_sb -= count;
+	if (*nb_sb >= count)
+		*nb_sb -= count;
+	else
+		*nb_sb = 0;
 	return 0;
 }
 
@@ -1617,7 +1704,10 @@ static int virtio_mem_sbm_plug_any_sb(struct virtio_mem *vm,
 		virtio_mem_fake_online(pfn, nr_pages);
 	}
 
-	if (virtio_mem_sbm_test_sb_plugged(vm, mb_id, 0, vm->sbm.sbs_per_mb)) {
+	if (virtio_mem_sbm_test_sb_plugged(vm, mb_id,
+					   vm->sbm.sbs_vmemmap,
+					   vm->sbm.sbs_per_mb -
+					   vm->sbm.sbs_vmemmap)) {
 		if (online)
 			virtio_mem_sbm_set_mb_state(vm, mb_id,
 						    VIRTIO_MEM_SBM_MB_ONLINE);
@@ -1820,13 +1910,17 @@ static int virtio_mem_sbm_unplug_any_sb_offline(struct virtio_mem *vm,
 	rc = virtio_mem_sbm_unplug_any_sb(vm, mb_id, nb_sb);
 
 	/* some subblocks might have been unplugged even on failure */
-	if (!virtio_mem_sbm_test_sb_plugged(vm, mb_id, 0, vm->sbm.sbs_per_mb))
+	if (!virtio_mem_sbm_test_sb_plugged(vm, mb_id, vm->sbm.sbs_vmemmap,
+					    vm->sbm.sbs_per_mb -
+					    vm->sbm.sbs_vmemmap))
 		virtio_mem_sbm_set_mb_state(vm, mb_id,
 					    VIRTIO_MEM_SBM_MB_OFFLINE_PARTIAL);
 	if (rc)
 		return rc;
 
-	if (virtio_mem_sbm_test_sb_unplugged(vm, mb_id, 0, vm->sbm.sbs_per_mb)) {
+	if (virtio_mem_sbm_test_sb_unplugged(vm, mb_id, vm->sbm.sbs_vmemmap,
+					     vm->sbm.sbs_per_mb -
+					     vm->sbm.sbs_vmemmap)) {
 		/*
 		 * Remove the block from Linux - this should never fail.
 		 * Hinder the block from getting onlined by marking it
@@ -1840,6 +1934,23 @@ static int virtio_mem_sbm_unplug_any_sb_offline(struct virtio_mem *vm,
 		rc = virtio_mem_sbm_remove_mb(vm, mb_id);
 		BUG_ON(rc);
 		mutex_lock(&vm->hotplug_mutex);
+
+		/* Remove vmemmap pages. */
+		if (vm->sbm.sbs_vmemmap) {
+			rc = virtio_mem_sbm_unplug_sb(vm, mb_id, 0,
+						      vm->sbm.sbs_vmemmap);
+			/*
+			 * Just warn because this error will
+			 * not affect next plug.
+			 */
+			WARN_ON(rc);
+			if (!rc) {
+				if (*nb_sb >= vm->sbm.sbs_vmemmap)
+					*nb_sb -= vm->sbm.sbs_vmemmap;
+				else
+					*nb_sb = 0;
+			}
+		}
 	}
 	return 0;
 }
@@ -1894,19 +2005,24 @@ static int virtio_mem_sbm_unplug_any_sb_online(struct virtio_mem *vm,
 	int rc, sb_id;
 
 	/* If possible, try to unplug the complete block in one shot. */
-	if (*nb_sb >= vm->sbm.sbs_per_mb &&
-	    virtio_mem_sbm_test_sb_plugged(vm, mb_id, 0, vm->sbm.sbs_per_mb)) {
-		rc = virtio_mem_sbm_unplug_sb_online(vm, mb_id, 0,
-						     vm->sbm.sbs_per_mb);
+	if (*nb_sb >= (vm->sbm.sbs_per_mb - vm->sbm.sbs_vmemmap) &&
+	    virtio_mem_sbm_test_sb_plugged(vm, mb_id, vm->sbm.sbs_vmemmap,
+					   vm->sbm.sbs_per_mb -
+					   vm->sbm.sbs_vmemmap)) {
+		rc = virtio_mem_sbm_unplug_sb_online(vm, mb_id,
+						     vm->sbm.sbs_vmemmap,
+						     vm->sbm.sbs_per_mb -
+						     vm->sbm.sbs_vmemmap);
 		if (!rc) {
-			*nb_sb -= vm->sbm.sbs_per_mb;
+			*nb_sb -= (vm->sbm.sbs_per_mb - vm->sbm.sbs_vmemmap);
 			goto unplugged;
 		} else if (rc != -EBUSY)
 			return rc;
 	}
 
 	/* Fallback to single subblocks. */
-	for (sb_id = vm->sbm.sbs_per_mb - 1; sb_id >= 0 && *nb_sb; sb_id--) {
+	for (sb_id = vm->sbm.sbs_per_mb - 1;
+	     sb_id >= vm->sbm.sbs_vmemmap && *nb_sb; sb_id--) {
 		/* Find the next candidate subblock */
 		while (sb_id >= 0 &&
 		       !virtio_mem_sbm_test_sb_plugged(vm, mb_id, sb_id, 1))
@@ -1928,13 +2044,31 @@ static int virtio_mem_sbm_unplug_any_sb_online(struct virtio_mem *vm,
 	 * remove it. This will usually not fail, as no memory is in use
 	 * anymore - however some other notifiers might NACK the request.
 	 */
-	if (virtio_mem_sbm_test_sb_unplugged(vm, mb_id, 0, vm->sbm.sbs_per_mb)) {
+	if (virtio_mem_sbm_test_sb_unplugged(vm, mb_id, vm->sbm.sbs_vmemmap,
+				vm->sbm.sbs_per_mb - vm->sbm.sbs_vmemmap)) {
 		mutex_unlock(&vm->hotplug_mutex);
 		rc = virtio_mem_sbm_offline_and_remove_mb(vm, mb_id);
 		mutex_lock(&vm->hotplug_mutex);
-		if (!rc)
+		if (!rc) {
 			virtio_mem_sbm_set_mb_state(vm, mb_id,
 						    VIRTIO_MEM_SBM_MB_UNUSED);
+			/* Remove vmemmap pages. */
+			if (vm->sbm.sbs_vmemmap) {
+				rc = virtio_mem_sbm_unplug_sb(vm, mb_id, 0,
+							vm->sbm.sbs_vmemmap);
+				/*
+				 * Just warn because this error will
+				 * not affect next plug.
+				 */
+				WARN_ON(rc);
+				if (!rc) {
+					if (*nb_sb >= vm->sbm.sbs_vmemmap)
+						*nb_sb -= vm->sbm.sbs_vmemmap;
+					else
+						*nb_sb = 0;
+				}
+			}
+		}
 	}
 
 	return 0;
@@ -2444,6 +2578,15 @@ static int virtio_mem_init(struct virtio_mem *vm)
 		       memory_block_size_bytes() - 1;
 		vm->sbm.first_mb_id = virtio_mem_phys_to_mb_id(addr);
 		vm->sbm.next_mb_id = vm->sbm.first_mb_id;
+		if (mhp_supports_memmap_on_memory(memory_block_size_bytes())) {
+			vm->sbm.nr_vmemmap_pages
+				= PFN_DOWN(PFN_DOWN(memory_block_size_bytes()) *
+					   sizeof(struct page));
+			vm->sbm.sbs_vmemmap
+				= ALIGN(PFN_PHYS(vm->sbm.nr_vmemmap_pages),
+					vm->sbm.sb_size) /
+				  vm->sbm.sb_size;
+		}
 	} else {
 		/* BBM: At least one Linux memory block. */
 		vm->bbm.bb_size = max_t(uint64_t, vm->device_block_size,
@@ -2481,10 +2624,17 @@ static int virtio_mem_init(struct virtio_mem *vm)
 		 (unsigned long long)vm->device_block_size);
 	dev_info(&vm->vdev->dev, "memory block size: 0x%lx",
 		 memory_block_size_bytes());
-	if (vm->in_sbm)
+	if (vm->in_sbm) {
 		dev_info(&vm->vdev->dev, "subblock size: 0x%llx",
 			 (unsigned long long)vm->sbm.sb_size);
-	else
+		if (vm->sbm.sbs_vmemmap) {
+			dev_info(&vm->vdev->dev, "nr vmemmap pages: %lu",
+				 vm->sbm.nr_vmemmap_pages);
+			dev_info(&vm->vdev->dev,
+		"The number of vmemmap subblocks per Linux memory block: %u",
+				 vm->sbm.sbs_vmemmap);
+		}
+	} else
 		dev_info(&vm->vdev->dev, "big block size: 0x%llx",
 			 (unsigned long long)vm->bbm.bb_size);
 	if (vm->nid != NUMA_NO_NODE && IS_ENABLED(CONFIG_NUMA))
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC] virtio-mem: Add support of memory_hotplug.memmap_on_memory
  2021-06-23 11:58 [RFC] virtio-mem: Add support of memory_hotplug.memmap_on_memory Hui Zhu
@ 2021-06-23 15:04 ` David Hildenbrand
  2021-06-23 18:59 ` kernel test robot
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: David Hildenbrand @ 2021-06-23 15:04 UTC (permalink / raw)
  To: Hui Zhu; +Cc: Hui Zhu, virtualization, linux-kernel, mst


[-- Attachment #1.1: Type: text/plain, Size: 4099 bytes --]

On Wednesday, June 23, 2021, Hui Zhu <teawater@gmail.com> wrote:

> From: Hui Zhu <teawater@antfin.com>
>
> We did some virtio-mem resize tests in high memory pressure environment.
> Memory increases slowly and sometimes fails in these tests.
> This is a way to reproduce the issue.
> Start a qemu with a small size of memory (132Mb) and resize the
> virtio-mem to hotplug memory.
> Then will get following error:
> [    8.097461] virtio_mem virtio0: requested size: 0x10000000
> [    8.098038] virtio_mem virtio0: plugging memory: 0x100000000 -
> 0x107ffffff
> [    8.098829] virtio_mem virtio0: adding memory: 0x100000000 -
> 0x107ffffff
> [    8.106298] kworker/0:1: vmemmap alloc failure: order:9,
> mode:0x4cc0(GFP_KERNEL|__GFP_RETRY_MAYFAIL),
> nodemask=(null),cpuset=/,mems_allowed=0
> [    8.107609] CPU: 0 PID: 14 Comm: kworker/0:1 Not tainted 5.13.0-rc7+
> [    8.108295] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> [    8.109476] Workqueue: events_freezable virtio_mem_run_wq
> [    8.110039] Call Trace:
> [    8.110305]  dump_stack+0x76/0x94
> [    8.110654]  warn_alloc.cold+0x7b/0xdf
> [    8.111054]  ? __alloc_pages+0x2c2/0x310
> [    8.111462]  vmemmap_alloc_block+0x86/0xdc
> [    8.111891]  vmemmap_populate+0xfc/0x325
> [    8.112309]  __populate_section_memmap+0x38/0x4e
> [    8.112787]  sparse_add_section+0x167/0x244
> [    8.113226]  __add_pages+0xa6/0x130
> [    8.113592]  add_pages+0x12/0x60
> [    8.113934]  add_memory_resource+0x114/0x2d0
> [    8.114377]  add_memory_driver_managed+0x7c/0xc0
> [    8.114852]  virtio_mem_add_memory+0x57/0xe0
> [    8.115304]  virtio_mem_sbm_plug_and_add_mb+0x9a/0x130
> [    8.115833]  virtio_mem_run_wq+0x9d5/0x1100
> I think allocating 2 Mb contiguous memory will be slow and failed
> in some cases, especially in high memory pressure environment.
> This commit try to add support of memory_hotplug.memmap_on_memory to
> handle this issue.
>
> Just let SBM mode support it because memory_hotplug.memmap_on_memory
> need a single memory block.



Hi,

I‘m on vacation this and next week. I‘ll have a closer look when I‘m back.

We also want to have this optimization for BBM, initially when a big block
comprises a single memory block. But we can add that separately later.


>
> Add nr_vmemmap_pages and sbs_vmemmap to struct sbm.
> If memory_hotplug.memmap_on_memory is open, pages number of a memory
> block's internal metadata will be store in nr_vmemmap_pages.
> sbs_vmemmap is the number of vmemmap subblocks per Linux memory block.
> The pages in the vmemmap subblocks should bigger than nr_vmemmap_pages
> because sb_size need to span at least MAX_ORDER_NR_PAGES and
> pageblock_nr_pages pages (virtio_mem_init).
> All the pages in vmemmap subblocks is not going to add to the buddy
> even if the pages that are not used to store the internal metadata
> (struct pages) because they should not work reliably with
> alloc_contig_range().



We most certainly want to handle partially consumed subblocks by metadata
and expose that memory to the buddy. alloc_contig_range() will really only
be sub-optimal on ZONE_NORMAL right now when called on pageblock
granularity; so that’s when we can expect memory unplug to be less
reliable, which is the case either way. ZONE_MOVABLE should be just fine I
think.


>
> When resize virtio-mem, sbs_vmemmap is going to count in
> virtio_mem_sbm_plug_and_add_mb, virtio_mem_sbm_unplug_any_sb_offline
> and virtio_mem_sbm_unplug_any_sb_online.
> Because internal metadata also need the real pages in the host to store
> it.  I think resize virtio-mem size same with the actual memory
> footprint
> on the host is better if we want setup a memory cgroup for QEMU.
>
> I did not add special module_param for this function and did not move
> code
> inside CONFIG_MHP_MEMMAP_ON_MEMORY.
> Do I need add them?


There is a single tunable to enable memmap_on_memory, so that should be
sufficient I think.

Thanks!

[-- Attachment #1.2: Type: text/html, Size: 5061 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] virtio-mem: Add support of memory_hotplug.memmap_on_memory
  2021-06-23 11:58 [RFC] virtio-mem: Add support of memory_hotplug.memmap_on_memory Hui Zhu
  2021-06-23 15:04 ` David Hildenbrand
@ 2021-06-23 18:59 ` kernel test robot
  2021-06-25  3:45 ` kernel test robot
  2021-07-07 15:17   ` David Hildenbrand
  3 siblings, 0 replies; 6+ messages in thread
From: kernel test robot @ 2021-06-23 18:59 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 1948 bytes --]

Hi Hui,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on linus/master]
[also build test ERROR on v5.13-rc7 next-20210623]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Hui-Zhu/virtio-mem-Add-support-of-memory_hotplug-memmap_on_memory/20210623-200242
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 0c18f29aae7ce3dadd26d8ee3505d07cc982df75
config: x86_64-randconfig-a001-20210622 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project b259740801d3515810ecc15bf0c24b0d476a1608)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install x86_64 cross compiling tool for clang build
        # apt-get install binutils-x86-64-linux-gnu
        # https://github.com/0day-ci/linux/commit/7c3e5b6a98dff36b10e34c1b20773850c587e2b2
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Hui-Zhu/virtio-mem-Add-support-of-memory_hotplug-memmap_on_memory/20210623-200242
        git checkout 7c3e5b6a98dff36b10e34c1b20773850c587e2b2
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>, old ones prefixed by <<):

>> ERROR: modpost: "mhp_supports_memmap_on_memory" [drivers/virtio/virtio_mem.ko] undefined!

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 44806 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] virtio-mem: Add support of memory_hotplug.memmap_on_memory
  2021-06-23 11:58 [RFC] virtio-mem: Add support of memory_hotplug.memmap_on_memory Hui Zhu
  2021-06-23 15:04 ` David Hildenbrand
  2021-06-23 18:59 ` kernel test robot
@ 2021-06-25  3:45 ` kernel test robot
  2021-07-07 15:17   ` David Hildenbrand
  3 siblings, 0 replies; 6+ messages in thread
From: kernel test robot @ 2021-06-25  3:45 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 1564 bytes --]

Hi Hui,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on linus/master]
[also build test ERROR on v5.13-rc7 next-20210624]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Hui-Zhu/virtio-mem-Add-support-of-memory_hotplug-memmap_on_memory/20210623-200242
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 0c18f29aae7ce3dadd26d8ee3505d07cc982df75
config: x86_64-rhel-8.3-kselftests (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/7c3e5b6a98dff36b10e34c1b20773850c587e2b2
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Hui-Zhu/virtio-mem-Add-support-of-memory_hotplug-memmap_on_memory/20210623-200242
        git checkout 7c3e5b6a98dff36b10e34c1b20773850c587e2b2
        # save the attached .config to linux build tree
        make W=1 ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>, old ones prefixed by <<):

>> ERROR: modpost: "mhp_supports_memmap_on_memory" [drivers/virtio/virtio_mem.ko] undefined!

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 41894 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] virtio-mem: Add support of memory_hotplug.memmap_on_memory
  2021-06-23 11:58 [RFC] virtio-mem: Add support of memory_hotplug.memmap_on_memory Hui Zhu
@ 2021-07-07 15:17   ` David Hildenbrand
  2021-06-23 18:59 ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: David Hildenbrand @ 2021-07-07 15:17 UTC (permalink / raw)
  To: Hui Zhu, mst, jasowang, virtualization, linux-kernel; +Cc: Hui Zhu

On 23.06.21 13:58, Hui Zhu wrote:
> From: Hui Zhu <teawater@antfin.com>
> 

Sorry for the delay, once I was back from vacation my inbox was 
overflowing :)

> We did some virtio-mem resize tests in high memory pressure environment.
> Memory increases slowly and sometimes fails in these tests.
> This is a way to reproduce the issue.
> Start a qemu with a small size of memory (132Mb) and resize the
> virtio-mem to hotplug memory.
> Then will get following error:
> [    8.097461] virtio_mem virtio0: requested size: 0x10000000
> [    8.098038] virtio_mem virtio0: plugging memory: 0x100000000 -
> 0x107ffffff
> [    8.098829] virtio_mem virtio0: adding memory: 0x100000000 -
> 0x107ffffff
> [    8.106298] kworker/0:1: vmemmap alloc failure: order:9,
> mode:0x4cc0(GFP_KERNEL|__GFP_RETRY_MAYFAIL),
> nodemask=(null),cpuset=/,mems_allowed=0
> [    8.107609] CPU: 0 PID: 14 Comm: kworker/0:1 Not tainted 5.13.0-rc7+
> [    8.108295] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> [    8.109476] Workqueue: events_freezable virtio_mem_run_wq
> [    8.110039] Call Trace:
> [    8.110305]  dump_stack+0x76/0x94
> [    8.110654]  warn_alloc.cold+0x7b/0xdf
> [    8.111054]  ? __alloc_pages+0x2c2/0x310
> [    8.111462]  vmemmap_alloc_block+0x86/0xdc
> [    8.111891]  vmemmap_populate+0xfc/0x325
> [    8.112309]  __populate_section_memmap+0x38/0x4e
> [    8.112787]  sparse_add_section+0x167/0x244
> [    8.113226]  __add_pages+0xa6/0x130
> [    8.113592]  add_pages+0x12/0x60
> [    8.113934]  add_memory_resource+0x114/0x2d0
> [    8.114377]  add_memory_driver_managed+0x7c/0xc0
> [    8.114852]  virtio_mem_add_memory+0x57/0xe0
> [    8.115304]  virtio_mem_sbm_plug_and_add_mb+0x9a/0x130
> [    8.115833]  virtio_mem_run_wq+0x9d5/0x1100
> I think allocating 2 Mb contiguous memory will be slow and failed
> in some cases, especially in high memory pressure environment.

So, interrestingly, what failed was the 2 MB allcoation. But 
vmemmap_populate() on x86-64 will actually fallback to individual 4k 
allcoations. See arch/x86/mm/init_64.c:vmemmap_populate_hugepages(), 
which falls back to vmemmap_populate_basepages().

We could certainly think about silencing this warning (GFP_NOWARN) and 
instead printing a clearer warning like "vmemmap: populating huge page 
failed; falling back to populating base pages"

Did you also see a "virtio_mem virtio0: adding memory failed: .." device 
warning? If not, adding memory did succeed! Although it's suboptimal.


> This commit try to add support of memory_hotplug.memmap_on_memory to
> handle this issue.
> 
> Just let SBM mode support it because memory_hotplug.memmap_on_memory
> need a single memory block.
> 
> Add nr_vmemmap_pages and sbs_vmemmap to struct sbm.
> If memory_hotplug.memmap_on_memory is open, pages number of a memory
> block's internal metadata will be store in nr_vmemmap_pages.
> sbs_vmemmap is the number of vmemmap subblocks per Linux memory block.
> The pages in the vmemmap subblocks should bigger than nr_vmemmap_pages
> because sb_size need to span at least MAX_ORDER_NR_PAGES and
> pageblock_nr_pages pages (virtio_mem_init).
> All the pages in vmemmap subblocks is not going to add to the buddy
> even if the pages that are not used to store the internal metadata
> (struct pages) because they should not work reliably with
> alloc_contig_range().
> 
> When resize virtio-mem, sbs_vmemmap is going to count in
> virtio_mem_sbm_plug_and_add_mb, virtio_mem_sbm_unplug_any_sb_offline
> and virtio_mem_sbm_unplug_any_sb_online.
> Because internal metadata also need the real pages in the host to store
> it.  I think resize virtio-mem size same with the actual memory
> footprint
> on the host is better if we want setup a memory cgroup for QEMU.

So the main thing I dislike about the current RFC state is the waste of 
memory we signaled to the device we are going to use, but we are not 
actually able to use. The remaining stuff looks reasonably clear to me, 
although we could do some refactorings to make it read a bit nicer.

The "loss" currently corresponds to 2 MiB for a 128MB memory block -- 
that's 1/64 of all added memory and just as much as the vmemmap itself!

I don't see an easy way around this when keeping the subblock size at 2 
MiB. We'd have to mess with "partial subblocks" when processing (un)plug 
requests, which is just plain ugly.

There is a big TODO item on my toto list to teach alloc_contig_range() 
to handle pageblock_order properly and I shall start working on that 
soonish. [1]

What I think we should do is


1. Implement the "memory_hotplug.memmap_on_memory" optimization for big 
block mode with a single memory block only for now (Linux also only 
supports this). Environments that desparately need this optimization can 
set "force_bbm=1" when loading the virtio-mem module.

It already has the correct flow -- virtio_mem_bbm_plug_and_add_bb() and 
virtio_mem_bbm_remove_and_unplug_bb(). We'll have to teach:

* virtio_mem_bbm_notify_going_offline() /
   virtio_mem_bbm_notify_cancel_offline() about the special vmemmap
   ranges
* virtio_mem_bbm_offline_remove_and_unplug_bb() about the special
   vmemmap ranges

I think that should be mostly it, but there might be more.

2. Optimize alloc_contig_range() such that we can drop the MAX_ORDER - 1 
requirement. With most devices, we can then use a subblock size of 
pageblock_order (2MiB on x86-64).

3. Pick up a reduced variant of your RFC that implements the 
optimization only if the sbm.vmemmap pages span complete subblocks. That 
removes the need for sbm.nr_vmemmap_pages and consequently results in no 
loss of plugged memory.

[1] 
https://lkml.kernel.org/r/c8e21ac4-ace7-3176-8782-535bd6590583@redhat.com


> @@ -1534,12 +1615,15 @@ static int virtio_mem_sbm_prepare_next_mb(struct virtio_mem *vm,
>   static int virtio_mem_sbm_plug_and_add_mb(struct virtio_mem *vm,
>   					  unsigned long mb_id, uint64_t *nb_sb)
>   {
> -	const int count = min_t(int, *nb_sb, vm->sbm.sbs_per_mb);
> +	int count = min_t(int, *nb_sb, vm->sbm.sbs_per_mb);
>   	int rc;
>   
>   	if (WARN_ON_ONCE(!count))
>   		return -EINVAL;
>   
> +	if (vm->sbm.sbs_vmemmap)
> +		count = max_t(int, count, vm->sbm.sbs_vmemmap);
> +

If you exceed original *nb_sb (which can happen on bigger memory blocks 
like with 2GB where the vmemmap actually spans multiple subblocks), you 
can try plugging more than requested from the hypervisor.

You'd instead have to return with -ENOSPC in case *nb_sb does not at 
least span vm->sbm.sbs_vmemmap.

[...]

>   		/*
>   		 * Remove the block from Linux - this should never fail.
>   		 * Hinder the block from getting onlined by marking it
> @@ -1840,6 +1934,23 @@ static int virtio_mem_sbm_unplug_any_sb_offline(struct virtio_mem *vm,
>   		rc = virtio_mem_sbm_remove_mb(vm, mb_id);
>   		BUG_ON(rc);
>   		mutex_lock(&vm->hotplug_mutex);
> +
> +		/* Remove vmemmap pages. */
> +		if (vm->sbm.sbs_vmemmap) {
> +			rc = virtio_mem_sbm_unplug_sb(vm, mb_id, 0,
> +						      vm->sbm.sbs_vmemmap);
> +			/*
> +			 * Just warn because this error will
> +			 * not affect next plug.
> +			 */
> +			WARN_ON(rc);

We should avoid WARN_ON() and instead use dev_warn() for cases that can 
actually happen. With panic_on_warn(), which some distributions enable, 
we can crash the kernel.

I think error handling might also not be correct. We'd actually want to 
set the state to something like VIRTIO_MEM_SBM_MB_PLUGGED, to retry 
unplug later. Further, we'd want to return to the main loop to retry 
fixing this up.

Because if you leave the vmemmap blocks plugged, 
virtio_mem_sbm_plug_and_add_mb() will later keep failing with invalid 
requests from the hypervisor I guess.

[...]

> -	if (virtio_mem_sbm_test_sb_unplugged(vm, mb_id, 0, vm->sbm.sbs_per_mb)) {
> +	if (virtio_mem_sbm_test_sb_unplugged(vm, mb_id, vm->sbm.sbs_vmemmap,
> +				vm->sbm.sbs_per_mb - vm->sbm.sbs_vmemmap)) {
>   		mutex_unlock(&vm->hotplug_mutex);
>   		rc = virtio_mem_sbm_offline_and_remove_mb(vm, mb_id);
>   		mutex_lock(&vm->hotplug_mutex);
> -		if (!rc)
> +		if (!rc) {
>   			virtio_mem_sbm_set_mb_state(vm, mb_id,
>   						    VIRTIO_MEM_SBM_MB_UNUSED);
> +			/* Remove vmemmap pages. */
> +			if (vm->sbm.sbs_vmemmap) {
> +				rc = virtio_mem_sbm_unplug_sb(vm, mb_id, 0,
> +							vm->sbm.sbs_vmemmap);
> +				/*
> +				 * Just warn because this error will
> +				 * not affect next plug.
> +				 */
> +				WARN_ON(rc);
> +				if (!rc) {
> +					if (*nb_sb >= vm->sbm.sbs_vmemmap)
> +						*nb_sb -= vm->sbm.sbs_vmemmap;
> +					else
> +						*nb_sb = 0;
> +				}
> +			}
> +		}

Same comments as for virtio_mem_sbm_unplug_any_sb_offline().

>   	}
>   
>   	return 0;
> @@ -2444,6 +2578,15 @@ static int virtio_mem_init(struct virtio_mem *vm)
>   		       memory_block_size_bytes() - 1;
>   		vm->sbm.first_mb_id = virtio_mem_phys_to_mb_id(addr);
>   		vm->sbm.next_mb_id = vm->sbm.first_mb_id;
> +		if (mhp_supports_memmap_on_memory(memory_block_size_bytes())) {
> +			vm->sbm.nr_vmemmap_pages
> +				= PFN_DOWN(PFN_DOWN(memory_block_size_bytes()) *
> +					   sizeof(struct page));

I feel like we want memory hotplug code / vmemmap code to tell us 
instead of calculating ourselves.



-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] virtio-mem: Add support of memory_hotplug.memmap_on_memory
@ 2021-07-07 15:17   ` David Hildenbrand
  0 siblings, 0 replies; 6+ messages in thread
From: David Hildenbrand @ 2021-07-07 15:17 UTC (permalink / raw)
  To: Hui Zhu, mst, jasowang, virtualization, linux-kernel; +Cc: Hui Zhu

On 23.06.21 13:58, Hui Zhu wrote:
> From: Hui Zhu <teawater@antfin.com>
> 

Sorry for the delay, once I was back from vacation my inbox was 
overflowing :)

> We did some virtio-mem resize tests in high memory pressure environment.
> Memory increases slowly and sometimes fails in these tests.
> This is a way to reproduce the issue.
> Start a qemu with a small size of memory (132Mb) and resize the
> virtio-mem to hotplug memory.
> Then will get following error:
> [    8.097461] virtio_mem virtio0: requested size: 0x10000000
> [    8.098038] virtio_mem virtio0: plugging memory: 0x100000000 -
> 0x107ffffff
> [    8.098829] virtio_mem virtio0: adding memory: 0x100000000 -
> 0x107ffffff
> [    8.106298] kworker/0:1: vmemmap alloc failure: order:9,
> mode:0x4cc0(GFP_KERNEL|__GFP_RETRY_MAYFAIL),
> nodemask=(null),cpuset=/,mems_allowed=0
> [    8.107609] CPU: 0 PID: 14 Comm: kworker/0:1 Not tainted 5.13.0-rc7+
> [    8.108295] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> [    8.109476] Workqueue: events_freezable virtio_mem_run_wq
> [    8.110039] Call Trace:
> [    8.110305]  dump_stack+0x76/0x94
> [    8.110654]  warn_alloc.cold+0x7b/0xdf
> [    8.111054]  ? __alloc_pages+0x2c2/0x310
> [    8.111462]  vmemmap_alloc_block+0x86/0xdc
> [    8.111891]  vmemmap_populate+0xfc/0x325
> [    8.112309]  __populate_section_memmap+0x38/0x4e
> [    8.112787]  sparse_add_section+0x167/0x244
> [    8.113226]  __add_pages+0xa6/0x130
> [    8.113592]  add_pages+0x12/0x60
> [    8.113934]  add_memory_resource+0x114/0x2d0
> [    8.114377]  add_memory_driver_managed+0x7c/0xc0
> [    8.114852]  virtio_mem_add_memory+0x57/0xe0
> [    8.115304]  virtio_mem_sbm_plug_and_add_mb+0x9a/0x130
> [    8.115833]  virtio_mem_run_wq+0x9d5/0x1100
> I think allocating 2 Mb contiguous memory will be slow and failed
> in some cases, especially in high memory pressure environment.

So, interrestingly, what failed was the 2 MB allcoation. But 
vmemmap_populate() on x86-64 will actually fallback to individual 4k 
allcoations. See arch/x86/mm/init_64.c:vmemmap_populate_hugepages(), 
which falls back to vmemmap_populate_basepages().

We could certainly think about silencing this warning (GFP_NOWARN) and 
instead printing a clearer warning like "vmemmap: populating huge page 
failed; falling back to populating base pages"

Did you also see a "virtio_mem virtio0: adding memory failed: .." device 
warning? If not, adding memory did succeed! Although it's suboptimal.


> This commit try to add support of memory_hotplug.memmap_on_memory to
> handle this issue.
> 
> Just let SBM mode support it because memory_hotplug.memmap_on_memory
> need a single memory block.
> 
> Add nr_vmemmap_pages and sbs_vmemmap to struct sbm.
> If memory_hotplug.memmap_on_memory is open, pages number of a memory
> block's internal metadata will be store in nr_vmemmap_pages.
> sbs_vmemmap is the number of vmemmap subblocks per Linux memory block.
> The pages in the vmemmap subblocks should bigger than nr_vmemmap_pages
> because sb_size need to span at least MAX_ORDER_NR_PAGES and
> pageblock_nr_pages pages (virtio_mem_init).
> All the pages in vmemmap subblocks is not going to add to the buddy
> even if the pages that are not used to store the internal metadata
> (struct pages) because they should not work reliably with
> alloc_contig_range().
> 
> When resize virtio-mem, sbs_vmemmap is going to count in
> virtio_mem_sbm_plug_and_add_mb, virtio_mem_sbm_unplug_any_sb_offline
> and virtio_mem_sbm_unplug_any_sb_online.
> Because internal metadata also need the real pages in the host to store
> it.  I think resize virtio-mem size same with the actual memory
> footprint
> on the host is better if we want setup a memory cgroup for QEMU.

So the main thing I dislike about the current RFC state is the waste of 
memory we signaled to the device we are going to use, but we are not 
actually able to use. The remaining stuff looks reasonably clear to me, 
although we could do some refactorings to make it read a bit nicer.

The "loss" currently corresponds to 2 MiB for a 128MB memory block -- 
that's 1/64 of all added memory and just as much as the vmemmap itself!

I don't see an easy way around this when keeping the subblock size at 2 
MiB. We'd have to mess with "partial subblocks" when processing (un)plug 
requests, which is just plain ugly.

There is a big TODO item on my toto list to teach alloc_contig_range() 
to handle pageblock_order properly and I shall start working on that 
soonish. [1]

What I think we should do is


1. Implement the "memory_hotplug.memmap_on_memory" optimization for big 
block mode with a single memory block only for now (Linux also only 
supports this). Environments that desparately need this optimization can 
set "force_bbm=1" when loading the virtio-mem module.

It already has the correct flow -- virtio_mem_bbm_plug_and_add_bb() and 
virtio_mem_bbm_remove_and_unplug_bb(). We'll have to teach:

* virtio_mem_bbm_notify_going_offline() /
   virtio_mem_bbm_notify_cancel_offline() about the special vmemmap
   ranges
* virtio_mem_bbm_offline_remove_and_unplug_bb() about the special
   vmemmap ranges

I think that should be mostly it, but there might be more.

2. Optimize alloc_contig_range() such that we can drop the MAX_ORDER - 1 
requirement. With most devices, we can then use a subblock size of 
pageblock_order (2MiB on x86-64).

3. Pick up a reduced variant of your RFC that implements the 
optimization only if the sbm.vmemmap pages span complete subblocks. That 
removes the need for sbm.nr_vmemmap_pages and consequently results in no 
loss of plugged memory.

[1] 
https://lkml.kernel.org/r/c8e21ac4-ace7-3176-8782-535bd6590583@redhat.com


> @@ -1534,12 +1615,15 @@ static int virtio_mem_sbm_prepare_next_mb(struct virtio_mem *vm,
>   static int virtio_mem_sbm_plug_and_add_mb(struct virtio_mem *vm,
>   					  unsigned long mb_id, uint64_t *nb_sb)
>   {
> -	const int count = min_t(int, *nb_sb, vm->sbm.sbs_per_mb);
> +	int count = min_t(int, *nb_sb, vm->sbm.sbs_per_mb);
>   	int rc;
>   
>   	if (WARN_ON_ONCE(!count))
>   		return -EINVAL;
>   
> +	if (vm->sbm.sbs_vmemmap)
> +		count = max_t(int, count, vm->sbm.sbs_vmemmap);
> +

If you exceed original *nb_sb (which can happen on bigger memory blocks 
like with 2GB where the vmemmap actually spans multiple subblocks), you 
can try plugging more than requested from the hypervisor.

You'd instead have to return with -ENOSPC in case *nb_sb does not at 
least span vm->sbm.sbs_vmemmap.

[...]

>   		/*
>   		 * Remove the block from Linux - this should never fail.
>   		 * Hinder the block from getting onlined by marking it
> @@ -1840,6 +1934,23 @@ static int virtio_mem_sbm_unplug_any_sb_offline(struct virtio_mem *vm,
>   		rc = virtio_mem_sbm_remove_mb(vm, mb_id);
>   		BUG_ON(rc);
>   		mutex_lock(&vm->hotplug_mutex);
> +
> +		/* Remove vmemmap pages. */
> +		if (vm->sbm.sbs_vmemmap) {
> +			rc = virtio_mem_sbm_unplug_sb(vm, mb_id, 0,
> +						      vm->sbm.sbs_vmemmap);
> +			/*
> +			 * Just warn because this error will
> +			 * not affect next plug.
> +			 */
> +			WARN_ON(rc);

We should avoid WARN_ON() and instead use dev_warn() for cases that can 
actually happen. With panic_on_warn(), which some distributions enable, 
we can crash the kernel.

I think error handling might also not be correct. We'd actually want to 
set the state to something like VIRTIO_MEM_SBM_MB_PLUGGED, to retry 
unplug later. Further, we'd want to return to the main loop to retry 
fixing this up.

Because if you leave the vmemmap blocks plugged, 
virtio_mem_sbm_plug_and_add_mb() will later keep failing with invalid 
requests from the hypervisor I guess.

[...]

> -	if (virtio_mem_sbm_test_sb_unplugged(vm, mb_id, 0, vm->sbm.sbs_per_mb)) {
> +	if (virtio_mem_sbm_test_sb_unplugged(vm, mb_id, vm->sbm.sbs_vmemmap,
> +				vm->sbm.sbs_per_mb - vm->sbm.sbs_vmemmap)) {
>   		mutex_unlock(&vm->hotplug_mutex);
>   		rc = virtio_mem_sbm_offline_and_remove_mb(vm, mb_id);
>   		mutex_lock(&vm->hotplug_mutex);
> -		if (!rc)
> +		if (!rc) {
>   			virtio_mem_sbm_set_mb_state(vm, mb_id,
>   						    VIRTIO_MEM_SBM_MB_UNUSED);
> +			/* Remove vmemmap pages. */
> +			if (vm->sbm.sbs_vmemmap) {
> +				rc = virtio_mem_sbm_unplug_sb(vm, mb_id, 0,
> +							vm->sbm.sbs_vmemmap);
> +				/*
> +				 * Just warn because this error will
> +				 * not affect next plug.
> +				 */
> +				WARN_ON(rc);
> +				if (!rc) {
> +					if (*nb_sb >= vm->sbm.sbs_vmemmap)
> +						*nb_sb -= vm->sbm.sbs_vmemmap;
> +					else
> +						*nb_sb = 0;
> +				}
> +			}
> +		}

Same comments as for virtio_mem_sbm_unplug_any_sb_offline().

>   	}
>   
>   	return 0;
> @@ -2444,6 +2578,15 @@ static int virtio_mem_init(struct virtio_mem *vm)
>   		       memory_block_size_bytes() - 1;
>   		vm->sbm.first_mb_id = virtio_mem_phys_to_mb_id(addr);
>   		vm->sbm.next_mb_id = vm->sbm.first_mb_id;
> +		if (mhp_supports_memmap_on_memory(memory_block_size_bytes())) {
> +			vm->sbm.nr_vmemmap_pages
> +				= PFN_DOWN(PFN_DOWN(memory_block_size_bytes()) *
> +					   sizeof(struct page));

I feel like we want memory hotplug code / vmemmap code to tell us 
instead of calculating ourselves.



-- 
Thanks,

David / dhildenb

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-07-07 15:18 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-23 11:58 [RFC] virtio-mem: Add support of memory_hotplug.memmap_on_memory Hui Zhu
2021-06-23 15:04 ` David Hildenbrand
2021-06-23 18:59 ` kernel test robot
2021-06-25  3:45 ` kernel test robot
2021-07-07 15:17 ` David Hildenbrand
2021-07-07 15:17   ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.