kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][Patch v10 0/2] mm: Support for page hinting
@ 2019-06-03 17:03 Nitesh Narayan Lal
  2019-06-03 17:03 ` [RFC][Patch v10 1/2] mm: page_hinting: core infrastructure Nitesh Narayan Lal
                   ` (3 more replies)
  0 siblings, 4 replies; 33+ messages in thread
From: Nitesh Narayan Lal @ 2019-06-03 17:03 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta,
	wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen, konrad.wilk,
	dhildenb, aarcange, alexander.duyck

This patch series proposes an efficient mechanism for communicating free memory
from a guest to its hypervisor. It especially enables guests with no page cache
(e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
rapidly hand back free memory to the hypervisor.
This approach has a minimal impact on the existing core-mm infrastructure.

Measurement results (measurement details appended to this email):
* With active page hinting, 3 more guests could be launched each of 5 GB(total 
5 vs. 2) on a 15GB (single NUMA) system without swapping.
* With active page hinting, on a system with 15 GB of (single NUMA) memory and
4GB of swap, the runtime of "memhog 6G" in 3 guests (run sequentially) resulted
in the last invocation to only need 37s compared to 3m35s without page hinting.

This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
A new hook after buddy merging is used to set the bits in the bitmap.
Currently, the bits are only cleared when pages are hinted, not when pages are
re-allocated.

Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
threshold is met, trying to isolate and report pages that are still free.

The isolated pages are reported via virtio-balloon, which is responsible for
sending batched pages to the host synchronously. Once the hypervisor processed
the hinting request, the isolated pages are returned back to the buddy.

The key changes made in this series compared to v9[1] are:
* Pages only in the chunks of "MAX_ORDER - 2" are reported to the hypervisor to
not break up the THP.
* At a time only a set of 16 pages can be isolated and reported to the host to
avoids any false OOMs.
* page_hinting.c is moved under mm/ from virt/kvm/ as the feature is dependent
on virtio and not on KVM itself. This would enable any other hypervisor to use
this feature by implementing virtio devices.
* The sysctl variable is replaced with a virtio-balloon parameter to
enable/disable page-hinting.

Pending items:
* Test device assigned guests to ensure that hinting doesn't break it.
* Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device side support.
* Compare reporting free pages via vring with vhost.
* Decide between MADV_DONTNEED and MADV_FREE.
* Look into memory hotplug, more efficient locking, possible races when
disabling.
* Come up with proper/traceable error-message/logs.
* Minor reworks and simplifications (e.g., virtio protocol).

Benefit analysis:
1. Use-case - Number of guests that can be launched without swap usage
NUMA Nodes = 1 with 15 GB memory
Guest Memory = 5 GB
Number of cores in guest = 1
Workload = test allocation program allocates 4GB memory, touches it via memset
and exits.
Procedure =
The first guest is launched and once its console is up, the test allocation
program is executed with 4 GB memory request (Due to this the guest occupies
almost 4-5 GB of memory in the host in a system without page hinting). Once
this program exits at that time another guest is launched in the host and the
same process is followed. It is continued until the swap is not used.

Results:
Without hinting = 3, swap usage at the end 1.1GB.
With hinting = 5, swap usage at the end 0.

2. Use-case - memhog execution time
Guest Memory = 6GB
Number of cores = 4
NUMA Nodes = 1 with 15 GB memory
Process: 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
one after the other in each of them.
Without Hinting - Guest1:47s, Guest2:53s, Guest3:3m35s, End swap usage: 3.5G
With Hinting - Guest1:40s, Guest2:44s, Guest3:37s, End swap usage: 0

Performance analysis:
1. will-it-scale's page_faul1:
Guest Memory = 6GB
Number of cores = 24

Without Hinting:
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,315890,95.82,317633,95.83,317633
2,570810,91.67,531147,91.94,635266
3,826491,87.54,713545,88.53,952899
4,1087434,83.40,901215,85.30,1270532
5,1277137,79.26,916442,83.74,1588165
6,1503611,75.12,1113832,79.89,1905798
7,1683750,70.99,1140629,78.33,2223431
8,1893105,66.85,1157028,77.40,2541064
9,2046516,62.50,1179445,76.48,2858697
10,2291171,58.57,1209247,74.99,3176330
11,2486198,54.47,1217265,75.13,3493963
12,2656533,50.36,1193392,74.42,3811596
13,2747951,46.21,1185540,73.45,4129229
14,2965757,42.09,1161862,72.20,4446862
15,3049128,37.97,1185923,72.12,4764495
16,3150692,33.83,1163789,70.70,5082128
17,3206023,29.70,1174217,70.11,5399761
18,3211380,25.62,1179660,69.40,5717394
19,3202031,21.44,1181259,67.28,6035027
20,3218245,17.35,1196367,66.75,6352660
21,3228576,13.26,1129561,66.74,6670293
22,3207452,9.15,1166517,66.47,6987926
23,3153800,5.09,1172877,61.57,7305559
24,3184542,0.99,1186244,58.36,7623192

With Hinting:
0,0,100,0,100,0
1,306737,95.82,305130,95.78,306737
2,573207,91.68,530453,91.92,613474
3,810319,87.53,695281,88.58,920211
4,1074116,83.40,880602,85.48,1226948
5,1308283,79.26,1109257,81.23,1533685
6,1501987,75.12,1093661,80.19,1840422
7,1695300,70.99,1104207,79.03,2147159
8,1901523,66.85,1193613,76.90,2453896
9,2051288,62.73,1200913,76.22,2760633
10,2275771,58.60,1192992,75.66,3067370
11,2435016,54.48,1191472,74.66,3374107
12,2623114,50.35,1196911,74.02,3680844
13,2766071,46.22,1178589,73.02,3987581
14,2932163,42.10,1166414,72.96,4294318
15,3000853,37.96,1177177,72.62,4601055
16,3113738,33.85,1165444,70.54,4907792
17,3132135,29.77,1165055,68.51,5214529
18,3175121,25.69,1166969,69.27,5521266
19,3205490,21.61,1159310,65.65,5828003
20,3220855,17.52,1171827,62.04,6134740
21,3182568,13.48,1138918,65.05,6441477
22,3130543,9.30,1128185,60.60,6748214
23,3087426,5.15,1127912,55.36,7054951
24,3099457,1.04,1176100,54.96,7361688

[1] https://lkml.org/lkml/2019/3/6/413



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC][Patch v10 1/2] mm: page_hinting: core infrastructure
  2019-06-03 17:03 [RFC][Patch v10 0/2] mm: Support for page hinting Nitesh Narayan Lal
@ 2019-06-03 17:03 ` Nitesh Narayan Lal
  2019-06-03 19:04   ` Alexander Duyck
                     ` (2 more replies)
  2019-06-03 17:03 ` [RFC][Patch v10 2/2] virtio-balloon: page_hinting: reporting to the host Nitesh Narayan Lal
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 33+ messages in thread
From: Nitesh Narayan Lal @ 2019-06-03 17:03 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta,
	wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen, konrad.wilk,
	dhildenb, aarcange, alexander.duyck

This patch introduces the core infrastructure for free page hinting in
virtual environments. It enables the kernel to track the free pages which
can be reported to its hypervisor so that the hypervisor could
free and reuse that memory as per its requirement.

While the pages are getting processed in the hypervisor (e.g.,
via MADV_FREE), the guest must not use them, otherwise, data loss
would be possible. To avoid such a situation, these pages are
temporarily removed from the buddy. The amount of pages removed
temporarily from the buddy is governed by the backend(virtio-balloon
in our case).

To efficiently identify free pages that can to be hinted to the
hypervisor, bitmaps in a coarse granularity are used. Only fairly big
chunks are reported to the hypervisor - especially, to not break up THP
in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
in the bitmap are an indication whether a page *might* be free, not a
guarantee. A new hook after buddy merging sets the bits.

Bitmaps are stored per zone, protected by the zone lock. A workqueue
asynchronously processes the bitmaps, trying to isolate and report pages
that are still free. The backend (virtio-balloon) is responsible for
reporting these batched pages to the host synchronously. Once reporting/
freeing is complete, isolated pages are returned back to the buddy.

There are still various things to look into (e.g., memory hotplug, more
efficient locking, possible races when disabling).

Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 drivers/virtio/Kconfig       |   1 +
 include/linux/page_hinting.h |  46 +++++++
 mm/Kconfig                   |   6 +
 mm/Makefile                  |   2 +
 mm/page_alloc.c              |  17 +--
 mm/page_hinting.c            | 236 +++++++++++++++++++++++++++++++++++
 6 files changed, 301 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/page_hinting.h
 create mode 100644 mm/page_hinting.c

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 35897649c24f..5a96b7a2ed1e 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -46,6 +46,7 @@ config VIRTIO_BALLOON
 	tristate "Virtio balloon driver"
 	depends on VIRTIO
 	select MEMORY_BALLOON
+	select PAGE_HINTING
 	---help---
 	 This driver supports increasing and decreasing the amount
 	 of memory within a KVM guest.
diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
new file mode 100644
index 000000000000..e65188fe1e6b
--- /dev/null
+++ b/include/linux/page_hinting.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PAGE_HINTING_H
+#define _LINUX_PAGE_HINTING_H
+
+/*
+ * Minimum page order required for a page to be hinted to the host.
+ */
+#define PAGE_HINTING_MIN_ORDER		(MAX_ORDER - 2)
+
+/*
+ * struct page_hinting_cb: holds the callbacks to store, report and cleanup
+ * isolated pages.
+ * @prepare:		Callback responsible for allocating an array to hold
+ *			the isolated pages.
+ * @hint_pages:		Callback which reports the isolated pages synchornously
+ *			to the host.
+ * @cleanup:		Callback to free the the array used for reporting the
+ *			isolated pages.
+ * @max_pages:		Maxmimum pages that are going to be hinted to the host
+ *			at a time of granularity >= PAGE_HINTING_MIN_ORDER.
+ */
+struct page_hinting_cb {
+	int (*prepare)(void);
+	void (*hint_pages)(struct list_head *list);
+	void (*cleanup)(void);
+	int max_pages;
+};
+
+#ifdef CONFIG_PAGE_HINTING
+void page_hinting_enqueue(struct page *page, int order);
+void page_hinting_enable(const struct page_hinting_cb *cb);
+void page_hinting_disable(void);
+#else
+static inline void page_hinting_enqueue(struct page *page, int order)
+{
+}
+
+static inline void page_hinting_enable(struct page_hinting_cb *cb)
+{
+}
+
+static inline void page_hinting_disable(void)
+{
+}
+#endif
+#endif /* _LINUX_PAGE_HINTING_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index ee8d1f311858..177d858de758 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -764,4 +764,10 @@ config GUP_BENCHMARK
 config ARCH_HAS_PTE_SPECIAL
 	bool
 
+# PAGE_HINTING will allow the guest to report the free pages to the
+# host in regular interval of time.
+config PAGE_HINTING
+       bool
+       def_bool n
+       depends on X86_64
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index ac5e5ba78874..bec456dfee34 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -41,6 +41,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   interval_tree.o list_lru.o workingset.o \
 			   debug.o $(mmu-y)
 
+
 # Give 'page_alloc' its own module-parameter namespace
 page-alloc-y := page_alloc.o
 page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
@@ -94,6 +95,7 @@ obj-$(CONFIG_Z3FOLD)	+= z3fold.o
 obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
 obj-$(CONFIG_CMA)	+= cma.o
 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
+obj-$(CONFIG_PAGE_HINTING) += page_hinting.o
 obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
 obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
 obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b13d3914176..d12f69e0e402 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -68,6 +68,7 @@
 #include <linux/lockdep.h>
 #include <linux/nmi.h>
 #include <linux/psi.h>
+#include <linux/page_hinting.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -873,10 +874,10 @@ compaction_capture(struct capture_control *capc, struct page *page,
  * -- nyc
  */
 
-static inline void __free_one_page(struct page *page,
+inline void __free_one_page(struct page *page,
 		unsigned long pfn,
 		struct zone *zone, unsigned int order,
-		int migratetype)
+		int migratetype, bool hint)
 {
 	unsigned long combined_pfn;
 	unsigned long uninitialized_var(buddy_pfn);
@@ -951,6 +952,8 @@ static inline void __free_one_page(struct page *page,
 done_merging:
 	set_page_order(page, order);
 
+	if (hint)
+		page_hinting_enqueue(page, order);
 	/*
 	 * If this is not the largest possible page, check if the buddy
 	 * of the next-highest order is free. If it is, it's possible
@@ -1262,7 +1265,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		if (unlikely(isolated_pageblocks))
 			mt = get_pageblock_migratetype(page);
 
-		__free_one_page(page, page_to_pfn(page), zone, 0, mt);
+		__free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
 		trace_mm_page_pcpu_drain(page, 0, mt);
 	}
 	spin_unlock(&zone->lock);
@@ -1271,14 +1274,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 static void free_one_page(struct zone *zone,
 				struct page *page, unsigned long pfn,
 				unsigned int order,
-				int migratetype)
+				int migratetype, bool hint)
 {
 	spin_lock(&zone->lock);
 	if (unlikely(has_isolate_pageblock(zone) ||
 		is_migrate_isolate(migratetype))) {
 		migratetype = get_pfnblock_migratetype(page, pfn);
 	}
-	__free_one_page(page, pfn, zone, order, migratetype);
+	__free_one_page(page, pfn, zone, order, migratetype, hint);
 	spin_unlock(&zone->lock);
 }
 
@@ -1368,7 +1371,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	free_one_page(page_zone(page), page, pfn, order, migratetype);
+	free_one_page(page_zone(page), page, pfn, order, migratetype, true);
 	local_irq_restore(flags);
 }
 
@@ -2968,7 +2971,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
 	 */
 	if (migratetype >= MIGRATE_PCPTYPES) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(zone, page, pfn, 0, migratetype);
+			free_one_page(zone, page, pfn, 0, migratetype, true);
 			return;
 		}
 		migratetype = MIGRATE_MOVABLE;
diff --git a/mm/page_hinting.c b/mm/page_hinting.c
new file mode 100644
index 000000000000..7341c6462de2
--- /dev/null
+++ b/mm/page_hinting.c
@@ -0,0 +1,236 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Page hinting support to enable a VM to report the freed pages back
+ * to the host.
+ *
+ * Copyright Red Hat, Inc. 2019
+ *
+ * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/page_hinting.h>
+#include <linux/kvm_host.h>
+
+/*
+ * struct hinting_bitmap: holds the bitmap pointer which tracks the freed PFNs
+ * and other required parameters which could help in retrieving the original
+ * PFN value using the bitmap.
+ * @bitmap:		Pointer to the bitmap of free PFN.
+ * @base_pfn:		Starting PFN value for the zone whose bitmap is stored.
+ * @free_pages:		Tracks the number of free pages of granularity
+ *			PAGE_HINTING_MIN_ORDER.
+ * @nbits:		Indicates the total size of the bitmap in bits allocated
+ *			at the time of initialization.
+ */
+struct hinting_bitmap {
+	unsigned long *bitmap;
+	unsigned long base_pfn;
+	atomic_t free_pages;
+	unsigned long nbits;
+} bm_zone[MAX_NR_ZONES];
+
+static void init_hinting_wq(struct work_struct *work);
+extern int __isolate_free_page(struct page *page, unsigned int order);
+extern void __free_one_page(struct page *page, unsigned long pfn,
+			    struct zone *zone, unsigned int order,
+			    int migratetype, bool hint);
+const struct page_hinting_cb *hcb;
+struct work_struct hinting_work;
+
+static unsigned long find_bitmap_size(struct zone *zone)
+{
+	unsigned long nbits = ALIGN(zone->spanned_pages,
+			    PAGE_HINTING_MIN_ORDER);
+
+	nbits = nbits >> PAGE_HINTING_MIN_ORDER;
+	return nbits;
+}
+
+void page_hinting_enable(const struct page_hinting_cb *callback)
+{
+	struct zone *zone;
+	int idx = 0;
+	unsigned long bitmap_size = 0;
+
+	for_each_populated_zone(zone) {
+		spin_lock(&zone->lock);
+		bitmap_size = find_bitmap_size(zone);
+		bm_zone[idx].bitmap = bitmap_zalloc(bitmap_size, GFP_KERNEL);
+		if (!bm_zone[idx].bitmap)
+			return;
+		bm_zone[idx].nbits = bitmap_size;
+		bm_zone[idx].base_pfn = zone->zone_start_pfn;
+		spin_unlock(&zone->lock);
+		idx++;
+	}
+	hcb = callback;
+	INIT_WORK(&hinting_work, init_hinting_wq);
+}
+EXPORT_SYMBOL_GPL(page_hinting_enable);
+
+void page_hinting_disable(void)
+{
+	struct zone *zone;
+	int idx = 0;
+
+	cancel_work_sync(&hinting_work);
+	hcb = NULL;
+	for_each_populated_zone(zone) {
+		spin_lock(&zone->lock);
+		bitmap_free(bm_zone[idx].bitmap);
+		bm_zone[idx].base_pfn = 0;
+		bm_zone[idx].nbits = 0;
+		atomic_set(&bm_zone[idx].free_pages, 0);
+		spin_unlock(&zone->lock);
+		idx++;
+	}
+}
+EXPORT_SYMBOL_GPL(page_hinting_disable);
+
+static unsigned long pfn_to_bit(struct page *page, int zonenum)
+{
+	unsigned long bitnr;
+
+	bitnr = (page_to_pfn(page) - bm_zone[zonenum].base_pfn)
+			 >> PAGE_HINTING_MIN_ORDER;
+	return bitnr;
+}
+
+static void release_buddy_pages(struct list_head *pages)
+{
+	int mt = 0, zonenum, order;
+	struct page *page, *next;
+	struct zone *zone;
+	unsigned long bitnr;
+
+	list_for_each_entry_safe(page, next, pages, lru) {
+		zonenum = page_zonenum(page);
+		zone = page_zone(page);
+		bitnr = pfn_to_bit(page, zonenum);
+		spin_lock(&zone->lock);
+		list_del(&page->lru);
+		order = page_private(page);
+		set_page_private(page, 0);
+		mt = get_pageblock_migratetype(page);
+		__free_one_page(page, page_to_pfn(page), zone,
+				order, mt, false);
+		spin_unlock(&zone->lock);
+	}
+}
+
+static void bm_set_pfn(struct page *page)
+{
+	unsigned long bitnr = 0;
+	int zonenum = page_zonenum(page);
+	struct zone *zone = page_zone(page);
+
+	lockdep_assert_held(&zone->lock);
+	bitnr = pfn_to_bit(page, zonenum);
+	if (bm_zone[zonenum].bitmap &&
+	    bitnr < bm_zone[zonenum].nbits &&
+	    !test_and_set_bit(bitnr, bm_zone[zonenum].bitmap))
+		atomic_inc(&bm_zone[zonenum].free_pages);
+}
+
+static void scan_hinting_bitmap(int zonenum, int free_pages)
+{
+	unsigned long set_bit, start = 0;
+	struct page *page;
+	struct zone *zone;
+	int scanned_pages = 0, ret = 0, order, isolated_cnt = 0;
+	LIST_HEAD(isolated_pages);
+
+	ret = hcb->prepare();
+	if (ret < 0)
+		return;
+	for (;;) {
+		ret = 0;
+		set_bit = find_next_bit(bm_zone[zonenum].bitmap,
+					bm_zone[zonenum].nbits, start);
+		if (set_bit >= bm_zone[zonenum].nbits)
+			break;
+		page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) +
+				bm_zone[zonenum].base_pfn);
+		if (!page)
+			continue;
+		zone = page_zone(page);
+		spin_lock(&zone->lock);
+
+		if (PageBuddy(page) && page_private(page) >=
+		    PAGE_HINTING_MIN_ORDER) {
+			order = page_private(page);
+			ret = __isolate_free_page(page, order);
+		}
+		clear_bit(set_bit, bm_zone[zonenum].bitmap);
+		spin_unlock(&zone->lock);
+		if (ret) {
+			/*
+			 * restoring page order to use it while releasing
+			 * the pages back to the buddy.
+			 */
+			set_page_private(page, order);
+			list_add_tail(&page->lru, &isolated_pages);
+			isolated_cnt++;
+			if (isolated_cnt == hcb->max_pages) {
+				hcb->hint_pages(&isolated_pages);
+				release_buddy_pages(&isolated_pages);
+				isolated_cnt = 0;
+			}
+		}
+		start = set_bit + 1;
+		scanned_pages++;
+	}
+	if (isolated_cnt) {
+		hcb->hint_pages(&isolated_pages);
+		release_buddy_pages(&isolated_pages);
+	}
+	hcb->cleanup();
+	if (scanned_pages > free_pages)
+		atomic_sub((scanned_pages - free_pages),
+			   &bm_zone[zonenum].free_pages);
+}
+
+static bool check_hinting_threshold(void)
+{
+	int zonenum = 0;
+
+	for (; zonenum < MAX_NR_ZONES; zonenum++) {
+		if (atomic_read(&bm_zone[zonenum].free_pages) >=
+				hcb->max_pages)
+			return true;
+	}
+	return false;
+}
+
+static void init_hinting_wq(struct work_struct *work)
+{
+	int zonenum = 0, free_pages = 0;
+
+	for (; zonenum < MAX_NR_ZONES; zonenum++) {
+		free_pages = atomic_read(&bm_zone[zonenum].free_pages);
+		if (free_pages >= hcb->max_pages) {
+			/* Find a better way to synchronize per zone
+			 * free_pages.
+			 */
+			atomic_sub(free_pages,
+				   &bm_zone[zonenum].free_pages);
+			scan_hinting_bitmap(zonenum, free_pages);
+		}
+	}
+}
+
+void page_hinting_enqueue(struct page *page, int order)
+{
+	if (hcb && order >= PAGE_HINTING_MIN_ORDER)
+		bm_set_pfn(page);
+	else
+		return;
+
+	if (check_hinting_threshold()) {
+		int cpu = smp_processor_id();
+
+		queue_work_on(cpu, system_wq, &hinting_work);
+	}
+}
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC][Patch v10 2/2] virtio-balloon: page_hinting: reporting to the host
  2019-06-03 17:03 [RFC][Patch v10 0/2] mm: Support for page hinting Nitesh Narayan Lal
  2019-06-03 17:03 ` [RFC][Patch v10 1/2] mm: page_hinting: core infrastructure Nitesh Narayan Lal
@ 2019-06-03 17:03 ` Nitesh Narayan Lal
  2019-06-03 22:38   ` Alexander Duyck
  2019-06-04 16:33   ` Alexander Duyck
  2019-06-03 17:04 ` [QEMU PATCH] KVM: Support for page hinting Nitesh Narayan Lal
  2019-06-03 18:04 ` [RFC][Patch v10 0/2] mm: " Michael S. Tsirkin
  3 siblings, 2 replies; 33+ messages in thread
From: Nitesh Narayan Lal @ 2019-06-03 17:03 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta,
	wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen, konrad.wilk,
	dhildenb, aarcange, alexander.duyck

Enables the kernel to negotiate VIRTIO_BALLOON_F_HINTING feature with the
host. If it is available and page_hinting_flag is set to true, page_hinting
is enabled and its callbacks are configured along with the max_pages count
which indicates the maximum number of pages that can be isolated and hinted
at a time. Currently, only free pages of order >= (MAX_ORDER - 2) are
reported. To prevent any false OOM max_pages count is set to 16.

By default page_hinting feature is enabled and gets loaded as soon
as the virtio-balloon driver is loaded. However, it could be disabled
by writing the page_hinting_flag which is a virtio-balloon parameter.

Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 drivers/virtio/virtio_balloon.c     | 112 +++++++++++++++++++++++++++-
 include/uapi/linux/virtio_balloon.h |  14 ++++
 2 files changed, 125 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f19061b585a4..40f09ea31643 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -31,6 +31,7 @@
 #include <linux/mm.h>
 #include <linux/mount.h>
 #include <linux/magic.h>
+#include <linux/page_hinting.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -48,6 +49,7 @@
 /* The size of a free page block in bytes */
 #define VIRTIO_BALLOON_FREE_PAGE_SIZE \
 	(1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT))
+#define VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES	16
 
 #ifdef CONFIG_BALLOON_COMPACTION
 static struct vfsmount *balloon_mnt;
@@ -58,6 +60,7 @@ enum virtio_balloon_vq {
 	VIRTIO_BALLOON_VQ_DEFLATE,
 	VIRTIO_BALLOON_VQ_STATS,
 	VIRTIO_BALLOON_VQ_FREE_PAGE,
+	VIRTIO_BALLOON_VQ_HINTING,
 	VIRTIO_BALLOON_VQ_MAX
 };
 
@@ -67,7 +70,8 @@ enum virtio_balloon_config_read {
 
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq,
+			 *hinting_vq;
 
 	/* Balloon's own wq for cpu-intensive work items */
 	struct workqueue_struct *balloon_wq;
@@ -125,6 +129,9 @@ struct virtio_balloon {
 
 	/* To register a shrinker to shrink memory upon memory pressure */
 	struct shrinker shrinker;
+
+	/* object pointing at the array of isolated pages ready for hinting */
+	struct hinting_data *hinting_arr;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -132,6 +139,85 @@ static struct virtio_device_id id_table[] = {
 	{ 0 },
 };
 
+#ifdef CONFIG_PAGE_HINTING
+struct virtio_balloon *hvb;
+bool page_hinting_flag = true;
+module_param(page_hinting_flag, bool, 0444);
+MODULE_PARM_DESC(page_hinting_flag, "Enable page hinting");
+
+static bool virtqueue_kick_sync(struct virtqueue *vq)
+{
+	u32 len;
+
+	if (likely(virtqueue_kick(vq))) {
+		while (!virtqueue_get_buf(vq, &len) &&
+		       !virtqueue_is_broken(vq))
+			cpu_relax();
+		return true;
+	}
+	return false;
+}
+
+static void page_hinting_report(int entries)
+{
+	struct scatterlist sg;
+	struct virtqueue *vq = hvb->hinting_vq;
+	int err = 0;
+	struct hinting_data *hint_req;
+	u64 gpaddr;
+
+	hint_req = kmalloc(sizeof(*hint_req), GFP_KERNEL);
+	if (!hint_req)
+		return;
+	gpaddr = virt_to_phys(hvb->hinting_arr);
+	hint_req->phys_addr = cpu_to_virtio64(hvb->vdev, gpaddr);
+	hint_req->size = cpu_to_virtio32(hvb->vdev, entries);
+	sg_init_one(&sg, hint_req, sizeof(*hint_req));
+	err = virtqueue_add_outbuf(vq, &sg, 1, hint_req, GFP_KERNEL);
+	if (!err)
+		virtqueue_kick_sync(hvb->hinting_vq);
+
+	kfree(hint_req);
+}
+
+int page_hinting_prepare(void)
+{
+	hvb->hinting_arr = kmalloc_array(VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES,
+					 sizeof(*hvb->hinting_arr), GFP_KERNEL);
+	if (!hvb->hinting_arr)
+		return -ENOMEM;
+	return 0;
+}
+
+void hint_pages(struct list_head *pages)
+{
+	struct page *page, *next;
+	unsigned long pfn;
+	int idx = 0, order;
+
+	list_for_each_entry_safe(page, next, pages, lru) {
+		pfn = page_to_pfn(page);
+		order = page_private(page);
+		hvb->hinting_arr[idx].phys_addr = pfn << PAGE_SHIFT;
+		hvb->hinting_arr[idx].size = (1 << order) * PAGE_SIZE;
+		idx++;
+	}
+	page_hinting_report(idx);
+}
+
+void page_hinting_cleanup(void)
+{
+	kfree(hvb->hinting_arr);
+}
+
+static const struct page_hinting_cb hcb = {
+	.prepare = page_hinting_prepare,
+	.hint_pages = hint_pages,
+	.cleanup = page_hinting_cleanup,
+	.max_pages = VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES,
+};
+#endif
+
 static u32 page_to_balloon_pfn(struct page *page)
 {
 	unsigned long pfn = page_to_pfn(page);
@@ -488,6 +574,7 @@ static int init_vqs(struct virtio_balloon *vb)
 	names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
 	names[VIRTIO_BALLOON_VQ_STATS] = NULL;
 	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
+	names[VIRTIO_BALLOON_VQ_HINTING] = NULL;
 
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
 		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
@@ -499,11 +586,18 @@ static int init_vqs(struct virtio_balloon *vb)
 		callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
 	}
 
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
+		names[VIRTIO_BALLOON_VQ_HINTING] = "hinting_vq";
+		callbacks[VIRTIO_BALLOON_VQ_HINTING] = NULL;
+	}
 	err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
 					 vqs, callbacks, names, NULL, NULL);
 	if (err)
 		return err;
 
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
+		vb->hinting_vq = vqs[VIRTIO_BALLOON_VQ_HINTING];
+
 	vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
 	vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
@@ -942,6 +1036,14 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		if (err)
 			goto out_del_balloon_wq;
 	}
+
+#ifdef CONFIG_PAGE_HINTING
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING) &&
+	    page_hinting_flag) {
+		hvb = vb;
+		page_hinting_enable(&hcb);
+	}
+#endif
 	virtio_device_ready(vdev);
 
 	if (towards_target(vb))
@@ -989,6 +1091,12 @@ static void virtballoon_remove(struct virtio_device *vdev)
 		destroy_workqueue(vb->balloon_wq);
 	}
 
+#ifdef CONFIG_PAGE_HINTING
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
+		hvb = NULL;
+		page_hinting_disable();
+	}
+#endif
 	remove_common(vb);
 #ifdef CONFIG_BALLOON_COMPACTION
 	if (vb->vb_dev_info.inode)
@@ -1043,8 +1151,10 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_HINTING,
 	VIRTIO_BALLOON_F_FREE_PAGE_HINT,
 	VIRTIO_BALLOON_F_PAGE_POISON,
+	VIRTIO_BALLOON_F_HINTING,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index a1966cd7b677..25e4f817c660 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -29,6 +29,7 @@
 #include <linux/virtio_types.h>
 #include <linux/virtio_ids.h>
 #include <linux/virtio_config.h>
+#include <linux/page_hinting.h>
 
 /* The feature bitmap for virtio balloon */
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
@@ -36,6 +37,7 @@
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
 #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
+#define VIRTIO_BALLOON_F_HINTING	5 /* Page hinting virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -108,4 +110,16 @@ struct virtio_balloon_stat {
 	__virtio64 val;
 } __attribute__((packed));
 
+#ifdef CONFIG_PAGE_HINTING
+/*
+ * struct hinting_data- holds the information associated with hinting.
+ * @phys_add:	physical address associated with a page or the array holding
+ *		the array of isolated pages.
+ * @size:	total size associated with the phys_addr.
+ */
+struct hinting_data {
+	__virtio64 phys_addr;
+	__virtio32 size;
+};
+#endif
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [QEMU PATCH] KVM: Support for page hinting
  2019-06-03 17:03 [RFC][Patch v10 0/2] mm: Support for page hinting Nitesh Narayan Lal
  2019-06-03 17:03 ` [RFC][Patch v10 1/2] mm: page_hinting: core infrastructure Nitesh Narayan Lal
  2019-06-03 17:03 ` [RFC][Patch v10 2/2] virtio-balloon: page_hinting: reporting to the host Nitesh Narayan Lal
@ 2019-06-03 17:04 ` Nitesh Narayan Lal
  2019-06-03 18:34   ` Alexander Duyck
  2019-06-04 16:41   ` Alexander Duyck
  2019-06-03 18:04 ` [RFC][Patch v10 0/2] mm: " Michael S. Tsirkin
  3 siblings, 2 replies; 33+ messages in thread
From: Nitesh Narayan Lal @ 2019-06-03 17:04 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta,
	wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen, konrad.wilk,
	dhildenb, aarcange, alexander.duyck

Enables QEMU to call madvise on the pages which are reported
by the guest kernel.

Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 hw/virtio/trace-events                        |  1 +
 hw/virtio/virtio-balloon.c                    | 85 +++++++++++++++++++
 include/hw/virtio/virtio-balloon.h            |  2 +-
 include/qemu/osdep.h                          |  7 ++
 .../standard-headers/linux/virtio_balloon.h   |  1 +
 5 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 07bcbe9e85..015565785c 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -46,3 +46,4 @@ virtio_balloon_handle_output(const char *name, uint64_t gpa) "section name: %s g
 virtio_balloon_get_config(uint32_t num_pages, uint32_t actual) "num_pages: %d actual: %d"
 virtio_balloon_set_config(uint32_t actual, uint32_t oldactual) "actual: %d oldactual: %d"
 virtio_balloon_to_target(uint64_t target, uint32_t num_pages) "balloon target: 0x%"PRIx64" num_pages: %d"
+virtio_balloon_hinting_request(unsigned long pfn, unsigned int num_pages) "Guest page hinting request: %lu size: %d"
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index a12677d4d5..cbb630279c 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -33,6 +33,13 @@
 
 #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
 
+struct guest_pages {
+	uint64_t phys_addr;
+	uint32_t len;
+};
+
+void page_hinting_request(uint64_t addr, uint32_t len);
+
 static void balloon_page(void *addr, int deflate)
 {
     if (!qemu_balloon_is_inhibited()) {
@@ -207,6 +214,80 @@ static void balloon_stats_set_poll_interval(Object *obj, Visitor *v,
     balloon_stats_change_timer(s, 0);
 }
 
+static void *gpa2hva(MemoryRegion **p_mr, hwaddr addr, Error **errp)
+{
+    MemoryRegionSection mrs = memory_region_find(get_system_memory(),
+                                                 addr, 1);
+
+    if (!mrs.mr) {
+        error_setg(errp, "No memory is mapped at address 0x%" HWADDR_PRIx, addr);
+        return NULL;
+    }
+
+    if (!memory_region_is_ram(mrs.mr) && !memory_region_is_romd(mrs.mr)) {
+        error_setg(errp, "Memory at address 0x%" HWADDR_PRIx "is not RAM", addr);
+        memory_region_unref(mrs.mr);
+        return NULL;
+    }
+
+    *p_mr = mrs.mr;
+    return qemu_map_ram_ptr(mrs.mr->ram_block, mrs.offset_within_region);
+}
+
+void page_hinting_request(uint64_t addr, uint32_t len)
+{
+    Error *local_err = NULL;
+    MemoryRegion *mr = NULL;
+    int ret = 0;
+    struct guest_pages *guest_obj;
+    int i = 0;
+    void *hvaddr_to_free;
+    uint64_t gpaddr_to_free;
+    void * temp_addr = gpa2hva(&mr, addr, &local_err);
+
+    if (local_err) {
+        error_report_err(local_err);
+        return;
+    }
+    guest_obj = temp_addr;
+    while (i < len) {
+	gpaddr_to_free = guest_obj[i].phys_addr;
+	trace_virtio_balloon_hinting_request(gpaddr_to_free,guest_obj[i].len);
+	hvaddr_to_free = gpa2hva(&mr, gpaddr_to_free, &local_err);
+	if (local_err) {
+		error_report_err(local_err);
+		return;
+	}
+	ret = qemu_madvise((void *)hvaddr_to_free, guest_obj[i].len, QEMU_MADV_FREE);
+	if (ret == -1)
+	    printf("\n%d:%s Error: Madvise failed with error:%d\n", __LINE__, __func__, ret);
+	i++;
+    }
+}
+
+static void virtio_balloon_page_hinting(VirtIODevice *vdev, VirtQueue *vq)
+{
+    VirtQueueElement *elem = NULL;
+    uint64_t temp_addr;
+    uint32_t temp_len;
+    size_t size, t_size = 0;
+
+    elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+    if (!elem) {
+	printf("\npop error\n");
+	return;
+    }
+    size = iov_to_buf(elem->out_sg, elem->out_num, 0, &temp_addr, sizeof(temp_addr));
+    t_size += size;
+    size = iov_to_buf(elem->out_sg, elem->out_num, 8, &temp_len, sizeof(temp_len));
+    t_size += size;
+    if (!qemu_balloon_is_inhibited())
+	    page_hinting_request(temp_addr, temp_len);
+    virtqueue_push(vq, elem, t_size);
+    virtio_notify(vdev, vq);
+    g_free(elem);
+}
+
 static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
@@ -376,6 +457,7 @@ static uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f,
     VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
     f |= dev->host_features;
     virtio_add_feature(&f, VIRTIO_BALLOON_F_STATS_VQ);
+    virtio_add_feature(&f, VIRTIO_BALLOON_F_HINTING);
     return f;
 }
 
@@ -445,6 +527,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
     s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
     s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
     s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
+    s->hvq = virtio_add_queue(vdev, 128, virtio_balloon_page_hinting);
 
     reset_stats(s);
 }
@@ -488,6 +571,8 @@ static void virtio_balloon_instance_init(Object *obj)
 
     object_property_add(obj, "guest-stats", "guest statistics",
                         balloon_stats_get_all, NULL, NULL, s, NULL);
+    object_property_add(obj, "guest-page-hinting", "guest page hinting",
+                        NULL, NULL, NULL, s, NULL);
 
     object_property_add(obj, "guest-stats-polling-interval", "int",
                         balloon_stats_get_poll_interval,
diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
index e0df3528c8..774498a6ca 100644
--- a/include/hw/virtio/virtio-balloon.h
+++ b/include/hw/virtio/virtio-balloon.h
@@ -32,7 +32,7 @@ typedef struct virtio_balloon_stat_modern {
 
 typedef struct VirtIOBalloon {
     VirtIODevice parent_obj;
-    VirtQueue *ivq, *dvq, *svq;
+    VirtQueue *ivq, *dvq, *svq, *hvq;
     uint32_t num_pages;
     uint32_t actual;
     uint64_t stats[VIRTIO_BALLOON_S_NR];
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 840af09cb0..4d632933a9 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -360,6 +360,11 @@ void qemu_anon_ram_free(void *ptr, size_t size);
 #else
 #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
 #endif
+#ifdef MADV_FREE
+#define QEMU_MADV_FREE MADV_FREE
+#else
+#define QEMU_MADV_FREE QEMU_MADV_INVALID
+#endif
 
 #elif defined(CONFIG_POSIX_MADVISE)
 
@@ -373,6 +378,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
 #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
+#define QEMU_MADV_FREE QEMU_MADV_INVALID
 
 #else /* no-op */
 
@@ -386,6 +392,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
 #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
+#define QEMU_MADV_FREE QEMU_MADV_INVALID
 
 #endif
 
diff --git a/include/standard-headers/linux/virtio_balloon.h b/include/standard-headers/linux/virtio_balloon.h
index 4dbb7dc6c0..f50c0d95ea 100644
--- a/include/standard-headers/linux/virtio_balloon.h
+++ b/include/standard-headers/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_HINTING	5 /* Page hinting virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC][Patch v10 0/2] mm: Support for page hinting
  2019-06-03 17:03 [RFC][Patch v10 0/2] mm: Support for page hinting Nitesh Narayan Lal
                   ` (2 preceding siblings ...)
  2019-06-03 17:04 ` [QEMU PATCH] KVM: Support for page hinting Nitesh Narayan Lal
@ 2019-06-03 18:04 ` Michael S. Tsirkin
  2019-06-03 18:38   ` Nitesh Narayan Lal
                     ` (2 more replies)
  3 siblings, 3 replies; 33+ messages in thread
From: Michael S. Tsirkin @ 2019-06-03 18:04 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta,
	wei.w.wang, yang.zhang.wz, riel, david, dodgen, konrad.wilk,
	dhildenb, aarcange, alexander.duyck

On Mon, Jun 03, 2019 at 01:03:04PM -0400, Nitesh Narayan Lal wrote:
> This patch series proposes an efficient mechanism for communicating free memory
> from a guest to its hypervisor. It especially enables guests with no page cache
> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
> rapidly hand back free memory to the hypervisor.
> This approach has a minimal impact on the existing core-mm infrastructure.

Could you help us compare with Alex's series?
What are the main differences?

> Measurement results (measurement details appended to this email):
> * With active page hinting, 3 more guests could be launched each of 5 GB(total 
> 5 vs. 2) on a 15GB (single NUMA) system without swapping.
> * With active page hinting, on a system with 15 GB of (single NUMA) memory and
> 4GB of swap, the runtime of "memhog 6G" in 3 guests (run sequentially) resulted
> in the last invocation to only need 37s compared to 3m35s without page hinting.
> 
> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
> A new hook after buddy merging is used to set the bits in the bitmap.
> Currently, the bits are only cleared when pages are hinted, not when pages are
> re-allocated.
> 
> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
> threshold is met, trying to isolate and report pages that are still free.
> 
> The isolated pages are reported via virtio-balloon, which is responsible for
> sending batched pages to the host synchronously. Once the hypervisor processed
> the hinting request, the isolated pages are returned back to the buddy.
> 
> The key changes made in this series compared to v9[1] are:
> * Pages only in the chunks of "MAX_ORDER - 2" are reported to the hypervisor to
> not break up the THP.
> * At a time only a set of 16 pages can be isolated and reported to the host to
> avoids any false OOMs.
> * page_hinting.c is moved under mm/ from virt/kvm/ as the feature is dependent
> on virtio and not on KVM itself. This would enable any other hypervisor to use
> this feature by implementing virtio devices.
> * The sysctl variable is replaced with a virtio-balloon parameter to
> enable/disable page-hinting.
> 
> Pending items:
> * Test device assigned guests to ensure that hinting doesn't break it.
> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device side support.
> * Compare reporting free pages via vring with vhost.
> * Decide between MADV_DONTNEED and MADV_FREE.
> * Look into memory hotplug, more efficient locking, possible races when
> disabling.
> * Come up with proper/traceable error-message/logs.
> * Minor reworks and simplifications (e.g., virtio protocol).
> 
> Benefit analysis:
> 1. Use-case - Number of guests that can be launched without swap usage
> NUMA Nodes = 1 with 15 GB memory
> Guest Memory = 5 GB
> Number of cores in guest = 1
> Workload = test allocation program allocates 4GB memory, touches it via memset
> and exits.
> Procedure =
> The first guest is launched and once its console is up, the test allocation
> program is executed with 4 GB memory request (Due to this the guest occupies
> almost 4-5 GB of memory in the host in a system without page hinting). Once
> this program exits at that time another guest is launched in the host and the
> same process is followed. It is continued until the swap is not used.
> 
> Results:
> Without hinting = 3, swap usage at the end 1.1GB.
> With hinting = 5, swap usage at the end 0.
> 
> 2. Use-case - memhog execution time
> Guest Memory = 6GB
> Number of cores = 4
> NUMA Nodes = 1 with 15 GB memory
> Process: 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
> one after the other in each of them.
> Without Hinting - Guest1:47s, Guest2:53s, Guest3:3m35s, End swap usage: 3.5G
> With Hinting - Guest1:40s, Guest2:44s, Guest3:37s, End swap usage: 0
> 
> Performance analysis:
> 1. will-it-scale's page_faul1:
> Guest Memory = 6GB
> Number of cores = 24
> 
> Without Hinting:
> tasks,processes,processes_idle,threads,threads_idle,linear
> 0,0,100,0,100,0
> 1,315890,95.82,317633,95.83,317633
> 2,570810,91.67,531147,91.94,635266
> 3,826491,87.54,713545,88.53,952899
> 4,1087434,83.40,901215,85.30,1270532
> 5,1277137,79.26,916442,83.74,1588165
> 6,1503611,75.12,1113832,79.89,1905798
> 7,1683750,70.99,1140629,78.33,2223431
> 8,1893105,66.85,1157028,77.40,2541064
> 9,2046516,62.50,1179445,76.48,2858697
> 10,2291171,58.57,1209247,74.99,3176330
> 11,2486198,54.47,1217265,75.13,3493963
> 12,2656533,50.36,1193392,74.42,3811596
> 13,2747951,46.21,1185540,73.45,4129229
> 14,2965757,42.09,1161862,72.20,4446862
> 15,3049128,37.97,1185923,72.12,4764495
> 16,3150692,33.83,1163789,70.70,5082128
> 17,3206023,29.70,1174217,70.11,5399761
> 18,3211380,25.62,1179660,69.40,5717394
> 19,3202031,21.44,1181259,67.28,6035027
> 20,3218245,17.35,1196367,66.75,6352660
> 21,3228576,13.26,1129561,66.74,6670293
> 22,3207452,9.15,1166517,66.47,6987926
> 23,3153800,5.09,1172877,61.57,7305559
> 24,3184542,0.99,1186244,58.36,7623192
> 
> With Hinting:
> 0,0,100,0,100,0
> 1,306737,95.82,305130,95.78,306737
> 2,573207,91.68,530453,91.92,613474
> 3,810319,87.53,695281,88.58,920211
> 4,1074116,83.40,880602,85.48,1226948
> 5,1308283,79.26,1109257,81.23,1533685
> 6,1501987,75.12,1093661,80.19,1840422
> 7,1695300,70.99,1104207,79.03,2147159
> 8,1901523,66.85,1193613,76.90,2453896
> 9,2051288,62.73,1200913,76.22,2760633
> 10,2275771,58.60,1192992,75.66,3067370
> 11,2435016,54.48,1191472,74.66,3374107
> 12,2623114,50.35,1196911,74.02,3680844
> 13,2766071,46.22,1178589,73.02,3987581
> 14,2932163,42.10,1166414,72.96,4294318
> 15,3000853,37.96,1177177,72.62,4601055
> 16,3113738,33.85,1165444,70.54,4907792
> 17,3132135,29.77,1165055,68.51,5214529
> 18,3175121,25.69,1166969,69.27,5521266
> 19,3205490,21.61,1159310,65.65,5828003
> 20,3220855,17.52,1171827,62.04,6134740
> 21,3182568,13.48,1138918,65.05,6441477
> 22,3130543,9.30,1128185,60.60,6748214
> 23,3087426,5.15,1127912,55.36,7054951
> 24,3099457,1.04,1176100,54.96,7361688
> 
> [1] https://lkml.org/lkml/2019/3/6/413
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [QEMU PATCH] KVM: Support for page hinting
  2019-06-03 17:04 ` [QEMU PATCH] KVM: Support for page hinting Nitesh Narayan Lal
@ 2019-06-03 18:34   ` Alexander Duyck
  2019-06-03 18:37     ` Nitesh Narayan Lal
  2019-06-03 18:45     ` Nitesh Narayan Lal
  2019-06-04 16:41   ` Alexander Duyck
  1 sibling, 2 replies; 33+ messages in thread
From: Alexander Duyck @ 2019-06-03 18:34 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
> Enables QEMU to call madvise on the pages which are reported
> by the guest kernel.
>
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>

What commit-id is this meant to apply on top of? I can't apply this to
the latest development version of QEMU.

> ---
>  hw/virtio/trace-events                        |  1 +
>  hw/virtio/virtio-balloon.c                    | 85 +++++++++++++++++++
>  include/hw/virtio/virtio-balloon.h            |  2 +-
>  include/qemu/osdep.h                          |  7 ++
>  .../standard-headers/linux/virtio_balloon.h   |  1 +
>  5 files changed, 95 insertions(+), 1 deletion(-)
>
> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> index 07bcbe9e85..015565785c 100644
> --- a/hw/virtio/trace-events
> +++ b/hw/virtio/trace-events
> @@ -46,3 +46,4 @@ virtio_balloon_handle_output(const char *name, uint64_t gpa) "section name: %s g
>  virtio_balloon_get_config(uint32_t num_pages, uint32_t actual) "num_pages: %d actual: %d"
>  virtio_balloon_set_config(uint32_t actual, uint32_t oldactual) "actual: %d oldactual: %d"
>  virtio_balloon_to_target(uint64_t target, uint32_t num_pages) "balloon target: 0x%"PRIx64" num_pages: %d"
> +virtio_balloon_hinting_request(unsigned long pfn, unsigned int num_pages) "Guest page hinting request: %lu size: %d"
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index a12677d4d5..cbb630279c 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -33,6 +33,13 @@
>
>  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
>
> +struct guest_pages {
> +       uint64_t phys_addr;
> +       uint32_t len;
> +};
> +

Any reason for matching up 64b addr w/ 32b size? The way I see it you
would be be better off going with either 64b for both or 32b for both.
I opted for the 32b approach in my case since there was already code
in place for doing the PFN shift anyway in the standard virtio_balloon
code path.

> +void page_hinting_request(uint64_t addr, uint32_t len);
> +
>  static void balloon_page(void *addr, int deflate)
>  {
>      if (!qemu_balloon_is_inhibited()) {
> @@ -207,6 +214,80 @@ static void balloon_stats_set_poll_interval(Object *obj, Visitor *v,
>      balloon_stats_change_timer(s, 0);
>  }
>
> +static void *gpa2hva(MemoryRegion **p_mr, hwaddr addr, Error **errp)
> +{
> +    MemoryRegionSection mrs = memory_region_find(get_system_memory(),
> +                                                 addr, 1);
> +
> +    if (!mrs.mr) {
> +        error_setg(errp, "No memory is mapped at address 0x%" HWADDR_PRIx, addr);
> +        return NULL;
> +    }
> +
> +    if (!memory_region_is_ram(mrs.mr) && !memory_region_is_romd(mrs.mr)) {
> +        error_setg(errp, "Memory at address 0x%" HWADDR_PRIx "is not RAM", addr);
> +        memory_region_unref(mrs.mr);
> +        return NULL;
> +    }
> +
> +    *p_mr = mrs.mr;
> +    return qemu_map_ram_ptr(mrs.mr->ram_block, mrs.offset_within_region);
> +}
> +
> +void page_hinting_request(uint64_t addr, uint32_t len)
> +{
> +    Error *local_err = NULL;
> +    MemoryRegion *mr = NULL;
> +    int ret = 0;
> +    struct guest_pages *guest_obj;
> +    int i = 0;
> +    void *hvaddr_to_free;
> +    uint64_t gpaddr_to_free;
> +    void * temp_addr = gpa2hva(&mr, addr, &local_err);
> +
> +    if (local_err) {
> +        error_report_err(local_err);
> +        return;
> +    }
> +    guest_obj = temp_addr;
> +    while (i < len) {
> +       gpaddr_to_free = guest_obj[i].phys_addr;
> +       trace_virtio_balloon_hinting_request(gpaddr_to_free,guest_obj[i].len);
> +       hvaddr_to_free = gpa2hva(&mr, gpaddr_to_free, &local_err);
> +       if (local_err) {
> +               error_report_err(local_err);
> +               return;
> +       }
> +       ret = qemu_madvise((void *)hvaddr_to_free, guest_obj[i].len, QEMU_MADV_FREE);
> +       if (ret == -1)
> +           printf("\n%d:%s Error: Madvise failed with error:%d\n", __LINE__, __func__, ret);
> +       i++;
> +    }
> +}
> +

Have we made any determination yet on the MADV_FREE vs MADV_DONT_NEED?
My preference would be to have this code just reuse the existing
balloon code as I did in my patch set. Then we can avoid the need to
have multiple types in use. We could just have the balloon use the
same as the hint.

> +static void virtio_balloon_page_hinting(VirtIODevice *vdev, VirtQueue *vq)
> +{
> +    VirtQueueElement *elem = NULL;
> +    uint64_t temp_addr;
> +    uint32_t temp_len;
> +    size_t size, t_size = 0;
> +
> +    elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
> +    if (!elem) {
> +       printf("\npop error\n");
> +       return;
> +    }
> +    size = iov_to_buf(elem->out_sg, elem->out_num, 0, &temp_addr, sizeof(temp_addr));
> +    t_size += size;
> +    size = iov_to_buf(elem->out_sg, elem->out_num, 8, &temp_len, sizeof(temp_len));
> +    t_size += size;
> +    if (!qemu_balloon_is_inhibited())
> +           page_hinting_request(temp_addr, temp_len);
> +    virtqueue_push(vq, elem, t_size);
> +    virtio_notify(vdev, vq);
> +    g_free(elem);
> +}
> +

If you are doing a u64 addr, and a u32 len, does that mean you are
having to use a packed array between the guest and the host? This
would be another good reason to have both settle on either u64 or u32.

>  static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>  {
>      VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
> @@ -376,6 +457,7 @@ static uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f,
>      VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
>      f |= dev->host_features;
>      virtio_add_feature(&f, VIRTIO_BALLOON_F_STATS_VQ);
> +    virtio_add_feature(&f, VIRTIO_BALLOON_F_HINTING);
>      return f;
>  }
>
> @@ -445,6 +527,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>      s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>      s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>      s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
> +    s->hvq = virtio_add_queue(vdev, 128, virtio_balloon_page_hinting);
>
>      reset_stats(s);
>  }
> @@ -488,6 +571,8 @@ static void virtio_balloon_instance_init(Object *obj)
>
>      object_property_add(obj, "guest-stats", "guest statistics",
>                          balloon_stats_get_all, NULL, NULL, s, NULL);
> +    object_property_add(obj, "guest-page-hinting", "guest page hinting",
> +                        NULL, NULL, NULL, s, NULL);
>
>      object_property_add(obj, "guest-stats-polling-interval", "int",
>                          balloon_stats_get_poll_interval,
> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> index e0df3528c8..774498a6ca 100644
> --- a/include/hw/virtio/virtio-balloon.h
> +++ b/include/hw/virtio/virtio-balloon.h
> @@ -32,7 +32,7 @@ typedef struct virtio_balloon_stat_modern {
>
>  typedef struct VirtIOBalloon {
>      VirtIODevice parent_obj;
> -    VirtQueue *ivq, *dvq, *svq;
> +    VirtQueue *ivq, *dvq, *svq, *hvq;
>      uint32_t num_pages;
>      uint32_t actual;
>      uint64_t stats[VIRTIO_BALLOON_S_NR];
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index 840af09cb0..4d632933a9 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -360,6 +360,11 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>  #else
>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
>  #endif
> +#ifdef MADV_FREE
> +#define QEMU_MADV_FREE MADV_FREE
> +#else
> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
> +#endif
>
>  #elif defined(CONFIG_POSIX_MADVISE)
>
> @@ -373,6 +378,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>  #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
>  #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
>
>  #else /* no-op */
>
> @@ -386,6 +392,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>  #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
>  #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
>
>  #endif
>
> diff --git a/include/standard-headers/linux/virtio_balloon.h b/include/standard-headers/linux/virtio_balloon.h
> index 4dbb7dc6c0..f50c0d95ea 100644
> --- a/include/standard-headers/linux/virtio_balloon.h
> +++ b/include/standard-headers/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST        0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ      1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM        2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_HINTING       5 /* Page hinting virtqueue */

So this is obviously built against an old version of QEMU, the latest
values for this include:
#define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
#define VIRTIO_BALLOON_F_PAGE_POISON    4 /* Guest is using page poisoning */

I wonder if we shouldn't look for a term other than "HINT" since there
is already the code around providing hints to migration.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [QEMU PATCH] KVM: Support for page hinting
  2019-06-03 18:34   ` Alexander Duyck
@ 2019-06-03 18:37     ` Nitesh Narayan Lal
  2019-06-03 18:45     ` Nitesh Narayan Lal
  1 sibling, 0 replies; 33+ messages in thread
From: Nitesh Narayan Lal @ 2019-06-03 18:37 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli


[-- Attachment #1.1: Type: text/plain, Size: 9840 bytes --]


On 6/3/19 2:34 PM, Alexander Duyck wrote:
> On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>> Enables QEMU to call madvise on the pages which are reported
>> by the guest kernel.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> What commit-id is this meant to apply on top of? I can't apply this to
> the latest development version of QEMU.
I am not at the latest commit with this.
My top of the commit is: f3b4d5ca67f2e933c93457b701883c307b99c15c
>
>> ---
>>  hw/virtio/trace-events                        |  1 +
>>  hw/virtio/virtio-balloon.c                    | 85 +++++++++++++++++++
>>  include/hw/virtio/virtio-balloon.h            |  2 +-
>>  include/qemu/osdep.h                          |  7 ++
>>  .../standard-headers/linux/virtio_balloon.h   |  1 +
>>  5 files changed, 95 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
>> index 07bcbe9e85..015565785c 100644
>> --- a/hw/virtio/trace-events
>> +++ b/hw/virtio/trace-events
>> @@ -46,3 +46,4 @@ virtio_balloon_handle_output(const char *name, uint64_t gpa) "section name: %s g
>>  virtio_balloon_get_config(uint32_t num_pages, uint32_t actual) "num_pages: %d actual: %d"
>>  virtio_balloon_set_config(uint32_t actual, uint32_t oldactual) "actual: %d oldactual: %d"
>>  virtio_balloon_to_target(uint64_t target, uint32_t num_pages) "balloon target: 0x%"PRIx64" num_pages: %d"
>> +virtio_balloon_hinting_request(unsigned long pfn, unsigned int num_pages) "Guest page hinting request: %lu size: %d"
>> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
>> index a12677d4d5..cbb630279c 100644
>> --- a/hw/virtio/virtio-balloon.c
>> +++ b/hw/virtio/virtio-balloon.c
>> @@ -33,6 +33,13 @@
>>
>>  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
>>
>> +struct guest_pages {
>> +       uint64_t phys_addr;
>> +       uint32_t len;
>> +};
>> +
> Any reason for matching up 64b addr w/ 32b size? The way I see it you
> would be be better off going with either 64b for both or 32b for both.
> I opted for the 32b approach in my case since there was already code
> in place for doing the PFN shift anyway in the standard virtio_balloon
> code path.
>
>> +void page_hinting_request(uint64_t addr, uint32_t len);
>> +
>>  static void balloon_page(void *addr, int deflate)
>>  {
>>      if (!qemu_balloon_is_inhibited()) {
>> @@ -207,6 +214,80 @@ static void balloon_stats_set_poll_interval(Object *obj, Visitor *v,
>>      balloon_stats_change_timer(s, 0);
>>  }
>>
>> +static void *gpa2hva(MemoryRegion **p_mr, hwaddr addr, Error **errp)
>> +{
>> +    MemoryRegionSection mrs = memory_region_find(get_system_memory(),
>> +                                                 addr, 1);
>> +
>> +    if (!mrs.mr) {
>> +        error_setg(errp, "No memory is mapped at address 0x%" HWADDR_PRIx, addr);
>> +        return NULL;
>> +    }
>> +
>> +    if (!memory_region_is_ram(mrs.mr) && !memory_region_is_romd(mrs.mr)) {
>> +        error_setg(errp, "Memory at address 0x%" HWADDR_PRIx "is not RAM", addr);
>> +        memory_region_unref(mrs.mr);
>> +        return NULL;
>> +    }
>> +
>> +    *p_mr = mrs.mr;
>> +    return qemu_map_ram_ptr(mrs.mr->ram_block, mrs.offset_within_region);
>> +}
>> +
>> +void page_hinting_request(uint64_t addr, uint32_t len)
>> +{
>> +    Error *local_err = NULL;
>> +    MemoryRegion *mr = NULL;
>> +    int ret = 0;
>> +    struct guest_pages *guest_obj;
>> +    int i = 0;
>> +    void *hvaddr_to_free;
>> +    uint64_t gpaddr_to_free;
>> +    void * temp_addr = gpa2hva(&mr, addr, &local_err);
>> +
>> +    if (local_err) {
>> +        error_report_err(local_err);
>> +        return;
>> +    }
>> +    guest_obj = temp_addr;
>> +    while (i < len) {
>> +       gpaddr_to_free = guest_obj[i].phys_addr;
>> +       trace_virtio_balloon_hinting_request(gpaddr_to_free,guest_obj[i].len);
>> +       hvaddr_to_free = gpa2hva(&mr, gpaddr_to_free, &local_err);
>> +       if (local_err) {
>> +               error_report_err(local_err);
>> +               return;
>> +       }
>> +       ret = qemu_madvise((void *)hvaddr_to_free, guest_obj[i].len, QEMU_MADV_FREE);
>> +       if (ret == -1)
>> +           printf("\n%d:%s Error: Madvise failed with error:%d\n", __LINE__, __func__, ret);
>> +       i++;
>> +    }
>> +}
>> +
> Have we made any determination yet on the MADV_FREE vs MADV_DONT_NEED?
> My preference would be to have this code just reuse the existing
> balloon code as I did in my patch set. Then we can avoid the need to
> have multiple types in use. We could just have the balloon use the
> same as the hint.
>
>> +static void virtio_balloon_page_hinting(VirtIODevice *vdev, VirtQueue *vq)
>> +{
>> +    VirtQueueElement *elem = NULL;
>> +    uint64_t temp_addr;
>> +    uint32_t temp_len;
>> +    size_t size, t_size = 0;
>> +
>> +    elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
>> +    if (!elem) {
>> +       printf("\npop error\n");
>> +       return;
>> +    }
>> +    size = iov_to_buf(elem->out_sg, elem->out_num, 0, &temp_addr, sizeof(temp_addr));
>> +    t_size += size;
>> +    size = iov_to_buf(elem->out_sg, elem->out_num, 8, &temp_len, sizeof(temp_len));
>> +    t_size += size;
>> +    if (!qemu_balloon_is_inhibited())
>> +           page_hinting_request(temp_addr, temp_len);
>> +    virtqueue_push(vq, elem, t_size);
>> +    virtio_notify(vdev, vq);
>> +    g_free(elem);
>> +}
>> +
> If you are doing a u64 addr, and a u32 len, does that mean you are
> having to use a packed array between the guest and the host? This
> would be another good reason to have both settle on either u64 or u32.
>
>>  static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>>  {
>>      VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
>> @@ -376,6 +457,7 @@ static uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f,
>>      VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
>>      f |= dev->host_features;
>>      virtio_add_feature(&f, VIRTIO_BALLOON_F_STATS_VQ);
>> +    virtio_add_feature(&f, VIRTIO_BALLOON_F_HINTING);
>>      return f;
>>  }
>>
>> @@ -445,6 +527,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>>      s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>>      s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>>      s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
>> +    s->hvq = virtio_add_queue(vdev, 128, virtio_balloon_page_hinting);
>>
>>      reset_stats(s);
>>  }
>> @@ -488,6 +571,8 @@ static void virtio_balloon_instance_init(Object *obj)
>>
>>      object_property_add(obj, "guest-stats", "guest statistics",
>>                          balloon_stats_get_all, NULL, NULL, s, NULL);
>> +    object_property_add(obj, "guest-page-hinting", "guest page hinting",
>> +                        NULL, NULL, NULL, s, NULL);
>>
>>      object_property_add(obj, "guest-stats-polling-interval", "int",
>>                          balloon_stats_get_poll_interval,
>> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
>> index e0df3528c8..774498a6ca 100644
>> --- a/include/hw/virtio/virtio-balloon.h
>> +++ b/include/hw/virtio/virtio-balloon.h
>> @@ -32,7 +32,7 @@ typedef struct virtio_balloon_stat_modern {
>>
>>  typedef struct VirtIOBalloon {
>>      VirtIODevice parent_obj;
>> -    VirtQueue *ivq, *dvq, *svq;
>> +    VirtQueue *ivq, *dvq, *svq, *hvq;
>>      uint32_t num_pages;
>>      uint32_t actual;
>>      uint64_t stats[VIRTIO_BALLOON_S_NR];
>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>> index 840af09cb0..4d632933a9 100644
>> --- a/include/qemu/osdep.h
>> +++ b/include/qemu/osdep.h
>> @@ -360,6 +360,11 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>>  #else
>>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
>>  #endif
>> +#ifdef MADV_FREE
>> +#define QEMU_MADV_FREE MADV_FREE
>> +#else
>> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
>> +#endif
>>
>>  #elif defined(CONFIG_POSIX_MADVISE)
>>
>> @@ -373,6 +378,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>>  #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
>>  #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
>>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
>> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
>>
>>  #else /* no-op */
>>
>> @@ -386,6 +392,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>>  #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
>>  #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
>>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
>> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
>>
>>  #endif
>>
>> diff --git a/include/standard-headers/linux/virtio_balloon.h b/include/standard-headers/linux/virtio_balloon.h
>> index 4dbb7dc6c0..f50c0d95ea 100644
>> --- a/include/standard-headers/linux/virtio_balloon.h
>> +++ b/include/standard-headers/linux/virtio_balloon.h
>> @@ -34,6 +34,7 @@
>>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST        0 /* Tell before reclaiming pages */
>>  #define VIRTIO_BALLOON_F_STATS_VQ      1 /* Memory Stats virtqueue */
>>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM        2 /* Deflate balloon on OOM */
>> +#define VIRTIO_BALLOON_F_HINTING       5 /* Page hinting virtqueue */
> So this is obviously built against an old version of QEMU, the latest
> values for this include:
> #define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
> #define VIRTIO_BALLOON_F_PAGE_POISON    4 /* Guest is using page poisoning */
>
> I wonder if we shouldn't look for a term other than "HINT" since there
> is already the code around providing hints to migration.
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][Patch v10 0/2] mm: Support for page hinting
  2019-06-03 18:04 ` [RFC][Patch v10 0/2] mm: " Michael S. Tsirkin
@ 2019-06-03 18:38   ` Nitesh Narayan Lal
  2019-06-11 12:19   ` Nitesh Narayan Lal
  2019-06-25 14:48   ` Nitesh Narayan Lal
  2 siblings, 0 replies; 33+ messages in thread
From: Nitesh Narayan Lal @ 2019-06-03 18:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta,
	wei.w.wang, yang.zhang.wz, riel, david, dodgen, konrad.wilk,
	dhildenb, aarcange, alexander.duyck


[-- Attachment #1.1: Type: text/plain, Size: 6562 bytes --]


On 6/3/19 2:04 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 03, 2019 at 01:03:04PM -0400, Nitesh Narayan Lal wrote:
>> This patch series proposes an efficient mechanism for communicating free memory
>> from a guest to its hypervisor. It especially enables guests with no page cache
>> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
>> rapidly hand back free memory to the hypervisor.
>> This approach has a minimal impact on the existing core-mm infrastructure.
> Could you help us compare with Alex's series?
> What are the main differences?
I have just started reviewing Alex's series. Once I am done with it, I can.
>> Measurement results (measurement details appended to this email):
>> * With active page hinting, 3 more guests could be launched each of 5 GB(total 
>> 5 vs. 2) on a 15GB (single NUMA) system without swapping.
>> * With active page hinting, on a system with 15 GB of (single NUMA) memory and
>> 4GB of swap, the runtime of "memhog 6G" in 3 guests (run sequentially) resulted
>> in the last invocation to only need 37s compared to 3m35s without page hinting.
>>
>> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
>> A new hook after buddy merging is used to set the bits in the bitmap.
>> Currently, the bits are only cleared when pages are hinted, not when pages are
>> re-allocated.
>>
>> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
>> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
>> threshold is met, trying to isolate and report pages that are still free.
>>
>> The isolated pages are reported via virtio-balloon, which is responsible for
>> sending batched pages to the host synchronously. Once the hypervisor processed
>> the hinting request, the isolated pages are returned back to the buddy.
>>
>> The key changes made in this series compared to v9[1] are:
>> * Pages only in the chunks of "MAX_ORDER - 2" are reported to the hypervisor to
>> not break up the THP.
>> * At a time only a set of 16 pages can be isolated and reported to the host to
>> avoids any false OOMs.
>> * page_hinting.c is moved under mm/ from virt/kvm/ as the feature is dependent
>> on virtio and not on KVM itself. This would enable any other hypervisor to use
>> this feature by implementing virtio devices.
>> * The sysctl variable is replaced with a virtio-balloon parameter to
>> enable/disable page-hinting.
>>
>> Pending items:
>> * Test device assigned guests to ensure that hinting doesn't break it.
>> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device side support.
>> * Compare reporting free pages via vring with vhost.
>> * Decide between MADV_DONTNEED and MADV_FREE.
>> * Look into memory hotplug, more efficient locking, possible races when
>> disabling.
>> * Come up with proper/traceable error-message/logs.
>> * Minor reworks and simplifications (e.g., virtio protocol).
>>
>> Benefit analysis:
>> 1. Use-case - Number of guests that can be launched without swap usage
>> NUMA Nodes = 1 with 15 GB memory
>> Guest Memory = 5 GB
>> Number of cores in guest = 1
>> Workload = test allocation program allocates 4GB memory, touches it via memset
>> and exits.
>> Procedure =
>> The first guest is launched and once its console is up, the test allocation
>> program is executed with 4 GB memory request (Due to this the guest occupies
>> almost 4-5 GB of memory in the host in a system without page hinting). Once
>> this program exits at that time another guest is launched in the host and the
>> same process is followed. It is continued until the swap is not used.
>>
>> Results:
>> Without hinting = 3, swap usage at the end 1.1GB.
>> With hinting = 5, swap usage at the end 0.
>>
>> 2. Use-case - memhog execution time
>> Guest Memory = 6GB
>> Number of cores = 4
>> NUMA Nodes = 1 with 15 GB memory
>> Process: 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
>> one after the other in each of them.
>> Without Hinting - Guest1:47s, Guest2:53s, Guest3:3m35s, End swap usage: 3.5G
>> With Hinting - Guest1:40s, Guest2:44s, Guest3:37s, End swap usage: 0
>>
>> Performance analysis:
>> 1. will-it-scale's page_faul1:
>> Guest Memory = 6GB
>> Number of cores = 24
>>
>> Without Hinting:
>> tasks,processes,processes_idle,threads,threads_idle,linear
>> 0,0,100,0,100,0
>> 1,315890,95.82,317633,95.83,317633
>> 2,570810,91.67,531147,91.94,635266
>> 3,826491,87.54,713545,88.53,952899
>> 4,1087434,83.40,901215,85.30,1270532
>> 5,1277137,79.26,916442,83.74,1588165
>> 6,1503611,75.12,1113832,79.89,1905798
>> 7,1683750,70.99,1140629,78.33,2223431
>> 8,1893105,66.85,1157028,77.40,2541064
>> 9,2046516,62.50,1179445,76.48,2858697
>> 10,2291171,58.57,1209247,74.99,3176330
>> 11,2486198,54.47,1217265,75.13,3493963
>> 12,2656533,50.36,1193392,74.42,3811596
>> 13,2747951,46.21,1185540,73.45,4129229
>> 14,2965757,42.09,1161862,72.20,4446862
>> 15,3049128,37.97,1185923,72.12,4764495
>> 16,3150692,33.83,1163789,70.70,5082128
>> 17,3206023,29.70,1174217,70.11,5399761
>> 18,3211380,25.62,1179660,69.40,5717394
>> 19,3202031,21.44,1181259,67.28,6035027
>> 20,3218245,17.35,1196367,66.75,6352660
>> 21,3228576,13.26,1129561,66.74,6670293
>> 22,3207452,9.15,1166517,66.47,6987926
>> 23,3153800,5.09,1172877,61.57,7305559
>> 24,3184542,0.99,1186244,58.36,7623192
>>
>> With Hinting:
>> 0,0,100,0,100,0
>> 1,306737,95.82,305130,95.78,306737
>> 2,573207,91.68,530453,91.92,613474
>> 3,810319,87.53,695281,88.58,920211
>> 4,1074116,83.40,880602,85.48,1226948
>> 5,1308283,79.26,1109257,81.23,1533685
>> 6,1501987,75.12,1093661,80.19,1840422
>> 7,1695300,70.99,1104207,79.03,2147159
>> 8,1901523,66.85,1193613,76.90,2453896
>> 9,2051288,62.73,1200913,76.22,2760633
>> 10,2275771,58.60,1192992,75.66,3067370
>> 11,2435016,54.48,1191472,74.66,3374107
>> 12,2623114,50.35,1196911,74.02,3680844
>> 13,2766071,46.22,1178589,73.02,3987581
>> 14,2932163,42.10,1166414,72.96,4294318
>> 15,3000853,37.96,1177177,72.62,4601055
>> 16,3113738,33.85,1165444,70.54,4907792
>> 17,3132135,29.77,1165055,68.51,5214529
>> 18,3175121,25.69,1166969,69.27,5521266
>> 19,3205490,21.61,1159310,65.65,5828003
>> 20,3220855,17.52,1171827,62.04,6134740
>> 21,3182568,13.48,1138918,65.05,6441477
>> 22,3130543,9.30,1128185,60.60,6748214
>> 23,3087426,5.15,1127912,55.36,7054951
>> 24,3099457,1.04,1176100,54.96,7361688
>>
>> [1] https://lkml.org/lkml/2019/3/6/413
>>
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [QEMU PATCH] KVM: Support for page hinting
  2019-06-03 18:34   ` Alexander Duyck
  2019-06-03 18:37     ` Nitesh Narayan Lal
@ 2019-06-03 18:45     ` Nitesh Narayan Lal
  1 sibling, 0 replies; 33+ messages in thread
From: Nitesh Narayan Lal @ 2019-06-03 18:45 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli


[-- Attachment #1.1: Type: text/plain, Size: 10053 bytes --]


On 6/3/19 2:34 PM, Alexander Duyck wrote:
> On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>> Enables QEMU to call madvise on the pages which are reported
>> by the guest kernel.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> What commit-id is this meant to apply on top of? I can't apply this to
> the latest development version of QEMU.
>
>> ---
>>  hw/virtio/trace-events                        |  1 +
>>  hw/virtio/virtio-balloon.c                    | 85 +++++++++++++++++++
>>  include/hw/virtio/virtio-balloon.h            |  2 +-
>>  include/qemu/osdep.h                          |  7 ++
>>  .../standard-headers/linux/virtio_balloon.h   |  1 +
>>  5 files changed, 95 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
>> index 07bcbe9e85..015565785c 100644
>> --- a/hw/virtio/trace-events
>> +++ b/hw/virtio/trace-events
>> @@ -46,3 +46,4 @@ virtio_balloon_handle_output(const char *name, uint64_t gpa) "section name: %s g
>>  virtio_balloon_get_config(uint32_t num_pages, uint32_t actual) "num_pages: %d actual: %d"
>>  virtio_balloon_set_config(uint32_t actual, uint32_t oldactual) "actual: %d oldactual: %d"
>>  virtio_balloon_to_target(uint64_t target, uint32_t num_pages) "balloon target: 0x%"PRIx64" num_pages: %d"
>> +virtio_balloon_hinting_request(unsigned long pfn, unsigned int num_pages) "Guest page hinting request: %lu size: %d"
>> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
>> index a12677d4d5..cbb630279c 100644
>> --- a/hw/virtio/virtio-balloon.c
>> +++ b/hw/virtio/virtio-balloon.c
>> @@ -33,6 +33,13 @@
>>
>>  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
>>
>> +struct guest_pages {
>> +       uint64_t phys_addr;
>> +       uint32_t len;
>> +};
>> +
> Any reason for matching up 64b addr w/ 32b size? The way I see it you
> would be be better off going with either 64b for both or 32b for both.
> I opted for the 32b approach in my case since there was already code
> in place for doing the PFN shift anyway in the standard virtio_balloon
> code path.
>
>> +void page_hinting_request(uint64_t addr, uint32_t len);
>> +
>>  static void balloon_page(void *addr, int deflate)
>>  {
>>      if (!qemu_balloon_is_inhibited()) {
>> @@ -207,6 +214,80 @@ static void balloon_stats_set_poll_interval(Object *obj, Visitor *v,
>>      balloon_stats_change_timer(s, 0);
>>  }
>>
>> +static void *gpa2hva(MemoryRegion **p_mr, hwaddr addr, Error **errp)
>> +{
>> +    MemoryRegionSection mrs = memory_region_find(get_system_memory(),
>> +                                                 addr, 1);
>> +
>> +    if (!mrs.mr) {
>> +        error_setg(errp, "No memory is mapped at address 0x%" HWADDR_PRIx, addr);
>> +        return NULL;
>> +    }
>> +
>> +    if (!memory_region_is_ram(mrs.mr) && !memory_region_is_romd(mrs.mr)) {
>> +        error_setg(errp, "Memory at address 0x%" HWADDR_PRIx "is not RAM", addr);
>> +        memory_region_unref(mrs.mr);
>> +        return NULL;
>> +    }
>> +
>> +    *p_mr = mrs.mr;
>> +    return qemu_map_ram_ptr(mrs.mr->ram_block, mrs.offset_within_region);
>> +}
>> +
>> +void page_hinting_request(uint64_t addr, uint32_t len)
>> +{
>> +    Error *local_err = NULL;
>> +    MemoryRegion *mr = NULL;
>> +    int ret = 0;
>> +    struct guest_pages *guest_obj;
>> +    int i = 0;
>> +    void *hvaddr_to_free;
>> +    uint64_t gpaddr_to_free;
>> +    void * temp_addr = gpa2hva(&mr, addr, &local_err);
>> +
>> +    if (local_err) {
>> +        error_report_err(local_err);
>> +        return;
>> +    }
>> +    guest_obj = temp_addr;
>> +    while (i < len) {
>> +       gpaddr_to_free = guest_obj[i].phys_addr;
>> +       trace_virtio_balloon_hinting_request(gpaddr_to_free,guest_obj[i].len);
>> +       hvaddr_to_free = gpa2hva(&mr, gpaddr_to_free, &local_err);
>> +       if (local_err) {
>> +               error_report_err(local_err);
>> +               return;
>> +       }
>> +       ret = qemu_madvise((void *)hvaddr_to_free, guest_obj[i].len, QEMU_MADV_FREE);
>> +       if (ret == -1)
>> +           printf("\n%d:%s Error: Madvise failed with error:%d\n", __LINE__, __func__, ret);
>> +       i++;
>> +    }
>> +}
>> +
> Have we made any determination yet on the MADV_FREE vs MADV_DONT_NEED?
One of the reason was mentioned by Andrea last time.
But I don't have any stats to prove one is better than the other.
It is in my todo list.
> My preference would be to have this code just reuse the existing
> balloon code as I did in my patch set. Then we can avoid the need to
> have multiple types in use. We could just have the balloon use the
> same as the hint.
>
>> +static void virtio_balloon_page_hinting(VirtIODevice *vdev, VirtQueue *vq)
>> +{
>> +    VirtQueueElement *elem = NULL;
>> +    uint64_t temp_addr;
>> +    uint32_t temp_len;
>> +    size_t size, t_size = 0;
>> +
>> +    elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
>> +    if (!elem) {
>> +       printf("\npop error\n");
>> +       return;
>> +    }
>> +    size = iov_to_buf(elem->out_sg, elem->out_num, 0, &temp_addr, sizeof(temp_addr));
>> +    t_size += size;
>> +    size = iov_to_buf(elem->out_sg, elem->out_num, 8, &temp_len, sizeof(temp_len));
>> +    t_size += size;
>> +    if (!qemu_balloon_is_inhibited())
>> +           page_hinting_request(temp_addr, temp_len);
>> +    virtqueue_push(vq, elem, t_size);
>> +    virtio_notify(vdev, vq);
>> +    g_free(elem);
>> +}
>> +
> If you are doing a u64 addr, and a u32 len, does that mean you are
> having to use a packed array between the guest and the host? This
> would be another good reason to have both settle on either u64 or u32.
I will take a look at this, thanks.
>
>>  static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>>  {
>>      VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
>> @@ -376,6 +457,7 @@ static uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f,
>>      VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
>>      f |= dev->host_features;
>>      virtio_add_feature(&f, VIRTIO_BALLOON_F_STATS_VQ);
>> +    virtio_add_feature(&f, VIRTIO_BALLOON_F_HINTING);
>>      return f;
>>  }
>>
>> @@ -445,6 +527,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>>      s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>>      s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>>      s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
>> +    s->hvq = virtio_add_queue(vdev, 128, virtio_balloon_page_hinting);
>>
>>      reset_stats(s);
>>  }
>> @@ -488,6 +571,8 @@ static void virtio_balloon_instance_init(Object *obj)
>>
>>      object_property_add(obj, "guest-stats", "guest statistics",
>>                          balloon_stats_get_all, NULL, NULL, s, NULL);
>> +    object_property_add(obj, "guest-page-hinting", "guest page hinting",
>> +                        NULL, NULL, NULL, s, NULL);
>>
>>      object_property_add(obj, "guest-stats-polling-interval", "int",
>>                          balloon_stats_get_poll_interval,
>> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
>> index e0df3528c8..774498a6ca 100644
>> --- a/include/hw/virtio/virtio-balloon.h
>> +++ b/include/hw/virtio/virtio-balloon.h
>> @@ -32,7 +32,7 @@ typedef struct virtio_balloon_stat_modern {
>>
>>  typedef struct VirtIOBalloon {
>>      VirtIODevice parent_obj;
>> -    VirtQueue *ivq, *dvq, *svq;
>> +    VirtQueue *ivq, *dvq, *svq, *hvq;
>>      uint32_t num_pages;
>>      uint32_t actual;
>>      uint64_t stats[VIRTIO_BALLOON_S_NR];
>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>> index 840af09cb0..4d632933a9 100644
>> --- a/include/qemu/osdep.h
>> +++ b/include/qemu/osdep.h
>> @@ -360,6 +360,11 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>>  #else
>>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
>>  #endif
>> +#ifdef MADV_FREE
>> +#define QEMU_MADV_FREE MADV_FREE
>> +#else
>> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
>> +#endif
>>
>>  #elif defined(CONFIG_POSIX_MADVISE)
>>
>> @@ -373,6 +378,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>>  #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
>>  #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
>>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
>> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
>>
>>  #else /* no-op */
>>
>> @@ -386,6 +392,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>>  #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
>>  #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
>>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
>> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
>>
>>  #endif
>>
>> diff --git a/include/standard-headers/linux/virtio_balloon.h b/include/standard-headers/linux/virtio_balloon.h
>> index 4dbb7dc6c0..f50c0d95ea 100644
>> --- a/include/standard-headers/linux/virtio_balloon.h
>> +++ b/include/standard-headers/linux/virtio_balloon.h
>> @@ -34,6 +34,7 @@
>>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST        0 /* Tell before reclaiming pages */
>>  #define VIRTIO_BALLOON_F_STATS_VQ      1 /* Memory Stats virtqueue */
>>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM        2 /* Deflate balloon on OOM */
>> +#define VIRTIO_BALLOON_F_HINTING       5 /* Page hinting virtqueue */
> So this is obviously built against an old version of QEMU, the latest
> values for this include:
> #define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
> #define VIRTIO_BALLOON_F_PAGE_POISON    4 /* Guest is using page poisoning */
>
> I wonder if we shouldn't look for a term other than "HINT" 
We may have to come up with a different/better naming convention to
avoid any confusion.
> since there
> is already the code around providing hints to migration.
I will re-base to the top next time around.
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][Patch v10 1/2] mm: page_hinting: core infrastructure
  2019-06-03 17:03 ` [RFC][Patch v10 1/2] mm: page_hinting: core infrastructure Nitesh Narayan Lal
@ 2019-06-03 19:04   ` Alexander Duyck
  2019-06-04 12:55     ` Nitesh Narayan Lal
  2019-06-03 19:57   ` David Hildenbrand
  2019-06-14  7:24   ` David Hildenbrand
  2 siblings, 1 reply; 33+ messages in thread
From: Alexander Duyck @ 2019-06-03 19:04 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
> This patch introduces the core infrastructure for free page hinting in
> virtual environments. It enables the kernel to track the free pages which
> can be reported to its hypervisor so that the hypervisor could
> free and reuse that memory as per its requirement.
>
> While the pages are getting processed in the hypervisor (e.g.,
> via MADV_FREE), the guest must not use them, otherwise, data loss
> would be possible. To avoid such a situation, these pages are
> temporarily removed from the buddy. The amount of pages removed
> temporarily from the buddy is governed by the backend(virtio-balloon
> in our case).
>
> To efficiently identify free pages that can to be hinted to the
> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> chunks are reported to the hypervisor - especially, to not break up THP
> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> in the bitmap are an indication whether a page *might* be free, not a
> guarantee. A new hook after buddy merging sets the bits.
>
> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> asynchronously processes the bitmaps, trying to isolate and report pages
> that are still free. The backend (virtio-balloon) is responsible for
> reporting these batched pages to the host synchronously. Once reporting/
> freeing is complete, isolated pages are returned back to the buddy.
>
> There are still various things to look into (e.g., memory hotplug, more
> efficient locking, possible races when disabling).
>
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>

So one thing I had thought about, that I don't believe that has been
addressed in your solution, is to determine a means to guarantee
forward progress. If you have a noisy thread that is allocating and
freeing some block of memory repeatedly you will be stuck processing
that and cannot get to the other work. Specifically if you have a zone
where somebody is just cycling the number of pages needed to fill your
hinting queue how do you get around it and get to the data that is
actually code instead of getting stuck processing the noise?

Do you have any idea what the hit rate would be on a system that is on
the more active side? From what I can tell you still are effectively
just doing a linear search of memory, but you have the bitmap hints to
tell what as not been freed recently, however you still don't know
that the pages you have bitmap hints for are actually free until you
check them.

> ---
>  drivers/virtio/Kconfig       |   1 +
>  include/linux/page_hinting.h |  46 +++++++
>  mm/Kconfig                   |   6 +
>  mm/Makefile                  |   2 +
>  mm/page_alloc.c              |  17 +--
>  mm/page_hinting.c            | 236 +++++++++++++++++++++++++++++++++++
>  6 files changed, 301 insertions(+), 7 deletions(-)
>  create mode 100644 include/linux/page_hinting.h
>  create mode 100644 mm/page_hinting.c
>
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 35897649c24f..5a96b7a2ed1e 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -46,6 +46,7 @@ config VIRTIO_BALLOON
>         tristate "Virtio balloon driver"
>         depends on VIRTIO
>         select MEMORY_BALLOON
> +       select PAGE_HINTING
>         ---help---
>          This driver supports increasing and decreasing the amount
>          of memory within a KVM guest.
> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
> new file mode 100644
> index 000000000000..e65188fe1e6b
> --- /dev/null
> +++ b/include/linux/page_hinting.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_PAGE_HINTING_H
> +#define _LINUX_PAGE_HINTING_H
> +
> +/*
> + * Minimum page order required for a page to be hinted to the host.
> + */
> +#define PAGE_HINTING_MIN_ORDER         (MAX_ORDER - 2)
> +
> +/*
> + * struct page_hinting_cb: holds the callbacks to store, report and cleanup
> + * isolated pages.
> + * @prepare:           Callback responsible for allocating an array to hold
> + *                     the isolated pages.
> + * @hint_pages:                Callback which reports the isolated pages synchornously
> + *                     to the host.
> + * @cleanup:           Callback to free the the array used for reporting the
> + *                     isolated pages.
> + * @max_pages:         Maxmimum pages that are going to be hinted to the host
> + *                     at a time of granularity >= PAGE_HINTING_MIN_ORDER.
> + */
> +struct page_hinting_cb {
> +       int (*prepare)(void);
> +       void (*hint_pages)(struct list_head *list);
> +       void (*cleanup)(void);
> +       int max_pages;
> +};
> +
> +#ifdef CONFIG_PAGE_HINTING
> +void page_hinting_enqueue(struct page *page, int order);
> +void page_hinting_enable(const struct page_hinting_cb *cb);
> +void page_hinting_disable(void);
> +#else
> +static inline void page_hinting_enqueue(struct page *page, int order)
> +{
> +}
> +
> +static inline void page_hinting_enable(struct page_hinting_cb *cb)
> +{
> +}
> +
> +static inline void page_hinting_disable(void)
> +{
> +}
> +#endif
> +#endif /* _LINUX_PAGE_HINTING_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index ee8d1f311858..177d858de758 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -764,4 +764,10 @@ config GUP_BENCHMARK
>  config ARCH_HAS_PTE_SPECIAL
>         bool
>
> +# PAGE_HINTING will allow the guest to report the free pages to the
> +# host in regular interval of time.
> +config PAGE_HINTING
> +       bool
> +       def_bool n
> +       depends on X86_64
>  endmenu
> diff --git a/mm/Makefile b/mm/Makefile
> index ac5e5ba78874..bec456dfee34 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -41,6 +41,7 @@ obj-y                 := filemap.o mempool.o oom_kill.o fadvise.o \
>                            interval_tree.o list_lru.o workingset.o \
>                            debug.o $(mmu-y)
>
> +
>  # Give 'page_alloc' its own module-parameter namespace
>  page-alloc-y := page_alloc.o
>  page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
> @@ -94,6 +95,7 @@ obj-$(CONFIG_Z3FOLD)  += z3fold.o
>  obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
>  obj-$(CONFIG_CMA)      += cma.o
>  obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
> +obj-$(CONFIG_PAGE_HINTING) += page_hinting.o
>  obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3b13d3914176..d12f69e0e402 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -68,6 +68,7 @@
>  #include <linux/lockdep.h>
>  #include <linux/nmi.h>
>  #include <linux/psi.h>
> +#include <linux/page_hinting.h>
>
>  #include <asm/sections.h>
>  #include <asm/tlbflush.h>
> @@ -873,10 +874,10 @@ compaction_capture(struct capture_control *capc, struct page *page,
>   * -- nyc
>   */
>
> -static inline void __free_one_page(struct page *page,
> +inline void __free_one_page(struct page *page,
>                 unsigned long pfn,
>                 struct zone *zone, unsigned int order,
> -               int migratetype)
> +               int migratetype, bool hint)
>  {
>         unsigned long combined_pfn;
>         unsigned long uninitialized_var(buddy_pfn);
> @@ -951,6 +952,8 @@ static inline void __free_one_page(struct page *page,
>  done_merging:
>         set_page_order(page, order);
>
> +       if (hint)
> +               page_hinting_enqueue(page, order);

This is a bit early to probably be dealing with the hint. You should
probably look at moving this down to a spot somewhere after the page
has been added to the free list. It may not cause any issues with the
current order setup, but moving after the addition to the free list
will make it so that you know it is in there when you call this
function.

>         /*
>          * If this is not the largest possible page, check if the buddy
>          * of the next-highest order is free. If it is, it's possible
> @@ -1262,7 +1265,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>                 if (unlikely(isolated_pageblocks))
>                         mt = get_pageblock_migratetype(page);
>
> -               __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> +               __free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
>                 trace_mm_page_pcpu_drain(page, 0, mt);
>         }
>         spin_unlock(&zone->lock);
> @@ -1271,14 +1274,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  static void free_one_page(struct zone *zone,
>                                 struct page *page, unsigned long pfn,
>                                 unsigned int order,
> -                               int migratetype)
> +                               int migratetype, bool hint)
>  {
>         spin_lock(&zone->lock);
>         if (unlikely(has_isolate_pageblock(zone) ||
>                 is_migrate_isolate(migratetype))) {
>                 migratetype = get_pfnblock_migratetype(page, pfn);
>         }
> -       __free_one_page(page, pfn, zone, order, migratetype);
> +       __free_one_page(page, pfn, zone, order, migratetype, hint);
>         spin_unlock(&zone->lock);
>  }
>
> @@ -1368,7 +1371,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>         migratetype = get_pfnblock_migratetype(page, pfn);
>         local_irq_save(flags);
>         __count_vm_events(PGFREE, 1 << order);
> -       free_one_page(page_zone(page), page, pfn, order, migratetype);
> +       free_one_page(page_zone(page), page, pfn, order, migratetype, true);
>         local_irq_restore(flags);
>  }
>
> @@ -2968,7 +2971,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
>          */
>         if (migratetype >= MIGRATE_PCPTYPES) {
>                 if (unlikely(is_migrate_isolate(migratetype))) {
> -                       free_one_page(zone, page, pfn, 0, migratetype);
> +                       free_one_page(zone, page, pfn, 0, migratetype, true);
>                         return;
>                 }
>                 migratetype = MIGRATE_MOVABLE;

So it looks like you are using a parameter to identify if the page is
a hinted page or not. I guess this works but it seems like it is a bit
intrusive as you are adding an argument to specify that this is a
specific page type.

> diff --git a/mm/page_hinting.c b/mm/page_hinting.c
> new file mode 100644
> index 000000000000..7341c6462de2
> --- /dev/null
> +++ b/mm/page_hinting.c
> @@ -0,0 +1,236 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Page hinting support to enable a VM to report the freed pages back
> + * to the host.
> + *
> + * Copyright Red Hat, Inc. 2019
> + *
> + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/page_hinting.h>
> +#include <linux/kvm_host.h>
> +
> +/*
> + * struct hinting_bitmap: holds the bitmap pointer which tracks the freed PFNs
> + * and other required parameters which could help in retrieving the original
> + * PFN value using the bitmap.
> + * @bitmap:            Pointer to the bitmap of free PFN.
> + * @base_pfn:          Starting PFN value for the zone whose bitmap is stored.
> + * @free_pages:                Tracks the number of free pages of granularity
> + *                     PAGE_HINTING_MIN_ORDER.
> + * @nbits:             Indicates the total size of the bitmap in bits allocated
> + *                     at the time of initialization.
> + */
> +struct hinting_bitmap {
> +       unsigned long *bitmap;
> +       unsigned long base_pfn;
> +       atomic_t free_pages;
> +       unsigned long nbits;
> +} bm_zone[MAX_NR_ZONES];
> +

This ignores NUMA doesn't it? Shouldn't you have support for other NUMA nodes?

> +static void init_hinting_wq(struct work_struct *work);
> +extern int __isolate_free_page(struct page *page, unsigned int order);
> +extern void __free_one_page(struct page *page, unsigned long pfn,
> +                           struct zone *zone, unsigned int order,
> +                           int migratetype, bool hint);
> +const struct page_hinting_cb *hcb;
> +struct work_struct hinting_work;
> +
> +static unsigned long find_bitmap_size(struct zone *zone)
> +{
> +       unsigned long nbits = ALIGN(zone->spanned_pages,
> +                           PAGE_HINTING_MIN_ORDER);
> +
> +       nbits = nbits >> PAGE_HINTING_MIN_ORDER;
> +       return nbits;
> +}
> +

This doesn't look right to me. You are trying to do something like a
DIV_ROUND_UP here, right? If so shouldn't you be aligning to 1 <<
PAGE_HINTING_MIN_ORDER, instead of just PAGE_HINTING_MIN_ORDER?
Another option would be to just do DIV_ROUND_UP with the 1 <<
PAGE_HINTING_MIN_ORDER value.

> +void page_hinting_enable(const struct page_hinting_cb *callback)
> +{
> +       struct zone *zone;
> +       int idx = 0;
> +       unsigned long bitmap_size = 0;
> +
> +       for_each_populated_zone(zone) {

The index for this doesn't match up to the index you used to define
bm_zone. for_each_populated_zone will go through each zone in each
pgdat. Right now you can only handle one pgdat.

> +               spin_lock(&zone->lock);
> +               bitmap_size = find_bitmap_size(zone);
> +               bm_zone[idx].bitmap = bitmap_zalloc(bitmap_size, GFP_KERNEL);
> +               if (!bm_zone[idx].bitmap)
> +                       return;
> +               bm_zone[idx].nbits = bitmap_size;
> +               bm_zone[idx].base_pfn = zone->zone_start_pfn;
> +               spin_unlock(&zone->lock);
> +               idx++;
> +       }
> +       hcb = callback;
> +       INIT_WORK(&hinting_work, init_hinting_wq);
> +}
> +EXPORT_SYMBOL_GPL(page_hinting_enable);
> +
> +void page_hinting_disable(void)
> +{
> +       struct zone *zone;
> +       int idx = 0;
> +
> +       cancel_work_sync(&hinting_work);
> +       hcb = NULL;
> +       for_each_populated_zone(zone) {
> +               spin_lock(&zone->lock);
> +               bitmap_free(bm_zone[idx].bitmap);
> +               bm_zone[idx].base_pfn = 0;
> +               bm_zone[idx].nbits = 0;
> +               atomic_set(&bm_zone[idx].free_pages, 0);
> +               spin_unlock(&zone->lock);
> +               idx++;
> +       }
> +}
> +EXPORT_SYMBOL_GPL(page_hinting_disable);
> +
> +static unsigned long pfn_to_bit(struct page *page, int zonenum)
> +{
> +       unsigned long bitnr;
> +
> +       bitnr = (page_to_pfn(page) - bm_zone[zonenum].base_pfn)
> +                        >> PAGE_HINTING_MIN_ORDER;
> +       return bitnr;
> +}
> +
> +static void release_buddy_pages(struct list_head *pages)
> +{
> +       int mt = 0, zonenum, order;
> +       struct page *page, *next;
> +       struct zone *zone;
> +       unsigned long bitnr;
> +
> +       list_for_each_entry_safe(page, next, pages, lru) {
> +               zonenum = page_zonenum(page);
> +               zone = page_zone(page);
> +               bitnr = pfn_to_bit(page, zonenum);
> +               spin_lock(&zone->lock);
> +               list_del(&page->lru);
> +               order = page_private(page);
> +               set_page_private(page, 0);
> +               mt = get_pageblock_migratetype(page);
> +               __free_one_page(page, page_to_pfn(page), zone,
> +                               order, mt, false);
> +               spin_unlock(&zone->lock);
> +       }
> +}
> +
> +static void bm_set_pfn(struct page *page)
> +{
> +       unsigned long bitnr = 0;
> +       int zonenum = page_zonenum(page);
> +       struct zone *zone = page_zone(page);
> +
> +       lockdep_assert_held(&zone->lock);
> +       bitnr = pfn_to_bit(page, zonenum);
> +       if (bm_zone[zonenum].bitmap &&
> +           bitnr < bm_zone[zonenum].nbits &&
> +           !test_and_set_bit(bitnr, bm_zone[zonenum].bitmap))
> +               atomic_inc(&bm_zone[zonenum].free_pages);
> +}
> +
> +static void scan_hinting_bitmap(int zonenum, int free_pages)
> +{
> +       unsigned long set_bit, start = 0;
> +       struct page *page;
> +       struct zone *zone;
> +       int scanned_pages = 0, ret = 0, order, isolated_cnt = 0;
> +       LIST_HEAD(isolated_pages);
> +
> +       ret = hcb->prepare();
> +       if (ret < 0)
> +               return;
> +       for (;;) {
> +               ret = 0;
> +               set_bit = find_next_bit(bm_zone[zonenum].bitmap,
> +                                       bm_zone[zonenum].nbits, start);
> +               if (set_bit >= bm_zone[zonenum].nbits)
> +                       break;
> +               page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) +
> +                               bm_zone[zonenum].base_pfn);
> +               if (!page)
> +                       continue;
> +               zone = page_zone(page);
> +               spin_lock(&zone->lock);
> +
> +               if (PageBuddy(page) && page_private(page) >=
> +                   PAGE_HINTING_MIN_ORDER) {
> +                       order = page_private(page);
> +                       ret = __isolate_free_page(page, order);
> +               }
> +               clear_bit(set_bit, bm_zone[zonenum].bitmap);
> +               spin_unlock(&zone->lock);
> +               if (ret) {
> +                       /*
> +                        * restoring page order to use it while releasing
> +                        * the pages back to the buddy.
> +                        */
> +                       set_page_private(page, order);
> +                       list_add_tail(&page->lru, &isolated_pages);
> +                       isolated_cnt++;
> +                       if (isolated_cnt == hcb->max_pages) {
> +                               hcb->hint_pages(&isolated_pages);
> +                               release_buddy_pages(&isolated_pages);
> +                               isolated_cnt = 0;
> +                       }
> +               }
> +               start = set_bit + 1;
> +               scanned_pages++;
> +       }
> +       if (isolated_cnt) {
> +               hcb->hint_pages(&isolated_pages);
> +               release_buddy_pages(&isolated_pages);
> +       }
> +       hcb->cleanup();
> +       if (scanned_pages > free_pages)
> +               atomic_sub((scanned_pages - free_pages),
> +                          &bm_zone[zonenum].free_pages);
> +}
> +
> +static bool check_hinting_threshold(void)
> +{
> +       int zonenum = 0;
> +
> +       for (; zonenum < MAX_NR_ZONES; zonenum++) {
> +               if (atomic_read(&bm_zone[zonenum].free_pages) >=
> +                               hcb->max_pages)
> +                       return true;
> +       }
> +       return false;
> +}
> +
> +static void init_hinting_wq(struct work_struct *work)
> +{
> +       int zonenum = 0, free_pages = 0;
> +
> +       for (; zonenum < MAX_NR_ZONES; zonenum++) {
> +               free_pages = atomic_read(&bm_zone[zonenum].free_pages);
> +               if (free_pages >= hcb->max_pages) {
> +                       /* Find a better way to synchronize per zone
> +                        * free_pages.
> +                        */
> +                       atomic_sub(free_pages,
> +                                  &bm_zone[zonenum].free_pages);
> +                       scan_hinting_bitmap(zonenum, free_pages);
> +               }
> +       }
> +}
> +
> +void page_hinting_enqueue(struct page *page, int order)
> +{
> +       if (hcb && order >= PAGE_HINTING_MIN_ORDER)
> +               bm_set_pfn(page);
> +       else
> +               return;

You could probably flip the logic and save yourself an "else" by just
doing something like:
if (!hcb || order < PAGE_HINTING_MIN_ORDER)
        return;

I think it would also make this more readable.

> +
> +       if (check_hinting_threshold()) {
> +               int cpu = smp_processor_id();
> +
> +               queue_work_on(cpu, system_wq, &hinting_work);
> +       }
> +}
> --
> 2.21.0
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][Patch v10 1/2] mm: page_hinting: core infrastructure
  2019-06-03 17:03 ` [RFC][Patch v10 1/2] mm: page_hinting: core infrastructure Nitesh Narayan Lal
  2019-06-03 19:04   ` Alexander Duyck
@ 2019-06-03 19:57   ` David Hildenbrand
  2019-06-04 13:16     ` Nitesh Narayan Lal
  2019-06-14  7:24   ` David Hildenbrand
  2 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2019-06-03 19:57 UTC (permalink / raw)
  To: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini,
	lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, mst,
	dodgen, konrad.wilk, dhildenb, aarcange, alexander.duyck

On 03.06.19 19:03, Nitesh Narayan Lal wrote:
> This patch introduces the core infrastructure for free page hinting in
> virtual environments. It enables the kernel to track the free pages which
> can be reported to its hypervisor so that the hypervisor could
> free and reuse that memory as per its requirement.
> 
> While the pages are getting processed in the hypervisor (e.g.,
> via MADV_FREE), the guest must not use them, otherwise, data loss
> would be possible. To avoid such a situation, these pages are
> temporarily removed from the buddy. The amount of pages removed
> temporarily from the buddy is governed by the backend(virtio-balloon
> in our case).
> 
> To efficiently identify free pages that can to be hinted to the
> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> chunks are reported to the hypervisor - especially, to not break up THP
> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> in the bitmap are an indication whether a page *might* be free, not a
> guarantee. A new hook after buddy merging sets the bits.
> 
> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> asynchronously processes the bitmaps, trying to isolate and report pages
> that are still free. The backend (virtio-balloon) is responsible for
> reporting these batched pages to the host synchronously. Once reporting/
> freeing is complete, isolated pages are returned back to the buddy.
> 
> There are still various things to look into (e.g., memory hotplug, more
> efficient locking, possible races when disabling).
> 
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> ---
>  drivers/virtio/Kconfig       |   1 +
>  include/linux/page_hinting.h |  46 +++++++
>  mm/Kconfig                   |   6 +
>  mm/Makefile                  |   2 +
>  mm/page_alloc.c              |  17 +--
>  mm/page_hinting.c            | 236 +++++++++++++++++++++++++++++++++++
>  6 files changed, 301 insertions(+), 7 deletions(-)
>  create mode 100644 include/linux/page_hinting.h
>  create mode 100644 mm/page_hinting.c
> 
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 35897649c24f..5a96b7a2ed1e 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -46,6 +46,7 @@ config VIRTIO_BALLOON
>  	tristate "Virtio balloon driver"
>  	depends on VIRTIO
>  	select MEMORY_BALLOON
> +	select PAGE_HINTING
>  	---help---
>  	 This driver supports increasing and decreasing the amount
>  	 of memory within a KVM guest.
> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
> new file mode 100644
> index 000000000000..e65188fe1e6b
> --- /dev/null
> +++ b/include/linux/page_hinting.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_PAGE_HINTING_H
> +#define _LINUX_PAGE_HINTING_H
> +
> +/*
> + * Minimum page order required for a page to be hinted to the host.
> + */
> +#define PAGE_HINTING_MIN_ORDER		(MAX_ORDER - 2)
> +
> +/*
> + * struct page_hinting_cb: holds the callbacks to store, report and cleanup
> + * isolated pages.
> + * @prepare:		Callback responsible for allocating an array to hold
> + *			the isolated pages.
> + * @hint_pages:		Callback which reports the isolated pages synchornously
> + *			to the host.
> + * @cleanup:		Callback to free the the array used for reporting the
> + *			isolated pages.
> + * @max_pages:		Maxmimum pages that are going to be hinted to the host
> + *			at a time of granularity >= PAGE_HINTING_MIN_ORDER.
> + */
> +struct page_hinting_cb {
> +	int (*prepare)(void);
> +	void (*hint_pages)(struct list_head *list);
> +	void (*cleanup)(void);
> +	int max_pages;

If we allocate the array in virtio-balloon differently (e.g. similar to
bulk inflation/deflation of pfn's right now), we can most probably get
rid of prepare() and cleanup(), simplifying the code further.

> +};
> +
> +#ifdef CONFIG_PAGE_HINTING
> +void page_hinting_enqueue(struct page *page, int order);
> +void page_hinting_enable(const struct page_hinting_cb *cb);
> +void page_hinting_disable(void);
> +#else
> +static inline void page_hinting_enqueue(struct page *page, int order)
> +{
> +}
> +
> +static inline void page_hinting_enable(struct page_hinting_cb *cb)
> +{
> +}
> +
> +static inline void page_hinting_disable(void)
> +{
> +}
> +#endif
> +#endif /* _LINUX_PAGE_HINTING_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index ee8d1f311858..177d858de758 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -764,4 +764,10 @@ config GUP_BENCHMARK
>  config ARCH_HAS_PTE_SPECIAL
>  	bool
>  
> +# PAGE_HINTING will allow the guest to report the free pages to the
> +# host in regular interval of time.
> +config PAGE_HINTING
> +       bool
> +       def_bool n
> +       depends on X86_64
>  endmenu
> diff --git a/mm/Makefile b/mm/Makefile
> index ac5e5ba78874..bec456dfee34 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -41,6 +41,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
>  			   interval_tree.o list_lru.o workingset.o \
>  			   debug.o $(mmu-y)
>  
> +
>  # Give 'page_alloc' its own module-parameter namespace
>  page-alloc-y := page_alloc.o
>  page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
> @@ -94,6 +95,7 @@ obj-$(CONFIG_Z3FOLD)	+= z3fold.o
>  obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
>  obj-$(CONFIG_CMA)	+= cma.o
>  obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
> +obj-$(CONFIG_PAGE_HINTING) += page_hinting.o
>  obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3b13d3914176..d12f69e0e402 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -68,6 +68,7 @@
>  #include <linux/lockdep.h>
>  #include <linux/nmi.h>
>  #include <linux/psi.h>
> +#include <linux/page_hinting.h>
>  
>  #include <asm/sections.h>
>  #include <asm/tlbflush.h>
> @@ -873,10 +874,10 @@ compaction_capture(struct capture_control *capc, struct page *page,
>   * -- nyc
>   */
>  
> -static inline void __free_one_page(struct page *page,
> +inline void __free_one_page(struct page *page,
>  		unsigned long pfn,
>  		struct zone *zone, unsigned int order,
> -		int migratetype)
> +		int migratetype, bool hint)
>  {
>  	unsigned long combined_pfn;
>  	unsigned long uninitialized_var(buddy_pfn);
> @@ -951,6 +952,8 @@ static inline void __free_one_page(struct page *page,
>  done_merging:
>  	set_page_order(page, order);
>  
> +	if (hint)
> +		page_hinting_enqueue(page, order);
>  	/*
>  	 * If this is not the largest possible page, check if the buddy
>  	 * of the next-highest order is free. If it is, it's possible
> @@ -1262,7 +1265,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  		if (unlikely(isolated_pageblocks))
>  			mt = get_pageblock_migratetype(page);
>  
> -		__free_one_page(page, page_to_pfn(page), zone, 0, mt);
> +		__free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
>  		trace_mm_page_pcpu_drain(page, 0, mt);
>  	}
>  	spin_unlock(&zone->lock);
> @@ -1271,14 +1274,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  static void free_one_page(struct zone *zone,
>  				struct page *page, unsigned long pfn,
>  				unsigned int order,
> -				int migratetype)
> +				int migratetype, bool hint)
>  {
>  	spin_lock(&zone->lock);
>  	if (unlikely(has_isolate_pageblock(zone) ||
>  		is_migrate_isolate(migratetype))) {
>  		migratetype = get_pfnblock_migratetype(page, pfn);
>  	}
> -	__free_one_page(page, pfn, zone, order, migratetype);
> +	__free_one_page(page, pfn, zone, order, migratetype, hint);
>  	spin_unlock(&zone->lock);
>  }
>  
> @@ -1368,7 +1371,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>  	migratetype = get_pfnblock_migratetype(page, pfn);
>  	local_irq_save(flags);
>  	__count_vm_events(PGFREE, 1 << order);
> -	free_one_page(page_zone(page), page, pfn, order, migratetype);
> +	free_one_page(page_zone(page), page, pfn, order, migratetype, true);
>  	local_irq_restore(flags);
>  }
>  
> @@ -2968,7 +2971,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
>  	 */
>  	if (migratetype >= MIGRATE_PCPTYPES) {
>  		if (unlikely(is_migrate_isolate(migratetype))) {
> -			free_one_page(zone, page, pfn, 0, migratetype);
> +			free_one_page(zone, page, pfn, 0, migratetype, true);
>  			return;
>  		}
>  		migratetype = MIGRATE_MOVABLE;
> diff --git a/mm/page_hinting.c b/mm/page_hinting.c
> new file mode 100644
> index 000000000000..7341c6462de2
> --- /dev/null
> +++ b/mm/page_hinting.c
> @@ -0,0 +1,236 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Page hinting support to enable a VM to report the freed pages back
> + * to the host.
> + *
> + * Copyright Red Hat, Inc. 2019
> + *
> + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/page_hinting.h>
> +#include <linux/kvm_host.h>
> +
> +/*
> + * struct hinting_bitmap: holds the bitmap pointer which tracks the freed PFNs
> + * and other required parameters which could help in retrieving the original
> + * PFN value using the bitmap.
> + * @bitmap:		Pointer to the bitmap of free PFN.
> + * @base_pfn:		Starting PFN value for the zone whose bitmap is stored.
> + * @free_pages:		Tracks the number of free pages of granularity
> + *			PAGE_HINTING_MIN_ORDER.
> + * @nbits:		Indicates the total size of the bitmap in bits allocated
> + *			at the time of initialization.
> + */
> +struct hinting_bitmap {
> +	unsigned long *bitmap;
> +	unsigned long base_pfn;
> +	atomic_t free_pages;
> +	unsigned long nbits;
> +} bm_zone[MAX_NR_ZONES];
> +
> +static void init_hinting_wq(struct work_struct *work);
> +extern int __isolate_free_page(struct page *page, unsigned int order);
> +extern void __free_one_page(struct page *page, unsigned long pfn,
> +			    struct zone *zone, unsigned int order,
> +			    int migratetype, bool hint);
> +const struct page_hinting_cb *hcb;
> +struct work_struct hinting_work;
> +
> +static unsigned long find_bitmap_size(struct zone *zone)
> +{
> +	unsigned long nbits = ALIGN(zone->spanned_pages,
> +			    PAGE_HINTING_MIN_ORDER);
> +
> +	nbits = nbits >> PAGE_HINTING_MIN_ORDER;
> +	return nbits;

I think we can simplify this to

return (zone->spanned_pages >> PAGE_HINTING_MIN_ORDER) + 1;

> +}
> +
> +void page_hinting_enable(const struct page_hinting_cb *callback)
> +{
> +	struct zone *zone;
> +	int idx = 0;
> +	unsigned long bitmap_size = 0;

You should probably protect enabling via a mutex and return -EINVAL or
similar if we already have a callback set (if we ever have different
drivers). But this has very little priority :)

> +
> +	for_each_populated_zone(zone) {
> +		spin_lock(&zone->lock);
> +		bitmap_size = find_bitmap_size(zone);
> +		bm_zone[idx].bitmap = bitmap_zalloc(bitmap_size, GFP_KERNEL);
> +		if (!bm_zone[idx].bitmap)
> +			return;
> +		bm_zone[idx].nbits = bitmap_size;
> +		bm_zone[idx].base_pfn = zone->zone_start_pfn;
> +		spin_unlock(&zone->lock);
> +		idx++;
> +	}
> +	hcb = callback;
> +	INIT_WORK(&hinting_work, init_hinting_wq);

There are also possible races when enabling, you will have to take care
of at one point.

> +}
> +EXPORT_SYMBOL_GPL(page_hinting_enable);
> +
> +void page_hinting_disable(void)
> +{
> +	struct zone *zone;
> +	int idx = 0;
> +
> +	cancel_work_sync(&hinting_work);
> +	hcb = NULL;
> +	for_each_populated_zone(zone) {
> +		spin_lock(&zone->lock);
> +		bitmap_free(bm_zone[idx].bitmap);
> +		bm_zone[idx].base_pfn = 0;
> +		bm_zone[idx].nbits = 0;
> +		atomic_set(&bm_zone[idx].free_pages, 0);
> +		spin_unlock(&zone->lock);
> +		idx++;
> +	}
> +}
> +EXPORT_SYMBOL_GPL(page_hinting_disable);
> +
> +static unsigned long pfn_to_bit(struct page *page, int zonenum)
> +{
> +	unsigned long bitnr;
> +
> +	bitnr = (page_to_pfn(page) - bm_zone[zonenum].base_pfn)
> +			 >> PAGE_HINTING_MIN_ORDER;
> +	return bitnr;
> +}
> +
> +static void release_buddy_pages(struct list_head *pages)

maybe "release_isolated_pages", not sure.

> +{
> +	int mt = 0, zonenum, order;
> +	struct page *page, *next;
> +	struct zone *zone;
> +	unsigned long bitnr;
> +
> +	list_for_each_entry_safe(page, next, pages, lru) {
> +		zonenum = page_zonenum(page);
> +		zone = page_zone(page);
> +		bitnr = pfn_to_bit(page, zonenum);
> +		spin_lock(&zone->lock);
> +		list_del(&page->lru);
> +		order = page_private(page);
> +		set_page_private(page, 0);
> +		mt = get_pageblock_migratetype(page);
> +		__free_one_page(page, page_to_pfn(page), zone,
> +				order, mt, false);
> +		spin_unlock(&zone->lock);
> +	}
> +}
> +
> +static void bm_set_pfn(struct page *page)
> +{
> +	unsigned long bitnr = 0;
> +	int zonenum = page_zonenum(page);
> +	struct zone *zone = page_zone(page);
> +
> +	lockdep_assert_held(&zone->lock);
> +	bitnr = pfn_to_bit(page, zonenum);
> +	if (bm_zone[zonenum].bitmap &&
> +	    bitnr < bm_zone[zonenum].nbits &&
> +	    !test_and_set_bit(bitnr, bm_zone[zonenum].bitmap))
> +		atomic_inc(&bm_zone[zonenum].free_pages);
> +}
> +
> +static void scan_hinting_bitmap(int zonenum, int free_pages)
> +{
> +	unsigned long set_bit, start = 0;
> +	struct page *page;
> +	struct zone *zone;
> +	int scanned_pages = 0, ret = 0, order, isolated_cnt = 0;
> +	LIST_HEAD(isolated_pages);
> +
> +	ret = hcb->prepare();
> +	if (ret < 0)
> +		return;
> +	for (;;) {
> +		ret = 0;
> +		set_bit = find_next_bit(bm_zone[zonenum].bitmap,
> +					bm_zone[zonenum].nbits, start);
> +		if (set_bit >= bm_zone[zonenum].nbits)
> +			break;
> +		page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) +
> +				bm_zone[zonenum].base_pfn);
> +		if (!page)
> +			continue;

You are not clearing the bit / decrementing the counter.

> +		zone = page_zone(page);
> +		spin_lock(&zone->lock);
> +
> +		if (PageBuddy(page) && page_private(page) >=
> +		    PAGE_HINTING_MIN_ORDER) {
> +			order = page_private(page);
> +			ret = __isolate_free_page(page, order);
> +		}
> +		clear_bit(set_bit, bm_zone[zonenum].bitmap);
> +		spin_unlock(&zone->lock);
> +		if (ret) {
> +			/*
> +			 * restoring page order to use it while releasing
> +			 * the pages back to the buddy.
> +			 */
> +			set_page_private(page, order);
> +			list_add_tail(&page->lru, &isolated_pages);
> +			isolated_cnt++;
> +			if (isolated_cnt == hcb->max_pages) {
> +				hcb->hint_pages(&isolated_pages);
> +				release_buddy_pages(&isolated_pages);
> +				isolated_cnt = 0;
> +			}
> +		}
> +		start = set_bit + 1;
> +		scanned_pages++;
> +	}
> +	if (isolated_cnt) {
> +		hcb->hint_pages(&isolated_pages);
> +		release_buddy_pages(&isolated_pages);
> +	}
> +	hcb->cleanup();
> +	if (scanned_pages > free_pages)
> +		atomic_sub((scanned_pages - free_pages),
> +			   &bm_zone[zonenum].free_pages);

This looks overly complicated. Can't we somehow simply decrement when
clearing a bit?

> +}
> +
> +static bool check_hinting_threshold(void)
> +{
> +	int zonenum = 0;
> +
> +	for (; zonenum < MAX_NR_ZONES; zonenum++) {
> +		if (atomic_read(&bm_zone[zonenum].free_pages) >=
> +				hcb->max_pages)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +static void init_hinting_wq(struct work_struct *work)
> +{
> +	int zonenum = 0, free_pages = 0;
> +
> +	for (; zonenum < MAX_NR_ZONES; zonenum++) {
> +		free_pages = atomic_read(&bm_zone[zonenum].free_pages);
> +		if (free_pages >= hcb->max_pages) {
> +			/* Find a better way to synchronize per zone
> +			 * free_pages.
> +			 */
> +			atomic_sub(free_pages,
> +				   &bm_zone[zonenum].free_pages);

I can't follow yet why we need that information. Wouldn't it be enough
to just track the number of set bits in the bitmap and start hinting
depending on that count? (there are false positives, but do we really care?)

> +			scan_hinting_bitmap(zonenum, free_pages);
> +		}
> +	}
> +}
> +
> +void page_hinting_enqueue(struct page *page, int order)
> +{
> +	if (hcb && order >= PAGE_HINTING_MIN_ORDER)
> +		bm_set_pfn(page);
> +	else
> +		return;
> +
> +	if (check_hinting_threshold()) {
> +		int cpu = smp_processor_id();
> +
> +		queue_work_on(cpu, system_wq, &hinting_work);
> +	}
> +}
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][Patch v10 2/2] virtio-balloon: page_hinting: reporting to the host
  2019-06-03 17:03 ` [RFC][Patch v10 2/2] virtio-balloon: page_hinting: reporting to the host Nitesh Narayan Lal
@ 2019-06-03 22:38   ` Alexander Duyck
  2019-06-04  7:12     ` David Hildenbrand
  2019-06-04 11:31     ` Nitesh Narayan Lal
  2019-06-04 16:33   ` Alexander Duyck
  1 sibling, 2 replies; 33+ messages in thread
From: Alexander Duyck @ 2019-06-03 22:38 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
> Enables the kernel to negotiate VIRTIO_BALLOON_F_HINTING feature with the
> host. If it is available and page_hinting_flag is set to true, page_hinting
> is enabled and its callbacks are configured along with the max_pages count
> which indicates the maximum number of pages that can be isolated and hinted
> at a time. Currently, only free pages of order >= (MAX_ORDER - 2) are
> reported. To prevent any false OOM max_pages count is set to 16.
>
> By default page_hinting feature is enabled and gets loaded as soon
> as the virtio-balloon driver is loaded. However, it could be disabled
> by writing the page_hinting_flag which is a virtio-balloon parameter.
>
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 112 +++++++++++++++++++++++++++-
>  include/uapi/linux/virtio_balloon.h |  14 ++++
>  2 files changed, 125 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index f19061b585a4..40f09ea31643 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -31,6 +31,7 @@
>  #include <linux/mm.h>
>  #include <linux/mount.h>
>  #include <linux/magic.h>
> +#include <linux/page_hinting.h>
>
>  /*
>   * Balloon device works in 4K page units.  So each page is pointed to by
> @@ -48,6 +49,7 @@
>  /* The size of a free page block in bytes */
>  #define VIRTIO_BALLOON_FREE_PAGE_SIZE \
>         (1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT))
> +#define VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES  16
>
>  #ifdef CONFIG_BALLOON_COMPACTION
>  static struct vfsmount *balloon_mnt;
> @@ -58,6 +60,7 @@ enum virtio_balloon_vq {
>         VIRTIO_BALLOON_VQ_DEFLATE,
>         VIRTIO_BALLOON_VQ_STATS,
>         VIRTIO_BALLOON_VQ_FREE_PAGE,
> +       VIRTIO_BALLOON_VQ_HINTING,
>         VIRTIO_BALLOON_VQ_MAX
>  };
>
> @@ -67,7 +70,8 @@ enum virtio_balloon_config_read {
>
>  struct virtio_balloon {
>         struct virtio_device *vdev;
> -       struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
> +       struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq,
> +                        *hinting_vq;
>
>         /* Balloon's own wq for cpu-intensive work items */
>         struct workqueue_struct *balloon_wq;
> @@ -125,6 +129,9 @@ struct virtio_balloon {
>
>         /* To register a shrinker to shrink memory upon memory pressure */
>         struct shrinker shrinker;
> +
> +       /* object pointing at the array of isolated pages ready for hinting */
> +       struct hinting_data *hinting_arr;

Just make this an array of size VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES.
It will save a bunch of complexity later.

>  };
>
>  static struct virtio_device_id id_table[] = {
> @@ -132,6 +139,85 @@ static struct virtio_device_id id_table[] = {
>         { 0 },
>  };
>
> +#ifdef CONFIG_PAGE_HINTING

Instead of having CONFIG_PAGE_HINTING enable this, maybe we should
have virtio-balloon enable CONFIG_PAGE_HINTING.

> +struct virtio_balloon *hvb;
> +bool page_hinting_flag = true;
> +module_param(page_hinting_flag, bool, 0444);
> +MODULE_PARM_DESC(page_hinting_flag, "Enable page hinting");
> +
> +static bool virtqueue_kick_sync(struct virtqueue *vq)
> +{
> +       u32 len;
> +
> +       if (likely(virtqueue_kick(vq))) {
> +               while (!virtqueue_get_buf(vq, &len) &&
> +                      !virtqueue_is_broken(vq))
> +                       cpu_relax();
> +               return true;

Is this a synchronous setup? It seems kind of wasteful to have a
thread busy waiting here like this. It might make more sense to just
make this work like the other balloon queues and have a wait event
with a wake up in the interrupt handler for the queue.

> +       }
> +       return false;
> +}
> +
> +static void page_hinting_report(int entries)
> +{
> +       struct scatterlist sg;
> +       struct virtqueue *vq = hvb->hinting_vq;
> +       int err = 0;
> +       struct hinting_data *hint_req;
> +       u64 gpaddr;
> +
> +       hint_req = kmalloc(sizeof(*hint_req), GFP_KERNEL);
> +       if (!hint_req)
> +               return;

Why do we need another allocation here? Couldn't you just allocate
hint_req on the stack and then use that? I think we might be doing too
much here. All this really needs to look like is something along the
lines of tell_host() minus the wait_event.

> +       gpaddr = virt_to_phys(hvb->hinting_arr);
> +       hint_req->phys_addr = cpu_to_virtio64(hvb->vdev, gpaddr);
> +       hint_req->size = cpu_to_virtio32(hvb->vdev, entries);
> +       sg_init_one(&sg, hint_req, sizeof(*hint_req));
> +       err = virtqueue_add_outbuf(vq, &sg, 1, hint_req, GFP_KERNEL);
> +       if (!err)
> +               virtqueue_kick_sync(hvb->hinting_vq);
> +
> +       kfree(hint_req);
> +}
> +
> +int page_hinting_prepare(void)
> +{
> +       hvb->hinting_arr = kmalloc_array(VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES,
> +                                        sizeof(*hvb->hinting_arr), GFP_KERNEL);
> +       if (!hvb->hinting_arr)
> +               return -ENOMEM;
> +       return 0;
> +}
> +

Why make the hinting_arr a dynamic allocation? You should probably
just make it a static array within the virtio_balloon structure. Then
you don't have the risk of an allocation failing and messing up the
hints.

> +void hint_pages(struct list_head *pages)
> +{
> +       struct page *page, *next;
> +       unsigned long pfn;
> +       int idx = 0, order;
> +
> +       list_for_each_entry_safe(page, next, pages, lru) {
> +               pfn = page_to_pfn(page);
> +               order = page_private(page);
> +               hvb->hinting_arr[idx].phys_addr = pfn << PAGE_SHIFT;
> +               hvb->hinting_arr[idx].size = (1 << order) * PAGE_SIZE;
> +               idx++;
> +       }
> +       page_hinting_report(idx);
> +}
> +

Getting back to my suggestion from earlier today. It might make sense
to not bother with the PAGE_SHIFT or PAGE_SIZE multiplication if you
just record everything in VIRTIO_BALLOON_PAGES intead of using the
actual address and size.

> +void page_hinting_cleanup(void)
> +{
> +       kfree(hvb->hinting_arr);
> +}
> +

Same comment here. Make this array a part of virtio_balloon and you
don't have to free it.

> +static const struct page_hinting_cb hcb = {
> +       .prepare = page_hinting_prepare,
> +       .hint_pages = hint_pages,
> +       .cleanup = page_hinting_cleanup,
> +       .max_pages = VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES,
> +};

With the above changes prepare and cleanup can be dropped.

> +#endif
> +
>  static u32 page_to_balloon_pfn(struct page *page)
>  {
>         unsigned long pfn = page_to_pfn(page);
> @@ -488,6 +574,7 @@ static int init_vqs(struct virtio_balloon *vb)
>         names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
>         names[VIRTIO_BALLOON_VQ_STATS] = NULL;
>         names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> +       names[VIRTIO_BALLOON_VQ_HINTING] = NULL;
>
>         if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>                 names[VIRTIO_BALLOON_VQ_STATS] = "stats";
> @@ -499,11 +586,18 @@ static int init_vqs(struct virtio_balloon *vb)
>                 callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
>         }
>
> +       if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
> +               names[VIRTIO_BALLOON_VQ_HINTING] = "hinting_vq";
> +               callbacks[VIRTIO_BALLOON_VQ_HINTING] = NULL;
> +       }
>         err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
>                                          vqs, callbacks, names, NULL, NULL);
>         if (err)
>                 return err;
>
> +       if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
> +               vb->hinting_vq = vqs[VIRTIO_BALLOON_VQ_HINTING];
> +
>         vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
>         vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
>         if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> @@ -942,6 +1036,14 @@ static int virtballoon_probe(struct virtio_device *vdev)
>                 if (err)
>                         goto out_del_balloon_wq;
>         }
> +
> +#ifdef CONFIG_PAGE_HINTING
> +       if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING) &&
> +           page_hinting_flag) {
> +               hvb = vb;
> +               page_hinting_enable(&hcb);
> +       }
> +#endif
>         virtio_device_ready(vdev);
>
>         if (towards_target(vb))
> @@ -989,6 +1091,12 @@ static void virtballoon_remove(struct virtio_device *vdev)
>                 destroy_workqueue(vb->balloon_wq);
>         }
>
> +#ifdef CONFIG_PAGE_HINTING
> +       if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
> +               hvb = NULL;
> +               page_hinting_disable();
> +       }
> +#endif
>         remove_common(vb);
>  #ifdef CONFIG_BALLOON_COMPACTION
>         if (vb->vb_dev_info.inode)
> @@ -1043,8 +1151,10 @@ static unsigned int features[] = {
>         VIRTIO_BALLOON_F_MUST_TELL_HOST,
>         VIRTIO_BALLOON_F_STATS_VQ,
>         VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +       VIRTIO_BALLOON_F_HINTING,
>         VIRTIO_BALLOON_F_FREE_PAGE_HINT,
>         VIRTIO_BALLOON_F_PAGE_POISON,
> +       VIRTIO_BALLOON_F_HINTING,
>  };
>
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index a1966cd7b677..25e4f817c660 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -29,6 +29,7 @@
>  #include <linux/virtio_types.h>
>  #include <linux/virtio_ids.h>
>  #include <linux/virtio_config.h>
> +#include <linux/page_hinting.h>
>
>  /* The feature bitmap for virtio balloon */
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST        0 /* Tell before reclaiming pages */
> @@ -36,6 +37,7 @@
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM        2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_FREE_PAGE_HINT        3 /* VQ to report free pages */
>  #define VIRTIO_BALLOON_F_PAGE_POISON   4 /* Guest is using page poisoning */
> +#define VIRTIO_BALLOON_F_HINTING       5 /* Page hinting virtqueue */
>
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> @@ -108,4 +110,16 @@ struct virtio_balloon_stat {
>         __virtio64 val;
>  } __attribute__((packed));
>
> +#ifdef CONFIG_PAGE_HINTING
> +/*
> + * struct hinting_data- holds the information associated with hinting.
> + * @phys_add:  physical address associated with a page or the array holding
> + *             the array of isolated pages.
> + * @size:      total size associated with the phys_addr.
> + */
> +struct hinting_data {
> +       __virtio64 phys_addr;
> +       __virtio32 size;
> +};

So in order to avoid errors this should either have
"__attribute__((packed))" added or it should be changed to a pair of
u32 or u64 values so that it will always be the same size regardless
of what platform it is built on.

> +#endif
>  #endif /* _LINUX_VIRTIO_BALLOON_H */
> --
> 2.21.0
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][Patch v10 2/2] virtio-balloon: page_hinting: reporting to the host
  2019-06-03 22:38   ` Alexander Duyck
@ 2019-06-04  7:12     ` David Hildenbrand
  2019-06-04 11:50       ` Nitesh Narayan Lal
  2019-06-04 11:31     ` Nitesh Narayan Lal
  1 sibling, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2019-06-04  7:12 UTC (permalink / raw)
  To: Alexander Duyck, Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, Michael S. Tsirkin, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli

On 04.06.19 00:38, Alexander Duyck wrote:
> On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>
>> Enables the kernel to negotiate VIRTIO_BALLOON_F_HINTING feature with the
>> host. If it is available and page_hinting_flag is set to true, page_hinting
>> is enabled and its callbacks are configured along with the max_pages count
>> which indicates the maximum number of pages that can be isolated and hinted
>> at a time. Currently, only free pages of order >= (MAX_ORDER - 2) are
>> reported. To prevent any false OOM max_pages count is set to 16.
>>
>> By default page_hinting feature is enabled and gets loaded as soon
>> as the virtio-balloon driver is loaded. However, it could be disabled
>> by writing the page_hinting_flag which is a virtio-balloon parameter.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> ---
>>  drivers/virtio/virtio_balloon.c     | 112 +++++++++++++++++++++++++++-
>>  include/uapi/linux/virtio_balloon.h |  14 ++++
>>  2 files changed, 125 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>> index f19061b585a4..40f09ea31643 100644
>> --- a/drivers/virtio/virtio_balloon.c
>> +++ b/drivers/virtio/virtio_balloon.c
>> @@ -31,6 +31,7 @@
>>  #include <linux/mm.h>
>>  #include <linux/mount.h>
>>  #include <linux/magic.h>
>> +#include <linux/page_hinting.h>
>>
>>  /*
>>   * Balloon device works in 4K page units.  So each page is pointed to by
>> @@ -48,6 +49,7 @@
>>  /* The size of a free page block in bytes */
>>  #define VIRTIO_BALLOON_FREE_PAGE_SIZE \
>>         (1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT))
>> +#define VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES  16
>>
>>  #ifdef CONFIG_BALLOON_COMPACTION
>>  static struct vfsmount *balloon_mnt;
>> @@ -58,6 +60,7 @@ enum virtio_balloon_vq {
>>         VIRTIO_BALLOON_VQ_DEFLATE,
>>         VIRTIO_BALLOON_VQ_STATS,
>>         VIRTIO_BALLOON_VQ_FREE_PAGE,
>> +       VIRTIO_BALLOON_VQ_HINTING,
>>         VIRTIO_BALLOON_VQ_MAX
>>  };
>>
>> @@ -67,7 +70,8 @@ enum virtio_balloon_config_read {
>>
>>  struct virtio_balloon {
>>         struct virtio_device *vdev;
>> -       struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>> +       struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq,
>> +                        *hinting_vq;
>>
>>         /* Balloon's own wq for cpu-intensive work items */
>>         struct workqueue_struct *balloon_wq;
>> @@ -125,6 +129,9 @@ struct virtio_balloon {
>>
>>         /* To register a shrinker to shrink memory upon memory pressure */
>>         struct shrinker shrinker;
>> +
>> +       /* object pointing at the array of isolated pages ready for hinting */
>> +       struct hinting_data *hinting_arr;
> 
> Just make this an array of size VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES.
> It will save a bunch of complexity later.

+1

[...]
> 
>> +struct virtio_balloon *hvb;
>> +bool page_hinting_flag = true;
>> +module_param(page_hinting_flag, bool, 0444);
>> +MODULE_PARM_DESC(page_hinting_flag, "Enable page hinting");
>> +
>> +static bool virtqueue_kick_sync(struct virtqueue *vq)
>> +{
>> +       u32 len;
>> +
>> +       if (likely(virtqueue_kick(vq))) {
>> +               while (!virtqueue_get_buf(vq, &len) &&
>> +                      !virtqueue_is_broken(vq))
>> +                       cpu_relax();
>> +               return true;
> 
> Is this a synchronous setup? It seems kind of wasteful to have a
> thread busy waiting here like this. It might make more sense to just
> make this work like the other balloon queues and have a wait event
> with a wake up in the interrupt handler for the queue.

+1

[...]

> 
>> +       gpaddr = virt_to_phys(hvb->hinting_arr);
>> +       hint_req->phys_addr = cpu_to_virtio64(hvb->vdev, gpaddr);
>> +       hint_req->size = cpu_to_virtio32(hvb->vdev, entries);
>> +       sg_init_one(&sg, hint_req, sizeof(*hint_req));
>> +       err = virtqueue_add_outbuf(vq, &sg, 1, hint_req, GFP_KERNEL);
>> +       if (!err)
>> +               virtqueue_kick_sync(hvb->hinting_vq);
>> +
>> +       kfree(hint_req);
>> +}
>> +
>> +int page_hinting_prepare(void)
>> +{
>> +       hvb->hinting_arr = kmalloc_array(VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES,
>> +                                        sizeof(*hvb->hinting_arr), GFP_KERNEL);
>> +       if (!hvb->hinting_arr)
>> +               return -ENOMEM;
>> +       return 0;
>> +}
>> +
> 
> Why make the hinting_arr a dynamic allocation? You should probably
> just make it a static array within the virtio_balloon structure. Then
> you don't have the risk of an allocation failing and messing up the
> hints.

+1

> 
>> +void hint_pages(struct list_head *pages)
>> +{
>> +       struct page *page, *next;
>> +       unsigned long pfn;
>> +       int idx = 0, order;
>> +
>> +       list_for_each_entry_safe(page, next, pages, lru) {
>> +               pfn = page_to_pfn(page);
>> +               order = page_private(page);
>> +               hvb->hinting_arr[idx].phys_addr = pfn << PAGE_SHIFT;
>> +               hvb->hinting_arr[idx].size = (1 << order) * PAGE_SIZE;
>> +               idx++;
>> +       }
>> +       page_hinting_report(idx);
>> +}
>> +
> 
> Getting back to my suggestion from earlier today. It might make sense
> to not bother with the PAGE_SHIFT or PAGE_SIZE multiplication if you
> just record everything in VIRTIO_BALLOON_PAGES intead of using the
> actual address and size.

I think I prefer "addr + size".

> 
> Same comment here. Make this array a part of virtio_balloon and you
> don't have to free it.
> 
>> +static const struct page_hinting_cb hcb = {
>> +       .prepare = page_hinting_prepare,
>> +       .hint_pages = hint_pages,
>> +       .cleanup = page_hinting_cleanup,
>> +       .max_pages = VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES,
>> +};
> 
> With the above changes prepare and cleanup can be dropped.

+1

> 
>> +#endif
>> +
>>  static u32 page_to_balloon_pfn(struct page *page)
>>  {
>>         unsigned long pfn = page_to_pfn(page);
>> @@ -488,6 +574,7 @@ static int init_vqs(struct virtio_balloon *vb)
>>         names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
>>         names[VIRTIO_BALLOON_VQ_STATS] = NULL;
>>         names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
>> +       names[VIRTIO_BALLOON_VQ_HINTING] = NULL;
>>
>>         if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>>                 names[VIRTIO_BALLOON_VQ_STATS] = "stats";
>> @@ -499,11 +586,18 @@ static int init_vqs(struct virtio_balloon *vb)
>>                 callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
>>         }
>>
>> +       if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
>> +               names[VIRTIO_BALLOON_VQ_HINTING] = "hinting_vq";
>> +               callbacks[VIRTIO_BALLOON_VQ_HINTING] = NULL;
>> +       }
>>         err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
>>                                          vqs, callbacks, names, NULL, NULL);
>>         if (err)
>>                 return err;
>>
>> +       if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
>> +               vb->hinting_vq = vqs[VIRTIO_BALLOON_VQ_HINTING];
>> +
>>         vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
>>         vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
>>         if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>> @@ -942,6 +1036,14 @@ static int virtballoon_probe(struct virtio_device *vdev)
>>                 if (err)
>>                         goto out_del_balloon_wq;
>>         }
>> +
>> +#ifdef CONFIG_PAGE_HINTING
>> +       if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING) &&
>> +           page_hinting_flag) {
>> +               hvb = vb;
>> +               page_hinting_enable(&hcb);
>> +       }
>> +#endif
>>         virtio_device_ready(vdev);
>>
>>         if (towards_target(vb))
>> @@ -989,6 +1091,12 @@ static void virtballoon_remove(struct virtio_device *vdev)
>>                 destroy_workqueue(vb->balloon_wq);
>>         }
>>
>> +#ifdef CONFIG_PAGE_HINTING
>> +       if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {

Nitesh, you should only disable if you actually enabled it
(page_hinting_flag).



-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][Patch v10 2/2] virtio-balloon: page_hinting: reporting to the host
  2019-06-03 22:38   ` Alexander Duyck
  2019-06-04  7:12     ` David Hildenbrand
@ 2019-06-04 11:31     ` Nitesh Narayan Lal
  1 sibling, 0 replies; 33+ messages in thread
From: Nitesh Narayan Lal @ 2019-06-04 11:31 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli


[-- Attachment #1.1: Type: text/plain, Size: 12031 bytes --]


On 6/3/19 6:38 PM, Alexander Duyck wrote:
> On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>> Enables the kernel to negotiate VIRTIO_BALLOON_F_HINTING feature with the
>> host. If it is available and page_hinting_flag is set to true, page_hinting
>> is enabled and its callbacks are configured along with the max_pages count
>> which indicates the maximum number of pages that can be isolated and hinted
>> at a time. Currently, only free pages of order >= (MAX_ORDER - 2) are
>> reported. To prevent any false OOM max_pages count is set to 16.
>>
>> By default page_hinting feature is enabled and gets loaded as soon
>> as the virtio-balloon driver is loaded. However, it could be disabled
>> by writing the page_hinting_flag which is a virtio-balloon parameter.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> ---
>>  drivers/virtio/virtio_balloon.c     | 112 +++++++++++++++++++++++++++-
>>  include/uapi/linux/virtio_balloon.h |  14 ++++
>>  2 files changed, 125 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>> index f19061b585a4..40f09ea31643 100644
>> --- a/drivers/virtio/virtio_balloon.c
>> +++ b/drivers/virtio/virtio_balloon.c
>> @@ -31,6 +31,7 @@
>>  #include <linux/mm.h>
>>  #include <linux/mount.h>
>>  #include <linux/magic.h>
>> +#include <linux/page_hinting.h>
>>
>>  /*
>>   * Balloon device works in 4K page units.  So each page is pointed to by
>> @@ -48,6 +49,7 @@
>>  /* The size of a free page block in bytes */
>>  #define VIRTIO_BALLOON_FREE_PAGE_SIZE \
>>         (1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT))
>> +#define VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES  16
>>
>>  #ifdef CONFIG_BALLOON_COMPACTION
>>  static struct vfsmount *balloon_mnt;
>> @@ -58,6 +60,7 @@ enum virtio_balloon_vq {
>>         VIRTIO_BALLOON_VQ_DEFLATE,
>>         VIRTIO_BALLOON_VQ_STATS,
>>         VIRTIO_BALLOON_VQ_FREE_PAGE,
>> +       VIRTIO_BALLOON_VQ_HINTING,
>>         VIRTIO_BALLOON_VQ_MAX
>>  };
>>
>> @@ -67,7 +70,8 @@ enum virtio_balloon_config_read {
>>
>>  struct virtio_balloon {
>>         struct virtio_device *vdev;
>> -       struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>> +       struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq,
>> +                        *hinting_vq;
>>
>>         /* Balloon's own wq for cpu-intensive work items */
>>         struct workqueue_struct *balloon_wq;
>> @@ -125,6 +129,9 @@ struct virtio_balloon {
>>
>>         /* To register a shrinker to shrink memory upon memory pressure */
>>         struct shrinker shrinker;
>> +
>> +       /* object pointing at the array of isolated pages ready for hinting */
>> +       struct hinting_data *hinting_arr;
> Just make this an array of size VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES.
> It will save a bunch of complexity later.
Make sense.
>>  };
>>
>>  static struct virtio_device_id id_table[] = {
>> @@ -132,6 +139,85 @@ static struct virtio_device_id id_table[] = {
>>         { 0 },
>>  };
>>
>> +#ifdef CONFIG_PAGE_HINTING
> Instead of having CONFIG_PAGE_HINTING enable this, maybe we should
> have virtio-balloon enable CONFIG_PAGE_HINTING.
>
>> +struct virtio_balloon *hvb;
>> +bool page_hinting_flag = true;
>> +module_param(page_hinting_flag, bool, 0444);
>> +MODULE_PARM_DESC(page_hinting_flag, "Enable page hinting");
>> +
>> +static bool virtqueue_kick_sync(struct virtqueue *vq)
>> +{
>> +       u32 len;
>> +
>> +       if (likely(virtqueue_kick(vq))) {
>> +               while (!virtqueue_get_buf(vq, &len) &&
>> +                      !virtqueue_is_broken(vq))
>> +                       cpu_relax();
>> +               return true;
> Is this a synchronous setup? 
Yes.
> It seems kind of wasteful to have a
> thread busy waiting here like this. It might make more sense to just
> make this work like the other balloon queues and have a wait event
> with a wake up in the interrupt handler for the queue.

I can do that. I may have to use a flag or something else to
page_hinting.c as well to ensure that no
other bitmap scanning is initiated until then.

>
>> +       }
>> +       return false;
>> +}
>> +
>> +static void page_hinting_report(int entries)
>> +{
>> +       struct scatterlist sg;
>> +       struct virtqueue *vq = hvb->hinting_vq;
>> +       int err = 0;
>> +       struct hinting_data *hint_req;
>> +       u64 gpaddr;
>> +
>> +       hint_req = kmalloc(sizeof(*hint_req), GFP_KERNEL);
>> +       if (!hint_req)
>> +               return;
> Why do we need another allocation here? Couldn't you just allocate
> hint_req on the stack and then use that? I think we might be doing too
> much here. All this really needs to look like is something along the
> lines of tell_host() minus the wait_event.
>
>> +       gpaddr = virt_to_phys(hvb->hinting_arr);
>> +       hint_req->phys_addr = cpu_to_virtio64(hvb->vdev, gpaddr);
>> +       hint_req->size = cpu_to_virtio32(hvb->vdev, entries);
>> +       sg_init_one(&sg, hint_req, sizeof(*hint_req));
>> +       err = virtqueue_add_outbuf(vq, &sg, 1, hint_req, GFP_KERNEL);
>> +       if (!err)
>> +               virtqueue_kick_sync(hvb->hinting_vq);
>> +
>> +       kfree(hint_req);
>> +}
>> +
>> +int page_hinting_prepare(void)
>> +{
>> +       hvb->hinting_arr = kmalloc_array(VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES,
>> +                                        sizeof(*hvb->hinting_arr), GFP_KERNEL);
>> +       if (!hvb->hinting_arr)
>> +               return -ENOMEM;
>> +       return 0;
>> +}
>> +
> Why make the hinting_arr a dynamic allocation? You should probably
> just make it a static array within the virtio_balloon structure. Then
> you don't have the risk of an allocation failing and messing up the
> hints.
I agree.
>> +void hint_pages(struct list_head *pages)
>> +{
>> +       struct page *page, *next;
>> +       unsigned long pfn;
>> +       int idx = 0, order;
>> +
>> +       list_for_each_entry_safe(page, next, pages, lru) {
>> +               pfn = page_to_pfn(page);
>> +               order = page_private(page);
>> +               hvb->hinting_arr[idx].phys_addr = pfn << PAGE_SHIFT;
>> +               hvb->hinting_arr[idx].size = (1 << order) * PAGE_SIZE;
>> +               idx++;
>> +       }
>> +       page_hinting_report(idx);
>> +}
>> +
> Getting back to my suggestion from earlier today. It might make sense
> to not bother with the PAGE_SHIFT or PAGE_SIZE multiplication if you
> just record everything in VIRTIO_BALLOON_PAGES intead of using the
> actual address and size.
>
>> +void page_hinting_cleanup(void)
>> +{
>> +       kfree(hvb->hinting_arr);
>> +}
>> +
> Same comment here. Make this array a part of virtio_balloon and you
> don't have to free it.
+1
>
>> +static const struct page_hinting_cb hcb = {
>> +       .prepare = page_hinting_prepare,
>> +       .hint_pages = hint_pages,
>> +       .cleanup = page_hinting_cleanup,
>> +       .max_pages = VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES,
>> +};
> With the above changes prepare and cleanup can be dropped.
>
>> +#endif
>> +
>>  static u32 page_to_balloon_pfn(struct page *page)
>>  {
>>         unsigned long pfn = page_to_pfn(page);
>> @@ -488,6 +574,7 @@ static int init_vqs(struct virtio_balloon *vb)
>>         names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
>>         names[VIRTIO_BALLOON_VQ_STATS] = NULL;
>>         names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
>> +       names[VIRTIO_BALLOON_VQ_HINTING] = NULL;
>>
>>         if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>>                 names[VIRTIO_BALLOON_VQ_STATS] = "stats";
>> @@ -499,11 +586,18 @@ static int init_vqs(struct virtio_balloon *vb)
>>                 callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
>>         }
>>
>> +       if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
>> +               names[VIRTIO_BALLOON_VQ_HINTING] = "hinting_vq";
>> +               callbacks[VIRTIO_BALLOON_VQ_HINTING] = NULL;
>> +       }
>>         err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
>>                                          vqs, callbacks, names, NULL, NULL);
>>         if (err)
>>                 return err;
>>
>> +       if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
>> +               vb->hinting_vq = vqs[VIRTIO_BALLOON_VQ_HINTING];
>> +
>>         vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
>>         vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
>>         if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>> @@ -942,6 +1036,14 @@ static int virtballoon_probe(struct virtio_device *vdev)
>>                 if (err)
>>                         goto out_del_balloon_wq;
>>         }
>> +
>> +#ifdef CONFIG_PAGE_HINTING
>> +       if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING) &&
>> +           page_hinting_flag) {
>> +               hvb = vb;
>> +               page_hinting_enable(&hcb);
>> +       }
>> +#endif
>>         virtio_device_ready(vdev);
>>
>>         if (towards_target(vb))
>> @@ -989,6 +1091,12 @@ static void virtballoon_remove(struct virtio_device *vdev)
>>                 destroy_workqueue(vb->balloon_wq);
>>         }
>>
>> +#ifdef CONFIG_PAGE_HINTING
>> +       if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
>> +               hvb = NULL;
>> +               page_hinting_disable();
>> +       }
>> +#endif
>>         remove_common(vb);
>>  #ifdef CONFIG_BALLOON_COMPACTION
>>         if (vb->vb_dev_info.inode)
>> @@ -1043,8 +1151,10 @@ static unsigned int features[] = {
>>         VIRTIO_BALLOON_F_MUST_TELL_HOST,
>>         VIRTIO_BALLOON_F_STATS_VQ,
>>         VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>> +       VIRTIO_BALLOON_F_HINTING,
>>         VIRTIO_BALLOON_F_FREE_PAGE_HINT,
>>         VIRTIO_BALLOON_F_PAGE_POISON,
>> +       VIRTIO_BALLOON_F_HINTING,
>>  };
>>
>>  static struct virtio_driver virtio_balloon_driver = {
>> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
>> index a1966cd7b677..25e4f817c660 100644
>> --- a/include/uapi/linux/virtio_balloon.h
>> +++ b/include/uapi/linux/virtio_balloon.h
>> @@ -29,6 +29,7 @@
>>  #include <linux/virtio_types.h>
>>  #include <linux/virtio_ids.h>
>>  #include <linux/virtio_config.h>
>> +#include <linux/page_hinting.h>
>>
>>  /* The feature bitmap for virtio balloon */
>>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST        0 /* Tell before reclaiming pages */
>> @@ -36,6 +37,7 @@
>>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM        2 /* Deflate balloon on OOM */
>>  #define VIRTIO_BALLOON_F_FREE_PAGE_HINT        3 /* VQ to report free pages */
>>  #define VIRTIO_BALLOON_F_PAGE_POISON   4 /* Guest is using page poisoning */
>> +#define VIRTIO_BALLOON_F_HINTING       5 /* Page hinting virtqueue */
>>
>>  /* Size of a PFN in the balloon interface. */
>>  #define VIRTIO_BALLOON_PFN_SHIFT 12
>> @@ -108,4 +110,16 @@ struct virtio_balloon_stat {
>>         __virtio64 val;
>>  } __attribute__((packed));
>>
>> +#ifdef CONFIG_PAGE_HINTING
>> +/*
>> + * struct hinting_data- holds the information associated with hinting.
>> + * @phys_add:  physical address associated with a page or the array holding
>> + *             the array of isolated pages.
>> + * @size:      total size associated with the phys_addr.
>> + */
>> +struct hinting_data {
>> +       __virtio64 phys_addr;
>> +       __virtio32 size;
>> +};
> So in order to avoid errors this should either have
> "__attribute__((packed))" added or it should be changed to a pair of
> u32 or u64 values so that it will always be the same size regardless
> of what platform it is built on.
I will take a look at this.
Thanks.
>> +#endif
>>  #endif /* _LINUX_VIRTIO_BALLOON_H */
>> --
>> 2.21.0
>>
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][Patch v10 2/2] virtio-balloon: page_hinting: reporting to the host
  2019-06-04  7:12     ` David Hildenbrand
@ 2019-06-04 11:50       ` Nitesh Narayan Lal
  0 siblings, 0 replies; 33+ messages in thread
From: Nitesh Narayan Lal @ 2019-06-04 11:50 UTC (permalink / raw)
  To: David Hildenbrand, Alexander Duyck
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, Michael S. Tsirkin, dodgen,
	Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli


[-- Attachment #1.1: Type: text/plain, Size: 8676 bytes --]


On 6/4/19 3:12 AM, David Hildenbrand wrote:
> On 04.06.19 00:38, Alexander Duyck wrote:
>> On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>> Enables the kernel to negotiate VIRTIO_BALLOON_F_HINTING feature with the
>>> host. If it is available and page_hinting_flag is set to true, page_hinting
>>> is enabled and its callbacks are configured along with the max_pages count
>>> which indicates the maximum number of pages that can be isolated and hinted
>>> at a time. Currently, only free pages of order >= (MAX_ORDER - 2) are
>>> reported. To prevent any false OOM max_pages count is set to 16.
>>>
>>> By default page_hinting feature is enabled and gets loaded as soon
>>> as the virtio-balloon driver is loaded. However, it could be disabled
>>> by writing the page_hinting_flag which is a virtio-balloon parameter.
>>>
>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>>> ---
>>>  drivers/virtio/virtio_balloon.c     | 112 +++++++++++++++++++++++++++-
>>>  include/uapi/linux/virtio_balloon.h |  14 ++++
>>>  2 files changed, 125 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>>> index f19061b585a4..40f09ea31643 100644
>>> --- a/drivers/virtio/virtio_balloon.c
>>> +++ b/drivers/virtio/virtio_balloon.c
>>> @@ -31,6 +31,7 @@
>>>  #include <linux/mm.h>
>>>  #include <linux/mount.h>
>>>  #include <linux/magic.h>
>>> +#include <linux/page_hinting.h>
>>>
>>>  /*
>>>   * Balloon device works in 4K page units.  So each page is pointed to by
>>> @@ -48,6 +49,7 @@
>>>  /* The size of a free page block in bytes */
>>>  #define VIRTIO_BALLOON_FREE_PAGE_SIZE \
>>>         (1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT))
>>> +#define VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES  16
>>>
>>>  #ifdef CONFIG_BALLOON_COMPACTION
>>>  static struct vfsmount *balloon_mnt;
>>> @@ -58,6 +60,7 @@ enum virtio_balloon_vq {
>>>         VIRTIO_BALLOON_VQ_DEFLATE,
>>>         VIRTIO_BALLOON_VQ_STATS,
>>>         VIRTIO_BALLOON_VQ_FREE_PAGE,
>>> +       VIRTIO_BALLOON_VQ_HINTING,
>>>         VIRTIO_BALLOON_VQ_MAX
>>>  };
>>>
>>> @@ -67,7 +70,8 @@ enum virtio_balloon_config_read {
>>>
>>>  struct virtio_balloon {
>>>         struct virtio_device *vdev;
>>> -       struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>>> +       struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq,
>>> +                        *hinting_vq;
>>>
>>>         /* Balloon's own wq for cpu-intensive work items */
>>>         struct workqueue_struct *balloon_wq;
>>> @@ -125,6 +129,9 @@ struct virtio_balloon {
>>>
>>>         /* To register a shrinker to shrink memory upon memory pressure */
>>>         struct shrinker shrinker;
>>> +
>>> +       /* object pointing at the array of isolated pages ready for hinting */
>>> +       struct hinting_data *hinting_arr;
>> Just make this an array of size VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES.
>> It will save a bunch of complexity later.
> +1
>
> [...]
>>> +struct virtio_balloon *hvb;
>>> +bool page_hinting_flag = true;
>>> +module_param(page_hinting_flag, bool, 0444);
>>> +MODULE_PARM_DESC(page_hinting_flag, "Enable page hinting");
>>> +
>>> +static bool virtqueue_kick_sync(struct virtqueue *vq)
>>> +{
>>> +       u32 len;
>>> +
>>> +       if (likely(virtqueue_kick(vq))) {
>>> +               while (!virtqueue_get_buf(vq, &len) &&
>>> +                      !virtqueue_is_broken(vq))
>>> +                       cpu_relax();
>>> +               return true;
>> Is this a synchronous setup? It seems kind of wasteful to have a
>> thread busy waiting here like this. It might make more sense to just
>> make this work like the other balloon queues and have a wait event
>> with a wake up in the interrupt handler for the queue.
> +1
>
> [...]
>
>>> +       gpaddr = virt_to_phys(hvb->hinting_arr);
>>> +       hint_req->phys_addr = cpu_to_virtio64(hvb->vdev, gpaddr);
>>> +       hint_req->size = cpu_to_virtio32(hvb->vdev, entries);
>>> +       sg_init_one(&sg, hint_req, sizeof(*hint_req));
>>> +       err = virtqueue_add_outbuf(vq, &sg, 1, hint_req, GFP_KERNEL);
>>> +       if (!err)
>>> +               virtqueue_kick_sync(hvb->hinting_vq);
>>> +
>>> +       kfree(hint_req);
>>> +}
>>> +
>>> +int page_hinting_prepare(void)
>>> +{
>>> +       hvb->hinting_arr = kmalloc_array(VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES,
>>> +                                        sizeof(*hvb->hinting_arr), GFP_KERNEL);
>>> +       if (!hvb->hinting_arr)
>>> +               return -ENOMEM;
>>> +       return 0;
>>> +}
>>> +
>> Why make the hinting_arr a dynamic allocation? You should probably
>> just make it a static array within the virtio_balloon structure. Then
>> you don't have the risk of an allocation failing and messing up the
>> hints.
> +1
>
>>> +void hint_pages(struct list_head *pages)
>>> +{
>>> +       struct page *page, *next;
>>> +       unsigned long pfn;
>>> +       int idx = 0, order;
>>> +
>>> +       list_for_each_entry_safe(page, next, pages, lru) {
>>> +               pfn = page_to_pfn(page);
>>> +               order = page_private(page);
>>> +               hvb->hinting_arr[idx].phys_addr = pfn << PAGE_SHIFT;
>>> +               hvb->hinting_arr[idx].size = (1 << order) * PAGE_SIZE;
>>> +               idx++;
>>> +       }
>>> +       page_hinting_report(idx);
>>> +}
>>> +
>> Getting back to my suggestion from earlier today. It might make sense
>> to not bother with the PAGE_SHIFT or PAGE_SIZE multiplication if you
>> just record everything in VIRTIO_BALLOON_PAGES intead of using the
>> actual address and size.
> I think I prefer "addr + size".
>
>> Same comment here. Make this array a part of virtio_balloon and you
>> don't have to free it.
>>
>>> +static const struct page_hinting_cb hcb = {
>>> +       .prepare = page_hinting_prepare,
>>> +       .hint_pages = hint_pages,
>>> +       .cleanup = page_hinting_cleanup,
>>> +       .max_pages = VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES,
>>> +};
>> With the above changes prepare and cleanup can be dropped.
> +1
>
>>> +#endif
>>> +
>>>  static u32 page_to_balloon_pfn(struct page *page)
>>>  {
>>>         unsigned long pfn = page_to_pfn(page);
>>> @@ -488,6 +574,7 @@ static int init_vqs(struct virtio_balloon *vb)
>>>         names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
>>>         names[VIRTIO_BALLOON_VQ_STATS] = NULL;
>>>         names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
>>> +       names[VIRTIO_BALLOON_VQ_HINTING] = NULL;
>>>
>>>         if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>>>                 names[VIRTIO_BALLOON_VQ_STATS] = "stats";
>>> @@ -499,11 +586,18 @@ static int init_vqs(struct virtio_balloon *vb)
>>>                 callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
>>>         }
>>>
>>> +       if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
>>> +               names[VIRTIO_BALLOON_VQ_HINTING] = "hinting_vq";
>>> +               callbacks[VIRTIO_BALLOON_VQ_HINTING] = NULL;
>>> +       }
>>>         err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
>>>                                          vqs, callbacks, names, NULL, NULL);
>>>         if (err)
>>>                 return err;
>>>
>>> +       if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
>>> +               vb->hinting_vq = vqs[VIRTIO_BALLOON_VQ_HINTING];
>>> +
>>>         vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
>>>         vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
>>>         if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>>> @@ -942,6 +1036,14 @@ static int virtballoon_probe(struct virtio_device *vdev)
>>>                 if (err)
>>>                         goto out_del_balloon_wq;
>>>         }
>>> +
>>> +#ifdef CONFIG_PAGE_HINTING
>>> +       if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING) &&
>>> +           page_hinting_flag) {
>>> +               hvb = vb;
>>> +               page_hinting_enable(&hcb);
>>> +       }
>>> +#endif
>>>         virtio_device_ready(vdev);
>>>
>>>         if (towards_target(vb))
>>> @@ -989,6 +1091,12 @@ static void virtballoon_remove(struct virtio_device *vdev)
>>>                 destroy_workqueue(vb->balloon_wq);
>>>         }
>>>
>>> +#ifdef CONFIG_PAGE_HINTING
>>> +       if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
> Nitesh, you should only disable if you actually enabled it
> (page_hinting_flag).
+1, thanks.
>
>
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][Patch v10 1/2] mm: page_hinting: core infrastructure
  2019-06-03 19:04   ` Alexander Duyck
@ 2019-06-04 12:55     ` Nitesh Narayan Lal
  2019-06-04 15:14       ` Alexander Duyck
  0 siblings, 1 reply; 33+ messages in thread
From: Nitesh Narayan Lal @ 2019-06-04 12:55 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli


[-- Attachment #1.1: Type: text/plain, Size: 21559 bytes --]


On 6/3/19 3:04 PM, Alexander Duyck wrote:
> On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>> This patch introduces the core infrastructure for free page hinting in
>> virtual environments. It enables the kernel to track the free pages which
>> can be reported to its hypervisor so that the hypervisor could
>> free and reuse that memory as per its requirement.
>>
>> While the pages are getting processed in the hypervisor (e.g.,
>> via MADV_FREE), the guest must not use them, otherwise, data loss
>> would be possible. To avoid such a situation, these pages are
>> temporarily removed from the buddy. The amount of pages removed
>> temporarily from the buddy is governed by the backend(virtio-balloon
>> in our case).
>>
>> To efficiently identify free pages that can to be hinted to the
>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>> chunks are reported to the hypervisor - especially, to not break up THP
>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>> in the bitmap are an indication whether a page *might* be free, not a
>> guarantee. A new hook after buddy merging sets the bits.
>>
>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>> asynchronously processes the bitmaps, trying to isolate and report pages
>> that are still free. The backend (virtio-balloon) is responsible for
>> reporting these batched pages to the host synchronously. Once reporting/
>> freeing is complete, isolated pages are returned back to the buddy.
>>
>> There are still various things to look into (e.g., memory hotplug, more
>> efficient locking, possible races when disabling).
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> So one thing I had thought about, that I don't believe that has been
> addressed in your solution, is to determine a means to guarantee
> forward progress. If you have a noisy thread that is allocating and
> freeing some block of memory repeatedly you will be stuck processing
> that and cannot get to the other work. Specifically if you have a zone
> where somebody is just cycling the number of pages needed to fill your
> hinting queue how do you get around it and get to the data that is
> actually code instead of getting stuck processing the noise?

It should not matter. As every time the memory threshold is met, entire
bitmap
is scanned and not just a chunk of memory for possible isolation. This
will guarantee
forward progress.

> Do you have any idea what the hit rate would be on a system that is on
> the more active side? From what I can tell you still are effectively
> just doing a linear search of memory, but you have the bitmap hints to
> tell what as not been freed recently, however you still don't know
> that the pages you have bitmap hints for are actually free until you
> check them.
>
>> ---
>>  drivers/virtio/Kconfig       |   1 +
>>  include/linux/page_hinting.h |  46 +++++++
>>  mm/Kconfig                   |   6 +
>>  mm/Makefile                  |   2 +
>>  mm/page_alloc.c              |  17 +--
>>  mm/page_hinting.c            | 236 +++++++++++++++++++++++++++++++++++
>>  6 files changed, 301 insertions(+), 7 deletions(-)
>>  create mode 100644 include/linux/page_hinting.h
>>  create mode 100644 mm/page_hinting.c
>>
>> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
>> index 35897649c24f..5a96b7a2ed1e 100644
>> --- a/drivers/virtio/Kconfig
>> +++ b/drivers/virtio/Kconfig
>> @@ -46,6 +46,7 @@ config VIRTIO_BALLOON
>>         tristate "Virtio balloon driver"
>>         depends on VIRTIO
>>         select MEMORY_BALLOON
>> +       select PAGE_HINTING
>>         ---help---
>>          This driver supports increasing and decreasing the amount
>>          of memory within a KVM guest.
>> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
>> new file mode 100644
>> index 000000000000..e65188fe1e6b
>> --- /dev/null
>> +++ b/include/linux/page_hinting.h
>> @@ -0,0 +1,46 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _LINUX_PAGE_HINTING_H
>> +#define _LINUX_PAGE_HINTING_H
>> +
>> +/*
>> + * Minimum page order required for a page to be hinted to the host.
>> + */
>> +#define PAGE_HINTING_MIN_ORDER         (MAX_ORDER - 2)
>> +
>> +/*
>> + * struct page_hinting_cb: holds the callbacks to store, report and cleanup
>> + * isolated pages.
>> + * @prepare:           Callback responsible for allocating an array to hold
>> + *                     the isolated pages.
>> + * @hint_pages:                Callback which reports the isolated pages synchornously
>> + *                     to the host.
>> + * @cleanup:           Callback to free the the array used for reporting the
>> + *                     isolated pages.
>> + * @max_pages:         Maxmimum pages that are going to be hinted to the host
>> + *                     at a time of granularity >= PAGE_HINTING_MIN_ORDER.
>> + */
>> +struct page_hinting_cb {
>> +       int (*prepare)(void);
>> +       void (*hint_pages)(struct list_head *list);
>> +       void (*cleanup)(void);
>> +       int max_pages;
>> +};
>> +
>> +#ifdef CONFIG_PAGE_HINTING
>> +void page_hinting_enqueue(struct page *page, int order);
>> +void page_hinting_enable(const struct page_hinting_cb *cb);
>> +void page_hinting_disable(void);
>> +#else
>> +static inline void page_hinting_enqueue(struct page *page, int order)
>> +{
>> +}
>> +
>> +static inline void page_hinting_enable(struct page_hinting_cb *cb)
>> +{
>> +}
>> +
>> +static inline void page_hinting_disable(void)
>> +{
>> +}
>> +#endif
>> +#endif /* _LINUX_PAGE_HINTING_H */
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index ee8d1f311858..177d858de758 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -764,4 +764,10 @@ config GUP_BENCHMARK
>>  config ARCH_HAS_PTE_SPECIAL
>>         bool
>>
>> +# PAGE_HINTING will allow the guest to report the free pages to the
>> +# host in regular interval of time.
>> +config PAGE_HINTING
>> +       bool
>> +       def_bool n
>> +       depends on X86_64
>>  endmenu
>> diff --git a/mm/Makefile b/mm/Makefile
>> index ac5e5ba78874..bec456dfee34 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -41,6 +41,7 @@ obj-y                 := filemap.o mempool.o oom_kill.o fadvise.o \
>>                            interval_tree.o list_lru.o workingset.o \
>>                            debug.o $(mmu-y)
>>
>> +
>>  # Give 'page_alloc' its own module-parameter namespace
>>  page-alloc-y := page_alloc.o
>>  page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
>> @@ -94,6 +95,7 @@ obj-$(CONFIG_Z3FOLD)  += z3fold.o
>>  obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
>>  obj-$(CONFIG_CMA)      += cma.o
>>  obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
>> +obj-$(CONFIG_PAGE_HINTING) += page_hinting.o
>>  obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
>>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3b13d3914176..d12f69e0e402 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -68,6 +68,7 @@
>>  #include <linux/lockdep.h>
>>  #include <linux/nmi.h>
>>  #include <linux/psi.h>
>> +#include <linux/page_hinting.h>
>>
>>  #include <asm/sections.h>
>>  #include <asm/tlbflush.h>
>> @@ -873,10 +874,10 @@ compaction_capture(struct capture_control *capc, struct page *page,
>>   * -- nyc
>>   */
>>
>> -static inline void __free_one_page(struct page *page,
>> +inline void __free_one_page(struct page *page,
>>                 unsigned long pfn,
>>                 struct zone *zone, unsigned int order,
>> -               int migratetype)
>> +               int migratetype, bool hint)
>>  {
>>         unsigned long combined_pfn;
>>         unsigned long uninitialized_var(buddy_pfn);
>> @@ -951,6 +952,8 @@ static inline void __free_one_page(struct page *page,
>>  done_merging:
>>         set_page_order(page, order);
>>
>> +       if (hint)
>> +               page_hinting_enqueue(page, order);
> This is a bit early to probably be dealing with the hint. You should
> probably look at moving this down to a spot somewhere after the page
> has been added to the free list. It may not cause any issues with the
> current order setup, but moving after the addition to the free list
> will make it so that you know it is in there when you call this
> function.
I will take a look at this.
>
>>         /*
>>          * If this is not the largest possible page, check if the buddy
>>          * of the next-highest order is free. If it is, it's possible
>> @@ -1262,7 +1265,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>                 if (unlikely(isolated_pageblocks))
>>                         mt = get_pageblock_migratetype(page);
>>
>> -               __free_one_page(page, page_to_pfn(page), zone, 0, mt);
>> +               __free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
>>                 trace_mm_page_pcpu_drain(page, 0, mt);
>>         }
>>         spin_unlock(&zone->lock);
>> @@ -1271,14 +1274,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>  static void free_one_page(struct zone *zone,
>>                                 struct page *page, unsigned long pfn,
>>                                 unsigned int order,
>> -                               int migratetype)
>> +                               int migratetype, bool hint)
>>  {
>>         spin_lock(&zone->lock);
>>         if (unlikely(has_isolate_pageblock(zone) ||
>>                 is_migrate_isolate(migratetype))) {
>>                 migratetype = get_pfnblock_migratetype(page, pfn);
>>         }
>> -       __free_one_page(page, pfn, zone, order, migratetype);
>> +       __free_one_page(page, pfn, zone, order, migratetype, hint);
>>         spin_unlock(&zone->lock);
>>  }
>>
>> @@ -1368,7 +1371,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>>         migratetype = get_pfnblock_migratetype(page, pfn);
>>         local_irq_save(flags);
>>         __count_vm_events(PGFREE, 1 << order);
>> -       free_one_page(page_zone(page), page, pfn, order, migratetype);
>> +       free_one_page(page_zone(page), page, pfn, order, migratetype, true);
>>         local_irq_restore(flags);
>>  }
>>
>> @@ -2968,7 +2971,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
>>          */
>>         if (migratetype >= MIGRATE_PCPTYPES) {
>>                 if (unlikely(is_migrate_isolate(migratetype))) {
>> -                       free_one_page(zone, page, pfn, 0, migratetype);
>> +                       free_one_page(zone, page, pfn, 0, migratetype, true);
>>                         return;
>>                 }
>>                 migratetype = MIGRATE_MOVABLE;
> So it looks like you are using a parameter to identify if the page is
> a hinted page or not. I guess this works but it seems like it is a bit
> intrusive as you are adding an argument to specify that this is a
> specific page type.
Any suggestions on how we could do this in a less intrusive manner?
>
>> diff --git a/mm/page_hinting.c b/mm/page_hinting.c
>> new file mode 100644
>> index 000000000000..7341c6462de2
>> --- /dev/null
>> +++ b/mm/page_hinting.c
>> @@ -0,0 +1,236 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Page hinting support to enable a VM to report the freed pages back
>> + * to the host.
>> + *
>> + * Copyright Red Hat, Inc. 2019
>> + *
>> + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/slab.h>
>> +#include <linux/page_hinting.h>
>> +#include <linux/kvm_host.h>
>> +
>> +/*
>> + * struct hinting_bitmap: holds the bitmap pointer which tracks the freed PFNs
>> + * and other required parameters which could help in retrieving the original
>> + * PFN value using the bitmap.
>> + * @bitmap:            Pointer to the bitmap of free PFN.
>> + * @base_pfn:          Starting PFN value for the zone whose bitmap is stored.
>> + * @free_pages:                Tracks the number of free pages of granularity
>> + *                     PAGE_HINTING_MIN_ORDER.
>> + * @nbits:             Indicates the total size of the bitmap in bits allocated
>> + *                     at the time of initialization.
>> + */
>> +struct hinting_bitmap {
>> +       unsigned long *bitmap;
>> +       unsigned long base_pfn;
>> +       atomic_t free_pages;
>> +       unsigned long nbits;
>> +} bm_zone[MAX_NR_ZONES];
>> +
> This ignores NUMA doesn't it? Shouldn't you have support for other NUMA nodes?
I will have to look into this.
>
>> +static void init_hinting_wq(struct work_struct *work);
>> +extern int __isolate_free_page(struct page *page, unsigned int order);
>> +extern void __free_one_page(struct page *page, unsigned long pfn,
>> +                           struct zone *zone, unsigned int order,
>> +                           int migratetype, bool hint);
>> +const struct page_hinting_cb *hcb;
>> +struct work_struct hinting_work;
>> +
>> +static unsigned long find_bitmap_size(struct zone *zone)
>> +{
>> +       unsigned long nbits = ALIGN(zone->spanned_pages,
>> +                           PAGE_HINTING_MIN_ORDER);
>> +
>> +       nbits = nbits >> PAGE_HINTING_MIN_ORDER;
>> +       return nbits;
>> +}
>> +
> This doesn't look right to me. You are trying to do something like a
> DIV_ROUND_UP here, right? If so shouldn't you be aligning to 1 <<
> PAGE_HINTING_MIN_ORDER, instead of just PAGE_HINTING_MIN_ORDER?
> Another option would be to just do DIV_ROUND_UP with the 1 <<
> PAGE_HINTING_MIN_ORDER value.
I will double check this.
>
>> +void page_hinting_enable(const struct page_hinting_cb *callback)
>> +{
>> +       struct zone *zone;
>> +       int idx = 0;
>> +       unsigned long bitmap_size = 0;
>> +
>> +       for_each_populated_zone(zone) {
> The index for this doesn't match up to the index you used to define
> bm_zone. for_each_populated_zone will go through each zone in each
> pgdat. Right now you can only handle one pgdat.
Not sure if I understood this entirely. Can you please explain more on this?
>
>> +               spin_lock(&zone->lock);
>> +               bitmap_size = find_bitmap_size(zone);
>> +               bm_zone[idx].bitmap = bitmap_zalloc(bitmap_size, GFP_KERNEL);
>> +               if (!bm_zone[idx].bitmap)
>> +                       return;
>> +               bm_zone[idx].nbits = bitmap_size;
>> +               bm_zone[idx].base_pfn = zone->zone_start_pfn;
>> +               spin_unlock(&zone->lock);
>> +               idx++;
>> +       }
>> +       hcb = callback;
>> +       INIT_WORK(&hinting_work, init_hinting_wq);
>> +}
>> +EXPORT_SYMBOL_GPL(page_hinting_enable);
>> +
>> +void page_hinting_disable(void)
>> +{
>> +       struct zone *zone;
>> +       int idx = 0;
>> +
>> +       cancel_work_sync(&hinting_work);
>> +       hcb = NULL;
>> +       for_each_populated_zone(zone) {
>> +               spin_lock(&zone->lock);
>> +               bitmap_free(bm_zone[idx].bitmap);
>> +               bm_zone[idx].base_pfn = 0;
>> +               bm_zone[idx].nbits = 0;
>> +               atomic_set(&bm_zone[idx].free_pages, 0);
>> +               spin_unlock(&zone->lock);
>> +               idx++;
>> +       }
>> +}
>> +EXPORT_SYMBOL_GPL(page_hinting_disable);
>> +
>> +static unsigned long pfn_to_bit(struct page *page, int zonenum)
>> +{
>> +       unsigned long bitnr;
>> +
>> +       bitnr = (page_to_pfn(page) - bm_zone[zonenum].base_pfn)
>> +                        >> PAGE_HINTING_MIN_ORDER;
>> +       return bitnr;
>> +}
>> +
>> +static void release_buddy_pages(struct list_head *pages)
>> +{
>> +       int mt = 0, zonenum, order;
>> +       struct page *page, *next;
>> +       struct zone *zone;
>> +       unsigned long bitnr;
>> +
>> +       list_for_each_entry_safe(page, next, pages, lru) {
>> +               zonenum = page_zonenum(page);
>> +               zone = page_zone(page);
>> +               bitnr = pfn_to_bit(page, zonenum);
>> +               spin_lock(&zone->lock);
>> +               list_del(&page->lru);
>> +               order = page_private(page);
>> +               set_page_private(page, 0);
>> +               mt = get_pageblock_migratetype(page);
>> +               __free_one_page(page, page_to_pfn(page), zone,
>> +                               order, mt, false);
>> +               spin_unlock(&zone->lock);
>> +       }
>> +}
>> +
>> +static void bm_set_pfn(struct page *page)
>> +{
>> +       unsigned long bitnr = 0;
>> +       int zonenum = page_zonenum(page);
>> +       struct zone *zone = page_zone(page);
>> +
>> +       lockdep_assert_held(&zone->lock);
>> +       bitnr = pfn_to_bit(page, zonenum);
>> +       if (bm_zone[zonenum].bitmap &&
>> +           bitnr < bm_zone[zonenum].nbits &&
>> +           !test_and_set_bit(bitnr, bm_zone[zonenum].bitmap))
>> +               atomic_inc(&bm_zone[zonenum].free_pages);
>> +}
>> +
>> +static void scan_hinting_bitmap(int zonenum, int free_pages)
>> +{
>> +       unsigned long set_bit, start = 0;
>> +       struct page *page;
>> +       struct zone *zone;
>> +       int scanned_pages = 0, ret = 0, order, isolated_cnt = 0;
>> +       LIST_HEAD(isolated_pages);
>> +
>> +       ret = hcb->prepare();
>> +       if (ret < 0)
>> +               return;
>> +       for (;;) {
>> +               ret = 0;
>> +               set_bit = find_next_bit(bm_zone[zonenum].bitmap,
>> +                                       bm_zone[zonenum].nbits, start);
>> +               if (set_bit >= bm_zone[zonenum].nbits)
>> +                       break;
>> +               page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) +
>> +                               bm_zone[zonenum].base_pfn);
>> +               if (!page)
>> +                       continue;
>> +               zone = page_zone(page);
>> +               spin_lock(&zone->lock);
>> +
>> +               if (PageBuddy(page) && page_private(page) >=
>> +                   PAGE_HINTING_MIN_ORDER) {
>> +                       order = page_private(page);
>> +                       ret = __isolate_free_page(page, order);
>> +               }
>> +               clear_bit(set_bit, bm_zone[zonenum].bitmap);
>> +               spin_unlock(&zone->lock);
>> +               if (ret) {
>> +                       /*
>> +                        * restoring page order to use it while releasing
>> +                        * the pages back to the buddy.
>> +                        */
>> +                       set_page_private(page, order);
>> +                       list_add_tail(&page->lru, &isolated_pages);
>> +                       isolated_cnt++;
>> +                       if (isolated_cnt == hcb->max_pages) {
>> +                               hcb->hint_pages(&isolated_pages);
>> +                               release_buddy_pages(&isolated_pages);
>> +                               isolated_cnt = 0;
>> +                       }
>> +               }
>> +               start = set_bit + 1;
>> +               scanned_pages++;
>> +       }
>> +       if (isolated_cnt) {
>> +               hcb->hint_pages(&isolated_pages);
>> +               release_buddy_pages(&isolated_pages);
>> +       }
>> +       hcb->cleanup();
>> +       if (scanned_pages > free_pages)
>> +               atomic_sub((scanned_pages - free_pages),
>> +                          &bm_zone[zonenum].free_pages);
>> +}
>> +
>> +static bool check_hinting_threshold(void)
>> +{
>> +       int zonenum = 0;
>> +
>> +       for (; zonenum < MAX_NR_ZONES; zonenum++) {
>> +               if (atomic_read(&bm_zone[zonenum].free_pages) >=
>> +                               hcb->max_pages)
>> +                       return true;
>> +       }
>> +       return false;
>> +}
>> +
>> +static void init_hinting_wq(struct work_struct *work)
>> +{
>> +       int zonenum = 0, free_pages = 0;
>> +
>> +       for (; zonenum < MAX_NR_ZONES; zonenum++) {
>> +               free_pages = atomic_read(&bm_zone[zonenum].free_pages);
>> +               if (free_pages >= hcb->max_pages) {
>> +                       /* Find a better way to synchronize per zone
>> +                        * free_pages.
>> +                        */
>> +                       atomic_sub(free_pages,
>> +                                  &bm_zone[zonenum].free_pages);
>> +                       scan_hinting_bitmap(zonenum, free_pages);
>> +               }
>> +       }
>> +}
>> +
>> +void page_hinting_enqueue(struct page *page, int order)
>> +{
>> +       if (hcb && order >= PAGE_HINTING_MIN_ORDER)
>> +               bm_set_pfn(page);
>> +       else
>> +               return;
> You could probably flip the logic and save yourself an "else" by just
> doing something like:
> if (!hcb || order < PAGE_HINTING_MIN_ORDER)
>         return;
>
> I think it would also make this more readable.
>
+1
>> +
>> +       if (check_hinting_threshold()) {
>> +               int cpu = smp_processor_id();
>> +
>> +               queue_work_on(cpu, system_wq, &hinting_work);
>> +       }
>> +}
>> --
>> 2.21.0
>>
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][Patch v10 1/2] mm: page_hinting: core infrastructure
  2019-06-03 19:57   ` David Hildenbrand
@ 2019-06-04 13:16     ` Nitesh Narayan Lal
  0 siblings, 0 replies; 33+ messages in thread
From: Nitesh Narayan Lal @ 2019-06-04 13:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta,
	wei.w.wang, yang.zhang.wz, riel, mst, dodgen, konrad.wilk,
	dhildenb, aarcange, alexander.duyck


[-- Attachment #1.1: Type: text/plain, Size: 17759 bytes --]

On 6/3/19 3:57 PM, David Hildenbrand wrote:
> On 03.06.19 19:03, Nitesh Narayan Lal wrote:
>> This patch introduces the core infrastructure for free page hinting in
>> virtual environments. It enables the kernel to track the free pages which
>> can be reported to its hypervisor so that the hypervisor could
>> free and reuse that memory as per its requirement.
>>
>> While the pages are getting processed in the hypervisor (e.g.,
>> via MADV_FREE), the guest must not use them, otherwise, data loss
>> would be possible. To avoid such a situation, these pages are
>> temporarily removed from the buddy. The amount of pages removed
>> temporarily from the buddy is governed by the backend(virtio-balloon
>> in our case).
>>
>> To efficiently identify free pages that can to be hinted to the
>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>> chunks are reported to the hypervisor - especially, to not break up THP
>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>> in the bitmap are an indication whether a page *might* be free, not a
>> guarantee. A new hook after buddy merging sets the bits.
>>
>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>> asynchronously processes the bitmaps, trying to isolate and report pages
>> that are still free. The backend (virtio-balloon) is responsible for
>> reporting these batched pages to the host synchronously. Once reporting/
>> freeing is complete, isolated pages are returned back to the buddy.
>>
>> There are still various things to look into (e.g., memory hotplug, more
>> efficient locking, possible races when disabling).
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> ---
>>  drivers/virtio/Kconfig       |   1 +
>>  include/linux/page_hinting.h |  46 +++++++
>>  mm/Kconfig                   |   6 +
>>  mm/Makefile                  |   2 +
>>  mm/page_alloc.c              |  17 +--
>>  mm/page_hinting.c            | 236 +++++++++++++++++++++++++++++++++++
>>  6 files changed, 301 insertions(+), 7 deletions(-)
>>  create mode 100644 include/linux/page_hinting.h
>>  create mode 100644 mm/page_hinting.c
>>
>> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
>> index 35897649c24f..5a96b7a2ed1e 100644
>> --- a/drivers/virtio/Kconfig
>> +++ b/drivers/virtio/Kconfig
>> @@ -46,6 +46,7 @@ config VIRTIO_BALLOON
>>  	tristate "Virtio balloon driver"
>>  	depends on VIRTIO
>>  	select MEMORY_BALLOON
>> +	select PAGE_HINTING
>>  	---help---
>>  	 This driver supports increasing and decreasing the amount
>>  	 of memory within a KVM guest.
>> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
>> new file mode 100644
>> index 000000000000..e65188fe1e6b
>> --- /dev/null
>> +++ b/include/linux/page_hinting.h
>> @@ -0,0 +1,46 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _LINUX_PAGE_HINTING_H
>> +#define _LINUX_PAGE_HINTING_H
>> +
>> +/*
>> + * Minimum page order required for a page to be hinted to the host.
>> + */
>> +#define PAGE_HINTING_MIN_ORDER		(MAX_ORDER - 2)
>> +
>> +/*
>> + * struct page_hinting_cb: holds the callbacks to store, report and cleanup
>> + * isolated pages.
>> + * @prepare:		Callback responsible for allocating an array to hold
>> + *			the isolated pages.
>> + * @hint_pages:		Callback which reports the isolated pages synchornously
>> + *			to the host.
>> + * @cleanup:		Callback to free the the array used for reporting the
>> + *			isolated pages.
>> + * @max_pages:		Maxmimum pages that are going to be hinted to the host
>> + *			at a time of granularity >= PAGE_HINTING_MIN_ORDER.
>> + */
>> +struct page_hinting_cb {
>> +	int (*prepare)(void);
>> +	void (*hint_pages)(struct list_head *list);
>> +	void (*cleanup)(void);
>> +	int max_pages;
> If we allocate the array in virtio-balloon differently (e.g. similar to
> bulk inflation/deflation of pfn's right now), we can most probably get
> rid of prepare() and cleanup(), simplifying the code further.
Yeap, as Alexander suggested. I will go for static allocation and remove
this prepare() and cleanup().
>> +};
>> +
>> +#ifdef CONFIG_PAGE_HINTING
>> +void page_hinting_enqueue(struct page *page, int order);
>> +void page_hinting_enable(const struct page_hinting_cb *cb);
>> +void page_hinting_disable(void);
>> +#else
>> +static inline void page_hinting_enqueue(struct page *page, int order)
>> +{
>> +}
>> +
>> +static inline void page_hinting_enable(struct page_hinting_cb *cb)
>> +{
>> +}
>> +
>> +static inline void page_hinting_disable(void)
>> +{
>> +}
>> +#endif
>> +#endif /* _LINUX_PAGE_HINTING_H */
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index ee8d1f311858..177d858de758 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -764,4 +764,10 @@ config GUP_BENCHMARK
>>  config ARCH_HAS_PTE_SPECIAL
>>  	bool
>>  
>> +# PAGE_HINTING will allow the guest to report the free pages to the
>> +# host in regular interval of time.
>> +config PAGE_HINTING
>> +       bool
>> +       def_bool n
>> +       depends on X86_64
>>  endmenu
>> diff --git a/mm/Makefile b/mm/Makefile
>> index ac5e5ba78874..bec456dfee34 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -41,6 +41,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
>>  			   interval_tree.o list_lru.o workingset.o \
>>  			   debug.o $(mmu-y)
>>  
>> +
>>  # Give 'page_alloc' its own module-parameter namespace
>>  page-alloc-y := page_alloc.o
>>  page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
>> @@ -94,6 +95,7 @@ obj-$(CONFIG_Z3FOLD)	+= z3fold.o
>>  obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
>>  obj-$(CONFIG_CMA)	+= cma.o
>>  obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
>> +obj-$(CONFIG_PAGE_HINTING) += page_hinting.o
>>  obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
>>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3b13d3914176..d12f69e0e402 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -68,6 +68,7 @@
>>  #include <linux/lockdep.h>
>>  #include <linux/nmi.h>
>>  #include <linux/psi.h>
>> +#include <linux/page_hinting.h>
>>  
>>  #include <asm/sections.h>
>>  #include <asm/tlbflush.h>
>> @@ -873,10 +874,10 @@ compaction_capture(struct capture_control *capc, struct page *page,
>>   * -- nyc
>>   */
>>  
>> -static inline void __free_one_page(struct page *page,
>> +inline void __free_one_page(struct page *page,
>>  		unsigned long pfn,
>>  		struct zone *zone, unsigned int order,
>> -		int migratetype)
>> +		int migratetype, bool hint)
>>  {
>>  	unsigned long combined_pfn;
>>  	unsigned long uninitialized_var(buddy_pfn);
>> @@ -951,6 +952,8 @@ static inline void __free_one_page(struct page *page,
>>  done_merging:
>>  	set_page_order(page, order);
>>  
>> +	if (hint)
>> +		page_hinting_enqueue(page, order);
>>  	/*
>>  	 * If this is not the largest possible page, check if the buddy
>>  	 * of the next-highest order is free. If it is, it's possible
>> @@ -1262,7 +1265,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>  		if (unlikely(isolated_pageblocks))
>>  			mt = get_pageblock_migratetype(page);
>>  
>> -		__free_one_page(page, page_to_pfn(page), zone, 0, mt);
>> +		__free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
>>  		trace_mm_page_pcpu_drain(page, 0, mt);
>>  	}
>>  	spin_unlock(&zone->lock);
>> @@ -1271,14 +1274,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>  static void free_one_page(struct zone *zone,
>>  				struct page *page, unsigned long pfn,
>>  				unsigned int order,
>> -				int migratetype)
>> +				int migratetype, bool hint)
>>  {
>>  	spin_lock(&zone->lock);
>>  	if (unlikely(has_isolate_pageblock(zone) ||
>>  		is_migrate_isolate(migratetype))) {
>>  		migratetype = get_pfnblock_migratetype(page, pfn);
>>  	}
>> -	__free_one_page(page, pfn, zone, order, migratetype);
>> +	__free_one_page(page, pfn, zone, order, migratetype, hint);
>>  	spin_unlock(&zone->lock);
>>  }
>>  
>> @@ -1368,7 +1371,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>>  	migratetype = get_pfnblock_migratetype(page, pfn);
>>  	local_irq_save(flags);
>>  	__count_vm_events(PGFREE, 1 << order);
>> -	free_one_page(page_zone(page), page, pfn, order, migratetype);
>> +	free_one_page(page_zone(page), page, pfn, order, migratetype, true);
>>  	local_irq_restore(flags);
>>  }
>>  
>> @@ -2968,7 +2971,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
>>  	 */
>>  	if (migratetype >= MIGRATE_PCPTYPES) {
>>  		if (unlikely(is_migrate_isolate(migratetype))) {
>> -			free_one_page(zone, page, pfn, 0, migratetype);
>> +			free_one_page(zone, page, pfn, 0, migratetype, true);
>>  			return;
>>  		}
>>  		migratetype = MIGRATE_MOVABLE;
>> diff --git a/mm/page_hinting.c b/mm/page_hinting.c
>> new file mode 100644
>> index 000000000000..7341c6462de2
>> --- /dev/null
>> +++ b/mm/page_hinting.c
>> @@ -0,0 +1,236 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Page hinting support to enable a VM to report the freed pages back
>> + * to the host.
>> + *
>> + * Copyright Red Hat, Inc. 2019
>> + *
>> + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/slab.h>
>> +#include <linux/page_hinting.h>
>> +#include <linux/kvm_host.h>
>> +
>> +/*
>> + * struct hinting_bitmap: holds the bitmap pointer which tracks the freed PFNs
>> + * and other required parameters which could help in retrieving the original
>> + * PFN value using the bitmap.
>> + * @bitmap:		Pointer to the bitmap of free PFN.
>> + * @base_pfn:		Starting PFN value for the zone whose bitmap is stored.
>> + * @free_pages:		Tracks the number of free pages of granularity
>> + *			PAGE_HINTING_MIN_ORDER.
>> + * @nbits:		Indicates the total size of the bitmap in bits allocated
>> + *			at the time of initialization.
>> + */
>> +struct hinting_bitmap {
>> +	unsigned long *bitmap;
>> +	unsigned long base_pfn;
>> +	atomic_t free_pages;
>> +	unsigned long nbits;
>> +} bm_zone[MAX_NR_ZONES];
>> +
>> +static void init_hinting_wq(struct work_struct *work);
>> +extern int __isolate_free_page(struct page *page, unsigned int order);
>> +extern void __free_one_page(struct page *page, unsigned long pfn,
>> +			    struct zone *zone, unsigned int order,
>> +			    int migratetype, bool hint);
>> +const struct page_hinting_cb *hcb;
>> +struct work_struct hinting_work;
>> +
>> +static unsigned long find_bitmap_size(struct zone *zone)
>> +{
>> +	unsigned long nbits = ALIGN(zone->spanned_pages,
>> +			    PAGE_HINTING_MIN_ORDER);
>> +
>> +	nbits = nbits >> PAGE_HINTING_MIN_ORDER;
>> +	return nbits;
> I think we can simplify this to
>
> return (zone->spanned_pages >> PAGE_HINTING_MIN_ORDER) + 1;
>
I will check this.
>> +}
>> +
>> +void page_hinting_enable(const struct page_hinting_cb *callback)
>> +{
>> +	struct zone *zone;
>> +	int idx = 0;
>> +	unsigned long bitmap_size = 0;
> You should probably protect enabling via a mutex and return -EINVAL or
> similar if we already have a callback set (if we ever have different
> drivers). But this has very little priority :)
I will have to look into this.
>> +
>> +	for_each_populated_zone(zone) {
>> +		spin_lock(&zone->lock);
>> +		bitmap_size = find_bitmap_size(zone);
>> +		bm_zone[idx].bitmap = bitmap_zalloc(bitmap_size, GFP_KERNEL);
>> +		if (!bm_zone[idx].bitmap)
>> +			return;
>> +		bm_zone[idx].nbits = bitmap_size;
>> +		bm_zone[idx].base_pfn = zone->zone_start_pfn;
>> +		spin_unlock(&zone->lock);
>> +		idx++;
>> +	}
>> +	hcb = callback;
>> +	INIT_WORK(&hinting_work, init_hinting_wq);
> There are also possible races when enabling, you will have to take care
> of at one point.
No page will be enqueued until hcb is not set.
I can probably move it below INIT_WORK.
>> +}
>> +EXPORT_SYMBOL_GPL(page_hinting_enable);
>> +
>> +void page_hinting_disable(void)
>> +{
>> +	struct zone *zone;
>> +	int idx = 0;
>> +
>> +	cancel_work_sync(&hinting_work);
>> +	hcb = NULL;
>> +	for_each_populated_zone(zone) {
>> +		spin_lock(&zone->lock);
>> +		bitmap_free(bm_zone[idx].bitmap);
>> +		bm_zone[idx].base_pfn = 0;
>> +		bm_zone[idx].nbits = 0;
>> +		atomic_set(&bm_zone[idx].free_pages, 0);
>> +		spin_unlock(&zone->lock);
>> +		idx++;
>> +	}
>> +}
>> +EXPORT_SYMBOL_GPL(page_hinting_disable);
>> +
>> +static unsigned long pfn_to_bit(struct page *page, int zonenum)
>> +{
>> +	unsigned long bitnr;
>> +
>> +	bitnr = (page_to_pfn(page) - bm_zone[zonenum].base_pfn)
>> +			 >> PAGE_HINTING_MIN_ORDER;
>> +	return bitnr;
>> +}
>> +
>> +static void release_buddy_pages(struct list_head *pages)
> maybe "release_isolated_pages", not sure.
>
>> +{
>> +	int mt = 0, zonenum, order;
>> +	struct page *page, *next;
>> +	struct zone *zone;
>> +	unsigned long bitnr;
>> +
>> +	list_for_each_entry_safe(page, next, pages, lru) {
>> +		zonenum = page_zonenum(page);
>> +		zone = page_zone(page);
>> +		bitnr = pfn_to_bit(page, zonenum);
>> +		spin_lock(&zone->lock);
>> +		list_del(&page->lru);
>> +		order = page_private(page);
>> +		set_page_private(page, 0);
>> +		mt = get_pageblock_migratetype(page);
>> +		__free_one_page(page, page_to_pfn(page), zone,
>> +				order, mt, false);
>> +		spin_unlock(&zone->lock);
>> +	}
>> +}
>> +
>> +static void bm_set_pfn(struct page *page)
>> +{
>> +	unsigned long bitnr = 0;
>> +	int zonenum = page_zonenum(page);
>> +	struct zone *zone = page_zone(page);
>> +
>> +	lockdep_assert_held(&zone->lock);
>> +	bitnr = pfn_to_bit(page, zonenum);
>> +	if (bm_zone[zonenum].bitmap &&
>> +	    bitnr < bm_zone[zonenum].nbits &&
>> +	    !test_and_set_bit(bitnr, bm_zone[zonenum].bitmap))
>> +		atomic_inc(&bm_zone[zonenum].free_pages);
>> +}
>> +
>> +static void scan_hinting_bitmap(int zonenum, int free_pages)
>> +{
>> +	unsigned long set_bit, start = 0;
>> +	struct page *page;
>> +	struct zone *zone;
>> +	int scanned_pages = 0, ret = 0, order, isolated_cnt = 0;
>> +	LIST_HEAD(isolated_pages);
>> +
>> +	ret = hcb->prepare();
>> +	if (ret < 0)
>> +		return;
>> +	for (;;) {
>> +		ret = 0;
>> +		set_bit = find_next_bit(bm_zone[zonenum].bitmap,
>> +					bm_zone[zonenum].nbits, start);
>> +		if (set_bit >= bm_zone[zonenum].nbits)
>> +			break;
>> +		page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) +
>> +				bm_zone[zonenum].base_pfn);
>> +		if (!page)
>> +			continue;
> You are not clearing the bit / decrementing the counter.
>
>> +		zone = page_zone(page);
>> +		spin_lock(&zone->lock);
>> +
>> +		if (PageBuddy(page) && page_private(page) >=
>> +		    PAGE_HINTING_MIN_ORDER) {
>> +			order = page_private(page);
>> +			ret = __isolate_free_page(page, order);
>> +		}
>> +		clear_bit(set_bit, bm_zone[zonenum].bitmap);
>> +		spin_unlock(&zone->lock);
>> +		if (ret) {
>> +			/*
>> +			 * restoring page order to use it while releasing
>> +			 * the pages back to the buddy.
>> +			 */
>> +			set_page_private(page, order);
>> +			list_add_tail(&page->lru, &isolated_pages);
>> +			isolated_cnt++;
>> +			if (isolated_cnt == hcb->max_pages) {
>> +				hcb->hint_pages(&isolated_pages);
>> +				release_buddy_pages(&isolated_pages);
>> +				isolated_cnt = 0;
>> +			}
>> +		}
>> +		start = set_bit + 1;
>> +		scanned_pages++;
>> +	}
>> +	if (isolated_cnt) {
>> +		hcb->hint_pages(&isolated_pages);
>> +		release_buddy_pages(&isolated_pages);
>> +	}
>> +	hcb->cleanup();
>> +	if (scanned_pages > free_pages)
>> +		atomic_sub((scanned_pages - free_pages),
>> +			   &bm_zone[zonenum].free_pages);
> This looks overly complicated. Can't we somehow simply decrement when
> clearing a bit?
>
>> +}
>> +
>> +static bool check_hinting_threshold(void)
>> +{
>> +	int zonenum = 0;
>> +
>> +	for (; zonenum < MAX_NR_ZONES; zonenum++) {
>> +		if (atomic_read(&bm_zone[zonenum].free_pages) >=
>> +				hcb->max_pages)
>> +			return true;
>> +	}
>> +	return false;
>> +}
>> +
>> +static void init_hinting_wq(struct work_struct *work)
>> +{
>> +	int zonenum = 0, free_pages = 0;
>> +
>> +	for (; zonenum < MAX_NR_ZONES; zonenum++) {
>> +		free_pages = atomic_read(&bm_zone[zonenum].free_pages);
>> +		if (free_pages >= hcb->max_pages) {
>> +			/* Find a better way to synchronize per zone
>> +			 * free_pages.
>> +			 */
>> +			atomic_sub(free_pages,
>> +				   &bm_zone[zonenum].free_pages);
> I can't follow yet why we need that information. 
We don't want to enqueue multiple jobs just because we are delaying the
decrementing of free_pages.
I agree, even I am not convinced with the approach which I have right now.
I will try to come up with a better way.
> Wouldn't it be enough
> to just track the number of set bits in the bitmap and start hinting
> depending on that count? (there are false positives, but do we really care?)
In an attempt to minimize the false positives, I might have overly
complicated this part.
I will try to simplify this in the next posting.
>

>> +			scan_hinting_bitmap(zonenum, free_pages);
>> +		}
>> +	}
>> +}
>> +
>> +void page_hinting_enqueue(struct page *page, int order)
>> +{
>> +	if (hcb && order >= PAGE_HINTING_MIN_ORDER)
>> +		bm_set_pfn(page);
>> +	else
>> +		return;
>> +
>> +	if (check_hinting_threshold()) {
>> +		int cpu = smp_processor_id();
>> +
>> +		queue_work_on(cpu, system_wq, &hinting_work);
>> +	}
>> +}
>>
>
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][Patch v10 1/2] mm: page_hinting: core infrastructure
  2019-06-04 12:55     ` Nitesh Narayan Lal
@ 2019-06-04 15:14       ` Alexander Duyck
  2019-06-04 16:07         ` Nitesh Narayan Lal
  0 siblings, 1 reply; 33+ messages in thread
From: Alexander Duyck @ 2019-06-04 15:14 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Tue, Jun 4, 2019 at 5:55 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 6/3/19 3:04 PM, Alexander Duyck wrote:
> > On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >> This patch introduces the core infrastructure for free page hinting in
> >> virtual environments. It enables the kernel to track the free pages which
> >> can be reported to its hypervisor so that the hypervisor could
> >> free and reuse that memory as per its requirement.
> >>
> >> While the pages are getting processed in the hypervisor (e.g.,
> >> via MADV_FREE), the guest must not use them, otherwise, data loss
> >> would be possible. To avoid such a situation, these pages are
> >> temporarily removed from the buddy. The amount of pages removed
> >> temporarily from the buddy is governed by the backend(virtio-balloon
> >> in our case).
> >>
> >> To efficiently identify free pages that can to be hinted to the
> >> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> >> chunks are reported to the hypervisor - especially, to not break up THP
> >> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> >> in the bitmap are an indication whether a page *might* be free, not a
> >> guarantee. A new hook after buddy merging sets the bits.
> >>
> >> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> >> asynchronously processes the bitmaps, trying to isolate and report pages
> >> that are still free. The backend (virtio-balloon) is responsible for
> >> reporting these batched pages to the host synchronously. Once reporting/
> >> freeing is complete, isolated pages are returned back to the buddy.
> >>
> >> There are still various things to look into (e.g., memory hotplug, more
> >> efficient locking, possible races when disabling).
> >>
> >> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> > So one thing I had thought about, that I don't believe that has been
> > addressed in your solution, is to determine a means to guarantee
> > forward progress. If you have a noisy thread that is allocating and
> > freeing some block of memory repeatedly you will be stuck processing
> > that and cannot get to the other work. Specifically if you have a zone
> > where somebody is just cycling the number of pages needed to fill your
> > hinting queue how do you get around it and get to the data that is
> > actually code instead of getting stuck processing the noise?
>
> It should not matter. As every time the memory threshold is met, entire
> bitmap
> is scanned and not just a chunk of memory for possible isolation. This
> will guarantee
> forward progress.

So I think there may still be some issues. I see how you go from the
start to the end, but how to you loop back to the start again as pages
are added? The init_hinting_wq doesn't seem to have a way to get back
to the start again if there is still work to do after you have
completed your pass without queue_work_on firing off another thread.

> > Do you have any idea what the hit rate would be on a system that is on
> > the more active side? From what I can tell you still are effectively
> > just doing a linear search of memory, but you have the bitmap hints to
> > tell what as not been freed recently, however you still don't know
> > that the pages you have bitmap hints for are actually free until you
> > check them.
> >
> >> ---
> >>  drivers/virtio/Kconfig       |   1 +
> >>  include/linux/page_hinting.h |  46 +++++++
> >>  mm/Kconfig                   |   6 +
> >>  mm/Makefile                  |   2 +
> >>  mm/page_alloc.c              |  17 +--
> >>  mm/page_hinting.c            | 236 +++++++++++++++++++++++++++++++++++
> >>  6 files changed, 301 insertions(+), 7 deletions(-)
> >>  create mode 100644 include/linux/page_hinting.h
> >>  create mode 100644 mm/page_hinting.c
> >>
> >> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> >> index 35897649c24f..5a96b7a2ed1e 100644
> >> --- a/drivers/virtio/Kconfig
> >> +++ b/drivers/virtio/Kconfig
> >> @@ -46,6 +46,7 @@ config VIRTIO_BALLOON
> >>         tristate "Virtio balloon driver"
> >>         depends on VIRTIO
> >>         select MEMORY_BALLOON
> >> +       select PAGE_HINTING
> >>         ---help---
> >>          This driver supports increasing and decreasing the amount
> >>          of memory within a KVM guest.
> >> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
> >> new file mode 100644
> >> index 000000000000..e65188fe1e6b
> >> --- /dev/null
> >> +++ b/include/linux/page_hinting.h
> >> @@ -0,0 +1,46 @@
> >> +/* SPDX-License-Identifier: GPL-2.0 */
> >> +#ifndef _LINUX_PAGE_HINTING_H
> >> +#define _LINUX_PAGE_HINTING_H
> >> +
> >> +/*
> >> + * Minimum page order required for a page to be hinted to the host.
> >> + */
> >> +#define PAGE_HINTING_MIN_ORDER         (MAX_ORDER - 2)
> >> +
> >> +/*
> >> + * struct page_hinting_cb: holds the callbacks to store, report and cleanup
> >> + * isolated pages.
> >> + * @prepare:           Callback responsible for allocating an array to hold
> >> + *                     the isolated pages.
> >> + * @hint_pages:                Callback which reports the isolated pages synchornously
> >> + *                     to the host.
> >> + * @cleanup:           Callback to free the the array used for reporting the
> >> + *                     isolated pages.
> >> + * @max_pages:         Maxmimum pages that are going to be hinted to the host
> >> + *                     at a time of granularity >= PAGE_HINTING_MIN_ORDER.
> >> + */
> >> +struct page_hinting_cb {
> >> +       int (*prepare)(void);
> >> +       void (*hint_pages)(struct list_head *list);
> >> +       void (*cleanup)(void);
> >> +       int max_pages;
> >> +};
> >> +
> >> +#ifdef CONFIG_PAGE_HINTING
> >> +void page_hinting_enqueue(struct page *page, int order);
> >> +void page_hinting_enable(const struct page_hinting_cb *cb);
> >> +void page_hinting_disable(void);
> >> +#else
> >> +static inline void page_hinting_enqueue(struct page *page, int order)
> >> +{
> >> +}
> >> +
> >> +static inline void page_hinting_enable(struct page_hinting_cb *cb)
> >> +{
> >> +}
> >> +
> >> +static inline void page_hinting_disable(void)
> >> +{
> >> +}
> >> +#endif
> >> +#endif /* _LINUX_PAGE_HINTING_H */
> >> diff --git a/mm/Kconfig b/mm/Kconfig
> >> index ee8d1f311858..177d858de758 100644
> >> --- a/mm/Kconfig
> >> +++ b/mm/Kconfig
> >> @@ -764,4 +764,10 @@ config GUP_BENCHMARK
> >>  config ARCH_HAS_PTE_SPECIAL
> >>         bool
> >>
> >> +# PAGE_HINTING will allow the guest to report the free pages to the
> >> +# host in regular interval of time.
> >> +config PAGE_HINTING
> >> +       bool
> >> +       def_bool n
> >> +       depends on X86_64
> >>  endmenu
> >> diff --git a/mm/Makefile b/mm/Makefile
> >> index ac5e5ba78874..bec456dfee34 100644
> >> --- a/mm/Makefile
> >> +++ b/mm/Makefile
> >> @@ -41,6 +41,7 @@ obj-y                 := filemap.o mempool.o oom_kill.o fadvise.o \
> >>                            interval_tree.o list_lru.o workingset.o \
> >>                            debug.o $(mmu-y)
> >>
> >> +
> >>  # Give 'page_alloc' its own module-parameter namespace
> >>  page-alloc-y := page_alloc.o
> >>  page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
> >> @@ -94,6 +95,7 @@ obj-$(CONFIG_Z3FOLD)  += z3fold.o
> >>  obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
> >>  obj-$(CONFIG_CMA)      += cma.o
> >>  obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
> >> +obj-$(CONFIG_PAGE_HINTING) += page_hinting.o
> >>  obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
> >>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
> >>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index 3b13d3914176..d12f69e0e402 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -68,6 +68,7 @@
> >>  #include <linux/lockdep.h>
> >>  #include <linux/nmi.h>
> >>  #include <linux/psi.h>
> >> +#include <linux/page_hinting.h>
> >>
> >>  #include <asm/sections.h>
> >>  #include <asm/tlbflush.h>
> >> @@ -873,10 +874,10 @@ compaction_capture(struct capture_control *capc, struct page *page,
> >>   * -- nyc
> >>   */
> >>
> >> -static inline void __free_one_page(struct page *page,
> >> +inline void __free_one_page(struct page *page,
> >>                 unsigned long pfn,
> >>                 struct zone *zone, unsigned int order,
> >> -               int migratetype)
> >> +               int migratetype, bool hint)
> >>  {
> >>         unsigned long combined_pfn;
> >>         unsigned long uninitialized_var(buddy_pfn);
> >> @@ -951,6 +952,8 @@ static inline void __free_one_page(struct page *page,
> >>  done_merging:
> >>         set_page_order(page, order);
> >>
> >> +       if (hint)
> >> +               page_hinting_enqueue(page, order);
> > This is a bit early to probably be dealing with the hint. You should
> > probably look at moving this down to a spot somewhere after the page
> > has been added to the free list. It may not cause any issues with the
> > current order setup, but moving after the addition to the free list
> > will make it so that you know it is in there when you call this
> > function.
> I will take a look at this.
> >
> >>         /*
> >>          * If this is not the largest possible page, check if the buddy
> >>          * of the next-highest order is free. If it is, it's possible
> >> @@ -1262,7 +1265,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >>                 if (unlikely(isolated_pageblocks))
> >>                         mt = get_pageblock_migratetype(page);
> >>
> >> -               __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> >> +               __free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
> >>                 trace_mm_page_pcpu_drain(page, 0, mt);
> >>         }
> >>         spin_unlock(&zone->lock);
> >> @@ -1271,14 +1274,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >>  static void free_one_page(struct zone *zone,
> >>                                 struct page *page, unsigned long pfn,
> >>                                 unsigned int order,
> >> -                               int migratetype)
> >> +                               int migratetype, bool hint)
> >>  {
> >>         spin_lock(&zone->lock);
> >>         if (unlikely(has_isolate_pageblock(zone) ||
> >>                 is_migrate_isolate(migratetype))) {
> >>                 migratetype = get_pfnblock_migratetype(page, pfn);
> >>         }
> >> -       __free_one_page(page, pfn, zone, order, migratetype);
> >> +       __free_one_page(page, pfn, zone, order, migratetype, hint);
> >>         spin_unlock(&zone->lock);
> >>  }
> >>
> >> @@ -1368,7 +1371,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
> >>         migratetype = get_pfnblock_migratetype(page, pfn);
> >>         local_irq_save(flags);
> >>         __count_vm_events(PGFREE, 1 << order);
> >> -       free_one_page(page_zone(page), page, pfn, order, migratetype);
> >> +       free_one_page(page_zone(page), page, pfn, order, migratetype, true);
> >>         local_irq_restore(flags);
> >>  }
> >>
> >> @@ -2968,7 +2971,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
> >>          */
> >>         if (migratetype >= MIGRATE_PCPTYPES) {
> >>                 if (unlikely(is_migrate_isolate(migratetype))) {
> >> -                       free_one_page(zone, page, pfn, 0, migratetype);
> >> +                       free_one_page(zone, page, pfn, 0, migratetype, true);
> >>                         return;
> >>                 }
> >>                 migratetype = MIGRATE_MOVABLE;
> > So it looks like you are using a parameter to identify if the page is
> > a hinted page or not. I guess this works but it seems like it is a bit
> > intrusive as you are adding an argument to specify that this is a
> > specific page type.
> Any suggestions on how we could do this in a less intrusive manner?

The quick approach would be to add some piece of metadata somewhere in
the page that you could trigger off of. If you could do that then drop
the need for all these extra checks and instead just not perform the
notification on the pages. I really don't think the addition of the
"Treated" flag was all that invasive, at least within the kernel. It
would allow you to avoid all the changes to free_one_page, and
__free_one_page.

> >
> >> diff --git a/mm/page_hinting.c b/mm/page_hinting.c
> >> new file mode 100644
> >> index 000000000000..7341c6462de2
> >> --- /dev/null
> >> +++ b/mm/page_hinting.c
> >> @@ -0,0 +1,236 @@
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +/*
> >> + * Page hinting support to enable a VM to report the freed pages back
> >> + * to the host.
> >> + *
> >> + * Copyright Red Hat, Inc. 2019
> >> + *
> >> + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
> >> + */
> >> +
> >> +#include <linux/mm.h>
> >> +#include <linux/slab.h>
> >> +#include <linux/page_hinting.h>
> >> +#include <linux/kvm_host.h>
> >> +
> >> +/*
> >> + * struct hinting_bitmap: holds the bitmap pointer which tracks the freed PFNs
> >> + * and other required parameters which could help in retrieving the original
> >> + * PFN value using the bitmap.
> >> + * @bitmap:            Pointer to the bitmap of free PFN.
> >> + * @base_pfn:          Starting PFN value for the zone whose bitmap is stored.
> >> + * @free_pages:                Tracks the number of free pages of granularity
> >> + *                     PAGE_HINTING_MIN_ORDER.
> >> + * @nbits:             Indicates the total size of the bitmap in bits allocated
> >> + *                     at the time of initialization.
> >> + */
> >> +struct hinting_bitmap {
> >> +       unsigned long *bitmap;
> >> +       unsigned long base_pfn;
> >> +       atomic_t free_pages;
> >> +       unsigned long nbits;
> >> +} bm_zone[MAX_NR_ZONES];
> >> +
> > This ignores NUMA doesn't it? Shouldn't you have support for other NUMA nodes?
> I will have to look into this.

So it doesn't cause a panic, but with 2 NUMA nodes you are only
hinting on half the memory. I was able to build, test,  and verify
this. I had resolved it by simply multiplying MAX_NR_ZONES by
MAX_NUMNODES, and splitting my indices between node and zone.

> >
> >> +static void init_hinting_wq(struct work_struct *work);
> >> +extern int __isolate_free_page(struct page *page, unsigned int order);
> >> +extern void __free_one_page(struct page *page, unsigned long pfn,
> >> +                           struct zone *zone, unsigned int order,
> >> +                           int migratetype, bool hint);
> >> +const struct page_hinting_cb *hcb;
> >> +struct work_struct hinting_work;
> >> +
> >> +static unsigned long find_bitmap_size(struct zone *zone)
> >> +{
> >> +       unsigned long nbits = ALIGN(zone->spanned_pages,
> >> +                           PAGE_HINTING_MIN_ORDER);
> >> +
> >> +       nbits = nbits >> PAGE_HINTING_MIN_ORDER;
> >> +       return nbits;
> >> +}
> >> +
> > This doesn't look right to me. You are trying to do something like a
> > DIV_ROUND_UP here, right? If so shouldn't you be aligning to 1 <<
> > PAGE_HINTING_MIN_ORDER, instead of just PAGE_HINTING_MIN_ORDER?
> > Another option would be to just do DIV_ROUND_UP with the 1 <<
> > PAGE_HINTING_MIN_ORDER value.
> I will double check this.
> >
> >> +void page_hinting_enable(const struct page_hinting_cb *callback)
> >> +{
> >> +       struct zone *zone;
> >> +       int idx = 0;
> >> +       unsigned long bitmap_size = 0;
> >> +
> >> +       for_each_populated_zone(zone) {
> > The index for this doesn't match up to the index you used to define
> > bm_zone. for_each_populated_zone will go through each zone in each
> > pgdat. Right now you can only handle one pgdat.
> Not sure if I understood this entirely. Can you please explain more on this?
> >
> >> +               spin_lock(&zone->lock);
> >> +               bitmap_size = find_bitmap_size(zone);
> >> +               bm_zone[idx].bitmap = bitmap_zalloc(bitmap_size, GFP_KERNEL);
> >> +               if (!bm_zone[idx].bitmap)
> >> +                       return;
> >> +               bm_zone[idx].nbits = bitmap_size;
> >> +               bm_zone[idx].base_pfn = zone->zone_start_pfn;
> >> +               spin_unlock(&zone->lock);
> >> +               idx++;
> >> +       }
> >> +       hcb = callback;
> >> +       INIT_WORK(&hinting_work, init_hinting_wq);
> >> +}
> >> +EXPORT_SYMBOL_GPL(page_hinting_enable);
> >> +
> >> +void page_hinting_disable(void)
> >> +{
> >> +       struct zone *zone;
> >> +       int idx = 0;
> >> +
> >> +       cancel_work_sync(&hinting_work);
> >> +       hcb = NULL;
> >> +       for_each_populated_zone(zone) {
> >> +               spin_lock(&zone->lock);
> >> +               bitmap_free(bm_zone[idx].bitmap);
> >> +               bm_zone[idx].base_pfn = 0;
> >> +               bm_zone[idx].nbits = 0;
> >> +               atomic_set(&bm_zone[idx].free_pages, 0);
> >> +               spin_unlock(&zone->lock);
> >> +               idx++;
> >> +       }
> >> +}
> >> +EXPORT_SYMBOL_GPL(page_hinting_disable);
> >> +
> >> +static unsigned long pfn_to_bit(struct page *page, int zonenum)
> >> +{
> >> +       unsigned long bitnr;
> >> +
> >> +       bitnr = (page_to_pfn(page) - bm_zone[zonenum].base_pfn)
> >> +                        >> PAGE_HINTING_MIN_ORDER;
> >> +       return bitnr;
> >> +}
> >> +
> >> +static void release_buddy_pages(struct list_head *pages)
> >> +{
> >> +       int mt = 0, zonenum, order;
> >> +       struct page *page, *next;
> >> +       struct zone *zone;
> >> +       unsigned long bitnr;
> >> +
> >> +       list_for_each_entry_safe(page, next, pages, lru) {
> >> +               zonenum = page_zonenum(page);
> >> +               zone = page_zone(page);
> >> +               bitnr = pfn_to_bit(page, zonenum);
> >> +               spin_lock(&zone->lock);
> >> +               list_del(&page->lru);
> >> +               order = page_private(page);
> >> +               set_page_private(page, 0);
> >> +               mt = get_pageblock_migratetype(page);
> >> +               __free_one_page(page, page_to_pfn(page), zone,
> >> +                               order, mt, false);
> >> +               spin_unlock(&zone->lock);
> >> +       }
> >> +}
> >> +
> >> +static void bm_set_pfn(struct page *page)
> >> +{
> >> +       unsigned long bitnr = 0;
> >> +       int zonenum = page_zonenum(page);
> >> +       struct zone *zone = page_zone(page);
> >> +
> >> +       lockdep_assert_held(&zone->lock);
> >> +       bitnr = pfn_to_bit(page, zonenum);
> >> +       if (bm_zone[zonenum].bitmap &&
> >> +           bitnr < bm_zone[zonenum].nbits &&
> >> +           !test_and_set_bit(bitnr, bm_zone[zonenum].bitmap))
> >> +               atomic_inc(&bm_zone[zonenum].free_pages);
> >> +}
> >> +
> >> +static void scan_hinting_bitmap(int zonenum, int free_pages)
> >> +{
> >> +       unsigned long set_bit, start = 0;
> >> +       struct page *page;
> >> +       struct zone *zone;
> >> +       int scanned_pages = 0, ret = 0, order, isolated_cnt = 0;
> >> +       LIST_HEAD(isolated_pages);
> >> +
> >> +       ret = hcb->prepare();
> >> +       if (ret < 0)
> >> +               return;
> >> +       for (;;) {
> >> +               ret = 0;
> >> +               set_bit = find_next_bit(bm_zone[zonenum].bitmap,
> >> +                                       bm_zone[zonenum].nbits, start);
> >> +               if (set_bit >= bm_zone[zonenum].nbits)
> >> +                       break;
> >> +               page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) +
> >> +                               bm_zone[zonenum].base_pfn);
> >> +               if (!page)
> >> +                       continue;
> >> +               zone = page_zone(page);
> >> +               spin_lock(&zone->lock);
> >> +
> >> +               if (PageBuddy(page) && page_private(page) >=
> >> +                   PAGE_HINTING_MIN_ORDER) {
> >> +                       order = page_private(page);
> >> +                       ret = __isolate_free_page(page, order);
> >> +               }
> >> +               clear_bit(set_bit, bm_zone[zonenum].bitmap);
> >> +               spin_unlock(&zone->lock);
> >> +               if (ret) {
> >> +                       /*
> >> +                        * restoring page order to use it while releasing
> >> +                        * the pages back to the buddy.
> >> +                        */
> >> +                       set_page_private(page, order);
> >> +                       list_add_tail(&page->lru, &isolated_pages);
> >> +                       isolated_cnt++;
> >> +                       if (isolated_cnt == hcb->max_pages) {
> >> +                               hcb->hint_pages(&isolated_pages);
> >> +                               release_buddy_pages(&isolated_pages);
> >> +                               isolated_cnt = 0;
> >> +                       }
> >> +               }
> >> +               start = set_bit + 1;
> >> +               scanned_pages++;
> >> +       }
> >> +       if (isolated_cnt) {
> >> +               hcb->hint_pages(&isolated_pages);
> >> +               release_buddy_pages(&isolated_pages);
> >> +       }
> >> +       hcb->cleanup();
> >> +       if (scanned_pages > free_pages)
> >> +               atomic_sub((scanned_pages - free_pages),
> >> +                          &bm_zone[zonenum].free_pages);
> >> +}
> >> +
> >> +static bool check_hinting_threshold(void)
> >> +{
> >> +       int zonenum = 0;
> >> +
> >> +       for (; zonenum < MAX_NR_ZONES; zonenum++) {
> >> +               if (atomic_read(&bm_zone[zonenum].free_pages) >=
> >> +                               hcb->max_pages)
> >> +                       return true;
> >> +       }
> >> +       return false;
> >> +}
> >> +
> >> +static void init_hinting_wq(struct work_struct *work)
> >> +{
> >> +       int zonenum = 0, free_pages = 0;
> >> +
> >> +       for (; zonenum < MAX_NR_ZONES; zonenum++) {
> >> +               free_pages = atomic_read(&bm_zone[zonenum].free_pages);
> >> +               if (free_pages >= hcb->max_pages) {
> >> +                       /* Find a better way to synchronize per zone
> >> +                        * free_pages.
> >> +                        */
> >> +                       atomic_sub(free_pages,
> >> +                                  &bm_zone[zonenum].free_pages);
> >> +                       scan_hinting_bitmap(zonenum, free_pages);
> >> +               }
> >> +       }
> >> +}
> >> +
> >> +void page_hinting_enqueue(struct page *page, int order)
> >> +{
> >> +       if (hcb && order >= PAGE_HINTING_MIN_ORDER)
> >> +               bm_set_pfn(page);
> >> +       else
> >> +               return;
> > You could probably flip the logic and save yourself an "else" by just
> > doing something like:
> > if (!hcb || order < PAGE_HINTING_MIN_ORDER)
> >         return;
> >
> > I think it would also make this more readable.
> >
> +1
> >> +
> >> +       if (check_hinting_threshold()) {
> >> +               int cpu = smp_processor_id();
> >> +
> >> +               queue_work_on(cpu, system_wq, &hinting_work);
> >> +       }
> >> +}
> >> --
> >> 2.21.0
> >>
> --
> Regards
> Nitesh
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][Patch v10 1/2] mm: page_hinting: core infrastructure
  2019-06-04 15:14       ` Alexander Duyck
@ 2019-06-04 16:07         ` Nitesh Narayan Lal
  2019-06-04 16:25           ` Alexander Duyck
  0 siblings, 1 reply; 33+ messages in thread
From: Nitesh Narayan Lal @ 2019-06-04 16:07 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli


[-- Attachment #1.1: Type: text/plain, Size: 24003 bytes --]


On 6/4/19 11:14 AM, Alexander Duyck wrote:
> On Tue, Jun 4, 2019 at 5:55 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>
>> On 6/3/19 3:04 PM, Alexander Duyck wrote:
>>> On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>> This patch introduces the core infrastructure for free page hinting in
>>>> virtual environments. It enables the kernel to track the free pages which
>>>> can be reported to its hypervisor so that the hypervisor could
>>>> free and reuse that memory as per its requirement.
>>>>
>>>> While the pages are getting processed in the hypervisor (e.g.,
>>>> via MADV_FREE), the guest must not use them, otherwise, data loss
>>>> would be possible. To avoid such a situation, these pages are
>>>> temporarily removed from the buddy. The amount of pages removed
>>>> temporarily from the buddy is governed by the backend(virtio-balloon
>>>> in our case).
>>>>
>>>> To efficiently identify free pages that can to be hinted to the
>>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>>>> chunks are reported to the hypervisor - especially, to not break up THP
>>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>>>> in the bitmap are an indication whether a page *might* be free, not a
>>>> guarantee. A new hook after buddy merging sets the bits.
>>>>
>>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>>>> asynchronously processes the bitmaps, trying to isolate and report pages
>>>> that are still free. The backend (virtio-balloon) is responsible for
>>>> reporting these batched pages to the host synchronously. Once reporting/
>>>> freeing is complete, isolated pages are returned back to the buddy.
>>>>
>>>> There are still various things to look into (e.g., memory hotplug, more
>>>> efficient locking, possible races when disabling).
>>>>
>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>>> So one thing I had thought about, that I don't believe that has been
>>> addressed in your solution, is to determine a means to guarantee
>>> forward progress. If you have a noisy thread that is allocating and
>>> freeing some block of memory repeatedly you will be stuck processing
>>> that and cannot get to the other work. Specifically if you have a zone
>>> where somebody is just cycling the number of pages needed to fill your
>>> hinting queue how do you get around it and get to the data that is
>>> actually code instead of getting stuck processing the noise?
>> It should not matter. As every time the memory threshold is met, entire
>> bitmap
>> is scanned and not just a chunk of memory for possible isolation. This
>> will guarantee
>> forward progress.
> So I think there may still be some issues. I see how you go from the
> start to the end, but how to you loop back to the start again as pages
> are added? The init_hinting_wq doesn't seem to have a way to get back
> to the start again if there is still work to do after you have
> completed your pass without queue_work_on firing off another thread.
>
That will be taken care as the part of a new job, which will be
en-queued as soon
as the free memory count for the respective zone will reach the threshold.

>>> Do you have any idea what the hit rate would be on a system that is on
>>> the more active side? From what I can tell you still are effectively
>>> just doing a linear search of memory, but you have the bitmap hints to
>>> tell what as not been freed recently, however you still don't know
>>> that the pages you have bitmap hints for are actually free until you
>>> check them.
>>>
>>>> ---
>>>>  drivers/virtio/Kconfig       |   1 +
>>>>  include/linux/page_hinting.h |  46 +++++++
>>>>  mm/Kconfig                   |   6 +
>>>>  mm/Makefile                  |   2 +
>>>>  mm/page_alloc.c              |  17 +--
>>>>  mm/page_hinting.c            | 236 +++++++++++++++++++++++++++++++++++
>>>>  6 files changed, 301 insertions(+), 7 deletions(-)
>>>>  create mode 100644 include/linux/page_hinting.h
>>>>  create mode 100644 mm/page_hinting.c
>>>>
>>>> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
>>>> index 35897649c24f..5a96b7a2ed1e 100644
>>>> --- a/drivers/virtio/Kconfig
>>>> +++ b/drivers/virtio/Kconfig
>>>> @@ -46,6 +46,7 @@ config VIRTIO_BALLOON
>>>>         tristate "Virtio balloon driver"
>>>>         depends on VIRTIO
>>>>         select MEMORY_BALLOON
>>>> +       select PAGE_HINTING
>>>>         ---help---
>>>>          This driver supports increasing and decreasing the amount
>>>>          of memory within a KVM guest.
>>>> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
>>>> new file mode 100644
>>>> index 000000000000..e65188fe1e6b
>>>> --- /dev/null
>>>> +++ b/include/linux/page_hinting.h
>>>> @@ -0,0 +1,46 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>> +#ifndef _LINUX_PAGE_HINTING_H
>>>> +#define _LINUX_PAGE_HINTING_H
>>>> +
>>>> +/*
>>>> + * Minimum page order required for a page to be hinted to the host.
>>>> + */
>>>> +#define PAGE_HINTING_MIN_ORDER         (MAX_ORDER - 2)
>>>> +
>>>> +/*
>>>> + * struct page_hinting_cb: holds the callbacks to store, report and cleanup
>>>> + * isolated pages.
>>>> + * @prepare:           Callback responsible for allocating an array to hold
>>>> + *                     the isolated pages.
>>>> + * @hint_pages:                Callback which reports the isolated pages synchornously
>>>> + *                     to the host.
>>>> + * @cleanup:           Callback to free the the array used for reporting the
>>>> + *                     isolated pages.
>>>> + * @max_pages:         Maxmimum pages that are going to be hinted to the host
>>>> + *                     at a time of granularity >= PAGE_HINTING_MIN_ORDER.
>>>> + */
>>>> +struct page_hinting_cb {
>>>> +       int (*prepare)(void);
>>>> +       void (*hint_pages)(struct list_head *list);
>>>> +       void (*cleanup)(void);
>>>> +       int max_pages;
>>>> +};
>>>> +
>>>> +#ifdef CONFIG_PAGE_HINTING
>>>> +void page_hinting_enqueue(struct page *page, int order);
>>>> +void page_hinting_enable(const struct page_hinting_cb *cb);
>>>> +void page_hinting_disable(void);
>>>> +#else
>>>> +static inline void page_hinting_enqueue(struct page *page, int order)
>>>> +{
>>>> +}
>>>> +
>>>> +static inline void page_hinting_enable(struct page_hinting_cb *cb)
>>>> +{
>>>> +}
>>>> +
>>>> +static inline void page_hinting_disable(void)
>>>> +{
>>>> +}
>>>> +#endif
>>>> +#endif /* _LINUX_PAGE_HINTING_H */
>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>> index ee8d1f311858..177d858de758 100644
>>>> --- a/mm/Kconfig
>>>> +++ b/mm/Kconfig
>>>> @@ -764,4 +764,10 @@ config GUP_BENCHMARK
>>>>  config ARCH_HAS_PTE_SPECIAL
>>>>         bool
>>>>
>>>> +# PAGE_HINTING will allow the guest to report the free pages to the
>>>> +# host in regular interval of time.
>>>> +config PAGE_HINTING
>>>> +       bool
>>>> +       def_bool n
>>>> +       depends on X86_64
>>>>  endmenu
>>>> diff --git a/mm/Makefile b/mm/Makefile
>>>> index ac5e5ba78874..bec456dfee34 100644
>>>> --- a/mm/Makefile
>>>> +++ b/mm/Makefile
>>>> @@ -41,6 +41,7 @@ obj-y                 := filemap.o mempool.o oom_kill.o fadvise.o \
>>>>                            interval_tree.o list_lru.o workingset.o \
>>>>                            debug.o $(mmu-y)
>>>>
>>>> +
>>>>  # Give 'page_alloc' its own module-parameter namespace
>>>>  page-alloc-y := page_alloc.o
>>>>  page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
>>>> @@ -94,6 +95,7 @@ obj-$(CONFIG_Z3FOLD)  += z3fold.o
>>>>  obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
>>>>  obj-$(CONFIG_CMA)      += cma.o
>>>>  obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
>>>> +obj-$(CONFIG_PAGE_HINTING) += page_hinting.o
>>>>  obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>>>>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
>>>>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index 3b13d3914176..d12f69e0e402 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -68,6 +68,7 @@
>>>>  #include <linux/lockdep.h>
>>>>  #include <linux/nmi.h>
>>>>  #include <linux/psi.h>
>>>> +#include <linux/page_hinting.h>
>>>>
>>>>  #include <asm/sections.h>
>>>>  #include <asm/tlbflush.h>
>>>> @@ -873,10 +874,10 @@ compaction_capture(struct capture_control *capc, struct page *page,
>>>>   * -- nyc
>>>>   */
>>>>
>>>> -static inline void __free_one_page(struct page *page,
>>>> +inline void __free_one_page(struct page *page,
>>>>                 unsigned long pfn,
>>>>                 struct zone *zone, unsigned int order,
>>>> -               int migratetype)
>>>> +               int migratetype, bool hint)
>>>>  {
>>>>         unsigned long combined_pfn;
>>>>         unsigned long uninitialized_var(buddy_pfn);
>>>> @@ -951,6 +952,8 @@ static inline void __free_one_page(struct page *page,
>>>>  done_merging:
>>>>         set_page_order(page, order);
>>>>
>>>> +       if (hint)
>>>> +               page_hinting_enqueue(page, order);
>>> This is a bit early to probably be dealing with the hint. You should
>>> probably look at moving this down to a spot somewhere after the page
>>> has been added to the free list. It may not cause any issues with the
>>> current order setup, but moving after the addition to the free list
>>> will make it so that you know it is in there when you call this
>>> function.
>> I will take a look at this.
>>>>         /*
>>>>          * If this is not the largest possible page, check if the buddy
>>>>          * of the next-highest order is free. If it is, it's possible
>>>> @@ -1262,7 +1265,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>>>                 if (unlikely(isolated_pageblocks))
>>>>                         mt = get_pageblock_migratetype(page);
>>>>
>>>> -               __free_one_page(page, page_to_pfn(page), zone, 0, mt);
>>>> +               __free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
>>>>                 trace_mm_page_pcpu_drain(page, 0, mt);
>>>>         }
>>>>         spin_unlock(&zone->lock);
>>>> @@ -1271,14 +1274,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>>>  static void free_one_page(struct zone *zone,
>>>>                                 struct page *page, unsigned long pfn,
>>>>                                 unsigned int order,
>>>> -                               int migratetype)
>>>> +                               int migratetype, bool hint)
>>>>  {
>>>>         spin_lock(&zone->lock);
>>>>         if (unlikely(has_isolate_pageblock(zone) ||
>>>>                 is_migrate_isolate(migratetype))) {
>>>>                 migratetype = get_pfnblock_migratetype(page, pfn);
>>>>         }
>>>> -       __free_one_page(page, pfn, zone, order, migratetype);
>>>> +       __free_one_page(page, pfn, zone, order, migratetype, hint);
>>>>         spin_unlock(&zone->lock);
>>>>  }
>>>>
>>>> @@ -1368,7 +1371,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>>>>         migratetype = get_pfnblock_migratetype(page, pfn);
>>>>         local_irq_save(flags);
>>>>         __count_vm_events(PGFREE, 1 << order);
>>>> -       free_one_page(page_zone(page), page, pfn, order, migratetype);
>>>> +       free_one_page(page_zone(page), page, pfn, order, migratetype, true);
>>>>         local_irq_restore(flags);
>>>>  }
>>>>
>>>> @@ -2968,7 +2971,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
>>>>          */
>>>>         if (migratetype >= MIGRATE_PCPTYPES) {
>>>>                 if (unlikely(is_migrate_isolate(migratetype))) {
>>>> -                       free_one_page(zone, page, pfn, 0, migratetype);
>>>> +                       free_one_page(zone, page, pfn, 0, migratetype, true);
>>>>                         return;
>>>>                 }
>>>>                 migratetype = MIGRATE_MOVABLE;
>>> So it looks like you are using a parameter to identify if the page is
>>> a hinted page or not. I guess this works but it seems like it is a bit
>>> intrusive as you are adding an argument to specify that this is a
>>> specific page type.
>> Any suggestions on how we could do this in a less intrusive manner?
> The quick approach would be to add some piece of metadata somewhere in
> the page that you could trigger off of. If you could do that then drop
> the need for all these extra checks and instead just not perform the
> notification on the pages. I really don't think the addition of the
> "Treated" flag was all that invasive, at least within the kernel. It
> would allow you to avoid all the changes to free_one_page, and
> __free_one_page.
>
>>>> diff --git a/mm/page_hinting.c b/mm/page_hinting.c
>>>> new file mode 100644
>>>> index 000000000000..7341c6462de2
>>>> --- /dev/null
>>>> +++ b/mm/page_hinting.c
>>>> @@ -0,0 +1,236 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/*
>>>> + * Page hinting support to enable a VM to report the freed pages back
>>>> + * to the host.
>>>> + *
>>>> + * Copyright Red Hat, Inc. 2019
>>>> + *
>>>> + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
>>>> + */
>>>> +
>>>> +#include <linux/mm.h>
>>>> +#include <linux/slab.h>
>>>> +#include <linux/page_hinting.h>
>>>> +#include <linux/kvm_host.h>
>>>> +
>>>> +/*
>>>> + * struct hinting_bitmap: holds the bitmap pointer which tracks the freed PFNs
>>>> + * and other required parameters which could help in retrieving the original
>>>> + * PFN value using the bitmap.
>>>> + * @bitmap:            Pointer to the bitmap of free PFN.
>>>> + * @base_pfn:          Starting PFN value for the zone whose bitmap is stored.
>>>> + * @free_pages:                Tracks the number of free pages of granularity
>>>> + *                     PAGE_HINTING_MIN_ORDER.
>>>> + * @nbits:             Indicates the total size of the bitmap in bits allocated
>>>> + *                     at the time of initialization.
>>>> + */
>>>> +struct hinting_bitmap {
>>>> +       unsigned long *bitmap;
>>>> +       unsigned long base_pfn;
>>>> +       atomic_t free_pages;
>>>> +       unsigned long nbits;
>>>> +} bm_zone[MAX_NR_ZONES];
>>>> +
>>> This ignores NUMA doesn't it? Shouldn't you have support for other NUMA nodes?
>> I will have to look into this.
> So it doesn't cause a panic, but with 2 NUMA nodes you are only
> hinting on half the memory. I was able to build, test,  and verify
> this. I had resolved it by simply multiplying MAX_NR_ZONES by
> MAX_NUMNODES, and splitting my indices between node and zone.
I see, Thanks.
>>>> +static void init_hinting_wq(struct work_struct *work);
>>>> +extern int __isolate_free_page(struct page *page, unsigned int order);
>>>> +extern void __free_one_page(struct page *page, unsigned long pfn,
>>>> +                           struct zone *zone, unsigned int order,
>>>> +                           int migratetype, bool hint);
>>>> +const struct page_hinting_cb *hcb;
>>>> +struct work_struct hinting_work;
>>>> +
>>>> +static unsigned long find_bitmap_size(struct zone *zone)
>>>> +{
>>>> +       unsigned long nbits = ALIGN(zone->spanned_pages,
>>>> +                           PAGE_HINTING_MIN_ORDER);
>>>> +
>>>> +       nbits = nbits >> PAGE_HINTING_MIN_ORDER;
>>>> +       return nbits;
>>>> +}
>>>> +
>>> This doesn't look right to me. You are trying to do something like a
>>> DIV_ROUND_UP here, right? If so shouldn't you be aligning to 1 <<
>>> PAGE_HINTING_MIN_ORDER, instead of just PAGE_HINTING_MIN_ORDER?
>>> Another option would be to just do DIV_ROUND_UP with the 1 <<
>>> PAGE_HINTING_MIN_ORDER value.
>> I will double check this.
>>>> +void page_hinting_enable(const struct page_hinting_cb *callback)
>>>> +{
>>>> +       struct zone *zone;
>>>> +       int idx = 0;
>>>> +       unsigned long bitmap_size = 0;
>>>> +
>>>> +       for_each_populated_zone(zone) {
>>> The index for this doesn't match up to the index you used to define
>>> bm_zone. for_each_populated_zone will go through each zone in each
>>> pgdat. Right now you can only handle one pgdat.
>> Not sure if I understood this entirely. Can you please explain more on this?
>>>> +               spin_lock(&zone->lock);
>>>> +               bitmap_size = find_bitmap_size(zone);
>>>> +               bm_zone[idx].bitmap = bitmap_zalloc(bitmap_size, GFP_KERNEL);
>>>> +               if (!bm_zone[idx].bitmap)
>>>> +                       return;
>>>> +               bm_zone[idx].nbits = bitmap_size;
>>>> +               bm_zone[idx].base_pfn = zone->zone_start_pfn;
>>>> +               spin_unlock(&zone->lock);
>>>> +               idx++;
>>>> +       }
>>>> +       hcb = callback;
>>>> +       INIT_WORK(&hinting_work, init_hinting_wq);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(page_hinting_enable);
>>>> +
>>>> +void page_hinting_disable(void)
>>>> +{
>>>> +       struct zone *zone;
>>>> +       int idx = 0;
>>>> +
>>>> +       cancel_work_sync(&hinting_work);
>>>> +       hcb = NULL;
>>>> +       for_each_populated_zone(zone) {
>>>> +               spin_lock(&zone->lock);
>>>> +               bitmap_free(bm_zone[idx].bitmap);
>>>> +               bm_zone[idx].base_pfn = 0;
>>>> +               bm_zone[idx].nbits = 0;
>>>> +               atomic_set(&bm_zone[idx].free_pages, 0);
>>>> +               spin_unlock(&zone->lock);
>>>> +               idx++;
>>>> +       }
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(page_hinting_disable);
>>>> +
>>>> +static unsigned long pfn_to_bit(struct page *page, int zonenum)
>>>> +{
>>>> +       unsigned long bitnr;
>>>> +
>>>> +       bitnr = (page_to_pfn(page) - bm_zone[zonenum].base_pfn)
>>>> +                        >> PAGE_HINTING_MIN_ORDER;
>>>> +       return bitnr;
>>>> +}
>>>> +
>>>> +static void release_buddy_pages(struct list_head *pages)
>>>> +{
>>>> +       int mt = 0, zonenum, order;
>>>> +       struct page *page, *next;
>>>> +       struct zone *zone;
>>>> +       unsigned long bitnr;
>>>> +
>>>> +       list_for_each_entry_safe(page, next, pages, lru) {
>>>> +               zonenum = page_zonenum(page);
>>>> +               zone = page_zone(page);
>>>> +               bitnr = pfn_to_bit(page, zonenum);
>>>> +               spin_lock(&zone->lock);
>>>> +               list_del(&page->lru);
>>>> +               order = page_private(page);
>>>> +               set_page_private(page, 0);
>>>> +               mt = get_pageblock_migratetype(page);
>>>> +               __free_one_page(page, page_to_pfn(page), zone,
>>>> +                               order, mt, false);
>>>> +               spin_unlock(&zone->lock);
>>>> +       }
>>>> +}
>>>> +
>>>> +static void bm_set_pfn(struct page *page)
>>>> +{
>>>> +       unsigned long bitnr = 0;
>>>> +       int zonenum = page_zonenum(page);
>>>> +       struct zone *zone = page_zone(page);
>>>> +
>>>> +       lockdep_assert_held(&zone->lock);
>>>> +       bitnr = pfn_to_bit(page, zonenum);
>>>> +       if (bm_zone[zonenum].bitmap &&
>>>> +           bitnr < bm_zone[zonenum].nbits &&
>>>> +           !test_and_set_bit(bitnr, bm_zone[zonenum].bitmap))
>>>> +               atomic_inc(&bm_zone[zonenum].free_pages);
>>>> +}
>>>> +
>>>> +static void scan_hinting_bitmap(int zonenum, int free_pages)
>>>> +{
>>>> +       unsigned long set_bit, start = 0;
>>>> +       struct page *page;
>>>> +       struct zone *zone;
>>>> +       int scanned_pages = 0, ret = 0, order, isolated_cnt = 0;
>>>> +       LIST_HEAD(isolated_pages);
>>>> +
>>>> +       ret = hcb->prepare();
>>>> +       if (ret < 0)
>>>> +               return;
>>>> +       for (;;) {
>>>> +               ret = 0;
>>>> +               set_bit = find_next_bit(bm_zone[zonenum].bitmap,
>>>> +                                       bm_zone[zonenum].nbits, start);
>>>> +               if (set_bit >= bm_zone[zonenum].nbits)
>>>> +                       break;
>>>> +               page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) +
>>>> +                               bm_zone[zonenum].base_pfn);
>>>> +               if (!page)
>>>> +                       continue;
>>>> +               zone = page_zone(page);
>>>> +               spin_lock(&zone->lock);
>>>> +
>>>> +               if (PageBuddy(page) && page_private(page) >=
>>>> +                   PAGE_HINTING_MIN_ORDER) {
>>>> +                       order = page_private(page);
>>>> +                       ret = __isolate_free_page(page, order);
>>>> +               }
>>>> +               clear_bit(set_bit, bm_zone[zonenum].bitmap);
>>>> +               spin_unlock(&zone->lock);
>>>> +               if (ret) {
>>>> +                       /*
>>>> +                        * restoring page order to use it while releasing
>>>> +                        * the pages back to the buddy.
>>>> +                        */
>>>> +                       set_page_private(page, order);
>>>> +                       list_add_tail(&page->lru, &isolated_pages);
>>>> +                       isolated_cnt++;
>>>> +                       if (isolated_cnt == hcb->max_pages) {
>>>> +                               hcb->hint_pages(&isolated_pages);
>>>> +                               release_buddy_pages(&isolated_pages);
>>>> +                               isolated_cnt = 0;
>>>> +                       }
>>>> +               }
>>>> +               start = set_bit + 1;
>>>> +               scanned_pages++;
>>>> +       }
>>>> +       if (isolated_cnt) {
>>>> +               hcb->hint_pages(&isolated_pages);
>>>> +               release_buddy_pages(&isolated_pages);
>>>> +       }
>>>> +       hcb->cleanup();
>>>> +       if (scanned_pages > free_pages)
>>>> +               atomic_sub((scanned_pages - free_pages),
>>>> +                          &bm_zone[zonenum].free_pages);
>>>> +}
>>>> +
>>>> +static bool check_hinting_threshold(void)
>>>> +{
>>>> +       int zonenum = 0;
>>>> +
>>>> +       for (; zonenum < MAX_NR_ZONES; zonenum++) {
>>>> +               if (atomic_read(&bm_zone[zonenum].free_pages) >=
>>>> +                               hcb->max_pages)
>>>> +                       return true;
>>>> +       }
>>>> +       return false;
>>>> +}
>>>> +
>>>> +static void init_hinting_wq(struct work_struct *work)
>>>> +{
>>>> +       int zonenum = 0, free_pages = 0;
>>>> +
>>>> +       for (; zonenum < MAX_NR_ZONES; zonenum++) {
>>>> +               free_pages = atomic_read(&bm_zone[zonenum].free_pages);
>>>> +               if (free_pages >= hcb->max_pages) {
>>>> +                       /* Find a better way to synchronize per zone
>>>> +                        * free_pages.
>>>> +                        */
>>>> +                       atomic_sub(free_pages,
>>>> +                                  &bm_zone[zonenum].free_pages);
>>>> +                       scan_hinting_bitmap(zonenum, free_pages);
>>>> +               }
>>>> +       }
>>>> +}
>>>> +
>>>> +void page_hinting_enqueue(struct page *page, int order)
>>>> +{
>>>> +       if (hcb && order >= PAGE_HINTING_MIN_ORDER)
>>>> +               bm_set_pfn(page);
>>>> +       else
>>>> +               return;
>>> You could probably flip the logic and save yourself an "else" by just
>>> doing something like:
>>> if (!hcb || order < PAGE_HINTING_MIN_ORDER)
>>>         return;
>>>
>>> I think it would also make this more readable.
>>>
>> +1
>>>> +
>>>> +       if (check_hinting_threshold()) {
>>>> +               int cpu = smp_processor_id();
>>>> +
>>>> +               queue_work_on(cpu, system_wq, &hinting_work);
>>>> +       }
>>>> +}
>>>> --
>>>> 2.21.0
>>>>
>> --
>> Regards
>> Nitesh
>>
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][Patch v10 1/2] mm: page_hinting: core infrastructure
  2019-06-04 16:07         ` Nitesh Narayan Lal
@ 2019-06-04 16:25           ` Alexander Duyck
  2019-06-04 16:42             ` Nitesh Narayan Lal
  0 siblings, 1 reply; 33+ messages in thread
From: Alexander Duyck @ 2019-06-04 16:25 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Tue, Jun 4, 2019 at 9:08 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 6/4/19 11:14 AM, Alexander Duyck wrote:
> > On Tue, Jun 4, 2019 at 5:55 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>
> >> On 6/3/19 3:04 PM, Alexander Duyck wrote:
> >>> On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>>> This patch introduces the core infrastructure for free page hinting in
> >>>> virtual environments. It enables the kernel to track the free pages which
> >>>> can be reported to its hypervisor so that the hypervisor could
> >>>> free and reuse that memory as per its requirement.
> >>>>
> >>>> While the pages are getting processed in the hypervisor (e.g.,
> >>>> via MADV_FREE), the guest must not use them, otherwise, data loss
> >>>> would be possible. To avoid such a situation, these pages are
> >>>> temporarily removed from the buddy. The amount of pages removed
> >>>> temporarily from the buddy is governed by the backend(virtio-balloon
> >>>> in our case).
> >>>>
> >>>> To efficiently identify free pages that can to be hinted to the
> >>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> >>>> chunks are reported to the hypervisor - especially, to not break up THP
> >>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> >>>> in the bitmap are an indication whether a page *might* be free, not a
> >>>> guarantee. A new hook after buddy merging sets the bits.
> >>>>
> >>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> >>>> asynchronously processes the bitmaps, trying to isolate and report pages
> >>>> that are still free. The backend (virtio-balloon) is responsible for
> >>>> reporting these batched pages to the host synchronously. Once reporting/
> >>>> freeing is complete, isolated pages are returned back to the buddy.
> >>>>
> >>>> There are still various things to look into (e.g., memory hotplug, more
> >>>> efficient locking, possible races when disabling).
> >>>>
> >>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> >>> So one thing I had thought about, that I don't believe that has been
> >>> addressed in your solution, is to determine a means to guarantee
> >>> forward progress. If you have a noisy thread that is allocating and
> >>> freeing some block of memory repeatedly you will be stuck processing
> >>> that and cannot get to the other work. Specifically if you have a zone
> >>> where somebody is just cycling the number of pages needed to fill your
> >>> hinting queue how do you get around it and get to the data that is
> >>> actually code instead of getting stuck processing the noise?
> >> It should not matter. As every time the memory threshold is met, entire
> >> bitmap
> >> is scanned and not just a chunk of memory for possible isolation. This
> >> will guarantee
> >> forward progress.
> > So I think there may still be some issues. I see how you go from the
> > start to the end, but how to you loop back to the start again as pages
> > are added? The init_hinting_wq doesn't seem to have a way to get back
> > to the start again if there is still work to do after you have
> > completed your pass without queue_work_on firing off another thread.
> >
> That will be taken care as the part of a new job, which will be
> en-queued as soon
> as the free memory count for the respective zone will reach the threshold.

So does that mean that you have multiple threads all calling
queue_work_on until you get below the threshold? If so it seems like
that would get expensive since that is an atomic test and set
operation that would be hammered until you get below that threshold.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][Patch v10 2/2] virtio-balloon: page_hinting: reporting to the host
  2019-06-03 17:03 ` [RFC][Patch v10 2/2] virtio-balloon: page_hinting: reporting to the host Nitesh Narayan Lal
  2019-06-03 22:38   ` Alexander Duyck
@ 2019-06-04 16:33   ` Alexander Duyck
  2019-06-04 16:44     ` Nitesh Narayan Lal
  1 sibling, 1 reply; 33+ messages in thread
From: Alexander Duyck @ 2019-06-04 16:33 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
> Enables the kernel to negotiate VIRTIO_BALLOON_F_HINTING feature with the
> host. If it is available and page_hinting_flag is set to true, page_hinting
> is enabled and its callbacks are configured along with the max_pages count
> which indicates the maximum number of pages that can be isolated and hinted
> at a time. Currently, only free pages of order >= (MAX_ORDER - 2) are
> reported. To prevent any false OOM max_pages count is set to 16.
>
> By default page_hinting feature is enabled and gets loaded as soon
> as the virtio-balloon driver is loaded. However, it could be disabled
> by writing the page_hinting_flag which is a virtio-balloon parameter.
>
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 112 +++++++++++++++++++++++++++-
>  include/uapi/linux/virtio_balloon.h |  14 ++++
>  2 files changed, 125 insertions(+), 1 deletion(-)

<snip>

> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index a1966cd7b677..25e4f817c660 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -29,6 +29,7 @@
>  #include <linux/virtio_types.h>
>  #include <linux/virtio_ids.h>
>  #include <linux/virtio_config.h>
> +#include <linux/page_hinting.h>

So this include breaks the build and from what I can tell it isn't
really needed. I deleted it in order to be able to build without
warnings about the file not being included in UAPI.

>  /* The feature bitmap for virtio balloon */
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST        0 /* Tell before reclaiming pages */
> @@ -36,6 +37,7 @@
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM        2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_FREE_PAGE_HINT        3 /* VQ to report free pages */
>  #define VIRTIO_BALLOON_F_PAGE_POISON   4 /* Guest is using page poisoning */
> +#define VIRTIO_BALLOON_F_HINTING       5 /* Page hinting virtqueue */
>
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> @@ -108,4 +110,16 @@ struct virtio_balloon_stat {
>         __virtio64 val;
>  } __attribute__((packed));
>
> +#ifdef CONFIG_PAGE_HINTING
> +/*
> + * struct hinting_data- holds the information associated with hinting.
> + * @phys_add:  physical address associated with a page or the array holding
> + *             the array of isolated pages.
> + * @size:      total size associated with the phys_addr.
> + */
> +struct hinting_data {
> +       __virtio64 phys_addr;
> +       __virtio32 size;
> +};
> +#endif
>  #endif /* _LINUX_VIRTIO_BALLOON_H */
> --
> 2.21.0
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [QEMU PATCH] KVM: Support for page hinting
  2019-06-03 17:04 ` [QEMU PATCH] KVM: Support for page hinting Nitesh Narayan Lal
  2019-06-03 18:34   ` Alexander Duyck
@ 2019-06-04 16:41   ` Alexander Duyck
  2019-06-04 16:48     ` Nitesh Narayan Lal
  1 sibling, 1 reply; 33+ messages in thread
From: Alexander Duyck @ 2019-06-04 16:41 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
> Enables QEMU to call madvise on the pages which are reported
> by the guest kernel.
>
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> ---
>  hw/virtio/trace-events                        |  1 +
>  hw/virtio/virtio-balloon.c                    | 85 +++++++++++++++++++
>  include/hw/virtio/virtio-balloon.h            |  2 +-
>  include/qemu/osdep.h                          |  7 ++
>  .../standard-headers/linux/virtio_balloon.h   |  1 +
>  5 files changed, 95 insertions(+), 1 deletion(-)

<snip>

> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index 840af09cb0..4d632933a9 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -360,6 +360,11 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>  #else
>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
>  #endif
> +#ifdef MADV_FREE
> +#define QEMU_MADV_FREE MADV_FREE
> +#else
> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
> +#endif

Is there a specific reason for making this default to INVALID instead
of just using DONTNEED? I ran into some issues as my host kernel
didn't have support for MADV_FREE in the exported kernel headers
apparently so I was getting no effect. It seems like it would be
better to fall back to doing DONTNEED instead of just disabling the
functionality all together.

>  #elif defined(CONFIG_POSIX_MADVISE)
>
> @@ -373,6 +378,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>  #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
>  #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
> +#define QEMU_MADV_FREE QEMU_MADV_INVALID

Same here. If you already have MADV_DONTNEED you could just use that
instead of disabling the functionality.

>  #else /* no-op */
>
> @@ -386,6 +392,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>  #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
>  #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
>
>  #endif
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][Patch v10 1/2] mm: page_hinting: core infrastructure
  2019-06-04 16:25           ` Alexander Duyck
@ 2019-06-04 16:42             ` Nitesh Narayan Lal
  2019-06-04 17:12               ` Alexander Duyck
  0 siblings, 1 reply; 33+ messages in thread
From: Nitesh Narayan Lal @ 2019-06-04 16:42 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli


[-- Attachment #1.1: Type: text/plain, Size: 4212 bytes --]


On 6/4/19 12:25 PM, Alexander Duyck wrote:
> On Tue, Jun 4, 2019 at 9:08 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>
>> On 6/4/19 11:14 AM, Alexander Duyck wrote:
>>> On Tue, Jun 4, 2019 at 5:55 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>> On 6/3/19 3:04 PM, Alexander Duyck wrote:
>>>>> On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>>>> This patch introduces the core infrastructure for free page hinting in
>>>>>> virtual environments. It enables the kernel to track the free pages which
>>>>>> can be reported to its hypervisor so that the hypervisor could
>>>>>> free and reuse that memory as per its requirement.
>>>>>>
>>>>>> While the pages are getting processed in the hypervisor (e.g.,
>>>>>> via MADV_FREE), the guest must not use them, otherwise, data loss
>>>>>> would be possible. To avoid such a situation, these pages are
>>>>>> temporarily removed from the buddy. The amount of pages removed
>>>>>> temporarily from the buddy is governed by the backend(virtio-balloon
>>>>>> in our case).
>>>>>>
>>>>>> To efficiently identify free pages that can to be hinted to the
>>>>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>>>>>> chunks are reported to the hypervisor - especially, to not break up THP
>>>>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>>>>>> in the bitmap are an indication whether a page *might* be free, not a
>>>>>> guarantee. A new hook after buddy merging sets the bits.
>>>>>>
>>>>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>>>>>> asynchronously processes the bitmaps, trying to isolate and report pages
>>>>>> that are still free. The backend (virtio-balloon) is responsible for
>>>>>> reporting these batched pages to the host synchronously. Once reporting/
>>>>>> freeing is complete, isolated pages are returned back to the buddy.
>>>>>>
>>>>>> There are still various things to look into (e.g., memory hotplug, more
>>>>>> efficient locking, possible races when disabling).
>>>>>>
>>>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>>>>> So one thing I had thought about, that I don't believe that has been
>>>>> addressed in your solution, is to determine a means to guarantee
>>>>> forward progress. If you have a noisy thread that is allocating and
>>>>> freeing some block of memory repeatedly you will be stuck processing
>>>>> that and cannot get to the other work. Specifically if you have a zone
>>>>> where somebody is just cycling the number of pages needed to fill your
>>>>> hinting queue how do you get around it and get to the data that is
>>>>> actually code instead of getting stuck processing the noise?
>>>> It should not matter. As every time the memory threshold is met, entire
>>>> bitmap
>>>> is scanned and not just a chunk of memory for possible isolation. This
>>>> will guarantee
>>>> forward progress.
>>> So I think there may still be some issues. I see how you go from the
>>> start to the end, but how to you loop back to the start again as pages
>>> are added? The init_hinting_wq doesn't seem to have a way to get back
>>> to the start again if there is still work to do after you have
>>> completed your pass without queue_work_on firing off another thread.
>>>
>> That will be taken care as the part of a new job, which will be
>> en-queued as soon
>> as the free memory count for the respective zone will reach the threshold.
> So does that mean that you have multiple threads all calling
> queue_work_on until you get below the threshold?
Every time a page of order MAX_ORDER - 2 is added to the buddy, free
memory count will be incremented if the bit is not already set and its
value will be checked against the threshold.
>  If so it seems like
> that would get expensive since that is an atomic test and set
> operation that would be hammered until you get below that threshold.

Not sure if I understood "until you get below that threshold".
Can you please explain?
test_and_set_bit() will be called every time a page with MAX_ORDER -2
order is added to the buddy. (Not already hinted)


-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][Patch v10 2/2] virtio-balloon: page_hinting: reporting to the host
  2019-06-04 16:33   ` Alexander Duyck
@ 2019-06-04 16:44     ` Nitesh Narayan Lal
  0 siblings, 0 replies; 33+ messages in thread
From: Nitesh Narayan Lal @ 2019-06-04 16:44 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli


[-- Attachment #1.1: Type: text/plain, Size: 2925 bytes --]


On 6/4/19 12:33 PM, Alexander Duyck wrote:
> On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>> Enables the kernel to negotiate VIRTIO_BALLOON_F_HINTING feature with the
>> host. If it is available and page_hinting_flag is set to true, page_hinting
>> is enabled and its callbacks are configured along with the max_pages count
>> which indicates the maximum number of pages that can be isolated and hinted
>> at a time. Currently, only free pages of order >= (MAX_ORDER - 2) are
>> reported. To prevent any false OOM max_pages count is set to 16.
>>
>> By default page_hinting feature is enabled and gets loaded as soon
>> as the virtio-balloon driver is loaded. However, it could be disabled
>> by writing the page_hinting_flag which is a virtio-balloon parameter.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> ---
>>  drivers/virtio/virtio_balloon.c     | 112 +++++++++++++++++++++++++++-
>>  include/uapi/linux/virtio_balloon.h |  14 ++++
>>  2 files changed, 125 insertions(+), 1 deletion(-)
> <snip>
>
>> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
>> index a1966cd7b677..25e4f817c660 100644
>> --- a/include/uapi/linux/virtio_balloon.h
>> +++ b/include/uapi/linux/virtio_balloon.h
>> @@ -29,6 +29,7 @@
>>  #include <linux/virtio_types.h>
>>  #include <linux/virtio_ids.h>
>>  #include <linux/virtio_config.h>
>> +#include <linux/page_hinting.h>
> So this include breaks the build and from what I can tell it isn't
> really needed. I deleted it in order to be able to build without
> warnings about the file not being included in UAPI.
I agree here, it is not required any more.
>
>>  /* The feature bitmap for virtio balloon */
>>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST        0 /* Tell before reclaiming pages */
>> @@ -36,6 +37,7 @@
>>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM        2 /* Deflate balloon on OOM */
>>  #define VIRTIO_BALLOON_F_FREE_PAGE_HINT        3 /* VQ to report free pages */
>>  #define VIRTIO_BALLOON_F_PAGE_POISON   4 /* Guest is using page poisoning */
>> +#define VIRTIO_BALLOON_F_HINTING       5 /* Page hinting virtqueue */
>>
>>  /* Size of a PFN in the balloon interface. */
>>  #define VIRTIO_BALLOON_PFN_SHIFT 12
>> @@ -108,4 +110,16 @@ struct virtio_balloon_stat {
>>         __virtio64 val;
>>  } __attribute__((packed));
>>
>> +#ifdef CONFIG_PAGE_HINTING
>> +/*
>> + * struct hinting_data- holds the information associated with hinting.
>> + * @phys_add:  physical address associated with a page or the array holding
>> + *             the array of isolated pages.
>> + * @size:      total size associated with the phys_addr.
>> + */
>> +struct hinting_data {
>> +       __virtio64 phys_addr;
>> +       __virtio32 size;
>> +};
>> +#endif
>>  #endif /* _LINUX_VIRTIO_BALLOON_H */
>> --
>> 2.21.0
>>
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [QEMU PATCH] KVM: Support for page hinting
  2019-06-04 16:41   ` Alexander Duyck
@ 2019-06-04 16:48     ` Nitesh Narayan Lal
  0 siblings, 0 replies; 33+ messages in thread
From: Nitesh Narayan Lal @ 2019-06-04 16:48 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli


[-- Attachment #1.1: Type: text/plain, Size: 2321 bytes --]


On 6/4/19 12:41 PM, Alexander Duyck wrote:
> On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>> Enables QEMU to call madvise on the pages which are reported
>> by the guest kernel.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> ---
>>  hw/virtio/trace-events                        |  1 +
>>  hw/virtio/virtio-balloon.c                    | 85 +++++++++++++++++++
>>  include/hw/virtio/virtio-balloon.h            |  2 +-
>>  include/qemu/osdep.h                          |  7 ++
>>  .../standard-headers/linux/virtio_balloon.h   |  1 +
>>  5 files changed, 95 insertions(+), 1 deletion(-)
> <snip>
>
>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>> index 840af09cb0..4d632933a9 100644
>> --- a/include/qemu/osdep.h
>> +++ b/include/qemu/osdep.h
>> @@ -360,6 +360,11 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>>  #else
>>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
>>  #endif
>> +#ifdef MADV_FREE
>> +#define QEMU_MADV_FREE MADV_FREE
>> +#else
>> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
>> +#endif
> Is there a specific reason for making this default to INVALID instead
> of just using DONTNEED?
No specific reason.
>  I ran into some issues as my host kernel
> didn't have support for MADV_FREE in the exported kernel headers
> apparently so I was getting no effect. It seems like it would be
> better to fall back to doing DONTNEED instead of just disabling the
> functionality all together.
Possibly, I will further look into it.
>>  #elif defined(CONFIG_POSIX_MADVISE)
>>
>> @@ -373,6 +378,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>>  #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
>>  #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
>>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
>> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
> Same here. If you already have MADV_DONTNEED you could just use that
> instead of disabling the functionality.
>
>>  #else /* no-op */
>>
>> @@ -386,6 +392,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>>  #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
>>  #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
>>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
>> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
>>
>>  #endif
>>
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][Patch v10 1/2] mm: page_hinting: core infrastructure
  2019-06-04 16:42             ` Nitesh Narayan Lal
@ 2019-06-04 17:12               ` Alexander Duyck
  0 siblings, 0 replies; 33+ messages in thread
From: Alexander Duyck @ 2019-06-04 17:12 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Tue, Jun 4, 2019 at 9:42 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 6/4/19 12:25 PM, Alexander Duyck wrote:
> > On Tue, Jun 4, 2019 at 9:08 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>
> >> On 6/4/19 11:14 AM, Alexander Duyck wrote:
> >>> On Tue, Jun 4, 2019 at 5:55 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>>> On 6/3/19 3:04 PM, Alexander Duyck wrote:
> >>>>> On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>>>>> This patch introduces the core infrastructure for free page hinting in
> >>>>>> virtual environments. It enables the kernel to track the free pages which
> >>>>>> can be reported to its hypervisor so that the hypervisor could
> >>>>>> free and reuse that memory as per its requirement.
> >>>>>>
> >>>>>> While the pages are getting processed in the hypervisor (e.g.,
> >>>>>> via MADV_FREE), the guest must not use them, otherwise, data loss
> >>>>>> would be possible. To avoid such a situation, these pages are
> >>>>>> temporarily removed from the buddy. The amount of pages removed
> >>>>>> temporarily from the buddy is governed by the backend(virtio-balloon
> >>>>>> in our case).
> >>>>>>
> >>>>>> To efficiently identify free pages that can to be hinted to the
> >>>>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> >>>>>> chunks are reported to the hypervisor - especially, to not break up THP
> >>>>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> >>>>>> in the bitmap are an indication whether a page *might* be free, not a
> >>>>>> guarantee. A new hook after buddy merging sets the bits.
> >>>>>>
> >>>>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> >>>>>> asynchronously processes the bitmaps, trying to isolate and report pages
> >>>>>> that are still free. The backend (virtio-balloon) is responsible for
> >>>>>> reporting these batched pages to the host synchronously. Once reporting/
> >>>>>> freeing is complete, isolated pages are returned back to the buddy.
> >>>>>>
> >>>>>> There are still various things to look into (e.g., memory hotplug, more
> >>>>>> efficient locking, possible races when disabling).
> >>>>>>
> >>>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> >>>>> So one thing I had thought about, that I don't believe that has been
> >>>>> addressed in your solution, is to determine a means to guarantee
> >>>>> forward progress. If you have a noisy thread that is allocating and
> >>>>> freeing some block of memory repeatedly you will be stuck processing
> >>>>> that and cannot get to the other work. Specifically if you have a zone
> >>>>> where somebody is just cycling the number of pages needed to fill your
> >>>>> hinting queue how do you get around it and get to the data that is
> >>>>> actually code instead of getting stuck processing the noise?
> >>>> It should not matter. As every time the memory threshold is met, entire
> >>>> bitmap
> >>>> is scanned and not just a chunk of memory for possible isolation. This
> >>>> will guarantee
> >>>> forward progress.
> >>> So I think there may still be some issues. I see how you go from the
> >>> start to the end, but how to you loop back to the start again as pages
> >>> are added? The init_hinting_wq doesn't seem to have a way to get back
> >>> to the start again if there is still work to do after you have
> >>> completed your pass without queue_work_on firing off another thread.
> >>>
> >> That will be taken care as the part of a new job, which will be
> >> en-queued as soon
> >> as the free memory count for the respective zone will reach the threshold.
> > So does that mean that you have multiple threads all calling
> > queue_work_on until you get below the threshold?
> Every time a page of order MAX_ORDER - 2 is added to the buddy, free
> memory count will be incremented if the bit is not already set and its
> value will be checked against the threshold.
> >  If so it seems like
> > that would get expensive since that is an atomic test and set
> > operation that would be hammered until you get below that threshold.
>
> Not sure if I understood "until you get below that threshold".
> Can you please explain?
> test_and_set_bit() will be called every time a page with MAX_ORDER -2
> order is added to the buddy. (Not already hinted)

I had overlooked the other paths that are already making use of the
test_and_set_bit(). What I was getting at specifically is that the
WORK_PENDING bit in the work struct is going to be getting hit every
time you add a new page. So it is adding yet another atomic operation
in addition to the increment and test_and_set_bit() that you were
already doing.

Generally you may want to look at trying to reduce how often you are
having to perform these atomic operations. So for example one thing
you could do is use something like an atomic_read before you do your
atomic_inc to determine if you are transitioning to a state where you
were below, and now you are above the threshold. Doing something like
that could save you on the number of calls you are making and save
some significant CPU cycles.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][Patch v10 0/2] mm: Support for page hinting
  2019-06-03 18:04 ` [RFC][Patch v10 0/2] mm: " Michael S. Tsirkin
  2019-06-03 18:38   ` Nitesh Narayan Lal
@ 2019-06-11 12:19   ` Nitesh Narayan Lal
  2019-06-11 15:00     ` Alexander Duyck
  2019-06-25 14:48   ` Nitesh Narayan Lal
  2 siblings, 1 reply; 33+ messages in thread
From: Nitesh Narayan Lal @ 2019-06-11 12:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta,
	wei.w.wang, yang.zhang.wz, riel, david, dodgen, konrad.wilk,
	dhildenb, aarcange, alexander.duyck


[-- Attachment #1.1: Type: text/plain, Size: 7180 bytes --]


On 6/3/19 2:04 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 03, 2019 at 01:03:04PM -0400, Nitesh Narayan Lal wrote:
>> This patch series proposes an efficient mechanism for communicating free memory
>> from a guest to its hypervisor. It especially enables guests with no page cache
>> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
>> rapidly hand back free memory to the hypervisor.
>> This approach has a minimal impact on the existing core-mm infrastructure.
> Could you help us compare with Alex's series?
> What are the main differences?
Sorry for the late reply, but I haven't been feeling too well during the
last week.

The main differences are that this series uses a bitmap to track pages
that should be hinted to the hypervisor, while Alexander's series tracks
it directly in core-mm. Also in order to prevent duplicate hints
Alexander's series uses a newly defined page flag whereas I have added
another argument to __free_one_page.
For these reasons, Alexander's series is relatively more core-mm
invasive, while this series is lightweight (e.g., LOC). We'll have to
see if there are real performance differences.

I'm planning on doing some further investigations/review/testing/...
once I'm back on track.
>
>> Measurement results (measurement details appended to this email):
>> * With active page hinting, 3 more guests could be launched each of 5 GB(total 
>> 5 vs. 2) on a 15GB (single NUMA) system without swapping.
>> * With active page hinting, on a system with 15 GB of (single NUMA) memory and
>> 4GB of swap, the runtime of "memhog 6G" in 3 guests (run sequentially) resulted
>> in the last invocation to only need 37s compared to 3m35s without page hinting.
>>
>> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
>> A new hook after buddy merging is used to set the bits in the bitmap.
>> Currently, the bits are only cleared when pages are hinted, not when pages are
>> re-allocated.
>>
>> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
>> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
>> threshold is met, trying to isolate and report pages that are still free.
>>
>> The isolated pages are reported via virtio-balloon, which is responsible for
>> sending batched pages to the host synchronously. Once the hypervisor processed
>> the hinting request, the isolated pages are returned back to the buddy.
>>
>> The key changes made in this series compared to v9[1] are:
>> * Pages only in the chunks of "MAX_ORDER - 2" are reported to the hypervisor to
>> not break up the THP.
>> * At a time only a set of 16 pages can be isolated and reported to the host to
>> avoids any false OOMs.
>> * page_hinting.c is moved under mm/ from virt/kvm/ as the feature is dependent
>> on virtio and not on KVM itself. This would enable any other hypervisor to use
>> this feature by implementing virtio devices.
>> * The sysctl variable is replaced with a virtio-balloon parameter to
>> enable/disable page-hinting.
>>
>> Pending items:
>> * Test device assigned guests to ensure that hinting doesn't break it.
>> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device side support.
>> * Compare reporting free pages via vring with vhost.
>> * Decide between MADV_DONTNEED and MADV_FREE.
>> * Look into memory hotplug, more efficient locking, possible races when
>> disabling.
>> * Come up with proper/traceable error-message/logs.
>> * Minor reworks and simplifications (e.g., virtio protocol).
>>
>> Benefit analysis:
>> 1. Use-case - Number of guests that can be launched without swap usage
>> NUMA Nodes = 1 with 15 GB memory
>> Guest Memory = 5 GB
>> Number of cores in guest = 1
>> Workload = test allocation program allocates 4GB memory, touches it via memset
>> and exits.
>> Procedure =
>> The first guest is launched and once its console is up, the test allocation
>> program is executed with 4 GB memory request (Due to this the guest occupies
>> almost 4-5 GB of memory in the host in a system without page hinting). Once
>> this program exits at that time another guest is launched in the host and the
>> same process is followed. It is continued until the swap is not used.
>>
>> Results:
>> Without hinting = 3, swap usage at the end 1.1GB.
>> With hinting = 5, swap usage at the end 0.
>>
>> 2. Use-case - memhog execution time
>> Guest Memory = 6GB
>> Number of cores = 4
>> NUMA Nodes = 1 with 15 GB memory
>> Process: 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
>> one after the other in each of them.
>> Without Hinting - Guest1:47s, Guest2:53s, Guest3:3m35s, End swap usage: 3.5G
>> With Hinting - Guest1:40s, Guest2:44s, Guest3:37s, End swap usage: 0
>>
>> Performance analysis:
>> 1. will-it-scale's page_faul1:
>> Guest Memory = 6GB
>> Number of cores = 24
>>
>> Without Hinting:
>> tasks,processes,processes_idle,threads,threads_idle,linear
>> 0,0,100,0,100,0
>> 1,315890,95.82,317633,95.83,317633
>> 2,570810,91.67,531147,91.94,635266
>> 3,826491,87.54,713545,88.53,952899
>> 4,1087434,83.40,901215,85.30,1270532
>> 5,1277137,79.26,916442,83.74,1588165
>> 6,1503611,75.12,1113832,79.89,1905798
>> 7,1683750,70.99,1140629,78.33,2223431
>> 8,1893105,66.85,1157028,77.40,2541064
>> 9,2046516,62.50,1179445,76.48,2858697
>> 10,2291171,58.57,1209247,74.99,3176330
>> 11,2486198,54.47,1217265,75.13,3493963
>> 12,2656533,50.36,1193392,74.42,3811596
>> 13,2747951,46.21,1185540,73.45,4129229
>> 14,2965757,42.09,1161862,72.20,4446862
>> 15,3049128,37.97,1185923,72.12,4764495
>> 16,3150692,33.83,1163789,70.70,5082128
>> 17,3206023,29.70,1174217,70.11,5399761
>> 18,3211380,25.62,1179660,69.40,5717394
>> 19,3202031,21.44,1181259,67.28,6035027
>> 20,3218245,17.35,1196367,66.75,6352660
>> 21,3228576,13.26,1129561,66.74,6670293
>> 22,3207452,9.15,1166517,66.47,6987926
>> 23,3153800,5.09,1172877,61.57,7305559
>> 24,3184542,0.99,1186244,58.36,7623192
>>
>> With Hinting:
>> 0,0,100,0,100,0
>> 1,306737,95.82,305130,95.78,306737
>> 2,573207,91.68,530453,91.92,613474
>> 3,810319,87.53,695281,88.58,920211
>> 4,1074116,83.40,880602,85.48,1226948
>> 5,1308283,79.26,1109257,81.23,1533685
>> 6,1501987,75.12,1093661,80.19,1840422
>> 7,1695300,70.99,1104207,79.03,2147159
>> 8,1901523,66.85,1193613,76.90,2453896
>> 9,2051288,62.73,1200913,76.22,2760633
>> 10,2275771,58.60,1192992,75.66,3067370
>> 11,2435016,54.48,1191472,74.66,3374107
>> 12,2623114,50.35,1196911,74.02,3680844
>> 13,2766071,46.22,1178589,73.02,3987581
>> 14,2932163,42.10,1166414,72.96,4294318
>> 15,3000853,37.96,1177177,72.62,4601055
>> 16,3113738,33.85,1165444,70.54,4907792
>> 17,3132135,29.77,1165055,68.51,5214529
>> 18,3175121,25.69,1166969,69.27,5521266
>> 19,3205490,21.61,1159310,65.65,5828003
>> 20,3220855,17.52,1171827,62.04,6134740
>> 21,3182568,13.48,1138918,65.05,6441477
>> 22,3130543,9.30,1128185,60.60,6748214
>> 23,3087426,5.15,1127912,55.36,7054951
>> 24,3099457,1.04,1176100,54.96,7361688
>>
>> [1] https://lkml.org/lkml/2019/3/6/413
>>
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][Patch v10 0/2] mm: Support for page hinting
  2019-06-11 12:19   ` Nitesh Narayan Lal
@ 2019-06-11 15:00     ` Alexander Duyck
  0 siblings, 0 replies; 33+ messages in thread
From: Alexander Duyck @ 2019-06-11 15:00 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: Michael S. Tsirkin, kvm list, LKML, linux-mm, Paolo Bonzini,
	lcapitulino, pagupta, wei.w.wang, Yang Zhang, Rik van Riel,
	David Hildenbrand, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Tue, Jun 11, 2019 at 5:19 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 6/3/19 2:04 PM, Michael S. Tsirkin wrote:
> > On Mon, Jun 03, 2019 at 01:03:04PM -0400, Nitesh Narayan Lal wrote:
> >> This patch series proposes an efficient mechanism for communicating free memory
> >> from a guest to its hypervisor. It especially enables guests with no page cache
> >> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
> >> rapidly hand back free memory to the hypervisor.
> >> This approach has a minimal impact on the existing core-mm infrastructure.
> > Could you help us compare with Alex's series?
> > What are the main differences?
> Sorry for the late reply, but I haven't been feeling too well during the
> last week.
>
> The main differences are that this series uses a bitmap to track pages
> that should be hinted to the hypervisor, while Alexander's series tracks
> it directly in core-mm. Also in order to prevent duplicate hints
> Alexander's series uses a newly defined page flag whereas I have added
> another argument to __free_one_page.
> For these reasons, Alexander's series is relatively more core-mm
> invasive, while this series is lightweight (e.g., LOC). We'll have to
> see if there are real performance differences.
>
> I'm planning on doing some further investigations/review/testing/...
> once I'm back on track.

BTW one thing I found is that I will likely need to add a new
parameter like you did to __free_one_page as I need to defer setting
the flag until after all of the merges have happened. Otherwise set
the flag on a given page, and then after the merge that page may not
be the one we ultimately add to the free list.

I'll try to have an update with all of my changes ready before the end
of this week.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][Patch v10 1/2] mm: page_hinting: core infrastructure
  2019-06-03 17:03 ` [RFC][Patch v10 1/2] mm: page_hinting: core infrastructure Nitesh Narayan Lal
  2019-06-03 19:04   ` Alexander Duyck
  2019-06-03 19:57   ` David Hildenbrand
@ 2019-06-14  7:24   ` David Hildenbrand
  2 siblings, 0 replies; 33+ messages in thread
From: David Hildenbrand @ 2019-06-14  7:24 UTC (permalink / raw)
  To: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini,
	lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, mst,
	dodgen, konrad.wilk, dhildenb, aarcange, alexander.duyck

On 03.06.19 19:03, Nitesh Narayan Lal wrote:
> This patch introduces the core infrastructure for free page hinting in
> virtual environments. It enables the kernel to track the free pages which
> can be reported to its hypervisor so that the hypervisor could
> free and reuse that memory as per its requirement.
> 
> While the pages are getting processed in the hypervisor (e.g.,
> via MADV_FREE), the guest must not use them, otherwise, data loss
> would be possible. To avoid such a situation, these pages are
> temporarily removed from the buddy. The amount of pages removed
> temporarily from the buddy is governed by the backend(virtio-balloon
> in our case).
> 
> To efficiently identify free pages that can to be hinted to the
> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> chunks are reported to the hypervisor - especially, to not break up THP
> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> in the bitmap are an indication whether a page *might* be free, not a
> guarantee. A new hook after buddy merging sets the bits.
> 
> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> asynchronously processes the bitmaps, trying to isolate and report pages
> that are still free. The backend (virtio-balloon) is responsible for
> reporting these batched pages to the host synchronously. Once reporting/
> freeing is complete, isolated pages are returned back to the buddy.
> 
> There are still various things to look into (e.g., memory hotplug, more
> efficient locking, possible races when disabling).
> 
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> ---
>  drivers/virtio/Kconfig       |   1 +
>  include/linux/page_hinting.h |  46 +++++++
>  mm/Kconfig                   |   6 +
>  mm/Makefile                  |   2 +
>  mm/page_alloc.c              |  17 +--
>  mm/page_hinting.c            | 236 +++++++++++++++++++++++++++++++++++
>  6 files changed, 301 insertions(+), 7 deletions(-)
>  create mode 100644 include/linux/page_hinting.h
>  create mode 100644 mm/page_hinting.c
> 
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 35897649c24f..5a96b7a2ed1e 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -46,6 +46,7 @@ config VIRTIO_BALLOON
>  	tristate "Virtio balloon driver"
>  	depends on VIRTIO
>  	select MEMORY_BALLOON
> +	select PAGE_HINTING
>  	---help---
>  	 This driver supports increasing and decreasing the amount
>  	 of memory within a KVM guest.

BTW, this hunk belongs to the virtio-balloon patch.


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][Patch v10 0/2] mm: Support for page hinting
  2019-06-03 18:04 ` [RFC][Patch v10 0/2] mm: " Michael S. Tsirkin
  2019-06-03 18:38   ` Nitesh Narayan Lal
  2019-06-11 12:19   ` Nitesh Narayan Lal
@ 2019-06-25 14:48   ` Nitesh Narayan Lal
  2019-06-25 17:10     ` Alexander Duyck
  2 siblings, 1 reply; 33+ messages in thread
From: Nitesh Narayan Lal @ 2019-06-25 14:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta,
	wei.w.wang, yang.zhang.wz, riel, david, dodgen, konrad.wilk,
	dhildenb, aarcange, alexander.duyck


[-- Attachment #1.1: Type: text/plain, Size: 11825 bytes --]


On 6/3/19 2:04 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 03, 2019 at 01:03:04PM -0400, Nitesh Narayan Lal wrote:
>> This patch series proposes an efficient mechanism for communicating free memory
>> from a guest to its hypervisor. It especially enables guests with no page cache
>> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
>> rapidly hand back free memory to the hypervisor.
>> This approach has a minimal impact on the existing core-mm infrastructure.
> Could you help us compare with Alex's series?
> What are the main differences?
Results on comparing the benefits/performance of Alexander's v1
(bubble-hinting)[1], Page-Hinting (includes some of the upstream
suggested changes on v10) over an unmodified Kernel.

Test1 - Number of guests that can be launched without swap usage.
Guest size: 5GB
Cores: 4
Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
Process: Guest is launched sequentially after running an allocation
program with 4GB request.

Results:
unmodified kernel: 2 guests without swap usage and 3rd guest with a swap
usage of 2.3GB.
bubble-hinting v1: 4 guests without swap usage and 5th guest with a swap
usage of 1MB.
Page-hinting: 5 guests without swap usage and 6th guest with a swap
usage of 8MB.


Test2 - Memhog execution time
Guest size: 6GB
Cores: 4
Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
Process: 3 guests are launched and "time memhog 6G" is launched in each
of them sequentially.

Results:
unmodified kernel: Guest1-40s, Guest2-1m5s, Guest3-6m38s (swap usage at
the end-3.6G)
bubble-hinting v1: Guest1-32s, Guest2-58s, Guest3-35s (swap usage at the
end-0)
Page-hinting: Guest1-42s, Guest2-47s, Guest3-32s (swap usage at the end-0)


Test3 - Will-it-scale's page_fault1
Guest size: 6GB
Cores: 24
Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)

unmodified kernel:
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,459168,95.83,459315,95.83,459315
2,956272,91.68,884643,91.72,918630
3,1407811,87.53,1267948,87.69,1377945
4,1755744,83.39,1562471,83.73,1837260
5,2056741,79.24,1812309,80.00,2296575
6,2393759,75.09,2025719,77.02,2755890
7,2754403,70.95,2238180,73.72,3215205
8,2947493,66.81,2369686,70.37,3674520
9,3063579,62.68,2321148,68.84,4133835
10,3229023,58.54,2377596,65.84,4593150
11,3337665,54.40,2429818,64.01,5052465
12,3255140,50.28,2395070,61.63,5511780
13,3260721,46.11,2402644,59.77,5971095
14,3210590,42.02,2390806,57.46,6430410
15,3164811,37.88,2265352,51.39,6889725
16,3144764,33.77,2335028,54.07,7349040
17,3128839,29.63,2328662,49.52,7808355
18,3133344,25.50,2301181,48.01,8267670
19,3135979,21.38,2343003,43.66,8726985
20,3136448,17.27,2306109,40.81,9186300
21,3130324,13.16,2403688,35.84,9645615
22,3109883,9.04,2290808,36.24,10104930
23,3136805,4.94,2263818,35.43,10564245
24,3118949,0.78,2252891,31.03,11023560

bubble-hinting v1:
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,292183,95.83,292428,95.83,292428
2,540606,91.67,501887,91.91,584856
3,821748,87.53,735244,88.31,877284
4,1033782,83.38,839925,85.59,1169712
5,1261352,79.25,896464,83.86,1462140
6,1459544,75.12,1050094,80.93,1754568
7,1686537,70.97,1112202,79.23,2046996
8,1866892,66.83,1083571,78.48,2339424
9,2056887,62.72,1101660,77.94,2631852
10,2252955,58.57,1097439,77.36,2924280
11,2413907,54.40,1088583,76.72,3216708
12,2596504,50.35,1117474,76.01,3509136
13,2715338,46.21,1087666,75.32,3801564
14,2861697,42.08,1084692,74.35,4093992
15,2964620,38.02,1087910,73.40,4386420
16,3065575,33.84,1099406,71.07,4678848
17,3107674,29.76,1056948,71.36,4971276
18,3144963,25.71,1094883,70.14,5263704
19,3173468,21.61,1073049,66.21,5556132
20,3173233,17.55,1072417,67.16,5848560
21,3209710,13.37,1079147,65.64,6140988
22,3182958,9.37,1085872,65.95,6433416
23,3200747,5.23,1076414,59.40,6725844
24,3181699,1.04,1051233,65.62,7018272

Page-hinting:
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,467693,95.83,467970,95.83,467970
2,967860,91.68,895883,91.70,935940
3,1408191,87.53,1279602,87.68,1403910
4,1766250,83.39,1557224,83.93,1871880
5,2124689,79.24,1834625,80.35,2339850
6,2413514,75.10,1989557,77.00,2807820
7,2644648,70.95,2158055,73.73,3275790
8,2896483,66.81,2305785,70.85,3743760
9,3157796,62.67,2304083,69.49,4211730
10,3251633,58.53,2379589,66.43,4679700
11,3313704,54.41,2349310,64.76,5147670
12,3285612,50.30,2362013,62.63,5615640
13,3207275,46.17,2377760,59.94,6083610
14,3221727,42.02,2416278,56.70,6551580
15,3194781,37.91,2334552,54.96,7019550
16,3211818,33.78,2399077,52.75,7487520
17,3172664,29.65,2337660,50.27,7955490
18,3177152,25.49,2349721,47.02,8423460
19,3149924,21.36,2319286,40.16,8891430
20,3166910,17.30,2279719,43.23,9359400
21,3159464,13.19,2342849,34.84,9827370
22,3167091,9.06,2285156,37.97,10295340
23,3174137,4.96,2365448,33.74,10763310
24,3161629,0.86,2253813,32.38,11231280


Test4: Netperf
Guest size: 5GB
Cores: 4
Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
Netserver: Running on core 0
Netperf: Running on core 1
Recv Socket Size bytes: 131072
Send Socket Size bytes:16384
Send Message Size bytes:1000000000
Time: 900s
Process: netperf is run 3 times sequentially in the same guest with the
same inputs mentioned above and throughput (10^6bits/sec) is observed.
unmodified kernel: 1st Run-14769.60, 2nd Run-14849.18, 3rd Run-14842.02
bubble-hinting v1: 1st Run-13441.77, 2nd Run-13487.81, 3rd Run-13503.87
Page-hinting: 1st Run-14308.20, 2nd Run-14344.36, 3rd Run-14450.07

Drawback with bubble-hinting:
More invasive.

Drawback with page-hinting:
Additional bitmap required, including growing/shrinking the bitmap on
memory hotplug.


[1] https://lkml.org/lkml/2019/6/19/926
>> Measurement results (measurement details appended to this email):
>> * With active page hinting, 3 more guests could be launched each of 5 GB(total 
>> 5 vs. 2) on a 15GB (single NUMA) system without swapping.
>> * With active page hinting, on a system with 15 GB of (single NUMA) memory and
>> 4GB of swap, the runtime of "memhog 6G" in 3 guests (run sequentially) resulted
>> in the last invocation to only need 37s compared to 3m35s without page hinting.
>>
>> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
>> A new hook after buddy merging is used to set the bits in the bitmap.
>> Currently, the bits are only cleared when pages are hinted, not when pages are
>> re-allocated.
>>
>> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
>> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
>> threshold is met, trying to isolate and report pages that are still free.
>>
>> The isolated pages are reported via virtio-balloon, which is responsible for
>> sending batched pages to the host synchronously. Once the hypervisor processed
>> the hinting request, the isolated pages are returned back to the buddy.
>>
>> The key changes made in this series compared to v9[1] are:
>> * Pages only in the chunks of "MAX_ORDER - 2" are reported to the hypervisor to
>> not break up the THP.
>> * At a time only a set of 16 pages can be isolated and reported to the host to
>> avoids any false OOMs.
>> * page_hinting.c is moved under mm/ from virt/kvm/ as the feature is dependent
>> on virtio and not on KVM itself. This would enable any other hypervisor to use
>> this feature by implementing virtio devices.
>> * The sysctl variable is replaced with a virtio-balloon parameter to
>> enable/disable page-hinting.
>>
>> Pending items:
>> * Test device assigned guests to ensure that hinting doesn't break it.
>> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device side support.
>> * Compare reporting free pages via vring with vhost.
>> * Decide between MADV_DONTNEED and MADV_FREE.
>> * Look into memory hotplug, more efficient locking, possible races when
>> disabling.
>> * Come up with proper/traceable error-message/logs.
>> * Minor reworks and simplifications (e.g., virtio protocol).
>>
>> Benefit analysis:
>> 1. Use-case - Number of guests that can be launched without swap usage
>> NUMA Nodes = 1 with 15 GB memory
>> Guest Memory = 5 GB
>> Number of cores in guest = 1
>> Workload = test allocation program allocates 4GB memory, touches it via memset
>> and exits.
>> Procedure =
>> The first guest is launched and once its console is up, the test allocation
>> program is executed with 4 GB memory request (Due to this the guest occupies
>> almost 4-5 GB of memory in the host in a system without page hinting). Once
>> this program exits at that time another guest is launched in the host and the
>> same process is followed. It is continued until the swap is not used.
>>
>> Results:
>> Without hinting = 3, swap usage at the end 1.1GB.
>> With hinting = 5, swap usage at the end 0.
>>
>> 2. Use-case - memhog execution time
>> Guest Memory = 6GB
>> Number of cores = 4
>> NUMA Nodes = 1 with 15 GB memory
>> Process: 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
>> one after the other in each of them.
>> Without Hinting - Guest1:47s, Guest2:53s, Guest3:3m35s, End swap usage: 3.5G
>> With Hinting - Guest1:40s, Guest2:44s, Guest3:37s, End swap usage: 0
>>
>> Performance analysis:
>> 1. will-it-scale's page_faul1:
>> Guest Memory = 6GB
>> Number of cores = 24
>>
>> Without Hinting:
>> tasks,processes,processes_idle,threads,threads_idle,linear
>> 0,0,100,0,100,0
>> 1,315890,95.82,317633,95.83,317633
>> 2,570810,91.67,531147,91.94,635266
>> 3,826491,87.54,713545,88.53,952899
>> 4,1087434,83.40,901215,85.30,1270532
>> 5,1277137,79.26,916442,83.74,1588165
>> 6,1503611,75.12,1113832,79.89,1905798
>> 7,1683750,70.99,1140629,78.33,2223431
>> 8,1893105,66.85,1157028,77.40,2541064
>> 9,2046516,62.50,1179445,76.48,2858697
>> 10,2291171,58.57,1209247,74.99,3176330
>> 11,2486198,54.47,1217265,75.13,3493963
>> 12,2656533,50.36,1193392,74.42,3811596
>> 13,2747951,46.21,1185540,73.45,4129229
>> 14,2965757,42.09,1161862,72.20,4446862
>> 15,3049128,37.97,1185923,72.12,4764495
>> 16,3150692,33.83,1163789,70.70,5082128
>> 17,3206023,29.70,1174217,70.11,5399761
>> 18,3211380,25.62,1179660,69.40,5717394
>> 19,3202031,21.44,1181259,67.28,6035027
>> 20,3218245,17.35,1196367,66.75,6352660
>> 21,3228576,13.26,1129561,66.74,6670293
>> 22,3207452,9.15,1166517,66.47,6987926
>> 23,3153800,5.09,1172877,61.57,7305559
>> 24,3184542,0.99,1186244,58.36,7623192
>>
>> With Hinting:
>> 0,0,100,0,100,0
>> 1,306737,95.82,305130,95.78,306737
>> 2,573207,91.68,530453,91.92,613474
>> 3,810319,87.53,695281,88.58,920211
>> 4,1074116,83.40,880602,85.48,1226948
>> 5,1308283,79.26,1109257,81.23,1533685
>> 6,1501987,75.12,1093661,80.19,1840422
>> 7,1695300,70.99,1104207,79.03,2147159
>> 8,1901523,66.85,1193613,76.90,2453896
>> 9,2051288,62.73,1200913,76.22,2760633
>> 10,2275771,58.60,1192992,75.66,3067370
>> 11,2435016,54.48,1191472,74.66,3374107
>> 12,2623114,50.35,1196911,74.02,3680844
>> 13,2766071,46.22,1178589,73.02,3987581
>> 14,2932163,42.10,1166414,72.96,4294318
>> 15,3000853,37.96,1177177,72.62,4601055
>> 16,3113738,33.85,1165444,70.54,4907792
>> 17,3132135,29.77,1165055,68.51,5214529
>> 18,3175121,25.69,1166969,69.27,5521266
>> 19,3205490,21.61,1159310,65.65,5828003
>> 20,3220855,17.52,1171827,62.04,6134740
>> 21,3182568,13.48,1138918,65.05,6441477
>> 22,3130543,9.30,1128185,60.60,6748214
>> 23,3087426,5.15,1127912,55.36,7054951
>> 24,3099457,1.04,1176100,54.96,7361688
>>
>> [1] https://lkml.org/lkml/2019/3/6/413
>>
-- 
Regards
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][Patch v10 0/2] mm: Support for page hinting
  2019-06-25 14:48   ` Nitesh Narayan Lal
@ 2019-06-25 17:10     ` Alexander Duyck
       [not found]       ` <cc20a6d2-9e95-3de4-301a-f2a6a5b025e4@redhat.com>
  0 siblings, 1 reply; 33+ messages in thread
From: Alexander Duyck @ 2019-06-25 17:10 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: Michael S. Tsirkin, kvm list, LKML, linux-mm, Paolo Bonzini,
	lcapitulino, pagupta, wei.w.wang, Yang Zhang, Rik van Riel,
	David Hildenbrand, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Tue, Jun 25, 2019 at 7:49 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 6/3/19 2:04 PM, Michael S. Tsirkin wrote:
> > On Mon, Jun 03, 2019 at 01:03:04PM -0400, Nitesh Narayan Lal wrote:
> >> This patch series proposes an efficient mechanism for communicating free memory
> >> from a guest to its hypervisor. It especially enables guests with no page cache
> >> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
> >> rapidly hand back free memory to the hypervisor.
> >> This approach has a minimal impact on the existing core-mm infrastructure.
> > Could you help us compare with Alex's series?
> > What are the main differences?
> Results on comparing the benefits/performance of Alexander's v1
> (bubble-hinting)[1], Page-Hinting (includes some of the upstream
> suggested changes on v10) over an unmodified Kernel.
>
> Test1 - Number of guests that can be launched without swap usage.
> Guest size: 5GB
> Cores: 4
> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
> Process: Guest is launched sequentially after running an allocation
> program with 4GB request.
>
> Results:
> unmodified kernel: 2 guests without swap usage and 3rd guest with a swap
> usage of 2.3GB.
> bubble-hinting v1: 4 guests without swap usage and 5th guest with a swap
> usage of 1MB.
> Page-hinting: 5 guests without swap usage and 6th guest with a swap
> usage of 8MB.
>
>
> Test2 - Memhog execution time
> Guest size: 6GB
> Cores: 4
> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
> Process: 3 guests are launched and "time memhog 6G" is launched in each
> of them sequentially.
>
> Results:
> unmodified kernel: Guest1-40s, Guest2-1m5s, Guest3-6m38s (swap usage at
> the end-3.6G)
> bubble-hinting v1: Guest1-32s, Guest2-58s, Guest3-35s (swap usage at the
> end-0)
> Page-hinting: Guest1-42s, Guest2-47s, Guest3-32s (swap usage at the end-0)
>
>
> Test3 - Will-it-scale's page_fault1
> Guest size: 6GB
> Cores: 24
> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
>
> unmodified kernel:
> tasks,processes,processes_idle,threads,threads_idle,linear
> 0,0,100,0,100,0
> 1,459168,95.83,459315,95.83,459315
> 2,956272,91.68,884643,91.72,918630
> 3,1407811,87.53,1267948,87.69,1377945
> 4,1755744,83.39,1562471,83.73,1837260
> 5,2056741,79.24,1812309,80.00,2296575
> 6,2393759,75.09,2025719,77.02,2755890
> 7,2754403,70.95,2238180,73.72,3215205
> 8,2947493,66.81,2369686,70.37,3674520
> 9,3063579,62.68,2321148,68.84,4133835
> 10,3229023,58.54,2377596,65.84,4593150
> 11,3337665,54.40,2429818,64.01,5052465
> 12,3255140,50.28,2395070,61.63,5511780
> 13,3260721,46.11,2402644,59.77,5971095
> 14,3210590,42.02,2390806,57.46,6430410
> 15,3164811,37.88,2265352,51.39,6889725
> 16,3144764,33.77,2335028,54.07,7349040
> 17,3128839,29.63,2328662,49.52,7808355
> 18,3133344,25.50,2301181,48.01,8267670
> 19,3135979,21.38,2343003,43.66,8726985
> 20,3136448,17.27,2306109,40.81,9186300
> 21,3130324,13.16,2403688,35.84,9645615
> 22,3109883,9.04,2290808,36.24,10104930
> 23,3136805,4.94,2263818,35.43,10564245
> 24,3118949,0.78,2252891,31.03,11023560
>
> bubble-hinting v1:
> tasks,processes,processes_idle,threads,threads_idle,linear
> 0,0,100,0,100,0
> 1,292183,95.83,292428,95.83,292428
> 2,540606,91.67,501887,91.91,584856
> 3,821748,87.53,735244,88.31,877284
> 4,1033782,83.38,839925,85.59,1169712
> 5,1261352,79.25,896464,83.86,1462140
> 6,1459544,75.12,1050094,80.93,1754568
> 7,1686537,70.97,1112202,79.23,2046996
> 8,1866892,66.83,1083571,78.48,2339424
> 9,2056887,62.72,1101660,77.94,2631852
> 10,2252955,58.57,1097439,77.36,2924280
> 11,2413907,54.40,1088583,76.72,3216708
> 12,2596504,50.35,1117474,76.01,3509136
> 13,2715338,46.21,1087666,75.32,3801564
> 14,2861697,42.08,1084692,74.35,4093992
> 15,2964620,38.02,1087910,73.40,4386420
> 16,3065575,33.84,1099406,71.07,4678848
> 17,3107674,29.76,1056948,71.36,4971276
> 18,3144963,25.71,1094883,70.14,5263704
> 19,3173468,21.61,1073049,66.21,5556132
> 20,3173233,17.55,1072417,67.16,5848560
> 21,3209710,13.37,1079147,65.64,6140988
> 22,3182958,9.37,1085872,65.95,6433416
> 23,3200747,5.23,1076414,59.40,6725844
> 24,3181699,1.04,1051233,65.62,7018272
>
> Page-hinting:
> tasks,processes,processes_idle,threads,threads_idle,linear
> 0,0,100,0,100,0
> 1,467693,95.83,467970,95.83,467970
> 2,967860,91.68,895883,91.70,935940
> 3,1408191,87.53,1279602,87.68,1403910
> 4,1766250,83.39,1557224,83.93,1871880
> 5,2124689,79.24,1834625,80.35,2339850
> 6,2413514,75.10,1989557,77.00,2807820
> 7,2644648,70.95,2158055,73.73,3275790
> 8,2896483,66.81,2305785,70.85,3743760
> 9,3157796,62.67,2304083,69.49,4211730
> 10,3251633,58.53,2379589,66.43,4679700
> 11,3313704,54.41,2349310,64.76,5147670
> 12,3285612,50.30,2362013,62.63,5615640
> 13,3207275,46.17,2377760,59.94,6083610
> 14,3221727,42.02,2416278,56.70,6551580
> 15,3194781,37.91,2334552,54.96,7019550
> 16,3211818,33.78,2399077,52.75,7487520
> 17,3172664,29.65,2337660,50.27,7955490
> 18,3177152,25.49,2349721,47.02,8423460
> 19,3149924,21.36,2319286,40.16,8891430
> 20,3166910,17.30,2279719,43.23,9359400
> 21,3159464,13.19,2342849,34.84,9827370
> 22,3167091,9.06,2285156,37.97,10295340
> 23,3174137,4.96,2365448,33.74,10763310
> 24,3161629,0.86,2253813,32.38,11231280
>
>
> Test4: Netperf
> Guest size: 5GB
> Cores: 4
> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
> Netserver: Running on core 0
> Netperf: Running on core 1
> Recv Socket Size bytes: 131072
> Send Socket Size bytes:16384
> Send Message Size bytes:1000000000
> Time: 900s
> Process: netperf is run 3 times sequentially in the same guest with the
> same inputs mentioned above and throughput (10^6bits/sec) is observed.
> unmodified kernel: 1st Run-14769.60, 2nd Run-14849.18, 3rd Run-14842.02
> bubble-hinting v1: 1st Run-13441.77, 2nd Run-13487.81, 3rd Run-13503.87
> Page-hinting: 1st Run-14308.20, 2nd Run-14344.36, 3rd Run-14450.07
>
> Drawback with bubble-hinting:
> More invasive.
>
> Drawback with page-hinting:
> Additional bitmap required, including growing/shrinking the bitmap on
> memory hotplug.
>
>
> [1] https://lkml.org/lkml/2019/6/19/926

Any chance you could provide a .config for your kernel? I'm wondering
what is different between the two as it seems like you are showing a
significant regression in terms of performance for the bubble
hinting/aeration approach versus a stock kernel without the patches
and that doesn't match up with what I have been seeing.

Also, any ETA for when we can look at the patches for the approach you have?

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][Patch v10 0/2] mm: Support for page hinting
       [not found]       ` <cc20a6d2-9e95-3de4-301a-f2a6a5b025e4@redhat.com>
@ 2019-06-28 18:25         ` Alexander Duyck
  2019-06-28 19:13           ` Nitesh Narayan Lal
  0 siblings, 1 reply; 33+ messages in thread
From: Alexander Duyck @ 2019-06-28 18:25 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: Michael S. Tsirkin, kvm list, LKML, linux-mm, Paolo Bonzini,
	lcapitulino, pagupta, wei.w.wang, Yang Zhang, Rik van Riel,
	David Hildenbrand, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Tue, Jun 25, 2019 at 10:32 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
> On 6/25/19 1:10 PM, Alexander Duyck wrote:
> > On Tue, Jun 25, 2019 at 7:49 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>
> >> On 6/3/19 2:04 PM, Michael S. Tsirkin wrote:
> >>> On Mon, Jun 03, 2019 at 01:03:04PM -0400, Nitesh Narayan Lal wrote:
> >>>> This patch series proposes an efficient mechanism for communicating free memory
> >>>> from a guest to its hypervisor. It especially enables guests with no page cache
> >>>> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
> >>>> rapidly hand back free memory to the hypervisor.
> >>>> This approach has a minimal impact on the existing core-mm infrastructure.
> >>> Could you help us compare with Alex's series?
> >>> What are the main differences?
> >> Results on comparing the benefits/performance of Alexander's v1
> >> (bubble-hinting)[1], Page-Hinting (includes some of the upstream
> >> suggested changes on v10) over an unmodified Kernel.
> >>
> >> Test1 - Number of guests that can be launched without swap usage.
> >> Guest size: 5GB
> >> Cores: 4
> >> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
> >> Process: Guest is launched sequentially after running an allocation
> >> program with 4GB request.
> >>
> >> Results:
> >> unmodified kernel: 2 guests without swap usage and 3rd guest with a swap
> >> usage of 2.3GB.
> >> bubble-hinting v1: 4 guests without swap usage and 5th guest with a swap
> >> usage of 1MB.
> >> Page-hinting: 5 guests without swap usage and 6th guest with a swap
> >> usage of 8MB.
> >>
> >>
> >> Test2 - Memhog execution time
> >> Guest size: 6GB
> >> Cores: 4
> >> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
> >> Process: 3 guests are launched and "time memhog 6G" is launched in each
> >> of them sequentially.
> >>
> >> Results:
> >> unmodified kernel: Guest1-40s, Guest2-1m5s, Guest3-6m38s (swap usage at
> >> the end-3.6G)
> >> bubble-hinting v1: Guest1-32s, Guest2-58s, Guest3-35s (swap usage at the
> >> end-0)
> >> Page-hinting: Guest1-42s, Guest2-47s, Guest3-32s (swap usage at the end-0)
> >>
> >>
> >> Test3 - Will-it-scale's page_fault1
> >> Guest size: 6GB
> >> Cores: 24
> >> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
> >>
> >> unmodified kernel:
> >> tasks,processes,processes_idle,threads,threads_idle,linear
> >> 0,0,100,0,100,0
> >> 1,459168,95.83,459315,95.83,459315
> >> 2,956272,91.68,884643,91.72,918630
> >> 3,1407811,87.53,1267948,87.69,1377945
> >> 4,1755744,83.39,1562471,83.73,1837260
> >> 5,2056741,79.24,1812309,80.00,2296575
> >> 6,2393759,75.09,2025719,77.02,2755890
> >> 7,2754403,70.95,2238180,73.72,3215205
> >> 8,2947493,66.81,2369686,70.37,3674520
> >> 9,3063579,62.68,2321148,68.84,4133835
> >> 10,3229023,58.54,2377596,65.84,4593150
> >> 11,3337665,54.40,2429818,64.01,5052465
> >> 12,3255140,50.28,2395070,61.63,5511780
> >> 13,3260721,46.11,2402644,59.77,5971095
> >> 14,3210590,42.02,2390806,57.46,6430410
> >> 15,3164811,37.88,2265352,51.39,6889725
> >> 16,3144764,33.77,2335028,54.07,7349040
> >> 17,3128839,29.63,2328662,49.52,7808355
> >> 18,3133344,25.50,2301181,48.01,8267670
> >> 19,3135979,21.38,2343003,43.66,8726985
> >> 20,3136448,17.27,2306109,40.81,9186300
> >> 21,3130324,13.16,2403688,35.84,9645615
> >> 22,3109883,9.04,2290808,36.24,10104930
> >> 23,3136805,4.94,2263818,35.43,10564245
> >> 24,3118949,0.78,2252891,31.03,11023560
> >>
> >> bubble-hinting v1:
> >> tasks,processes,processes_idle,threads,threads_idle,linear
> >> 0,0,100,0,100,0
> >> 1,292183,95.83,292428,95.83,292428
> >> 2,540606,91.67,501887,91.91,584856
> >> 3,821748,87.53,735244,88.31,877284
> >> 4,1033782,83.38,839925,85.59,1169712
> >> 5,1261352,79.25,896464,83.86,1462140
> >> 6,1459544,75.12,1050094,80.93,1754568
> >> 7,1686537,70.97,1112202,79.23,2046996
> >> 8,1866892,66.83,1083571,78.48,2339424
> >> 9,2056887,62.72,1101660,77.94,2631852
> >> 10,2252955,58.57,1097439,77.36,2924280
> >> 11,2413907,54.40,1088583,76.72,3216708
> >> 12,2596504,50.35,1117474,76.01,3509136
> >> 13,2715338,46.21,1087666,75.32,3801564
> >> 14,2861697,42.08,1084692,74.35,4093992
> >> 15,2964620,38.02,1087910,73.40,4386420
> >> 16,3065575,33.84,1099406,71.07,4678848
> >> 17,3107674,29.76,1056948,71.36,4971276
> >> 18,3144963,25.71,1094883,70.14,5263704
> >> 19,3173468,21.61,1073049,66.21,5556132
> >> 20,3173233,17.55,1072417,67.16,5848560
> >> 21,3209710,13.37,1079147,65.64,6140988
> >> 22,3182958,9.37,1085872,65.95,6433416
> >> 23,3200747,5.23,1076414,59.40,6725844
> >> 24,3181699,1.04,1051233,65.62,7018272
> >>
> >> Page-hinting:
> >> tasks,processes,processes_idle,threads,threads_idle,linear
> >> 0,0,100,0,100,0
> >> 1,467693,95.83,467970,95.83,467970
> >> 2,967860,91.68,895883,91.70,935940
> >> 3,1408191,87.53,1279602,87.68,1403910
> >> 4,1766250,83.39,1557224,83.93,1871880
> >> 5,2124689,79.24,1834625,80.35,2339850
> >> 6,2413514,75.10,1989557,77.00,2807820
> >> 7,2644648,70.95,2158055,73.73,3275790
> >> 8,2896483,66.81,2305785,70.85,3743760
> >> 9,3157796,62.67,2304083,69.49,4211730
> >> 10,3251633,58.53,2379589,66.43,4679700
> >> 11,3313704,54.41,2349310,64.76,5147670
> >> 12,3285612,50.30,2362013,62.63,5615640
> >> 13,3207275,46.17,2377760,59.94,6083610
> >> 14,3221727,42.02,2416278,56.70,6551580
> >> 15,3194781,37.91,2334552,54.96,7019550
> >> 16,3211818,33.78,2399077,52.75,7487520
> >> 17,3172664,29.65,2337660,50.27,7955490
> >> 18,3177152,25.49,2349721,47.02,8423460
> >> 19,3149924,21.36,2319286,40.16,8891430
> >> 20,3166910,17.30,2279719,43.23,9359400
> >> 21,3159464,13.19,2342849,34.84,9827370
> >> 22,3167091,9.06,2285156,37.97,10295340
> >> 23,3174137,4.96,2365448,33.74,10763310
> >> 24,3161629,0.86,2253813,32.38,11231280
> >>
> >>
> >> Test4: Netperf
> >> Guest size: 5GB
> >> Cores: 4
> >> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
> >> Netserver: Running on core 0
> >> Netperf: Running on core 1
> >> Recv Socket Size bytes: 131072
> >> Send Socket Size bytes:16384
> >> Send Message Size bytes:1000000000
> >> Time: 900s
> >> Process: netperf is run 3 times sequentially in the same guest with the
> >> same inputs mentioned above and throughput (10^6bits/sec) is observed.
> >> unmodified kernel: 1st Run-14769.60, 2nd Run-14849.18, 3rd Run-14842.02
> >> bubble-hinting v1: 1st Run-13441.77, 2nd Run-13487.81, 3rd Run-13503.87
> >> Page-hinting: 1st Run-14308.20, 2nd Run-14344.36, 3rd Run-14450.07
> >>
> >> Drawback with bubble-hinting:
> >> More invasive.
> >>
> >> Drawback with page-hinting:
> >> Additional bitmap required, including growing/shrinking the bitmap on
> >> memory hotplug.
> >>
> >>
> >> [1] https://lkml.org/lkml/2019/6/19/926
> > Any chance you could provide a .config for your kernel? I'm wondering
> > what is different between the two as it seems like you are showing a
> > significant regression in terms of performance for the bubble
> > hinting/aeration approach versus a stock kernel without the patches
> > and that doesn't match up with what I have been seeing.
> I have attached the config which I was using.

Were all of these runs with the same config? I ask because I noticed
the config you provided had a number of quite expensive memory debug
options enabled:

#
# Memory Debugging
#
CONFIG_PAGE_EXTENSION=y
CONFIG_DEBUG_PAGEALLOC=y
CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y
CONFIG_PAGE_OWNER=y
# CONFIG_PAGE_POISONING is not set
CONFIG_DEBUG_PAGE_REF=y
# CONFIG_DEBUG_RODATA_TEST is not set
CONFIG_DEBUG_OBJECTS=y
# CONFIG_DEBUG_OBJECTS_SELFTEST is not set
# CONFIG_DEBUG_OBJECTS_FREE is not set
# CONFIG_DEBUG_OBJECTS_TIMERS is not set
# CONFIG_DEBUG_OBJECTS_WORK is not set
# CONFIG_DEBUG_OBJECTS_RCU_HEAD is not set
# CONFIG_DEBUG_OBJECTS_PERCPU_COUNTER is not set
CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT=1
CONFIG_SLUB_DEBUG_ON=y
# CONFIG_SLUB_STATS is not set
CONFIG_HAVE_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_KMEMLEAK_EARLY_LOG_SIZE=400
# CONFIG_DEBUG_KMEMLEAK_TEST is not set
# CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF is not set
CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y
CONFIG_DEBUG_STACK_USAGE=y
CONFIG_DEBUG_VM=y
# CONFIG_DEBUG_VM_VMACACHE is not set
# CONFIG_DEBUG_VM_RB is not set
# CONFIG_DEBUG_VM_PGFLAGS is not set
CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
CONFIG_DEBUG_VIRTUAL=y
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_DEBUG_PER_CPU_MAPS=y
CONFIG_HAVE_ARCH_KASAN=y
CONFIG_CC_HAS_KASAN_GENERIC=y
# CONFIG_KASAN is not set
CONFIG_KASAN_STACK=1
# end of Memory Debugging

When I went through and enabled these then my results for the bubble
hinting matched pretty closely to what you reported. However, when I
compiled without the patches and this config enabled the results were
still about what was reported with the bubble hinting but were maybe
5% improved. I'm just wondering if you were doing some additional
debugging and left those options enabled for the bubble hinting test
run.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][Patch v10 0/2] mm: Support for page hinting
  2019-06-28 18:25         ` Alexander Duyck
@ 2019-06-28 19:13           ` Nitesh Narayan Lal
  0 siblings, 0 replies; 33+ messages in thread
From: Nitesh Narayan Lal @ 2019-06-28 19:13 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michael S. Tsirkin, kvm list, LKML, linux-mm, Paolo Bonzini,
	lcapitulino, pagupta, wei.w.wang, Yang Zhang, Rik van Riel,
	David Hildenbrand, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli


[-- Attachment #1.1: Type: text/plain, Size: 9447 bytes --]


On 6/28/19 2:25 PM, Alexander Duyck wrote:
> On Tue, Jun 25, 2019 at 10:32 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>> On 6/25/19 1:10 PM, Alexander Duyck wrote:
>>> On Tue, Jun 25, 2019 at 7:49 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>> On 6/3/19 2:04 PM, Michael S. Tsirkin wrote:
>>>>> On Mon, Jun 03, 2019 at 01:03:04PM -0400, Nitesh Narayan Lal wrote:
>>>>>> This patch series proposes an efficient mechanism for communicating free memory
>>>>>> from a guest to its hypervisor. It especially enables guests with no page cache
>>>>>> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
>>>>>> rapidly hand back free memory to the hypervisor.
>>>>>> This approach has a minimal impact on the existing core-mm infrastructure.
>>>>> Could you help us compare with Alex's series?
>>>>> What are the main differences?
>>>> Results on comparing the benefits/performance of Alexander's v1
>>>> (bubble-hinting)[1], Page-Hinting (includes some of the upstream
>>>> suggested changes on v10) over an unmodified Kernel.
>>>>
>>>> Test1 - Number of guests that can be launched without swap usage.
>>>> Guest size: 5GB
>>>> Cores: 4
>>>> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
>>>> Process: Guest is launched sequentially after running an allocation
>>>> program with 4GB request.
>>>>
>>>> Results:
>>>> unmodified kernel: 2 guests without swap usage and 3rd guest with a swap
>>>> usage of 2.3GB.
>>>> bubble-hinting v1: 4 guests without swap usage and 5th guest with a swap
>>>> usage of 1MB.
>>>> Page-hinting: 5 guests without swap usage and 6th guest with a swap
>>>> usage of 8MB.
>>>>
>>>>
>>>> Test2 - Memhog execution time
>>>> Guest size: 6GB
>>>> Cores: 4
>>>> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
>>>> Process: 3 guests are launched and "time memhog 6G" is launched in each
>>>> of them sequentially.
>>>>
>>>> Results:
>>>> unmodified kernel: Guest1-40s, Guest2-1m5s, Guest3-6m38s (swap usage at
>>>> the end-3.6G)
>>>> bubble-hinting v1: Guest1-32s, Guest2-58s, Guest3-35s (swap usage at the
>>>> end-0)
>>>> Page-hinting: Guest1-42s, Guest2-47s, Guest3-32s (swap usage at the end-0)
>>>>
>>>>
>>>> Test3 - Will-it-scale's page_fault1
>>>> Guest size: 6GB
>>>> Cores: 24
>>>> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
>>>>
>>>> unmodified kernel:
>>>> tasks,processes,processes_idle,threads,threads_idle,linear
>>>> 0,0,100,0,100,0
>>>> 1,459168,95.83,459315,95.83,459315
>>>> 2,956272,91.68,884643,91.72,918630
>>>> 3,1407811,87.53,1267948,87.69,1377945
>>>> 4,1755744,83.39,1562471,83.73,1837260
>>>> 5,2056741,79.24,1812309,80.00,2296575
>>>> 6,2393759,75.09,2025719,77.02,2755890
>>>> 7,2754403,70.95,2238180,73.72,3215205
>>>> 8,2947493,66.81,2369686,70.37,3674520
>>>> 9,3063579,62.68,2321148,68.84,4133835
>>>> 10,3229023,58.54,2377596,65.84,4593150
>>>> 11,3337665,54.40,2429818,64.01,5052465
>>>> 12,3255140,50.28,2395070,61.63,5511780
>>>> 13,3260721,46.11,2402644,59.77,5971095
>>>> 14,3210590,42.02,2390806,57.46,6430410
>>>> 15,3164811,37.88,2265352,51.39,6889725
>>>> 16,3144764,33.77,2335028,54.07,7349040
>>>> 17,3128839,29.63,2328662,49.52,7808355
>>>> 18,3133344,25.50,2301181,48.01,8267670
>>>> 19,3135979,21.38,2343003,43.66,8726985
>>>> 20,3136448,17.27,2306109,40.81,9186300
>>>> 21,3130324,13.16,2403688,35.84,9645615
>>>> 22,3109883,9.04,2290808,36.24,10104930
>>>> 23,3136805,4.94,2263818,35.43,10564245
>>>> 24,3118949,0.78,2252891,31.03,11023560
>>>>
>>>> bubble-hinting v1:
>>>> tasks,processes,processes_idle,threads,threads_idle,linear
>>>> 0,0,100,0,100,0
>>>> 1,292183,95.83,292428,95.83,292428
>>>> 2,540606,91.67,501887,91.91,584856
>>>> 3,821748,87.53,735244,88.31,877284
>>>> 4,1033782,83.38,839925,85.59,1169712
>>>> 5,1261352,79.25,896464,83.86,1462140
>>>> 6,1459544,75.12,1050094,80.93,1754568
>>>> 7,1686537,70.97,1112202,79.23,2046996
>>>> 8,1866892,66.83,1083571,78.48,2339424
>>>> 9,2056887,62.72,1101660,77.94,2631852
>>>> 10,2252955,58.57,1097439,77.36,2924280
>>>> 11,2413907,54.40,1088583,76.72,3216708
>>>> 12,2596504,50.35,1117474,76.01,3509136
>>>> 13,2715338,46.21,1087666,75.32,3801564
>>>> 14,2861697,42.08,1084692,74.35,4093992
>>>> 15,2964620,38.02,1087910,73.40,4386420
>>>> 16,3065575,33.84,1099406,71.07,4678848
>>>> 17,3107674,29.76,1056948,71.36,4971276
>>>> 18,3144963,25.71,1094883,70.14,5263704
>>>> 19,3173468,21.61,1073049,66.21,5556132
>>>> 20,3173233,17.55,1072417,67.16,5848560
>>>> 21,3209710,13.37,1079147,65.64,6140988
>>>> 22,3182958,9.37,1085872,65.95,6433416
>>>> 23,3200747,5.23,1076414,59.40,6725844
>>>> 24,3181699,1.04,1051233,65.62,7018272
>>>>
>>>> Page-hinting:
>>>> tasks,processes,processes_idle,threads,threads_idle,linear
>>>> 0,0,100,0,100,0
>>>> 1,467693,95.83,467970,95.83,467970
>>>> 2,967860,91.68,895883,91.70,935940
>>>> 3,1408191,87.53,1279602,87.68,1403910
>>>> 4,1766250,83.39,1557224,83.93,1871880
>>>> 5,2124689,79.24,1834625,80.35,2339850
>>>> 6,2413514,75.10,1989557,77.00,2807820
>>>> 7,2644648,70.95,2158055,73.73,3275790
>>>> 8,2896483,66.81,2305785,70.85,3743760
>>>> 9,3157796,62.67,2304083,69.49,4211730
>>>> 10,3251633,58.53,2379589,66.43,4679700
>>>> 11,3313704,54.41,2349310,64.76,5147670
>>>> 12,3285612,50.30,2362013,62.63,5615640
>>>> 13,3207275,46.17,2377760,59.94,6083610
>>>> 14,3221727,42.02,2416278,56.70,6551580
>>>> 15,3194781,37.91,2334552,54.96,7019550
>>>> 16,3211818,33.78,2399077,52.75,7487520
>>>> 17,3172664,29.65,2337660,50.27,7955490
>>>> 18,3177152,25.49,2349721,47.02,8423460
>>>> 19,3149924,21.36,2319286,40.16,8891430
>>>> 20,3166910,17.30,2279719,43.23,9359400
>>>> 21,3159464,13.19,2342849,34.84,9827370
>>>> 22,3167091,9.06,2285156,37.97,10295340
>>>> 23,3174137,4.96,2365448,33.74,10763310
>>>> 24,3161629,0.86,2253813,32.38,11231280
>>>>
>>>>
>>>> Test4: Netperf
>>>> Guest size: 5GB
>>>> Cores: 4
>>>> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
>>>> Netserver: Running on core 0
>>>> Netperf: Running on core 1
>>>> Recv Socket Size bytes: 131072
>>>> Send Socket Size bytes:16384
>>>> Send Message Size bytes:1000000000
>>>> Time: 900s
>>>> Process: netperf is run 3 times sequentially in the same guest with the
>>>> same inputs mentioned above and throughput (10^6bits/sec) is observed.
>>>> unmodified kernel: 1st Run-14769.60, 2nd Run-14849.18, 3rd Run-14842.02
>>>> bubble-hinting v1: 1st Run-13441.77, 2nd Run-13487.81, 3rd Run-13503.87
>>>> Page-hinting: 1st Run-14308.20, 2nd Run-14344.36, 3rd Run-14450.07
>>>>
>>>> Drawback with bubble-hinting:
>>>> More invasive.
>>>>
>>>> Drawback with page-hinting:
>>>> Additional bitmap required, including growing/shrinking the bitmap on
>>>> memory hotplug.
>>>>
>>>>
>>>> [1] https://lkml.org/lkml/2019/6/19/926
>>> Any chance you could provide a .config for your kernel? I'm wondering
>>> what is different between the two as it seems like you are showing a
>>> significant regression in terms of performance for the bubble
>>> hinting/aeration approach versus a stock kernel without the patches
>>> and that doesn't match up with what I have been seeing.
>> I have attached the config which I was using.
> Were all of these runs with the same config? I ask because I noticed
> the config you provided had a number of quite expensive memory debug
> options enabled:
Yes, memory debugging configs were enabled for all the cases.
>
> #
> # Memory Debugging
> #
> CONFIG_PAGE_EXTENSION=y
> CONFIG_DEBUG_PAGEALLOC=y
> CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y
> CONFIG_PAGE_OWNER=y
> # CONFIG_PAGE_POISONING is not set
> CONFIG_DEBUG_PAGE_REF=y
> # CONFIG_DEBUG_RODATA_TEST is not set
> CONFIG_DEBUG_OBJECTS=y
> # CONFIG_DEBUG_OBJECTS_SELFTEST is not set
> # CONFIG_DEBUG_OBJECTS_FREE is not set
> # CONFIG_DEBUG_OBJECTS_TIMERS is not set
> # CONFIG_DEBUG_OBJECTS_WORK is not set
> # CONFIG_DEBUG_OBJECTS_RCU_HEAD is not set
> # CONFIG_DEBUG_OBJECTS_PERCPU_COUNTER is not set
> CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT=1
> CONFIG_SLUB_DEBUG_ON=y
> # CONFIG_SLUB_STATS is not set
> CONFIG_HAVE_DEBUG_KMEMLEAK=y
> CONFIG_DEBUG_KMEMLEAK=y
> CONFIG_DEBUG_KMEMLEAK_EARLY_LOG_SIZE=400
> # CONFIG_DEBUG_KMEMLEAK_TEST is not set
> # CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF is not set
> CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y
> CONFIG_DEBUG_STACK_USAGE=y
> CONFIG_DEBUG_VM=y
> # CONFIG_DEBUG_VM_VMACACHE is not set
> # CONFIG_DEBUG_VM_RB is not set
> # CONFIG_DEBUG_VM_PGFLAGS is not set
> CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
> CONFIG_DEBUG_VIRTUAL=y
> CONFIG_DEBUG_MEMORY_INIT=y
> CONFIG_DEBUG_PER_CPU_MAPS=y
> CONFIG_HAVE_ARCH_KASAN=y
> CONFIG_CC_HAS_KASAN_GENERIC=y
> # CONFIG_KASAN is not set
> CONFIG_KASAN_STACK=1
> # end of Memory Debugging
>
> When I went through and enabled these then my results for the bubble
> hinting matched pretty closely to what you reported. However, when I
> compiled without the patches and this config enabled the results were
> still about what was reported with the bubble hinting but were maybe
> 5% improved. I'm just wondering if you were doing some additional
> debugging and left those options enabled for the bubble hinting test
> run.
I have the same set of debugging options enabled for all three cases
reported.
-- 
Thanks
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2019-06-28 19:14 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-03 17:03 [RFC][Patch v10 0/2] mm: Support for page hinting Nitesh Narayan Lal
2019-06-03 17:03 ` [RFC][Patch v10 1/2] mm: page_hinting: core infrastructure Nitesh Narayan Lal
2019-06-03 19:04   ` Alexander Duyck
2019-06-04 12:55     ` Nitesh Narayan Lal
2019-06-04 15:14       ` Alexander Duyck
2019-06-04 16:07         ` Nitesh Narayan Lal
2019-06-04 16:25           ` Alexander Duyck
2019-06-04 16:42             ` Nitesh Narayan Lal
2019-06-04 17:12               ` Alexander Duyck
2019-06-03 19:57   ` David Hildenbrand
2019-06-04 13:16     ` Nitesh Narayan Lal
2019-06-14  7:24   ` David Hildenbrand
2019-06-03 17:03 ` [RFC][Patch v10 2/2] virtio-balloon: page_hinting: reporting to the host Nitesh Narayan Lal
2019-06-03 22:38   ` Alexander Duyck
2019-06-04  7:12     ` David Hildenbrand
2019-06-04 11:50       ` Nitesh Narayan Lal
2019-06-04 11:31     ` Nitesh Narayan Lal
2019-06-04 16:33   ` Alexander Duyck
2019-06-04 16:44     ` Nitesh Narayan Lal
2019-06-03 17:04 ` [QEMU PATCH] KVM: Support for page hinting Nitesh Narayan Lal
2019-06-03 18:34   ` Alexander Duyck
2019-06-03 18:37     ` Nitesh Narayan Lal
2019-06-03 18:45     ` Nitesh Narayan Lal
2019-06-04 16:41   ` Alexander Duyck
2019-06-04 16:48     ` Nitesh Narayan Lal
2019-06-03 18:04 ` [RFC][Patch v10 0/2] mm: " Michael S. Tsirkin
2019-06-03 18:38   ` Nitesh Narayan Lal
2019-06-11 12:19   ` Nitesh Narayan Lal
2019-06-11 15:00     ` Alexander Duyck
2019-06-25 14:48   ` Nitesh Narayan Lal
2019-06-25 17:10     ` Alexander Duyck
     [not found]       ` <cc20a6d2-9e95-3de4-301a-f2a6a5b025e4@redhat.com>
2019-06-28 18:25         ` Alexander Duyck
2019-06-28 19:13           ` Nitesh Narayan Lal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).