kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH v11 0/2] mm: Support for page hinting
@ 2019-07-10 19:51 Nitesh Narayan Lal
  2019-07-10 19:51 ` [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure Nitesh Narayan Lal
                   ` (4 more replies)
  0 siblings, 5 replies; 43+ messages in thread
From: Nitesh Narayan Lal @ 2019-07-10 19:51 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta,
	wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen, konrad.wilk,
	dhildenb, aarcange, alexander.duyck, john.starks, dave.hansen,
	mhocko

This patch series proposes an efficient mechanism for reporting free memory
from a guest to its hypervisor. It especially enables guests with no page cache
(e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
rapidly hand back free memory to the hypervisor.
This approach has a minimal impact on the existing core-mm infrastructure.

Measurement results (measurement details appended to this email):
*Number of 5GB guests (each touching 4GB memory) that can be launched
without swap usage on a system with 15GB:
unmodified kernel - 2, 3rd with 2.5GB   
v11 page hinting - 6, 7th with 26MB    
v1 bubble hinting[1] - 6, 7th with 1.8GB (bubble hinting is another series
proposed to solve the same problems)

*Memhog execution time (For 3 guests each of 6GB on a system with 15GB):
unmodified kernel - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB       
v11 page hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0           
v1 bubble hinting - Guest1:23, Guest2:11s, Guest3:26s swap used = 0           


This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
A new hook after buddy merging is used to set the bits in the bitmap.
Currently, the bits are only cleared when pages are hinted, not when pages are
re-allocated.

Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
threshold is met, trying to isolate and report pages that are still free.

The isolated pages are reported via virtio-balloon, which is responsible for
sending batched pages to the host synchronously. Once the hypervisor processed
the hinting request, the isolated pages are returned back to the buddy.

Changelog in v11:
* Added logic to take care of multiple NUMA nodes scenarios.
* Simplified the logic for reporting isolated pages to the host. (Eg. replaced
dynamically allocated arrays with static ones, introduced wait event instead of
the loop in order to wait for a response from the host)
* Added a mutex to prevent race condition when page hinting is enabled by
multiple drivers.
* Simplified the logic responsible for decrementing free page counter for each
zone.
* Simplified code structuring/naming.

Known work items for the future:
* Test device assigned guests to ensure that hinting doesn't break it.
* Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device-side support.
* Decide between MADV_DONTNEED and MADV_FREE.
* Look into memory hotplug, more efficient locking, better naming conventions to
avoid confusion with VIRTIO_BALLOON_F_FREE_PAGE_HINT support.
* Come up with proper/traceable error-message/logs and look into other code
simplifications. (If necessary).

Benefit analysis:
1. Number of 5GB guests (each touching 4GB memory) that can be launched without
swap usage on a system with 15GB:
unmodified kernel - 2, 3rd with 2.5GB   
v11 page hinting - 6, 7th with 26MB    
v1 bubble hinting - 6, 7th with 1.8GB   

Conclusion - In this particular testcase on using v11 page hinting and
v1 bubble-hinting 4 more guests could be launched without swapping compared
to an unmodified kernel.
For the 7th guest launch, v11 page hinting is slightly better than v1 Bubble
hinting as it touches lesser swap space.

Setup & procedure - 
Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
Guest Memory = 5GB
Number of CPUs in the guest = 1
Host swap = 4GB
Workload = test allocation program that allocates 4GB memory, touches it via
memset and exits.
The first guest is launched and once its console is up, the test allocation
program is executed with 4 GB memory request (Due to this the guest occupies
almost 4-5 GB of memory in the host in a system without page hinting). Once
this program exits at that time another guest is launched in the host and the
same process is followed. It is continued until the swap is not used.

2. Memhog execution time (For 3 guests each of 6GB on a system with 15GB):
unmodified kernel - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB       
v11 page hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0           
v1 bubble hinting - Guest1:23, Guest2:11s, Guest3:26s swap used = 0           

For this particular test-case in a guest which doesn't require swap access
"memhog 6G" execution time lies within a range of 15-30s.
Conclusion -
In the above test case for an unmodified kernel on executing memhog in the
third guest execution time rises to above 2minutes due to swap access.
Using either page-hinting or bubble hinting brings this execution time to a
a normal range of 15-30s.

Setup & procedure -
Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
Guest Memory = 6GB
Number of CPUs in the guest = 4
Process = 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
one after the other in each of them.
Host swap = 4GB

Performance Analysis:
1. will-it-scale's page_faul1
Setup -
Guest Memory = 6GB
Number of cores = 24

Unmodified kernel -
0,0,100,0,100,0
1,514453,95.84,519502,95.83,519502
2,991485,91.67,932268,91.68,1039004
3,1381237,87.36,1264214,87.64,1558506
4,1789116,83.36,1597767,83.88,2078008
5,2181552,79.20,1889489,80.08,2597510
6,2452416,75.05,2001879,77.10,3117012
7,2671047,70.90,2263866,73.22,3636514
8,2930081,66.75,2333813,70.60,4156016
9,3126431,62.60,2370108,68.28,4675518
10,3211937,58.44,2454093,65.74,5195020
11,3162172,54.32,2450822,63.21,5714522
12,3154261,50.14,2272290,58.98,6234024
13,3115174,46.02,2369679,57.74,6753526
14,3150511,41.86,2470837,54.02,7273028
15,3134158,37.71,2428129,51.98,7792530
16,3143067,33.57,2340469,49.54,8312032
17,3112457,29.43,2263627,44.81,8831534
18,3089724,25.29,2181879,38.69,9351036
19,3076878,21.15,2236505,40.01,9870538
20,3091978,16.95,2266327,35.00,10390040
21,3082927,12.84,2172578,28.12,10909542
22,3055282,8.73,2176269,29.14,11429044
23,3081144,4.56,2138442,24.87,11948546
24,3075509,0.45,2173753,21.62,12468048

page hinting -
0,0,100,0,100,0
1,491683,95.83,494366,95.82,494366
2,988415,91.67,919660,91.68,988732
3,1344829,87.52,1244608,87.69,1483098
4,1797933,83.37,1625797,83.70,1977464
5,2179009,79.21,1881534,80.13,2471830
6,2449858,75.07,2078137,76.82,2966196
7,2732122,70.90,2178105,73.75,3460562
8,2910965,66.75,2340901,70.28,3954928
9,3006665,62.61,2353748,67.91,4449294
10,3164752,58.46,2377936,65.08,4943660
11,3234846,54.32,2510149,63.14,5438026
12,3165477,50.17,2412007,59.91,5932392
13,3141457,46.05,2421548,57.85,6426758
14,3135839,41.90,2378021,53.81,6921124
15,3109113,37.75,2269290,51.76,7415490
16,3093613,33.62,2346185,48.73,7909856
17,3086542,29.49,2352140,46.19,8404222
18,3048991,25.36,2217144,41.52,8898588
19,2965500,21.18,2313614,38.18,9392954
20,2928977,17.05,2175316,35.67,9887320
21,2896667,12.91,2141311,28.90,10381686
22,3047782,8.76,2177664,28.24,10876052
23,2994503,4.58,2160976,22.97,11370418
24,3038762,0.47,2053533,22.39,11864784

bubble-hinting v1 - 
0,0,100,0,100,0
1,515272,95.83,492355,95.81,515272
2,985903,91.66,919653,91.68,1030544
3,1475300,87.51,1353723,87.65,1545816
4,1783938,83.36,1586307,83.78,2061088
5,2093307,79.20,1867395,79.95,2576360
6,2441370,75.05,2055421,76.65,3091632
7,2650471,70.89,2246014,72.93,3606904
8,2926782,66.75,2333601,70.41,4122176
9,3107617,62.60,2383112,68.46,4637448
10,3192332,58.44,2441626,65.84,5152720
11,3268043,54.32,2235964,62.92,5667992
12,3191105,50.18,2449045,60.49,6183264
13,3145317,46.05,2377317,57.80,6698536
14,3161552,41.91,2395814,53.26,7213808
15,3140443,37.77,2333200,51.42,7729080
16,3130866,33.65,2150967,46.11,8244352
17,3112894,29.52,2372068,45.93,8759624
18,3078424,25.39,2336211,39.85,9274896
19,3036457,21.27,2224821,35.25,9790168
20,3046330,17.13,2199755,37.43,10305440
21,2981130,12.98,2214862,28.67,10820712
22,3017481,8.84,2195996,29.69,11335984
23,2979906,4.68,2173395,25.90,11851256
24,2971170,0.52,2134311,21.89,12366528

Conclusion -
For an unmodified kernel, with every fresh boot, there is 3-4% delta observed
in the results wrt the numbers mentioned above. For both bubble-hinting and
page-hinting, there was no noticeable degradation observed other than the
expected variability mentioned earlier. 

Page hinting vs bubble hinting:
From the benefits and performance perspective, both solutions look quite similar
so far. However, unlike bubble-hinting which is more invasive, the overall core
mm changes required for page hinting are minimal.

[1] https://lkml.org/lkml/2019/6/19/926

Nitesh Narayan Lal (2):
  mm: page_hinting: core infrastructure
  virtio-balloon: page_hinting: reporting to the host

 drivers/virtio/Kconfig              |   1 +
 drivers/virtio/virtio_balloon.c     |  91 +++++++++-
 include/linux/page_hinting.h        |  45 +++++
 include/uapi/linux/virtio_balloon.h |  11 ++
 mm/Kconfig                          |   6 +
 mm/Makefile                         |   1 +
 mm/page_alloc.c                     |  18 +-
 mm/page_hinting.c                   | 250 ++++++++++++++++++++++++++++
 8 files changed, 414 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/page_hinting.h
 create mode 100644 mm/page_hinting.c

-- 



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure
  2019-07-10 19:51 [RFC][PATCH v11 0/2] mm: Support for page hinting Nitesh Narayan Lal
@ 2019-07-10 19:51 ` Nitesh Narayan Lal
  2019-07-10 20:45   ` Dave Hansen
                     ` (2 more replies)
  2019-07-10 19:51 ` [RFC][Patch v11 2/2] virtio-balloon: page_hinting: reporting to the host Nitesh Narayan Lal
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 43+ messages in thread
From: Nitesh Narayan Lal @ 2019-07-10 19:51 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta,
	wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen, konrad.wilk,
	dhildenb, aarcange, alexander.duyck, john.starks, dave.hansen,
	mhocko

This patch introduces the core infrastructure for free page hinting in
virtual environments. It enables the kernel to track the free pages which
can be reported to its hypervisor so that the hypervisor could
free and reuse that memory as per its requirement.

While the pages are getting processed in the hypervisor (e.g.,
via MADV_FREE), the guest must not use them, otherwise, data loss
would be possible. To avoid such a situation, these pages are
temporarily removed from the buddy. The amount of pages removed
temporarily from the buddy is governed by the backend(virtio-balloon
in our case).

To efficiently identify free pages that can to be hinted to the
hypervisor, bitmaps in a coarse granularity are used. Only fairly big
chunks are reported to the hypervisor - especially, to not break up THP
in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
in the bitmap are an indication whether a page *might* be free, not a
guarantee. A new hook after buddy merging sets the bits.

Bitmaps are stored per zone, protected by the zone lock. A workqueue
asynchronously processes the bitmaps, trying to isolate and report pages
that are still free. The backend (virtio-balloon) is responsible for
reporting these batched pages to the host synchronously. Once reporting/
freeing is complete, isolated pages are returned back to the buddy.

There are still various things to look into (e.g., memory hotplug, more
efficient locking, possible races when disabling).

Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 include/linux/page_hinting.h |  45 +++++++
 mm/Kconfig                   |   6 +
 mm/Makefile                  |   1 +
 mm/page_alloc.c              |  18 +--
 mm/page_hinting.c            | 250 +++++++++++++++++++++++++++++++++++
 5 files changed, 312 insertions(+), 8 deletions(-)
 create mode 100644 include/linux/page_hinting.h
 create mode 100644 mm/page_hinting.c

diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
new file mode 100644
index 000000000000..4900feb796f9
--- /dev/null
+++ b/include/linux/page_hinting.h
@@ -0,0 +1,45 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PAGE_HINTING_H
+#define _LINUX_PAGE_HINTING_H
+
+/*
+ * Minimum page order required for a page to be hinted to the host.
+ */
+#define PAGE_HINTING_MIN_ORDER		(MAX_ORDER - 2)
+
+/*
+ * struct page_hinting_config: holds the information supplied by the balloon
+ * device to page hinting.
+ * @hint_pages:		Callback which reports the isolated pages
+ *			synchornously to the host.
+ * @max_pages:		Maxmimum pages that are going to be hinted to the host
+ *			at a time of granularity >= PAGE_HINTING_MIN_ORDER.
+ */
+struct page_hinting_config {
+	void (*hint_pages)(struct list_head *list);
+	int max_pages;
+};
+
+extern int __isolate_free_page(struct page *page, unsigned int order);
+extern void __free_one_page(struct page *page, unsigned long pfn,
+			    struct zone *zone, unsigned int order,
+			    int migratetype, bool hint);
+#ifdef CONFIG_PAGE_HINTING
+void page_hinting_enqueue(struct page *page, int order);
+int page_hinting_enable(const struct page_hinting_config *conf);
+void page_hinting_disable(void);
+#else
+static inline void page_hinting_enqueue(struct page *page, int order)
+{
+}
+
+static inline int page_hinting_enable(const struct page_hinting_config *conf)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void page_hinting_disable(void)
+{
+}
+#endif
+#endif /* _LINUX_PAGE_HINTING_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index f0c76ba47695..e97fab429d9b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -765,4 +765,10 @@ config GUP_BENCHMARK
 config ARCH_HAS_PTE_SPECIAL
 	bool
 
+# PAGE_HINTING will allow the guest to report the free pages to the
+# host in fixed chunks as soon as the threshold is reached.
+config PAGE_HINTING
+       bool
+       def_bool n
+       depends on X86_64
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index ac5e5ba78874..73be49177656 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -94,6 +94,7 @@ obj-$(CONFIG_Z3FOLD)	+= z3fold.o
 obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
 obj-$(CONFIG_CMA)	+= cma.o
 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
+obj-$(CONFIG_PAGE_HINTING) += page_hinting.o
 obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
 obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
 obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d66bc8abe0af..8a44338bd04e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -69,6 +69,7 @@
 #include <linux/lockdep.h>
 #include <linux/nmi.h>
 #include <linux/psi.h>
+#include <linux/page_hinting.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -874,10 +875,10 @@ compaction_capture(struct capture_control *capc, struct page *page,
  * -- nyc
  */
 
-static inline void __free_one_page(struct page *page,
+inline void __free_one_page(struct page *page,
 		unsigned long pfn,
 		struct zone *zone, unsigned int order,
-		int migratetype)
+		int migratetype, bool hint)
 {
 	unsigned long combined_pfn;
 	unsigned long uninitialized_var(buddy_pfn);
@@ -980,7 +981,8 @@ static inline void __free_one_page(struct page *page,
 				migratetype);
 	else
 		add_to_free_area(page, &zone->free_area[order], migratetype);
-
+	if (hint)
+		page_hinting_enqueue(page, order);
 }
 
 /*
@@ -1263,7 +1265,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		if (unlikely(isolated_pageblocks))
 			mt = get_pageblock_migratetype(page);
 
-		__free_one_page(page, page_to_pfn(page), zone, 0, mt);
+		__free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
 		trace_mm_page_pcpu_drain(page, 0, mt);
 	}
 	spin_unlock(&zone->lock);
@@ -1272,14 +1274,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 static void free_one_page(struct zone *zone,
 				struct page *page, unsigned long pfn,
 				unsigned int order,
-				int migratetype)
+				int migratetype, bool hint)
 {
 	spin_lock(&zone->lock);
 	if (unlikely(has_isolate_pageblock(zone) ||
 		is_migrate_isolate(migratetype))) {
 		migratetype = get_pfnblock_migratetype(page, pfn);
 	}
-	__free_one_page(page, pfn, zone, order, migratetype);
+	__free_one_page(page, pfn, zone, order, migratetype, hint);
 	spin_unlock(&zone->lock);
 }
 
@@ -1369,7 +1371,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	free_one_page(page_zone(page), page, pfn, order, migratetype);
+	free_one_page(page_zone(page), page, pfn, order, migratetype, true);
 	local_irq_restore(flags);
 }
 
@@ -2969,7 +2971,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
 	 */
 	if (migratetype >= MIGRATE_PCPTYPES) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(zone, page, pfn, 0, migratetype);
+			free_one_page(zone, page, pfn, 0, migratetype, true);
 			return;
 		}
 		migratetype = MIGRATE_MOVABLE;
diff --git a/mm/page_hinting.c b/mm/page_hinting.c
new file mode 100644
index 000000000000..0bfa09f8c3ed
--- /dev/null
+++ b/mm/page_hinting.c
@@ -0,0 +1,250 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Page hinting core infrastructure to enable a VM to report free pages to its
+ * hypervisor.
+ *
+ * Copyright Red Hat, Inc. 2019
+ *
+ * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/page_hinting.h>
+#include <linux/kvm_host.h>
+
+/*
+ * struct zone_free_area: For a single zone across NUMA nodes, it holds the
+ * bitmap pointer to track the free pages and other required parameters
+ * used to recover these pages by scanning the bitmap.
+ * @bitmap:		Pointer to the bitmap in PAGE_HINTING_MIN_ORDER
+ *			granularity.
+ * @base_pfn:		Starting PFN value for the zone whose bitmap is stored.
+ * @end_pfn:		Indicates the last PFN value for the zone.
+ * @free_pages:		Tracks the number of free pages of granularity
+ *			PAGE_HINTING_MIN_ORDER.
+ * @nbits:		Indicates the total size of the bitmap in bits allocated
+ *			at the time of initialization.
+ */
+struct zone_free_area {
+	unsigned long *bitmap;
+	unsigned long base_pfn;
+	unsigned long end_pfn;
+	atomic_t free_pages;
+	unsigned long nbits;
+} free_area[MAX_NR_ZONES];
+
+static void init_hinting_wq(struct work_struct *work);
+static DEFINE_MUTEX(page_hinting_init);
+const struct page_hinting_config *page_hitning_conf;
+struct work_struct hinting_work;
+atomic_t page_hinting_active;
+
+void free_area_cleanup(int nr_zones)
+{
+	int zone_idx;
+
+	for (zone_idx = 0; zone_idx < nr_zones; zone_idx++) {
+		bitmap_free(free_area[zone_idx].bitmap);
+		free_area[zone_idx].base_pfn = 0;
+		free_area[zone_idx].end_pfn = 0;
+		free_area[zone_idx].nbits = 0;
+		atomic_set(&free_area[zone_idx].free_pages, 0);
+	}
+}
+
+int page_hinting_enable(const struct page_hinting_config *conf)
+{
+	unsigned long bitmap_size = 0;
+	int zone_idx = 0, ret = -EBUSY;
+	struct zone *zone;
+
+	mutex_lock(&page_hinting_init);
+	if (!page_hitning_conf) {
+		for_each_populated_zone(zone) {
+			zone_idx = zone_idx(zone);
+#ifdef CONFIG_ZONE_DEVICE
+			if (zone_idx == ZONE_DEVICE)
+				continue;
+#endif
+			spin_lock(&zone->lock);
+			if (free_area[zone_idx].base_pfn) {
+				free_area[zone_idx].base_pfn =
+					min(free_area[zone_idx].base_pfn,
+					    zone->zone_start_pfn);
+				free_area[zone_idx].end_pfn =
+					max(free_area[zone_idx].end_pfn,
+					    zone->zone_start_pfn +
+					    zone->spanned_pages);
+			} else {
+				free_area[zone_idx].base_pfn =
+					zone->zone_start_pfn;
+				free_area[zone_idx].end_pfn =
+					zone->zone_start_pfn +
+					zone->spanned_pages;
+			}
+			spin_unlock(&zone->lock);
+		}
+
+		for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) {
+			unsigned long pages = free_area[zone_idx].end_pfn -
+					free_area[zone_idx].base_pfn;
+			bitmap_size = (pages >> PAGE_HINTING_MIN_ORDER) + 1;
+			if (!bitmap_size)
+				continue;
+			free_area[zone_idx].bitmap = bitmap_zalloc(bitmap_size,
+								   GFP_KERNEL);
+			if (!free_area[zone_idx].bitmap) {
+				free_area_cleanup(zone_idx);
+				mutex_unlock(&page_hinting_init);
+				return -ENOMEM;
+			}
+			free_area[zone_idx].nbits = bitmap_size;
+		}
+		page_hitning_conf = conf;
+		INIT_WORK(&hinting_work, init_hinting_wq);
+		ret = 0;
+	}
+	mutex_unlock(&page_hinting_init);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(page_hinting_enable);
+
+void page_hinting_disable(void)
+{
+	cancel_work_sync(&hinting_work);
+	page_hitning_conf = NULL;
+	free_area_cleanup(MAX_NR_ZONES);
+}
+EXPORT_SYMBOL_GPL(page_hinting_disable);
+
+static unsigned long pfn_to_bit(struct page *page, int zone_idx)
+{
+	unsigned long bitnr;
+
+	bitnr = (page_to_pfn(page) - free_area[zone_idx].base_pfn)
+			 >> PAGE_HINTING_MIN_ORDER;
+	return bitnr;
+}
+
+static void release_buddy_pages(struct list_head *pages)
+{
+	int mt = 0, zone_idx, order;
+	struct page *page, *next;
+	unsigned long bitnr;
+	struct zone *zone;
+
+	list_for_each_entry_safe(page, next, pages, lru) {
+		zone_idx = page_zonenum(page);
+		zone = page_zone(page);
+		bitnr = pfn_to_bit(page, zone_idx);
+		spin_lock(&zone->lock);
+		list_del(&page->lru);
+		order = page_private(page);
+		set_page_private(page, 0);
+		mt = get_pageblock_migratetype(page);
+		__free_one_page(page, page_to_pfn(page), zone,
+				order, mt, false);
+		spin_unlock(&zone->lock);
+	}
+}
+
+static void bm_set_pfn(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	int zone_idx = page_zonenum(page);
+	unsigned long bitnr = 0;
+
+	lockdep_assert_held(&zone->lock);
+	bitnr = pfn_to_bit(page, zone_idx);
+	/*
+	 * TODO: fix possible underflows.
+	 */
+	if (free_area[zone_idx].bitmap &&
+	    bitnr < free_area[zone_idx].nbits &&
+	    !test_and_set_bit(bitnr, free_area[zone_idx].bitmap))
+		atomic_inc(&free_area[zone_idx].free_pages);
+}
+
+static void scan_zone_free_area(int zone_idx, int free_pages)
+{
+	int ret = 0, order, isolated_cnt = 0;
+	unsigned long set_bit, start = 0;
+	LIST_HEAD(isolated_pages);
+	struct page *page;
+	struct zone *zone;
+
+	for (;;) {
+		ret = 0;
+		set_bit = find_next_bit(free_area[zone_idx].bitmap,
+					free_area[zone_idx].nbits, start);
+		if (set_bit >= free_area[zone_idx].nbits)
+			break;
+		page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) +
+				free_area[zone_idx].base_pfn);
+		if (!page)
+			continue;
+		zone = page_zone(page);
+		spin_lock(&zone->lock);
+
+		if (PageBuddy(page) && page_private(page) >=
+		    PAGE_HINTING_MIN_ORDER) {
+			order = page_private(page);
+			ret = __isolate_free_page(page, order);
+		}
+		clear_bit(set_bit, free_area[zone_idx].bitmap);
+		atomic_dec(&free_area[zone_idx].free_pages);
+		spin_unlock(&zone->lock);
+		if (ret) {
+			/*
+			 * restoring page order to use it while releasing
+			 * the pages back to the buddy.
+			 */
+			set_page_private(page, order);
+			list_add_tail(&page->lru, &isolated_pages);
+			isolated_cnt++;
+			if (isolated_cnt == page_hitning_conf->max_pages) {
+				page_hitning_conf->hint_pages(&isolated_pages);
+				release_buddy_pages(&isolated_pages);
+				isolated_cnt = 0;
+			}
+		}
+		start = set_bit + 1;
+	}
+	if (isolated_cnt) {
+		page_hitning_conf->hint_pages(&isolated_pages);
+		release_buddy_pages(&isolated_pages);
+	}
+}
+
+static void init_hinting_wq(struct work_struct *work)
+{
+	int zone_idx, free_pages;
+
+	atomic_set(&page_hinting_active, 1);
+	for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) {
+		free_pages = atomic_read(&free_area[zone_idx].free_pages);
+		if (free_pages >= page_hitning_conf->max_pages)
+			scan_zone_free_area(zone_idx, free_pages);
+	}
+	atomic_set(&page_hinting_active, 0);
+}
+
+void page_hinting_enqueue(struct page *page, int order)
+{
+	int zone_idx;
+
+	if (!page_hitning_conf || order < PAGE_HINTING_MIN_ORDER)
+		return;
+
+	bm_set_pfn(page);
+	if (atomic_read(&page_hinting_active))
+		return;
+	zone_idx = zone_idx(page_zone(page));
+	if (atomic_read(&free_area[zone_idx].free_pages) >=
+			page_hitning_conf->max_pages) {
+		int cpu = smp_processor_id();
+
+		queue_work_on(cpu, system_wq, &hinting_work);
+	}
+}
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC][Patch v11 2/2] virtio-balloon: page_hinting: reporting to the host
  2019-07-10 19:51 [RFC][PATCH v11 0/2] mm: Support for page hinting Nitesh Narayan Lal
  2019-07-10 19:51 ` [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure Nitesh Narayan Lal
@ 2019-07-10 19:51 ` Nitesh Narayan Lal
  2019-07-24 19:47   ` Michael S. Tsirkin
  2019-07-10 19:53 ` [QEMU Patch] virtio-baloon: Support for page hinting Nitesh Narayan Lal
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 43+ messages in thread
From: Nitesh Narayan Lal @ 2019-07-10 19:51 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta,
	wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen, konrad.wilk,
	dhildenb, aarcange, alexander.duyck, john.starks, dave.hansen,
	mhocko

Enables the kernel to negotiate VIRTIO_BALLOON_F_HINTING feature with the
host. If it is available and page_hinting_flag is set to true, page_hinting
is enabled and its callbacks are configured along with the max_pages count
which indicates the maximum number of pages that can be isolated and hinted
at a time. Currently, only free pages of order >= (MAX_ORDER - 2) are
reported. To prevent any false OOM max_pages count is set to 16.

By default page_hinting feature is enabled and gets loaded as soon
as the virtio-balloon driver is loaded. However, it could be disabled
by writing the page_hinting_flag which is a virtio-balloon parameter.

Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 drivers/virtio/Kconfig              |  1 +
 drivers/virtio/virtio_balloon.c     | 91 ++++++++++++++++++++++++++++-
 include/uapi/linux/virtio_balloon.h | 11 ++++
 3 files changed, 102 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 023fc3bc01c6..dcc0cb4269a5 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -47,6 +47,7 @@ config VIRTIO_BALLOON
 	tristate "Virtio balloon driver"
 	depends on VIRTIO
 	select MEMORY_BALLOON
+	select PAGE_HINTING
 	---help---
 	 This driver supports increasing and decreasing the amount
 	 of memory within a KVM guest.
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 44339fc87cc7..1fb0eb0b2c20 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -18,6 +18,7 @@
 #include <linux/mm.h>
 #include <linux/mount.h>
 #include <linux/magic.h>
+#include <linux/page_hinting.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -35,6 +36,12 @@
 /* The size of a free page block in bytes */
 #define VIRTIO_BALLOON_FREE_PAGE_SIZE \
 	(1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT))
+/* Number of isolated pages to be reported to the host at a time.
+ * TODO:
+ * 1. Set it via host.
+ * 2. Find an optimal value for this.
+ */
+#define PAGE_HINTING_MAX_PAGES	16
 
 #ifdef CONFIG_BALLOON_COMPACTION
 static struct vfsmount *balloon_mnt;
@@ -45,6 +52,7 @@ enum virtio_balloon_vq {
 	VIRTIO_BALLOON_VQ_DEFLATE,
 	VIRTIO_BALLOON_VQ_STATS,
 	VIRTIO_BALLOON_VQ_FREE_PAGE,
+	VIRTIO_BALLOON_VQ_HINTING,
 	VIRTIO_BALLOON_VQ_MAX
 };
 
@@ -54,7 +62,8 @@ enum virtio_balloon_config_read {
 
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq,
+			 *hinting_vq;
 
 	/* Balloon's own wq for cpu-intensive work items */
 	struct workqueue_struct *balloon_wq;
@@ -112,6 +121,9 @@ struct virtio_balloon {
 
 	/* To register a shrinker to shrink memory upon memory pressure */
 	struct shrinker shrinker;
+
+	/* Array object pointing at the isolated pages ready for hinting */
+	struct isolated_memory isolated_pages[PAGE_HINTING_MAX_PAGES];
 };
 
 static struct virtio_device_id id_table[] = {
@@ -119,6 +131,66 @@ static struct virtio_device_id id_table[] = {
 	{ 0 },
 };
 
+static struct page_hinting_config page_hinting_conf;
+bool page_hinting_flag = true;
+struct virtio_balloon *hvb;
+module_param(page_hinting_flag, bool, 0444);
+MODULE_PARM_DESC(page_hinting_flag, "Enable page hinting");
+
+static int page_hinting_report(void)
+{
+	struct virtqueue *vq = hvb->hinting_vq;
+	struct scatterlist sg;
+	int err = 0, unused;
+
+	mutex_lock(&hvb->balloon_lock);
+	sg_init_one(&sg, hvb->isolated_pages, sizeof(hvb->isolated_pages[0]) *
+		    PAGE_HINTING_MAX_PAGES);
+	err = virtqueue_add_outbuf(vq, &sg, 1, hvb, GFP_KERNEL);
+	if (!err)
+		virtqueue_kick(hvb->hinting_vq);
+	wait_event(hvb->acked, virtqueue_get_buf(vq, &unused));
+	mutex_unlock(&hvb->balloon_lock);
+	return err;
+}
+
+void hint_pages(struct list_head *pages)
+{
+	struct device *dev = &hvb->vdev->dev;
+	struct page *page, *next;
+	int idx = 0, order, err;
+	unsigned long pfn;
+
+	list_for_each_entry_safe(page, next, pages, lru) {
+		pfn = page_to_pfn(page);
+		order = page_private(page);
+		hvb->isolated_pages[idx].phys_addr = pfn << PAGE_SHIFT;
+		hvb->isolated_pages[idx].size = (1 << order) * PAGE_SIZE;
+		idx++;
+	}
+	err = page_hinting_report();
+	if (err < 0)
+		dev_err(dev, "Failed to hint pages, err = %d\n", err);
+}
+
+static void page_hinting_init(struct virtio_balloon *vb)
+{
+	struct device *dev = &vb->vdev->dev;
+	int err;
+
+	page_hinting_conf.hint_pages = hint_pages;
+	page_hinting_conf.max_pages = PAGE_HINTING_MAX_PAGES;
+	err = page_hinting_enable(&page_hinting_conf);
+	if (err < 0) {
+		dev_err(dev, "Failed to enable page-hinting, err = %d\n", err);
+		page_hinting_flag = false;
+		page_hinting_conf.hint_pages = NULL;
+		page_hinting_conf.max_pages = 0;
+		return;
+	}
+	hvb = vb;
+}
+
 static u32 page_to_balloon_pfn(struct page *page)
 {
 	unsigned long pfn = page_to_pfn(page);
@@ -475,6 +547,7 @@ static int init_vqs(struct virtio_balloon *vb)
 	names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
 	names[VIRTIO_BALLOON_VQ_STATS] = NULL;
 	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
+	names[VIRTIO_BALLOON_VQ_HINTING] = NULL;
 
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
 		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
@@ -486,11 +559,18 @@ static int init_vqs(struct virtio_balloon *vb)
 		callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
 	}
 
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
+		names[VIRTIO_BALLOON_VQ_HINTING] = "hinting_vq";
+		callbacks[VIRTIO_BALLOON_VQ_HINTING] = balloon_ack;
+	}
 	err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
 					 vqs, callbacks, names, NULL, NULL);
 	if (err)
 		return err;
 
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
+		vb->hinting_vq = vqs[VIRTIO_BALLOON_VQ_HINTING];
+
 	vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
 	vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
@@ -929,6 +1009,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		if (err)
 			goto out_del_balloon_wq;
 	}
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING) &&
+	    page_hinting_flag)
+		page_hinting_init(vb);
 	virtio_device_ready(vdev);
 
 	if (towards_target(vb))
@@ -976,6 +1059,10 @@ static void virtballoon_remove(struct virtio_device *vdev)
 		destroy_workqueue(vb->balloon_wq);
 	}
 
+	if (!page_hinting_flag) {
+		hvb = NULL;
+		page_hinting_disable();
+	}
 	remove_common(vb);
 #ifdef CONFIG_BALLOON_COMPACTION
 	if (vb->vb_dev_info.inode)
@@ -1030,8 +1117,10 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_HINTING,
 	VIRTIO_BALLOON_F_FREE_PAGE_HINT,
 	VIRTIO_BALLOON_F_PAGE_POISON,
+	VIRTIO_BALLOON_F_HINTING,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index a1966cd7b677..29eed0ec83d3 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -36,6 +36,8 @@
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
 #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
+/* TODO: Find a better name to avoid any confusion with FREE_PAGE_HINT */
+#define VIRTIO_BALLOON_F_HINTING	5 /* Page hinting virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -108,4 +110,13 @@ struct virtio_balloon_stat {
 	__virtio64 val;
 } __attribute__((packed));
 
+/*
+ * struct isolated_memory- holds the pages which will be reported to the host.
+ * @phys_add:	physical address associated with a page.
+ * @size:	total size of memory to be reported.
+ */
+struct isolated_memory {
+	__virtio64 phys_addr;
+	__virtio64 size;
+};
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [QEMU Patch] virtio-baloon: Support for page hinting
  2019-07-10 19:51 [RFC][PATCH v11 0/2] mm: Support for page hinting Nitesh Narayan Lal
  2019-07-10 19:51 ` [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure Nitesh Narayan Lal
  2019-07-10 19:51 ` [RFC][Patch v11 2/2] virtio-balloon: page_hinting: reporting to the host Nitesh Narayan Lal
@ 2019-07-10 19:53 ` Nitesh Narayan Lal
  2019-07-10 20:17   ` Alexander Duyck
                     ` (2 more replies)
  2019-07-10 20:19 ` [RFC][PATCH v11 0/2] mm: " Dave Hansen
  2019-07-10 23:40 ` Alexander Duyck
  4 siblings, 3 replies; 43+ messages in thread
From: Nitesh Narayan Lal @ 2019-07-10 19:53 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta,
	wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen, konrad.wilk,
	dhildenb, aarcange, alexander.duyck, john.starks, dave.hansen,
	mhocko

Enables QEMU to perform madvise free on the memory range reported
by the vm.

Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 hw/virtio/trace-events                        |  1 +
 hw/virtio/virtio-balloon.c                    | 59 +++++++++++++++++++
 include/hw/virtio/virtio-balloon.h            |  2 +-
 include/qemu/osdep.h                          |  7 +++
 .../standard-headers/linux/virtio_balloon.h   |  1 +
 5 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index e28ba48da6..f703a22d36 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -46,6 +46,7 @@ virtio_balloon_handle_output(const char *name, uint64_t gpa) "section name: %s g
 virtio_balloon_get_config(uint32_t num_pages, uint32_t actual) "num_pages: %d actual: %d"
 virtio_balloon_set_config(uint32_t actual, uint32_t oldactual) "actual: %d oldactual: %d"
 virtio_balloon_to_target(uint64_t target, uint32_t num_pages) "balloon target: 0x%"PRIx64" num_pages: %d"
+virtio_balloon_hinting_request(unsigned long pfn, unsigned int num_pages) "Guest page hinting request PFN:%lu size: %d"
 
 # virtio-mmio.c
 virtio_mmio_read(uint64_t offset) "virtio_mmio_read offset 0x%" PRIx64
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 2112874055..5d186707b5 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -34,6 +34,9 @@
 
 #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
 
+#define VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES	16
+void free_mem_range(uint64_t addr, uint64_t len);
+
 struct PartiallyBalloonedPage {
     RAMBlock *rb;
     ram_addr_t base;
@@ -328,6 +331,58 @@ static void balloon_stats_set_poll_interval(Object *obj, Visitor *v,
     balloon_stats_change_timer(s, 0);
 }
 
+void free_mem_range(uint64_t addr, uint64_t len)
+{
+    int ret = 0;
+    void *hvaddr_to_free;
+    MemoryRegionSection mrs = memory_region_find(get_system_memory(),
+                                                 addr, 1);
+    if (!mrs.mr) {
+	warn_report("%s:No memory is mapped at address 0x%lu", __func__, addr);
+        return;
+    }
+
+    if (!memory_region_is_ram(mrs.mr) && !memory_region_is_romd(mrs.mr)) {
+	warn_report("%s:Memory at address 0x%s is not RAM:0x%lu", __func__,
+		    HWADDR_PRIx, addr);
+        memory_region_unref(mrs.mr);
+        return;
+    }
+
+    hvaddr_to_free = qemu_map_ram_ptr(mrs.mr->ram_block, mrs.offset_within_region);
+    trace_virtio_balloon_hinting_request(addr, len);
+    ret = qemu_madvise(hvaddr_to_free,len, QEMU_MADV_FREE);
+    if (ret == -1) {
+	warn_report("%s: Madvise failed with error:%d", __func__, ret);
+    }
+}
+
+static void virtio_balloon_handle_page_hinting(VirtIODevice *vdev,
+					       VirtQueue *vq)
+{
+    VirtQueueElement *elem;
+    size_t offset = 0;
+    uint64_t gpa, len;
+    elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+    if (!elem) {
+        return;
+    }
+    /* For pending hints which are < max_pages(16), 'gpa != 0' ensures that we
+     * only read the buffer which holds a valid PFN value.
+     * TODO: Find a better way to do this.
+     */
+    while (iov_to_buf(elem->out_sg, elem->out_num, offset, &gpa, 8) == 8 && gpa != 0) {
+	offset += 8;
+	offset += iov_to_buf(elem->out_sg, elem->out_num, offset, &len, 8);
+	if (!qemu_balloon_is_inhibited()) {
+	    free_mem_range(gpa, len);
+	}
+    }
+    virtqueue_push(vq, elem, offset);
+    virtio_notify(vdev, vq);
+    g_free(elem);
+}
+
 static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
@@ -694,6 +749,7 @@ static uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f,
     VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
     f |= dev->host_features;
     virtio_add_feature(&f, VIRTIO_BALLOON_F_STATS_VQ);
+    virtio_add_feature(&f, VIRTIO_BALLOON_F_HINTING);
 
     return f;
 }
@@ -780,6 +836,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
     s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
     s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
     s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
+    s->hvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_page_hinting);
 
     if (virtio_has_feature(s->host_features,
                            VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
@@ -875,6 +932,8 @@ static void virtio_balloon_instance_init(Object *obj)
 
     object_property_add(obj, "guest-stats", "guest statistics",
                         balloon_stats_get_all, NULL, NULL, s, NULL);
+    object_property_add(obj, "guest-page-hinting", "guest page hinting",
+                        NULL, NULL, NULL, s, NULL);
 
     object_property_add(obj, "guest-stats-polling-interval", "int",
                         balloon_stats_get_poll_interval,
diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
index 1afafb12f6..a58b24fdf2 100644
--- a/include/hw/virtio/virtio-balloon.h
+++ b/include/hw/virtio/virtio-balloon.h
@@ -44,7 +44,7 @@ enum virtio_balloon_free_page_report_status {
 
 typedef struct VirtIOBalloon {
     VirtIODevice parent_obj;
-    VirtQueue *ivq, *dvq, *svq, *free_page_vq;
+    VirtQueue *ivq, *dvq, *svq, *free_page_vq, *hvq;
     uint32_t free_page_report_status;
     uint32_t num_pages;
     uint32_t actual;
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index af2b91f0b8..bb9207e7f4 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -360,6 +360,11 @@ void qemu_anon_ram_free(void *ptr, size_t size);
 #else
 #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
 #endif
+#ifdef MADV_FREE
+#define QEMU_MADV_FREE MADV_FREE
+#else
+#define QEMU_MADV_FREE QEMU_MADV_INVALID
+#endif
 
 #elif defined(CONFIG_POSIX_MADVISE)
 
@@ -373,6 +378,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
 #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
+#define QEMU_MADV_FREE QEMU_MADV_INVALID
 
 #else /* no-op */
 
@@ -386,6 +392,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
 #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
+#define QEMU_MADV_FREE QEMU_MADV_INVALID
 
 #endif
 
diff --git a/include/standard-headers/linux/virtio_balloon.h b/include/standard-headers/linux/virtio_balloon.h
index 9375ca2a70..f9e3e82562 100644
--- a/include/standard-headers/linux/virtio_balloon.h
+++ b/include/standard-headers/linux/virtio_balloon.h
@@ -36,6 +36,7 @@
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
 #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
+#define VIRTIO_BALLOON_F_HINTING	5 /* Page hinting virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [QEMU Patch] virtio-baloon: Support for page hinting
  2019-07-10 19:53 ` [QEMU Patch] virtio-baloon: Support for page hinting Nitesh Narayan Lal
@ 2019-07-10 20:17   ` Alexander Duyck
  2019-07-11 12:03     ` Nitesh Narayan Lal
  2019-07-11  8:49   ` Cornelia Huck
  2019-07-11 18:55   ` Michael S. Tsirkin
  2 siblings, 1 reply; 43+ messages in thread
From: Alexander Duyck @ 2019-07-10 20:17 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko

On Wed, Jul 10, 2019 at 12:53 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
> Enables QEMU to perform madvise free on the memory range reported
> by the vm.
>
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> ---
>  hw/virtio/trace-events                        |  1 +
>  hw/virtio/virtio-balloon.c                    | 59 +++++++++++++++++++
>  include/hw/virtio/virtio-balloon.h            |  2 +-
>  include/qemu/osdep.h                          |  7 +++
>  .../standard-headers/linux/virtio_balloon.h   |  1 +
>  5 files changed, 69 insertions(+), 1 deletion(-)
>
> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> index e28ba48da6..f703a22d36 100644
> --- a/hw/virtio/trace-events
> +++ b/hw/virtio/trace-events
> @@ -46,6 +46,7 @@ virtio_balloon_handle_output(const char *name, uint64_t gpa) "section name: %s g
>  virtio_balloon_get_config(uint32_t num_pages, uint32_t actual) "num_pages: %d actual: %d"
>  virtio_balloon_set_config(uint32_t actual, uint32_t oldactual) "actual: %d oldactual: %d"
>  virtio_balloon_to_target(uint64_t target, uint32_t num_pages) "balloon target: 0x%"PRIx64" num_pages: %d"
> +virtio_balloon_hinting_request(unsigned long pfn, unsigned int num_pages) "Guest page hinting request PFN:%lu size: %d"
>
>  # virtio-mmio.c
>  virtio_mmio_read(uint64_t offset) "virtio_mmio_read offset 0x%" PRIx64
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 2112874055..5d186707b5 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -34,6 +34,9 @@
>
>  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
>
> +#define VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES  16
> +void free_mem_range(uint64_t addr, uint64_t len);
> +

The definition you have here is unused. I think you can drop it. Also
why do you need this forward declaration? Couldn't you just leave
free_mem_range below as a static and still have this compile?

>  struct PartiallyBalloonedPage {
>      RAMBlock *rb;
>      ram_addr_t base;
> @@ -328,6 +331,58 @@ static void balloon_stats_set_poll_interval(Object *obj, Visitor *v,
>      balloon_stats_change_timer(s, 0);
>  }
>
> +void free_mem_range(uint64_t addr, uint64_t len)
> +{
> +    int ret = 0;
> +    void *hvaddr_to_free;
> +    MemoryRegionSection mrs = memory_region_find(get_system_memory(),
> +                                                 addr, 1);
> +    if (!mrs.mr) {
> +       warn_report("%s:No memory is mapped at address 0x%lu", __func__, addr);
> +        return;
> +    }
> +
> +    if (!memory_region_is_ram(mrs.mr) && !memory_region_is_romd(mrs.mr)) {
> +       warn_report("%s:Memory at address 0x%s is not RAM:0x%lu", __func__,
> +                   HWADDR_PRIx, addr);
> +        memory_region_unref(mrs.mr);
> +        return;
> +    }
> +
> +    hvaddr_to_free = qemu_map_ram_ptr(mrs.mr->ram_block, mrs.offset_within_region);
> +    trace_virtio_balloon_hinting_request(addr, len);
> +    ret = qemu_madvise(hvaddr_to_free,len, QEMU_MADV_FREE);
> +    if (ret == -1) {
> +       warn_report("%s: Madvise failed with error:%d", __func__, ret);
> +    }
> +}
> +
> +static void virtio_balloon_handle_page_hinting(VirtIODevice *vdev,
> +                                              VirtQueue *vq)
> +{
> +    VirtQueueElement *elem;
> +    size_t offset = 0;
> +    uint64_t gpa, len;
> +    elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
> +    if (!elem) {
> +        return;
> +    }
> +    /* For pending hints which are < max_pages(16), 'gpa != 0' ensures that we
> +     * only read the buffer which holds a valid PFN value.
> +     * TODO: Find a better way to do this.
> +     */

I'm not sure this comment makes much sense to me. Shouldn't the
iov_to_buf be limiting you anyway? Why do you need the additional gpa
check?

> +    while (iov_to_buf(elem->out_sg, elem->out_num, offset, &gpa, 8) == 8 && gpa != 0) {
> +       offset += 8;
> +       offset += iov_to_buf(elem->out_sg, elem->out_num, offset, &len, 8);

Why pull this out as two separate buffers? Why not just define a
structure that consists of the two uint64_t values and then pull the
entire thing as one buffer? I'm pretty sure the solution as you have
it now opens you up to an error since you could have a malicious guest
only give you a part of the structure and you really should be
verifying you get the entire structure.

> +       if (!qemu_balloon_is_inhibited()) {
> +           free_mem_range(gpa, len);
> +       }
> +    }
> +    virtqueue_push(vq, elem, offset);
> +    virtio_notify(vdev, vq);
> +    g_free(elem);
> +}
> +
>  static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>  {
>      VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
> @@ -694,6 +749,7 @@ static uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f,
>      VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
>      f |= dev->host_features;
>      virtio_add_feature(&f, VIRTIO_BALLOON_F_STATS_VQ);
> +    virtio_add_feature(&f, VIRTIO_BALLOON_F_HINTING);
>
>      return f;
>  }
> @@ -780,6 +836,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>      s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>      s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>      s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
> +    s->hvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_page_hinting);
>
>      if (virtio_has_feature(s->host_features,
>                             VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
> @@ -875,6 +932,8 @@ static void virtio_balloon_instance_init(Object *obj)
>
>      object_property_add(obj, "guest-stats", "guest statistics",
>                          balloon_stats_get_all, NULL, NULL, s, NULL);
> +    object_property_add(obj, "guest-page-hinting", "guest page hinting",
> +                        NULL, NULL, NULL, s, NULL);
>
>      object_property_add(obj, "guest-stats-polling-interval", "int",
>                          balloon_stats_get_poll_interval,
> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> index 1afafb12f6..a58b24fdf2 100644
> --- a/include/hw/virtio/virtio-balloon.h
> +++ b/include/hw/virtio/virtio-balloon.h
> @@ -44,7 +44,7 @@ enum virtio_balloon_free_page_report_status {
>
>  typedef struct VirtIOBalloon {
>      VirtIODevice parent_obj;
> -    VirtQueue *ivq, *dvq, *svq, *free_page_vq;
> +    VirtQueue *ivq, *dvq, *svq, *free_page_vq, *hvq;
>      uint32_t free_page_report_status;
>      uint32_t num_pages;
>      uint32_t actual;
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index af2b91f0b8..bb9207e7f4 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -360,6 +360,11 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>  #else
>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
>  #endif
> +#ifdef MADV_FREE
> +#define QEMU_MADV_FREE MADV_FREE
> +#else
> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
> +#endif

As I mentioned before it might make more sense to use MADV_DONTNEED
instead of just disabling this functionality if the host kernel
doesn't have MADV_FREE support. That way you would still have the
functionality on kernels prior to 4.5 if they need it.

>  #elif defined(CONFIG_POSIX_MADVISE)
>
> @@ -373,6 +378,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>  #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
>  #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
> +#define QEMU_MADV_FREE QEMU_MADV_INVALID

Same here. It might make more sense to use the POSIX_MADV_DONTNEED
instead of just making it invalid.

>  #else /* no-op */
>
> @@ -386,6 +392,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>  #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
>  #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
>
>  #endif
>
> diff --git a/include/standard-headers/linux/virtio_balloon.h b/include/standard-headers/linux/virtio_balloon.h
> index 9375ca2a70..f9e3e82562 100644
> --- a/include/standard-headers/linux/virtio_balloon.h
> +++ b/include/standard-headers/linux/virtio_balloon.h
> @@ -36,6 +36,7 @@
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM        2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_FREE_PAGE_HINT        3 /* VQ to report free pages */
>  #define VIRTIO_BALLOON_F_PAGE_POISON   4 /* Guest is using page poisoning */
> +#define VIRTIO_BALLOON_F_HINTING       5 /* Page hinting virtqueue */
>
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> --
> 2.21.0
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH v11 0/2] mm: Support for page hinting
  2019-07-10 19:51 [RFC][PATCH v11 0/2] mm: Support for page hinting Nitesh Narayan Lal
                   ` (2 preceding siblings ...)
  2019-07-10 19:53 ` [QEMU Patch] virtio-baloon: Support for page hinting Nitesh Narayan Lal
@ 2019-07-10 20:19 ` Dave Hansen
  2019-07-11 11:37   ` Nitesh Narayan Lal
  2019-07-10 23:40 ` Alexander Duyck
  4 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2019-07-10 20:19 UTC (permalink / raw)
  To: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini,
	lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, david,
	mst, dodgen, konrad.wilk, dhildenb, aarcange, alexander.duyck,
	john.starks, mhocko

On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote:
> This patch series proposes an efficient mechanism for reporting free memory
> from a guest to its hypervisor. It especially enables guests with no page cache
> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
> rapidly hand back free memory to the hypervisor.
> This approach has a minimal impact on the existing core-mm infrastructure.
> 
> Measurement results (measurement details appended to this email):
> *Number of 5GB guests (each touching 4GB memory) that can be launched
> without swap usage on a system with 15GB:

This sounds like a reasonable measurement, but I think you're missing a
sentence or two explaining why this test was used.

> unmodified kernel - 2, 3rd with 2.5GB   

What does "3rd with 2.5GB" mean?  The third gets 2.5GB before failing an
allocation and crashing?

> v11 page hinting - 6, 7th with 26MB    
> v1 bubble hinting[1] - 6, 7th with 1.8GB (bubble hinting is another series
> proposed to solve the same problems)

Could you please make an effort to format things so that reviewers can
easily read them?  Aligning columns and using common units would be very
helpful, for instance:

     unmodified kernel - 2, 3rd with 2.50 GB
      v11 page hinting - 6, 7th with 0.03 GB
  v1 bubble hinting[1] - 6, 7th with 1.80 GB

See how you can scan that easily and compare between the rows?

I think you did some analysis below.  But, that seems misplaced.  It's
better to include the conclusion here and the details to back it up
later.  As it stands, the cover letter just throws some data at a
reviewer and hopes they can make sense of it.

> *Memhog execution time (For 3 guests each of 6GB on a system with 15GB):
> unmodified kernel - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB       
> v11 page hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0           
> v1 bubble hinting - Guest1:23, Guest2:11s, Guest3:26s swap used = 0           

Again, I'm finding myself having to reformat your data just so I can
make sense of it.  You also forgot the unit for Guest 1 in row 3.

   unmodified - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB

  v11 hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0
  v1 bubble   - Guest1:23s, Guest2:11s, Guest3:26s swap used = 0

So, what is this supposed to show?  What does it mean?  Why do the
numbers vary *so* much?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure
  2019-07-10 19:51 ` [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure Nitesh Narayan Lal
@ 2019-07-10 20:45   ` Dave Hansen
  2019-07-11 11:48     ` Nitesh Narayan Lal
                       ` (2 more replies)
  2019-07-10 21:56   ` Alexander Duyck
  2019-07-11 18:21   ` Dave Hansen
  2 siblings, 3 replies; 43+ messages in thread
From: Dave Hansen @ 2019-07-10 20:45 UTC (permalink / raw)
  To: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini,
	lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, david,
	mst, dodgen, konrad.wilk, dhildenb, aarcange, alexander.duyck,
	john.starks, mhocko

On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote:
> +struct zone_free_area {
> +	unsigned long *bitmap;
> +	unsigned long base_pfn;
> +	unsigned long end_pfn;
> +	atomic_t free_pages;
> +	unsigned long nbits;
> +} free_area[MAX_NR_ZONES];

Why do we need an extra data structure.  What's wrong with putting
per-zone data in ... 'struct zone'?  The cover letter claims that it
doesn't touch core-mm infrastructure, but if it depends on mechanisms
like this, I think that's a very bad thing.

To be honest, I'm not sure this series is worth reviewing at this point.
 It's horribly lightly commented and full of kernel antipatterns lik

void func()
{
	if () {
		... indent entire logic
		... of function
	}
}

It has big "TODO"s.  It's virtually comment-free.  I'm shocked it's at
the 11th version and still looking like this.

> +
> +		for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) {
> +			unsigned long pages = free_area[zone_idx].end_pfn -
> +					free_area[zone_idx].base_pfn;
> +			bitmap_size = (pages >> PAGE_HINTING_MIN_ORDER) + 1;
> +			if (!bitmap_size)
> +				continue;
> +			free_area[zone_idx].bitmap = bitmap_zalloc(bitmap_size,
> +								   GFP_KERNEL);

This doesn't support sparse zones.  We can have zones with massive
spanned page sizes, but very few present pages.  On those zones, this
will exhaust memory for no good reason.

Comparing this to Alex's patch set, it's of much lower quality and at a
much earlier stage of development.  The two sets are not really even
comparable right now.  This certainly doesn't sell me on (or even really
enumerate the deltas in) this approach vs. Alex's.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure
  2019-07-10 19:51 ` [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure Nitesh Narayan Lal
  2019-07-10 20:45   ` Dave Hansen
@ 2019-07-10 21:56   ` Alexander Duyck
  2019-07-11 17:58     ` Nitesh Narayan Lal
  2019-07-11 18:21   ` Dave Hansen
  2 siblings, 1 reply; 43+ messages in thread
From: Alexander Duyck @ 2019-07-10 21:56 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko

On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
> This patch introduces the core infrastructure for free page hinting in
> virtual environments. It enables the kernel to track the free pages which
> can be reported to its hypervisor so that the hypervisor could
> free and reuse that memory as per its requirement.
>
> While the pages are getting processed in the hypervisor (e.g.,
> via MADV_FREE), the guest must not use them, otherwise, data loss
> would be possible. To avoid such a situation, these pages are
> temporarily removed from the buddy. The amount of pages removed
> temporarily from the buddy is governed by the backend(virtio-balloon
> in our case).
>
> To efficiently identify free pages that can to be hinted to the
> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> chunks are reported to the hypervisor - especially, to not break up THP
> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> in the bitmap are an indication whether a page *might* be free, not a
> guarantee. A new hook after buddy merging sets the bits.
>
> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> asynchronously processes the bitmaps, trying to isolate and report pages
> that are still free. The backend (virtio-balloon) is responsible for
> reporting these batched pages to the host synchronously. Once reporting/
> freeing is complete, isolated pages are returned back to the buddy.
>
> There are still various things to look into (e.g., memory hotplug, more
> efficient locking, possible races when disabling).
>
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> ---
>  include/linux/page_hinting.h |  45 +++++++
>  mm/Kconfig                   |   6 +
>  mm/Makefile                  |   1 +
>  mm/page_alloc.c              |  18 +--
>  mm/page_hinting.c            | 250 +++++++++++++++++++++++++++++++++++
>  5 files changed, 312 insertions(+), 8 deletions(-)
>  create mode 100644 include/linux/page_hinting.h
>  create mode 100644 mm/page_hinting.c
>
> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
> new file mode 100644
> index 000000000000..4900feb796f9
> --- /dev/null
> +++ b/include/linux/page_hinting.h
> @@ -0,0 +1,45 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_PAGE_HINTING_H
> +#define _LINUX_PAGE_HINTING_H
> +
> +/*
> + * Minimum page order required for a page to be hinted to the host.
> + */
> +#define PAGE_HINTING_MIN_ORDER         (MAX_ORDER - 2)
> +

Why use (MAX_ORDER - 2)? Is this just because of the issues I pointed
out earlier for is it due to something else? I'm just wondering if
this will have an impact on architectures outside of x86 as I had
chose pageblock_order which happened to be MAX_ORDER - 2 on x86, but I
don't know that the impact of doing that is on other architectures
versus the (MAX_ORDER - 2) approach you took here.

> +/*
> + * struct page_hinting_config: holds the information supplied by the balloon
> + * device to page hinting.
> + * @hint_pages:                Callback which reports the isolated pages
> + *                     synchornously to the host.
> + * @max_pages:         Maxmimum pages that are going to be hinted to the host
> + *                     at a time of granularity >= PAGE_HINTING_MIN_ORDER.
> + */
> +struct page_hinting_config {
> +       void (*hint_pages)(struct list_head *list);
> +       int max_pages;
> +};
> +
> +extern int __isolate_free_page(struct page *page, unsigned int order);
> +extern void __free_one_page(struct page *page, unsigned long pfn,
> +                           struct zone *zone, unsigned int order,
> +                           int migratetype, bool hint);
> +#ifdef CONFIG_PAGE_HINTING
> +void page_hinting_enqueue(struct page *page, int order);
> +int page_hinting_enable(const struct page_hinting_config *conf);
> +void page_hinting_disable(void);
> +#else
> +static inline void page_hinting_enqueue(struct page *page, int order)
> +{
> +}
> +
> +static inline int page_hinting_enable(const struct page_hinting_config *conf)
> +{
> +       return -EOPNOTSUPP;
> +}
> +
> +static inline void page_hinting_disable(void)
> +{
> +}
> +#endif
> +#endif /* _LINUX_PAGE_HINTING_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index f0c76ba47695..e97fab429d9b 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -765,4 +765,10 @@ config GUP_BENCHMARK
>  config ARCH_HAS_PTE_SPECIAL
>         bool
>
> +# PAGE_HINTING will allow the guest to report the free pages to the
> +# host in fixed chunks as soon as the threshold is reached.
> +config PAGE_HINTING
> +       bool
> +       def_bool n
> +       depends on X86_64
>  endmenu

If there are no issue with using the term "PAGE_HINTING" I guess I
will update my patch set to use that term instead of aeration.

> diff --git a/mm/Makefile b/mm/Makefile
> index ac5e5ba78874..73be49177656 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -94,6 +94,7 @@ obj-$(CONFIG_Z3FOLD)  += z3fold.o
>  obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
>  obj-$(CONFIG_CMA)      += cma.o
>  obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
> +obj-$(CONFIG_PAGE_HINTING) += page_hinting.o
>  obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d66bc8abe0af..8a44338bd04e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -69,6 +69,7 @@
>  #include <linux/lockdep.h>
>  #include <linux/nmi.h>
>  #include <linux/psi.h>
> +#include <linux/page_hinting.h>
>
>  #include <asm/sections.h>
>  #include <asm/tlbflush.h>
> @@ -874,10 +875,10 @@ compaction_capture(struct capture_control *capc, struct page *page,
>   * -- nyc
>   */
>
> -static inline void __free_one_page(struct page *page,
> +inline void __free_one_page(struct page *page,
>                 unsigned long pfn,
>                 struct zone *zone, unsigned int order,
> -               int migratetype)
> +               int migratetype, bool hint)
>  {
>         unsigned long combined_pfn;
>         unsigned long uninitialized_var(buddy_pfn);
> @@ -980,7 +981,8 @@ static inline void __free_one_page(struct page *page,
>                                 migratetype);
>         else
>                 add_to_free_area(page, &zone->free_area[order], migratetype);
> -
> +       if (hint)
> +               page_hinting_enqueue(page, order);
>  }

I'm not sure I am a fan of the way the word "hint" is used here. At
first I thought this was supposed to be !hint since I thought hint
meant that it was a hinted page, not that we need to record that this
page has been freed. Maybe "record" or "report" might be a better word
to use here.

>  /*
> @@ -1263,7 +1265,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>                 if (unlikely(isolated_pageblocks))
>                         mt = get_pageblock_migratetype(page);
>
> -               __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> +               __free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
>                 trace_mm_page_pcpu_drain(page, 0, mt);
>         }
>         spin_unlock(&zone->lock);
> @@ -1272,14 +1274,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  static void free_one_page(struct zone *zone,
>                                 struct page *page, unsigned long pfn,
>                                 unsigned int order,
> -                               int migratetype)
> +                               int migratetype, bool hint)
>  {
>         spin_lock(&zone->lock);
>         if (unlikely(has_isolate_pageblock(zone) ||
>                 is_migrate_isolate(migratetype))) {
>                 migratetype = get_pfnblock_migratetype(page, pfn);
>         }
> -       __free_one_page(page, pfn, zone, order, migratetype);
> +       __free_one_page(page, pfn, zone, order, migratetype, hint);
>         spin_unlock(&zone->lock);
>  }
>
> @@ -1369,7 +1371,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>         migratetype = get_pfnblock_migratetype(page, pfn);
>         local_irq_save(flags);
>         __count_vm_events(PGFREE, 1 << order);
> -       free_one_page(page_zone(page), page, pfn, order, migratetype);
> +       free_one_page(page_zone(page), page, pfn, order, migratetype, true);
>         local_irq_restore(flags);
>  }
>
> @@ -2969,7 +2971,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
>          */
>         if (migratetype >= MIGRATE_PCPTYPES) {
>                 if (unlikely(is_migrate_isolate(migratetype))) {
> -                       free_one_page(zone, page, pfn, 0, migratetype);
> +                       free_one_page(zone, page, pfn, 0, migratetype, true);
>                         return;
>                 }
>                 migratetype = MIGRATE_MOVABLE;
> diff --git a/mm/page_hinting.c b/mm/page_hinting.c
> new file mode 100644
> index 000000000000..0bfa09f8c3ed
> --- /dev/null
> +++ b/mm/page_hinting.c
> @@ -0,0 +1,250 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Page hinting core infrastructure to enable a VM to report free pages to its
> + * hypervisor.
> + *
> + * Copyright Red Hat, Inc. 2019
> + *
> + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/page_hinting.h>
> +#include <linux/kvm_host.h>
> +
> +/*
> + * struct zone_free_area: For a single zone across NUMA nodes, it holds the
> + * bitmap pointer to track the free pages and other required parameters
> + * used to recover these pages by scanning the bitmap.
> + * @bitmap:            Pointer to the bitmap in PAGE_HINTING_MIN_ORDER
> + *                     granularity.
> + * @base_pfn:          Starting PFN value for the zone whose bitmap is stored.
> + * @end_pfn:           Indicates the last PFN value for the zone.
> + * @free_pages:                Tracks the number of free pages of granularity
> + *                     PAGE_HINTING_MIN_ORDER.
> + * @nbits:             Indicates the total size of the bitmap in bits allocated
> + *                     at the time of initialization.
> + */
> +struct zone_free_area {
> +       unsigned long *bitmap;
> +       unsigned long base_pfn;
> +       unsigned long end_pfn;
> +       atomic_t free_pages;
> +       unsigned long nbits;
> +} free_area[MAX_NR_ZONES];
> +

You still haven't addressed the NUMA issue I pointed out with v10. You
are only able to address the first set of zones with this setup. As
such you can end up missing large sections of memory if it is split
over multiple nodes.

> +static void init_hinting_wq(struct work_struct *work);
> +static DEFINE_MUTEX(page_hinting_init);
> +const struct page_hinting_config *page_hitning_conf;
> +struct work_struct hinting_work;
> +atomic_t page_hinting_active;
> +
> +void free_area_cleanup(int nr_zones)
> +{

I'm not sure why you are passing nr_zones as an argument here. Won't
this always be MAX_NR_ZONES?

> +       int zone_idx;
> +
> +       for (zone_idx = 0; zone_idx < nr_zones; zone_idx++) {
> +               bitmap_free(free_area[zone_idx].bitmap);
> +               free_area[zone_idx].base_pfn = 0;
> +               free_area[zone_idx].end_pfn = 0;
> +               free_area[zone_idx].nbits = 0;
> +               atomic_set(&free_area[zone_idx].free_pages, 0);
> +       }
> +}
> +
> +int page_hinting_enable(const struct page_hinting_config *conf)
> +{
> +       unsigned long bitmap_size = 0;
> +       int zone_idx = 0, ret = -EBUSY;
> +       struct zone *zone;
> +
> +       mutex_lock(&page_hinting_init);
> +       if (!page_hitning_conf) {
> +               for_each_populated_zone(zone) {

So for_each_populated_zone will go through all of the NUMA nodes. So
if I am not mistaken you will overwrite the free_area values of all
the previous nodes with the last node in the system. So if we have a
setup that has all the memory in the first node, and none in the
second it would effectively disable free page hinting would it not?

> +                       zone_idx = zone_idx(zone);
> +#ifdef CONFIG_ZONE_DEVICE
> +                       if (zone_idx == ZONE_DEVICE)
> +                               continue;
> +#endif
> +                       spin_lock(&zone->lock);
> +                       if (free_area[zone_idx].base_pfn) {
> +                               free_area[zone_idx].base_pfn =
> +                                       min(free_area[zone_idx].base_pfn,
> +                                           zone->zone_start_pfn);
> +                               free_area[zone_idx].end_pfn =
> +                                       max(free_area[zone_idx].end_pfn,
> +                                           zone->zone_start_pfn +
> +                                           zone->spanned_pages);
> +                       } else {
> +                               free_area[zone_idx].base_pfn =
> +                                       zone->zone_start_pfn;
> +                               free_area[zone_idx].end_pfn =
> +                                       zone->zone_start_pfn +
> +                                       zone->spanned_pages;
> +                       }
> +                       spin_unlock(&zone->lock);
> +               }
> +
> +               for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) {
> +                       unsigned long pages = free_area[zone_idx].end_pfn -
> +                                       free_area[zone_idx].base_pfn;
> +                       bitmap_size = (pages >> PAGE_HINTING_MIN_ORDER) + 1;
> +                       if (!bitmap_size)
> +                               continue;
> +                       free_area[zone_idx].bitmap = bitmap_zalloc(bitmap_size,
> +                                                                  GFP_KERNEL);
> +                       if (!free_area[zone_idx].bitmap) {
> +                               free_area_cleanup(zone_idx);
> +                               mutex_unlock(&page_hinting_init);
> +                               return -ENOMEM;
> +                       }
> +                       free_area[zone_idx].nbits = bitmap_size;
> +               }

So this is the bit that still needs to address hotplug right? I would
imagine you need to reallocate this if the spanned_pages range changes
correct?

> +               page_hitning_conf = conf;
> +               INIT_WORK(&hinting_work, init_hinting_wq);
> +               ret = 0;
> +       }
> +       mutex_unlock(&page_hinting_init);
> +       return ret;
> +}
> +EXPORT_SYMBOL_GPL(page_hinting_enable);
> +
> +void page_hinting_disable(void)
> +{
> +       cancel_work_sync(&hinting_work);
> +       page_hitning_conf = NULL;
> +       free_area_cleanup(MAX_NR_ZONES);
> +}
> +EXPORT_SYMBOL_GPL(page_hinting_disable);
> +
> +static unsigned long pfn_to_bit(struct page *page, int zone_idx)
> +{
> +       unsigned long bitnr;
> +
> +       bitnr = (page_to_pfn(page) - free_area[zone_idx].base_pfn)
> +                        >> PAGE_HINTING_MIN_ORDER;
> +       return bitnr;
> +}
> +
> +static void release_buddy_pages(struct list_head *pages)
> +{
> +       int mt = 0, zone_idx, order;
> +       struct page *page, *next;
> +       unsigned long bitnr;
> +       struct zone *zone;
> +
> +       list_for_each_entry_safe(page, next, pages, lru) {
> +               zone_idx = page_zonenum(page);
> +               zone = page_zone(page);
> +               bitnr = pfn_to_bit(page, zone_idx);
> +               spin_lock(&zone->lock);
> +               list_del(&page->lru);
> +               order = page_private(page);
> +               set_page_private(page, 0);
> +               mt = get_pageblock_migratetype(page);
> +               __free_one_page(page, page_to_pfn(page), zone,
> +                               order, mt, false);
> +               spin_unlock(&zone->lock);
> +       }
> +}
> +
> +static void bm_set_pfn(struct page *page)
> +{
> +       struct zone *zone = page_zone(page);
> +       int zone_idx = page_zonenum(page);
> +       unsigned long bitnr = 0;
> +
> +       lockdep_assert_held(&zone->lock);
> +       bitnr = pfn_to_bit(page, zone_idx);
> +       /*
> +        * TODO: fix possible underflows.
> +        */
> +       if (free_area[zone_idx].bitmap &&
> +           bitnr < free_area[zone_idx].nbits &&
> +           !test_and_set_bit(bitnr, free_area[zone_idx].bitmap))
> +               atomic_inc(&free_area[zone_idx].free_pages);
> +}
> +
> +static void scan_zone_free_area(int zone_idx, int free_pages)
> +{
> +       int ret = 0, order, isolated_cnt = 0;
> +       unsigned long set_bit, start = 0;
> +       LIST_HEAD(isolated_pages);
> +       struct page *page;
> +       struct zone *zone;
> +
> +       for (;;) {
> +               ret = 0;
> +               set_bit = find_next_bit(free_area[zone_idx].bitmap,
> +                                       free_area[zone_idx].nbits, start);
> +               if (set_bit >= free_area[zone_idx].nbits)
> +                       break;
> +               page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) +
> +                               free_area[zone_idx].base_pfn);
> +               if (!page)
> +                       continue;
> +               zone = page_zone(page);
> +               spin_lock(&zone->lock);
> +
> +               if (PageBuddy(page) && page_private(page) >=
> +                   PAGE_HINTING_MIN_ORDER) {
> +                       order = page_private(page);
> +                       ret = __isolate_free_page(page, order);
> +               }
> +               clear_bit(set_bit, free_area[zone_idx].bitmap);
> +               atomic_dec(&free_area[zone_idx].free_pages);
> +               spin_unlock(&zone->lock);
> +               if (ret) {
> +                       /*
> +                        * restoring page order to use it while releasing
> +                        * the pages back to the buddy.
> +                        */
> +                       set_page_private(page, order);
> +                       list_add_tail(&page->lru, &isolated_pages);
> +                       isolated_cnt++;
> +                       if (isolated_cnt == page_hitning_conf->max_pages) {
> +                               page_hitning_conf->hint_pages(&isolated_pages);
> +                               release_buddy_pages(&isolated_pages);
> +                               isolated_cnt = 0;
> +                       }
> +               }
> +               start = set_bit + 1;
> +       }
> +       if (isolated_cnt) {
> +               page_hitning_conf->hint_pages(&isolated_pages);
> +               release_buddy_pages(&isolated_pages);
> +       }
> +}
> +

I really worry that this loop is going to become more expensive as the
size of memory increases. For example if we hint on just 16 pages we
would have to walk something like 4K bits, 512 longs, if a system had
64G of memory. Have you considered testing with a larger memory
footprint to see if it has an impact on performance?

> +static void init_hinting_wq(struct work_struct *work)
> +{
> +       int zone_idx, free_pages;
> +
> +       atomic_set(&page_hinting_active, 1);
> +       for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) {
> +               free_pages = atomic_read(&free_area[zone_idx].free_pages);
> +               if (free_pages >= page_hitning_conf->max_pages)
> +                       scan_zone_free_area(zone_idx, free_pages);
> +       }
> +       atomic_set(&page_hinting_active, 0);
> +}
> +
> +void page_hinting_enqueue(struct page *page, int order)
> +{
> +       int zone_idx;
> +
> +       if (!page_hitning_conf || order < PAGE_HINTING_MIN_ORDER)
> +               return;

I would think it is going to be expensive to be jumping into this
function for every freed page. You should probably have an inline
taking care of the order check before you even get here since it would
be faster that way.

> +
> +       bm_set_pfn(page);
> +       if (atomic_read(&page_hinting_active))
> +               return;

So I would think this piece is racy. Specifically if you set a PFN
that is somewhere below the PFN you are currently processing in your
scan it is going to remain unset until you have another page freed
after the scan is completed. I would worry you can end up with a batch
free of memory resulting in a group of pages sitting at the start of
your bitmap unhinted.

In my patches I resolved this by looping through all of the zones,
however your approach is missing the necessary pieces to make that
safe as you could end up in a soft lockup with the scanning thread
spinning on a noisy system.

> +       zone_idx = zone_idx(page_zone(page));
> +       if (atomic_read(&free_area[zone_idx].free_pages) >=
> +                       page_hitning_conf->max_pages) {
> +               int cpu = smp_processor_id();
> +
> +               queue_work_on(cpu, system_wq, &hinting_work);
> +       }
> +}

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH v11 0/2] mm: Support for page hinting
  2019-07-10 19:51 [RFC][PATCH v11 0/2] mm: Support for page hinting Nitesh Narayan Lal
                   ` (3 preceding siblings ...)
  2019-07-10 20:19 ` [RFC][PATCH v11 0/2] mm: " Dave Hansen
@ 2019-07-10 23:40 ` Alexander Duyck
  2019-07-11 11:30   ` Nitesh Narayan Lal
  4 siblings, 1 reply; 43+ messages in thread
From: Alexander Duyck @ 2019-07-10 23:40 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko

On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:

The results up here were redundant with what is below so I am just
dropping them. I would suggest only including one set of results in
any future cover page as it is confusing to duplicate it like that.

> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
> A new hook after buddy merging is used to set the bits in the bitmap.
> Currently, the bits are only cleared when pages are hinted, not when pages are
> re-allocated.
>
> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
> threshold is met, trying to isolate and report pages that are still free.
>
> The isolated pages are reported via virtio-balloon, which is responsible for
> sending batched pages to the host synchronously. Once the hypervisor processed
> the hinting request, the isolated pages are returned back to the buddy.
>
> Changelog in v11:
> * Added logic to take care of multiple NUMA nodes scenarios.
> * Simplified the logic for reporting isolated pages to the host. (Eg. replaced
> dynamically allocated arrays with static ones, introduced wait event instead of
> the loop in order to wait for a response from the host)
> * Added a mutex to prevent race condition when page hinting is enabled by
> multiple drivers.
> * Simplified the logic responsible for decrementing free page counter for each
> zone.
> * Simplified code structuring/naming.
>
> Known work items for the future:
> * Test device assigned guests to ensure that hinting doesn't break it.
> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device-side support.
> * Decide between MADV_DONTNEED and MADV_FREE.
> * Look into memory hotplug, more efficient locking, better naming conventions to
> avoid confusion with VIRTIO_BALLOON_F_FREE_PAGE_HINT support.
> * Come up with proper/traceable error-message/logs and look into other code
> simplifications. (If necessary).
>
> Benefit analysis:
> 1. Number of 5GB guests (each touching 4GB memory) that can be launched without
> swap usage on a system with 15GB:
> unmodified kernel - 2, 3rd with 2.5GB
> v11 page hinting - 6, 7th with 26MB
> v1 bubble hinting - 6, 7th with 1.8GB
>
> Conclusion - In this particular testcase on using v11 page hinting and
> v1 bubble-hinting 4 more guests could be launched without swapping compared
> to an unmodified kernel.
> For the 7th guest launch, v11 page hinting is slightly better than v1 Bubble
> hinting as it touches lesser swap space.

I'm confused by the comment. From what I can tell bubble hinting came
up with 1.8GB of memory while page hinting only managed to achieve
.026GB (Using the same units makes it easier to visualize the
difference). Also your test says "can be launched without swap usage",
yet you say the bubble hinting is touching swap which makes not sense
to me.

> Setup & procedure -
> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
> Guest Memory = 5GB
> Number of CPUs in the guest = 1
> Host swap = 4GB
> Workload = test allocation program that allocates 4GB memory, touches it via
> memset and exits.
> The first guest is launched and once its console is up, the test allocation
> program is executed with 4 GB memory request (Due to this the guest occupies
> almost 4-5 GB of memory in the host in a system without page hinting). Once
> this program exits at that time another guest is launched in the host and the
> same process is followed. It is continued until the swap is not used.
>
> 2. Memhog execution time (For 3 guests each of 6GB on a system with 15GB):
> unmodified kernel - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB
> v11 page hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0
> v1 bubble hinting - Guest1:23, Guest2:11s, Guest3:26s swap used = 0
>
> For this particular test-case in a guest which doesn't require swap access
> "memhog 6G" execution time lies within a range of 15-30s.
> Conclusion -
> In the above test case for an unmodified kernel on executing memhog in the
> third guest execution time rises to above 2minutes due to swap access.
> Using either page-hinting or bubble hinting brings this execution time to a
> a normal range of 15-30s.

So really this test doesn't add much in value. The whole reason why
Guest3 runs so much slower is because it is going to swap. I initially
did this to demonstrate a point, but now running this test doesn't
prove much as it isn't really meant to be a performance test. It is
essentially just a duplicate of the "how many guests can you run" test
that is passing itself off as some sort of performance test.

We could probably just drop this from future version of this as long
as we verify that the memory hinting is freeing most of the memory
back and the guest is reporting a size less than the total guest
memory size.

> Setup & procedure -
> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
> Guest Memory = 6GB
> Number of CPUs in the guest = 4
> Process = 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
> one after the other in each of them.
> Host swap = 4GB
>
> Performance Analysis:
> 1. will-it-scale's page_faul1
> Setup -
> Guest Memory = 6GB
> Number of cores = 24
>
> Unmodified kernel -
> 0,0,100,0,100,0
> 1,514453,95.84,519502,95.83,519502
> 2,991485,91.67,932268,91.68,1039004
> 3,1381237,87.36,1264214,87.64,1558506
> 4,1789116,83.36,1597767,83.88,2078008
> 5,2181552,79.20,1889489,80.08,2597510
> 6,2452416,75.05,2001879,77.10,3117012
> 7,2671047,70.90,2263866,73.22,3636514
> 8,2930081,66.75,2333813,70.60,4156016
> 9,3126431,62.60,2370108,68.28,4675518
> 10,3211937,58.44,2454093,65.74,5195020
> 11,3162172,54.32,2450822,63.21,5714522
> 12,3154261,50.14,2272290,58.98,6234024
> 13,3115174,46.02,2369679,57.74,6753526
> 14,3150511,41.86,2470837,54.02,7273028
> 15,3134158,37.71,2428129,51.98,7792530
> 16,3143067,33.57,2340469,49.54,8312032
> 17,3112457,29.43,2263627,44.81,8831534
> 18,3089724,25.29,2181879,38.69,9351036
> 19,3076878,21.15,2236505,40.01,9870538
> 20,3091978,16.95,2266327,35.00,10390040
> 21,3082927,12.84,2172578,28.12,10909542
> 22,3055282,8.73,2176269,29.14,11429044
> 23,3081144,4.56,2138442,24.87,11948546
> 24,3075509,0.45,2173753,21.62,12468048
>
> page hinting -
> 0,0,100,0,100,0
> 1,491683,95.83,494366,95.82,494366
> 2,988415,91.67,919660,91.68,988732
> 3,1344829,87.52,1244608,87.69,1483098
> 4,1797933,83.37,1625797,83.70,1977464
> 5,2179009,79.21,1881534,80.13,2471830
> 6,2449858,75.07,2078137,76.82,2966196
> 7,2732122,70.90,2178105,73.75,3460562
> 8,2910965,66.75,2340901,70.28,3954928
> 9,3006665,62.61,2353748,67.91,4449294
> 10,3164752,58.46,2377936,65.08,4943660
> 11,3234846,54.32,2510149,63.14,5438026
> 12,3165477,50.17,2412007,59.91,5932392
> 13,3141457,46.05,2421548,57.85,6426758
> 14,3135839,41.90,2378021,53.81,6921124
> 15,3109113,37.75,2269290,51.76,7415490
> 16,3093613,33.62,2346185,48.73,7909856
> 17,3086542,29.49,2352140,46.19,8404222
> 18,3048991,25.36,2217144,41.52,8898588
> 19,2965500,21.18,2313614,38.18,9392954
> 20,2928977,17.05,2175316,35.67,9887320
> 21,2896667,12.91,2141311,28.90,10381686
> 22,3047782,8.76,2177664,28.24,10876052
> 23,2994503,4.58,2160976,22.97,11370418
> 24,3038762,0.47,2053533,22.39,11864784
>
> bubble-hinting v1 -
> 0,0,100,0,100,0
> 1,515272,95.83,492355,95.81,515272
> 2,985903,91.66,919653,91.68,1030544
> 3,1475300,87.51,1353723,87.65,1545816
> 4,1783938,83.36,1586307,83.78,2061088
> 5,2093307,79.20,1867395,79.95,2576360
> 6,2441370,75.05,2055421,76.65,3091632
> 7,2650471,70.89,2246014,72.93,3606904
> 8,2926782,66.75,2333601,70.41,4122176
> 9,3107617,62.60,2383112,68.46,4637448
> 10,3192332,58.44,2441626,65.84,5152720
> 11,3268043,54.32,2235964,62.92,5667992
> 12,3191105,50.18,2449045,60.49,6183264
> 13,3145317,46.05,2377317,57.80,6698536
> 14,3161552,41.91,2395814,53.26,7213808
> 15,3140443,37.77,2333200,51.42,7729080
> 16,3130866,33.65,2150967,46.11,8244352
> 17,3112894,29.52,2372068,45.93,8759624
> 18,3078424,25.39,2336211,39.85,9274896
> 19,3036457,21.27,2224821,35.25,9790168
> 20,3046330,17.13,2199755,37.43,10305440
> 21,2981130,12.98,2214862,28.67,10820712
> 22,3017481,8.84,2195996,29.69,11335984
> 23,2979906,4.68,2173395,25.90,11851256
> 24,2971170,0.52,2134311,21.89,12366528

Okay, so this doesn't match up with the results you gave me last time
(https://lore.kernel.org/lkml/afac6f92-74f5-4580-0303-12b7374e5011@redhat.com/),
and actually more closely matches what I was expecting to see. The
bubble-hinting patches are performing within a few percent of what the
baseline kernel was doing. I am assuming the results from before had
some additional debugging enabled for the bubble-hinting test that
wasn't enabled for the other ones.

> Conclusion -
> For an unmodified kernel, with every fresh boot, there is 3-4% delta observed
> in the results wrt the numbers mentioned above. For both bubble-hinting and
> page-hinting, there was no noticeable degradation observed other than the
> expected variability mentioned earlier.
>
> Page hinting vs bubble hinting:
> From the benefits and performance perspective, both solutions look quite similar
> so far. However, unlike bubble-hinting which is more invasive, the overall core
> mm changes required for page hinting are minimal.
>
> [1] https://lkml.org/lkml/2019/6/19/926

So I think I called it out in the review of the patch but I think we
may want to see what happens if we increase the size of the memory in
the guest to something more like 64G or larger. My main concern is
that as we increase the size of memory the walk through the bitmap is
going to become more and more expensive and I am worried that at some
point it will start impacting the results.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [QEMU Patch] virtio-baloon: Support for page hinting
  2019-07-10 19:53 ` [QEMU Patch] virtio-baloon: Support for page hinting Nitesh Narayan Lal
  2019-07-10 20:17   ` Alexander Duyck
@ 2019-07-11  8:49   ` Cornelia Huck
  2019-07-11 11:13     ` Nitesh Narayan Lal
  2019-07-11 18:55   ` Michael S. Tsirkin
  2 siblings, 1 reply; 43+ messages in thread
From: Cornelia Huck @ 2019-07-11  8:49 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta,
	wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen, konrad.wilk,
	dhildenb, aarcange, alexander.duyck, john.starks, dave.hansen,
	mhocko

On Wed, 10 Jul 2019 15:53:03 -0400
Nitesh Narayan Lal <nitesh@redhat.com> wrote:


$SUBJECT: s/baloon/balloon/

> Enables QEMU to perform madvise free on the memory range reported
> by the vm.

[No comments on the actual functionality; just some stuff I noticed.]

> 
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> ---
>  hw/virtio/trace-events                        |  1 +
>  hw/virtio/virtio-balloon.c                    | 59 +++++++++++++++++++
>  include/hw/virtio/virtio-balloon.h            |  2 +-
>  include/qemu/osdep.h                          |  7 +++
>  .../standard-headers/linux/virtio_balloon.h   |  1 +
>  5 files changed, 69 insertions(+), 1 deletion(-)
> 

(...)

> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 2112874055..5d186707b5 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -34,6 +34,9 @@
>  
>  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
>  
> +#define VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES	16
> +void free_mem_range(uint64_t addr, uint64_t len);
> +
>  struct PartiallyBalloonedPage {
>      RAMBlock *rb;
>      ram_addr_t base;
> @@ -328,6 +331,58 @@ static void balloon_stats_set_poll_interval(Object *obj, Visitor *v,
>      balloon_stats_change_timer(s, 0);
>  }
>  
> +void free_mem_range(uint64_t addr, uint64_t len)
> +{
> +    int ret = 0;
> +    void *hvaddr_to_free;
> +    MemoryRegionSection mrs = memory_region_find(get_system_memory(),
> +                                                 addr, 1);
> +    if (!mrs.mr) {
> +	warn_report("%s:No memory is mapped at address 0x%lu", __func__, addr);

Indentation seems to be off here (also in other places; please double
check.)

> +        return;
> +    }
> +
> +    if (!memory_region_is_ram(mrs.mr) && !memory_region_is_romd(mrs.mr)) {
> +	warn_report("%s:Memory at address 0x%s is not RAM:0x%lu", __func__,
> +		    HWADDR_PRIx, addr);
> +        memory_region_unref(mrs.mr);
> +        return;
> +    }
> +
> +    hvaddr_to_free = qemu_map_ram_ptr(mrs.mr->ram_block, mrs.offset_within_region);
> +    trace_virtio_balloon_hinting_request(addr, len);
> +    ret = qemu_madvise(hvaddr_to_free,len, QEMU_MADV_FREE);
> +    if (ret == -1) {
> +	warn_report("%s: Madvise failed with error:%d", __func__, ret);
> +    }
> +}
> +
> +static void virtio_balloon_handle_page_hinting(VirtIODevice *vdev,
> +					       VirtQueue *vq)
> +{
> +    VirtQueueElement *elem;
> +    size_t offset = 0;
> +    uint64_t gpa, len;
> +    elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
> +    if (!elem) {
> +        return;
> +    }
> +    /* For pending hints which are < max_pages(16), 'gpa != 0' ensures that we
> +     * only read the buffer which holds a valid PFN value.
> +     * TODO: Find a better way to do this.
> +     */
> +    while (iov_to_buf(elem->out_sg, elem->out_num, offset, &gpa, 8) == 8 && gpa != 0) {
> +	offset += 8;
> +	offset += iov_to_buf(elem->out_sg, elem->out_num, offset, &len, 8);
> +	if (!qemu_balloon_is_inhibited()) {
> +	    free_mem_range(gpa, len);
> +	}
> +    }
> +    virtqueue_push(vq, elem, offset);
> +    virtio_notify(vdev, vq);
> +    g_free(elem);
> +}
> +
>  static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>  {
>      VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
> @@ -694,6 +749,7 @@ static uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f,
>      VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
>      f |= dev->host_features;
>      virtio_add_feature(&f, VIRTIO_BALLOON_F_STATS_VQ);
> +    virtio_add_feature(&f, VIRTIO_BALLOON_F_HINTING);

I don't think you can add this unconditionally if you want to keep this
migratable. This should be done via a property (as for deflate-on-oom
and free-page-hint) so it can be turned off in compat machines.

>  
>      return f;
>  }
> @@ -780,6 +836,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>      s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>      s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>      s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
> +    s->hvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_page_hinting);

This should probably be conditional in the same way as the free page hint
queue (also see above).

>  
>      if (virtio_has_feature(s->host_features,
>                             VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
> @@ -875,6 +932,8 @@ static void virtio_balloon_instance_init(Object *obj)
>  
>      object_property_add(obj, "guest-stats", "guest statistics",
>                          balloon_stats_get_all, NULL, NULL, s, NULL);
> +    object_property_add(obj, "guest-page-hinting", "guest page hinting",
> +                        NULL, NULL, NULL, s, NULL);

This object does not have any accessors; what purpose does it serve?

>  
>      object_property_add(obj, "guest-stats-polling-interval", "int",
>                          balloon_stats_get_poll_interval,

(...)

> diff --git a/include/standard-headers/linux/virtio_balloon.h b/include/standard-headers/linux/virtio_balloon.h
> index 9375ca2a70..f9e3e82562 100644
> --- a/include/standard-headers/linux/virtio_balloon.h
> +++ b/include/standard-headers/linux/virtio_balloon.h
> @@ -36,6 +36,7 @@
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
>  #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
> +#define VIRTIO_BALLOON_F_HINTING	5 /* Page hinting virtqueue */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12

Please split off any update to these headers into a separate patch, so
that it can be replaced by a proper headers update when it is merged.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [QEMU Patch] virtio-baloon: Support for page hinting
  2019-07-11  8:49   ` Cornelia Huck
@ 2019-07-11 11:13     ` Nitesh Narayan Lal
  0 siblings, 0 replies; 43+ messages in thread
From: Nitesh Narayan Lal @ 2019-07-11 11:13 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta,
	wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen, konrad.wilk,
	dhildenb, aarcange, alexander.duyck, john.starks, dave.hansen,
	mhocko


On 7/11/19 4:49 AM, Cornelia Huck wrote:
> On Wed, 10 Jul 2019 15:53:03 -0400
> Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> $SUBJECT: s/baloon/balloon/
>
>> Enables QEMU to perform madvise free on the memory range reported
>> by the vm.
> [No comments on the actual functionality; just some stuff I noticed.]
>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> ---
>>  hw/virtio/trace-events                        |  1 +
>>  hw/virtio/virtio-balloon.c                    | 59 +++++++++++++++++++
>>  include/hw/virtio/virtio-balloon.h            |  2 +-
>>  include/qemu/osdep.h                          |  7 +++
>>  .../standard-headers/linux/virtio_balloon.h   |  1 +
>>  5 files changed, 69 insertions(+), 1 deletion(-)
>>
> (...)
>
>> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
>> index 2112874055..5d186707b5 100644
>> --- a/hw/virtio/virtio-balloon.c
>> +++ b/hw/virtio/virtio-balloon.c
>> @@ -34,6 +34,9 @@
>>  
>>  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
>>  
>> +#define VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES	16
>> +void free_mem_range(uint64_t addr, uint64_t len);
>> +
>>  struct PartiallyBalloonedPage {
>>      RAMBlock *rb;
>>      ram_addr_t base;
>> @@ -328,6 +331,58 @@ static void balloon_stats_set_poll_interval(Object *obj, Visitor *v,
>>      balloon_stats_change_timer(s, 0);
>>  }
>>  
>> +void free_mem_range(uint64_t addr, uint64_t len)
>> +{
>> +    int ret = 0;
>> +    void *hvaddr_to_free;
>> +    MemoryRegionSection mrs = memory_region_find(get_system_memory(),
>> +                                                 addr, 1);
>> +    if (!mrs.mr) {
>> +	warn_report("%s:No memory is mapped at address 0x%lu", __func__, addr);
> Indentation seems to be off here (also in other places; please double
> check.)
Thanks, I will check it.
>
>> +        return;
>> +    }
>> +
>> +    if (!memory_region_is_ram(mrs.mr) && !memory_region_is_romd(mrs.mr)) {
>> +	warn_report("%s:Memory at address 0x%s is not RAM:0x%lu", __func__,
>> +		    HWADDR_PRIx, addr);
>> +        memory_region_unref(mrs.mr);
>> +        return;
>> +    }
>> +
>> +    hvaddr_to_free = qemu_map_ram_ptr(mrs.mr->ram_block, mrs.offset_within_region);
>> +    trace_virtio_balloon_hinting_request(addr, len);
>> +    ret = qemu_madvise(hvaddr_to_free,len, QEMU_MADV_FREE);
>> +    if (ret == -1) {
>> +	warn_report("%s: Madvise failed with error:%d", __func__, ret);
>> +    }
>> +}
>> +
>> +static void virtio_balloon_handle_page_hinting(VirtIODevice *vdev,
>> +					       VirtQueue *vq)
>> +{
>> +    VirtQueueElement *elem;
>> +    size_t offset = 0;
>> +    uint64_t gpa, len;
>> +    elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
>> +    if (!elem) {
>> +        return;
>> +    }
>> +    /* For pending hints which are < max_pages(16), 'gpa != 0' ensures that we
>> +     * only read the buffer which holds a valid PFN value.
>> +     * TODO: Find a better way to do this.
>> +     */
>> +    while (iov_to_buf(elem->out_sg, elem->out_num, offset, &gpa, 8) == 8 && gpa != 0) {
>> +	offset += 8;
>> +	offset += iov_to_buf(elem->out_sg, elem->out_num, offset, &len, 8);
>> +	if (!qemu_balloon_is_inhibited()) {
>> +	    free_mem_range(gpa, len);
>> +	}
>> +    }
>> +    virtqueue_push(vq, elem, offset);
>> +    virtio_notify(vdev, vq);
>> +    g_free(elem);
>> +}
>> +
>>  static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>>  {
>>      VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
>> @@ -694,6 +749,7 @@ static uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f,
>>      VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
>>      f |= dev->host_features;
>>      virtio_add_feature(&f, VIRTIO_BALLOON_F_STATS_VQ);
>> +    virtio_add_feature(&f, VIRTIO_BALLOON_F_HINTING);
> I don't think you can add this unconditionally if you want to keep this
> migratable. This should be done via a property (as for deflate-on-oom
> and free-page-hint) so it can be turned off in compat machines.
I see, I will take a look at it.
>
>>  
>>      return f;
>>  }
>> @@ -780,6 +836,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>>      s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>>      s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>>      s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
>> +    s->hvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_page_hinting);
> This should probably be conditional in the same way as the free page hint
> queue (also see above).
Makes sense. Thanks.
>
>>  
>>      if (virtio_has_feature(s->host_features,
>>                             VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
>> @@ -875,6 +932,8 @@ static void virtio_balloon_instance_init(Object *obj)
>>  
>>      object_property_add(obj, "guest-stats", "guest statistics",
>>                          balloon_stats_get_all, NULL, NULL, s, NULL);
>> +    object_property_add(obj, "guest-page-hinting", "guest page hinting",
>> +                        NULL, NULL, NULL, s, NULL);
> This object does not have any accessors; what purpose does it serve?
I think its not required. I will correct this.
>
>>  
>>      object_property_add(obj, "guest-stats-polling-interval", "int",
>>                          balloon_stats_get_poll_interval,
> (...)
>
>> diff --git a/include/standard-headers/linux/virtio_balloon.h b/include/standard-headers/linux/virtio_balloon.h
>> index 9375ca2a70..f9e3e82562 100644
>> --- a/include/standard-headers/linux/virtio_balloon.h
>> +++ b/include/standard-headers/linux/virtio_balloon.h
>> @@ -36,6 +36,7 @@
>>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>>  #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
>>  #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
>> +#define VIRTIO_BALLOON_F_HINTING	5 /* Page hinting virtqueue */
>>  
>>  /* Size of a PFN in the balloon interface. */
>>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> Please split off any update to these headers into a separate patch, so
> that it can be replaced by a proper headers update when it is merged.
I will do that.
-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH v11 0/2] mm: Support for page hinting
  2019-07-10 23:40 ` Alexander Duyck
@ 2019-07-11 11:30   ` Nitesh Narayan Lal
  2019-07-11 14:58     ` Alexander Duyck
  0 siblings, 1 reply; 43+ messages in thread
From: Nitesh Narayan Lal @ 2019-07-11 11:30 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko


On 7/10/19 7:40 PM, Alexander Duyck wrote:
> On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
> The results up here were redundant with what is below so I am just
> dropping them. I would suggest only including one set of results in
> any future cover page as it is confusing to duplicate it like that.
>
>> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
>> A new hook after buddy merging is used to set the bits in the bitmap.
>> Currently, the bits are only cleared when pages are hinted, not when pages are
>> re-allocated.
>>
>> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
>> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
>> threshold is met, trying to isolate and report pages that are still free.
>>
>> The isolated pages are reported via virtio-balloon, which is responsible for
>> sending batched pages to the host synchronously. Once the hypervisor processed
>> the hinting request, the isolated pages are returned back to the buddy.
>>
>> Changelog in v11:
>> * Added logic to take care of multiple NUMA nodes scenarios.
>> * Simplified the logic for reporting isolated pages to the host. (Eg. replaced
>> dynamically allocated arrays with static ones, introduced wait event instead of
>> the loop in order to wait for a response from the host)
>> * Added a mutex to prevent race condition when page hinting is enabled by
>> multiple drivers.
>> * Simplified the logic responsible for decrementing free page counter for each
>> zone.
>> * Simplified code structuring/naming.
>>
>> Known work items for the future:
>> * Test device assigned guests to ensure that hinting doesn't break it.
>> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device-side support.
>> * Decide between MADV_DONTNEED and MADV_FREE.
>> * Look into memory hotplug, more efficient locking, better naming conventions to
>> avoid confusion with VIRTIO_BALLOON_F_FREE_PAGE_HINT support.
>> * Come up with proper/traceable error-message/logs and look into other code
>> simplifications. (If necessary).
>>
>> Benefit analysis:
>> 1. Number of 5GB guests (each touching 4GB memory) that can be launched without
>> swap usage on a system with 15GB:
>> unmodified kernel - 2, 3rd with 2.5GB
>> v11 page hinting - 6, 7th with 26MB
>> v1 bubble hinting - 6, 7th with 1.8GB
>>
>> Conclusion - In this particular testcase on using v11 page hinting and
>> v1 bubble-hinting 4 more guests could be launched without swapping compared
>> to an unmodified kernel.
>> For the 7th guest launch, v11 page hinting is slightly better than v1 Bubble
>> hinting as it touches lesser swap space.
> I'm confused by the comment. From what I can tell bubble hinting came
> up with 1.8GB of memory while page hinting only managed to achieve
> .026GB (Using the same units makes it easier to visualize the
> difference). Also your test says "can be launched without swap usage",
> yet you say the bubble hinting is touching swap which makes not sense
> to me.
I will work on the cover to improve this part.
Basically, In each case, the first number indicates the number of the
guest which are launched without touching the swap space. For instance
with bubble hinting, I was able to launch 6 guests without any swap
usage. On launching the 7th guests initially there was no swap usage,
however, as the test app starts allocating 4GB memory the swap came into
the picture. 1.8 GB is the swap usage after the completion of the test
application.
>> Setup & procedure -
>> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
>> Guest Memory = 5GB
>> Number of CPUs in the guest = 1
>> Host swap = 4GB
>> Workload = test allocation program that allocates 4GB memory, touches it via
>> memset and exits.
>> The first guest is launched and once its console is up, the test allocation
>> program is executed with 4 GB memory request (Due to this the guest occupies
>> almost 4-5 GB of memory in the host in a system without page hinting). Once
>> this program exits at that time another guest is launched in the host and the
>> same process is followed. It is continued until the swap is not used.
>>
>> 2. Memhog execution time (For 3 guests each of 6GB on a system with 15GB):
>> unmodified kernel - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB
>> v11 page hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0
>> v1 bubble hinting - Guest1:23, Guest2:11s, Guest3:26s swap used = 0
>>
>> For this particular test-case in a guest which doesn't require swap access
>> "memhog 6G" execution time lies within a range of 15-30s.
>> Conclusion -
>> In the above test case for an unmodified kernel on executing memhog in the
>> third guest execution time rises to above 2minutes due to swap access.
>> Using either page-hinting or bubble hinting brings this execution time to a
>> a normal range of 15-30s.
> So really this test doesn't add much in value. The whole reason why
> Guest3 runs so much slower is because it is going to swap. I initially
> did this to demonstrate a point, but now running this test doesn't
> prove much as it isn't really meant to be a performance test. It is
> essentially just a duplicate of the "how many guests can you run" test
> that is passing itself off as some sort of performance test.
>
> We could probably just drop this from future version of this as long
> as we verify that the memory hinting is freeing most of the memory
> back and the guest is reporting a size less than the total guest
> memory size.
>
+1, makes sense to keep just one of the above two.
>> Setup & procedure -
>> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
>> Guest Memory = 6GB
>> Number of CPUs in the guest = 4
>> Process = 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
>> one after the other in each of them.
>> Host swap = 4GB
>>
>> Performance Analysis:
>> 1. will-it-scale's page_faul1
>> Setup -
>> Guest Memory = 6GB
>> Number of cores = 24
>>
>> Unmodified kernel -
>> 0,0,100,0,100,0
>> 1,514453,95.84,519502,95.83,519502
>> 2,991485,91.67,932268,91.68,1039004
>> 3,1381237,87.36,1264214,87.64,1558506
>> 4,1789116,83.36,1597767,83.88,2078008
>> 5,2181552,79.20,1889489,80.08,2597510
>> 6,2452416,75.05,2001879,77.10,3117012
>> 7,2671047,70.90,2263866,73.22,3636514
>> 8,2930081,66.75,2333813,70.60,4156016
>> 9,3126431,62.60,2370108,68.28,4675518
>> 10,3211937,58.44,2454093,65.74,5195020
>> 11,3162172,54.32,2450822,63.21,5714522
>> 12,3154261,50.14,2272290,58.98,6234024
>> 13,3115174,46.02,2369679,57.74,6753526
>> 14,3150511,41.86,2470837,54.02,7273028
>> 15,3134158,37.71,2428129,51.98,7792530
>> 16,3143067,33.57,2340469,49.54,8312032
>> 17,3112457,29.43,2263627,44.81,8831534
>> 18,3089724,25.29,2181879,38.69,9351036
>> 19,3076878,21.15,2236505,40.01,9870538
>> 20,3091978,16.95,2266327,35.00,10390040
>> 21,3082927,12.84,2172578,28.12,10909542
>> 22,3055282,8.73,2176269,29.14,11429044
>> 23,3081144,4.56,2138442,24.87,11948546
>> 24,3075509,0.45,2173753,21.62,12468048
>>
>> page hinting -
>> 0,0,100,0,100,0
>> 1,491683,95.83,494366,95.82,494366
>> 2,988415,91.67,919660,91.68,988732
>> 3,1344829,87.52,1244608,87.69,1483098
>> 4,1797933,83.37,1625797,83.70,1977464
>> 5,2179009,79.21,1881534,80.13,2471830
>> 6,2449858,75.07,2078137,76.82,2966196
>> 7,2732122,70.90,2178105,73.75,3460562
>> 8,2910965,66.75,2340901,70.28,3954928
>> 9,3006665,62.61,2353748,67.91,4449294
>> 10,3164752,58.46,2377936,65.08,4943660
>> 11,3234846,54.32,2510149,63.14,5438026
>> 12,3165477,50.17,2412007,59.91,5932392
>> 13,3141457,46.05,2421548,57.85,6426758
>> 14,3135839,41.90,2378021,53.81,6921124
>> 15,3109113,37.75,2269290,51.76,7415490
>> 16,3093613,33.62,2346185,48.73,7909856
>> 17,3086542,29.49,2352140,46.19,8404222
>> 18,3048991,25.36,2217144,41.52,8898588
>> 19,2965500,21.18,2313614,38.18,9392954
>> 20,2928977,17.05,2175316,35.67,9887320
>> 21,2896667,12.91,2141311,28.90,10381686
>> 22,3047782,8.76,2177664,28.24,10876052
>> 23,2994503,4.58,2160976,22.97,11370418
>> 24,3038762,0.47,2053533,22.39,11864784
>>
>> bubble-hinting v1 -
>> 0,0,100,0,100,0
>> 1,515272,95.83,492355,95.81,515272
>> 2,985903,91.66,919653,91.68,1030544
>> 3,1475300,87.51,1353723,87.65,1545816
>> 4,1783938,83.36,1586307,83.78,2061088
>> 5,2093307,79.20,1867395,79.95,2576360
>> 6,2441370,75.05,2055421,76.65,3091632
>> 7,2650471,70.89,2246014,72.93,3606904
>> 8,2926782,66.75,2333601,70.41,4122176
>> 9,3107617,62.60,2383112,68.46,4637448
>> 10,3192332,58.44,2441626,65.84,5152720
>> 11,3268043,54.32,2235964,62.92,5667992
>> 12,3191105,50.18,2449045,60.49,6183264
>> 13,3145317,46.05,2377317,57.80,6698536
>> 14,3161552,41.91,2395814,53.26,7213808
>> 15,3140443,37.77,2333200,51.42,7729080
>> 16,3130866,33.65,2150967,46.11,8244352
>> 17,3112894,29.52,2372068,45.93,8759624
>> 18,3078424,25.39,2336211,39.85,9274896
>> 19,3036457,21.27,2224821,35.25,9790168
>> 20,3046330,17.13,2199755,37.43,10305440
>> 21,2981130,12.98,2214862,28.67,10820712
>> 22,3017481,8.84,2195996,29.69,11335984
>> 23,2979906,4.68,2173395,25.90,11851256
>> 24,2971170,0.52,2134311,21.89,12366528
> Okay, so this doesn't match up with the results you gave me last time
> (https://lore.kernel.org/lkml/afac6f92-74f5-4580-0303-12b7374e5011@redhat.com/),
> and actually more closely matches what I was expecting to see. The
> bubble-hinting patches are performing within a few percent of what the
> baseline kernel was doing. 
Interestingly even with an unmodified kernel with every fresh boot, I
observed a certain amount of variability in the results which I stated
below.
> I am assuming the results from before had
> some additional debugging enabled for the bubble-hinting test that
> wasn't enabled for the other ones.

Nope, I had debugging options enabled for all the cases. This time
around I disabled all the debug options.

>
>> Conclusion -
>> For an unmodified kernel, with every fresh boot, there is 3-4% delta observed
>> in the results wrt the numbers mentioned above. For both bubble-hinting and
>> page-hinting, there was no noticeable degradation observed other than the
>> expected variability mentioned earlier.
>>
>> Page hinting vs bubble hinting:
>> From the benefits and performance perspective, both solutions look quite similar
>> so far. However, unlike bubble-hinting which is more invasive, the overall core
>> mm changes required for page hinting are minimal.
>>
>> [1] https://lkml.org/lkml/2019/6/19/926
> So I think I called it out in the review of the patch but I think we
> may want to see what happens if we increase the size of the memory in
> the guest to something more like 64G or larger. My main concern is
> that as we increase the size of memory the walk through the bitmap is
> going to become more and more expensive and I am worried that at some
> point it will start impacting the results.
Ok, I can try that scenario.



-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH v11 0/2] mm: Support for page hinting
  2019-07-10 20:19 ` [RFC][PATCH v11 0/2] mm: " Dave Hansen
@ 2019-07-11 11:37   ` Nitesh Narayan Lal
  0 siblings, 0 replies; 43+ messages in thread
From: Nitesh Narayan Lal @ 2019-07-11 11:37 UTC (permalink / raw)
  To: Dave Hansen, kvm, linux-kernel, linux-mm, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck, john.starks,
	mhocko


On 7/10/19 4:19 PM, Dave Hansen wrote:
> On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote:
>> This patch series proposes an efficient mechanism for reporting free memory
>> from a guest to its hypervisor. It especially enables guests with no page cache
>> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
>> rapidly hand back free memory to the hypervisor.
>> This approach has a minimal impact on the existing core-mm infrastructure.
>>
>> Measurement results (measurement details appended to this email):
>> *Number of 5GB guests (each touching 4GB memory) that can be launched
>> without swap usage on a system with 15GB:
> This sounds like a reasonable measurement, but I think you're missing a
> sentence or two explaining why this test was used.
I will re-work the cover email to better communicate the numbers.
>
>> unmodified kernel - 2, 3rd with 2.5GB   
> What does "3rd with 2.5GB" mean?  The third gets 2.5GB before failing an
> allocation and crashing?
It doesn't crash or fail. To complete the execution of the test
application which allocates 4GB memory in the 3rd guest 2.5GB swap has
been accessed.
>
>> v11 page hinting - 6, 7th with 26MB    
>> v1 bubble hinting[1] - 6, 7th with 1.8GB (bubble hinting is another series
>> proposed to solve the same problems)
> Could you please make an effort to format things so that reviewers can
> easily read them?  Aligning columns and using common units would be very
> helpful, for instance:
>
>      unmodified kernel - 2, 3rd with 2.50 GB
>       v11 page hinting - 6, 7th with 0.03 GB
>   v1 bubble hinting[1] - 6, 7th with 1.80 GB
>
> See how you can scan that easily and compare between the rows?
>
> I think you did some analysis below.  But, that seems misplaced.  It's
> better to include the conclusion here and the details to back it up
> later.  As it stands, the cover letter just throws some data at a
> reviewer and hopes they can make sense of it.
I will improve this. Thanks.
>
>> *Memhog execution time (For 3 guests each of 6GB on a system with 15GB):
>> unmodified kernel - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB       
>> v11 page hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0           
>> v1 bubble hinting - Guest1:23, Guest2:11s, Guest3:26s swap used = 0           
> Again, I'm finding myself having to reformat your data just so I can
> make sense of it.  You also forgot the unit for Guest 1 in row 3.
>
>    unmodified - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB
>
>   v11 hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0
>   v1 bubble   - Guest1:23s, Guest2:11s, Guest3:26s swap used = 0
>
> So, what is this supposed to show?  What does it mean?  Why do the
> numbers vary *so* much?

Basically, the idea was to communicate that with hinting swap was not
accessed and hence the time of execution is lower.

But as you already mentioned next time around I will format this and add
the conclusion along with these numbers.
I agree with Alexander's comment that there is no point of having the
same thing at two place.

-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure
  2019-07-10 20:45   ` Dave Hansen
@ 2019-07-11 11:48     ` Nitesh Narayan Lal
  2019-07-11 15:25     ` Nitesh Narayan Lal
  2019-07-15  9:26     ` David Hildenbrand
  2 siblings, 0 replies; 43+ messages in thread
From: Nitesh Narayan Lal @ 2019-07-11 11:48 UTC (permalink / raw)
  To: Dave Hansen, kvm, linux-kernel, linux-mm, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck, john.starks,
	mhocko


On 7/10/19 4:45 PM, Dave Hansen wrote:
> On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote:
>> +struct zone_free_area {
>> +	unsigned long *bitmap;
>> +	unsigned long base_pfn;
>> +	unsigned long end_pfn;
>> +	atomic_t free_pages;
>> +	unsigned long nbits;
>> +} free_area[MAX_NR_ZONES];
> Why do we need an extra data structure.  What's wrong with putting
> per-zone data in ... 'struct zone'?  The cover letter claims that it
> doesn't touch core-mm infrastructure, but if it depends on mechanisms
> like this, I think that's a very bad thing.
>
> To be honest, I'm not sure this series is worth reviewing at this point.
>  It's horribly lightly commented and full of kernel antipatterns lik
>
> void func()
> {
> 	if () {
> 		... indent entire logic
> 		... of function
> 	}
> }
>
> It has big "TODO"s.  It's virtually comment-free.  I'm shocked it's at
> the 11th version and still looking like this.
One of the reasons for being on v11 was that the entire design has
changed a few times.
But that's no excuse, I understand what you are saying and I will work
on it and improve this.
>
>> +
>> +		for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) {
>> +			unsigned long pages = free_area[zone_idx].end_pfn -
>> +					free_area[zone_idx].base_pfn;
>> +			bitmap_size = (pages >> PAGE_HINTING_MIN_ORDER) + 1;
>> +			if (!bitmap_size)
>> +				continue;
>> +			free_area[zone_idx].bitmap = bitmap_zalloc(bitmap_size,
>> +								   GFP_KERNEL);
> This doesn't support sparse zones.  We can have zones with massive
> spanned page sizes, but very few present pages.  On those zones, this
> will exhaust memory for no good reason.
Thanks, I will look into this.
>
> Comparing this to Alex's patch set, it's of much lower quality and at a
> much earlier stage of development.  The two sets are not really even
> comparable right now.  This certainly doesn't sell me on (or even really
> enumerate the deltas in) this approach vs. Alex's.
>
-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [QEMU Patch] virtio-baloon: Support for page hinting
  2019-07-10 20:17   ` Alexander Duyck
@ 2019-07-11 12:03     ` Nitesh Narayan Lal
  0 siblings, 0 replies; 43+ messages in thread
From: Nitesh Narayan Lal @ 2019-07-11 12:03 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko


On 7/10/19 4:17 PM, Alexander Duyck wrote:
> On Wed, Jul 10, 2019 at 12:53 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>> Enables QEMU to perform madvise free on the memory range reported
>> by the vm.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> ---
>>  hw/virtio/trace-events                        |  1 +
>>  hw/virtio/virtio-balloon.c                    | 59 +++++++++++++++++++
>>  include/hw/virtio/virtio-balloon.h            |  2 +-
>>  include/qemu/osdep.h                          |  7 +++
>>  .../standard-headers/linux/virtio_balloon.h   |  1 +
>>  5 files changed, 69 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
>> index e28ba48da6..f703a22d36 100644
>> --- a/hw/virtio/trace-events
>> +++ b/hw/virtio/trace-events
>> @@ -46,6 +46,7 @@ virtio_balloon_handle_output(const char *name, uint64_t gpa) "section name: %s g
>>  virtio_balloon_get_config(uint32_t num_pages, uint32_t actual) "num_pages: %d actual: %d"
>>  virtio_balloon_set_config(uint32_t actual, uint32_t oldactual) "actual: %d oldactual: %d"
>>  virtio_balloon_to_target(uint64_t target, uint32_t num_pages) "balloon target: 0x%"PRIx64" num_pages: %d"
>> +virtio_balloon_hinting_request(unsigned long pfn, unsigned int num_pages) "Guest page hinting request PFN:%lu size: %d"
>>
>>  # virtio-mmio.c
>>  virtio_mmio_read(uint64_t offset) "virtio_mmio_read offset 0x%" PRIx64
>> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
>> index 2112874055..5d186707b5 100644
>> --- a/hw/virtio/virtio-balloon.c
>> +++ b/hw/virtio/virtio-balloon.c
>> @@ -34,6 +34,9 @@
>>
>>  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
>>
>> +#define VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES  16
>> +void free_mem_range(uint64_t addr, uint64_t len);
>> +
> The definition you have here is unused. I think you can drop it. Also
> why do you need this forward declaration? Couldn't you just leave
> free_mem_range below as a static and still have this compile?
+1. Thanks for pointing this out.
>
>>  struct PartiallyBalloonedPage {
>>      RAMBlock *rb;
>>      ram_addr_t base;
>> @@ -328,6 +331,58 @@ static void balloon_stats_set_poll_interval(Object *obj, Visitor *v,
>>      balloon_stats_change_timer(s, 0);
>>  }
>>
>> +void free_mem_range(uint64_t addr, uint64_t len)
>> +{
>> +    int ret = 0;
>> +    void *hvaddr_to_free;
>> +    MemoryRegionSection mrs = memory_region_find(get_system_memory(),
>> +                                                 addr, 1);
>> +    if (!mrs.mr) {
>> +       warn_report("%s:No memory is mapped at address 0x%lu", __func__, addr);
>> +        return;
>> +    }
>> +
>> +    if (!memory_region_is_ram(mrs.mr) && !memory_region_is_romd(mrs.mr)) {
>> +       warn_report("%s:Memory at address 0x%s is not RAM:0x%lu", __func__,
>> +                   HWADDR_PRIx, addr);
>> +        memory_region_unref(mrs.mr);
>> +        return;
>> +    }
>> +
>> +    hvaddr_to_free = qemu_map_ram_ptr(mrs.mr->ram_block, mrs.offset_within_region);
>> +    trace_virtio_balloon_hinting_request(addr, len);
>> +    ret = qemu_madvise(hvaddr_to_free,len, QEMU_MADV_FREE);
>> +    if (ret == -1) {
>> +       warn_report("%s: Madvise failed with error:%d", __func__, ret);
>> +    }
>> +}
>> +
>> +static void virtio_balloon_handle_page_hinting(VirtIODevice *vdev,
>> +                                              VirtQueue *vq)
>> +{
>> +    VirtQueueElement *elem;
>> +    size_t offset = 0;
>> +    uint64_t gpa, len;
>> +    elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
>> +    if (!elem) {
>> +        return;
>> +    }
>> +    /* For pending hints which are < max_pages(16), 'gpa != 0' ensures that we
>> +     * only read the buffer which holds a valid PFN value.
>> +     * TODO: Find a better way to do this.
>> +     */
> I'm not sure this comment makes much sense to me. Shouldn't the
> iov_to_buf be limiting you anyway? Why do you need the additional gpa
> check?
>
>> +    while (iov_to_buf(elem->out_sg, elem->out_num, offset, &gpa, 8) == 8 && gpa != 0) {
>> +       offset += 8;
>> +       offset += iov_to_buf(elem->out_sg, elem->out_num, offset, &len, 8);
> Why pull this out as two separate buffers? Why not just define a
> structure that consists of the two uint64_t values and then pull the
> entire thing as one buffer? 
This does make sense. I will correct this. Thanks.
> I'm pretty sure the solution as you have
> it now opens you up to an error since you could have a malicious guest
> only give you a part of the structure and you really should be
> verifying you get the entire structure.
>
>> +       if (!qemu_balloon_is_inhibited()) {
>> +           free_mem_range(gpa, len);
>> +       }
>> +    }
>> +    virtqueue_push(vq, elem, offset);
>> +    virtio_notify(vdev, vq);
>> +    g_free(elem);
>> +}
>> +
>>  static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>>  {
>>      VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
>> @@ -694,6 +749,7 @@ static uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f,
>>      VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
>>      f |= dev->host_features;
>>      virtio_add_feature(&f, VIRTIO_BALLOON_F_STATS_VQ);
>> +    virtio_add_feature(&f, VIRTIO_BALLOON_F_HINTING);
>>
>>      return f;
>>  }
>> @@ -780,6 +836,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>>      s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>>      s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>>      s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
>> +    s->hvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_page_hinting);
>>
>>      if (virtio_has_feature(s->host_features,
>>                             VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
>> @@ -875,6 +932,8 @@ static void virtio_balloon_instance_init(Object *obj)
>>
>>      object_property_add(obj, "guest-stats", "guest statistics",
>>                          balloon_stats_get_all, NULL, NULL, s, NULL);
>> +    object_property_add(obj, "guest-page-hinting", "guest page hinting",
>> +                        NULL, NULL, NULL, s, NULL);
>>
>>      object_property_add(obj, "guest-stats-polling-interval", "int",
>>                          balloon_stats_get_poll_interval,
>> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
>> index 1afafb12f6..a58b24fdf2 100644
>> --- a/include/hw/virtio/virtio-balloon.h
>> +++ b/include/hw/virtio/virtio-balloon.h
>> @@ -44,7 +44,7 @@ enum virtio_balloon_free_page_report_status {
>>
>>  typedef struct VirtIOBalloon {
>>      VirtIODevice parent_obj;
>> -    VirtQueue *ivq, *dvq, *svq, *free_page_vq;
>> +    VirtQueue *ivq, *dvq, *svq, *free_page_vq, *hvq;
>>      uint32_t free_page_report_status;
>>      uint32_t num_pages;
>>      uint32_t actual;
>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>> index af2b91f0b8..bb9207e7f4 100644
>> --- a/include/qemu/osdep.h
>> +++ b/include/qemu/osdep.h
>> @@ -360,6 +360,11 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>>  #else
>>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
>>  #endif
>> +#ifdef MADV_FREE
>> +#define QEMU_MADV_FREE MADV_FREE
>> +#else
>> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
>> +#endif
> As I mentioned before it might make more sense to use MADV_DONTNEED
> instead of just disabling this functionality if the host kernel
> doesn't have MADV_FREE support.
I have been trying to find the reason for it and later decided to just
avoid hinting and print an error message instead.
> That way you would still have the
> functionality on kernels prior to 4.5 if they need it.
I didn't think of this earlier. If that's the case it does make sense
fallback to DONTNEED.
>
>>  #elif defined(CONFIG_POSIX_MADVISE)
>>
>> @@ -373,6 +378,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>>  #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
>>  #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
>>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
>> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
> Same here. It might make more sense to use the POSIX_MADV_DONTNEED
> instead of just making it invalid.
>
>>  #else /* no-op */
>>
>> @@ -386,6 +392,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>>  #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
>>  #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
>>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
>> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
>>
>>  #endif
>>
>> diff --git a/include/standard-headers/linux/virtio_balloon.h b/include/standard-headers/linux/virtio_balloon.h
>> index 9375ca2a70..f9e3e82562 100644
>> --- a/include/standard-headers/linux/virtio_balloon.h
>> +++ b/include/standard-headers/linux/virtio_balloon.h
>> @@ -36,6 +36,7 @@
>>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM        2 /* Deflate balloon on OOM */
>>  #define VIRTIO_BALLOON_F_FREE_PAGE_HINT        3 /* VQ to report free pages */
>>  #define VIRTIO_BALLOON_F_PAGE_POISON   4 /* Guest is using page poisoning */
>> +#define VIRTIO_BALLOON_F_HINTING       5 /* Page hinting virtqueue */
>>
>>  /* Size of a PFN in the balloon interface. */
>>  #define VIRTIO_BALLOON_PFN_SHIFT 12
>> --
>> 2.21.0
>>
-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH v11 0/2] mm: Support for page hinting
  2019-07-11 11:30   ` Nitesh Narayan Lal
@ 2019-07-11 14:58     ` Alexander Duyck
  2019-07-11 15:03       ` Nitesh Narayan Lal
  0 siblings, 1 reply; 43+ messages in thread
From: Alexander Duyck @ 2019-07-11 14:58 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko

On Thu, Jul 11, 2019 at 4:31 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 7/10/19 7:40 PM, Alexander Duyck wrote:
> > On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >
> > The results up here were redundant with what is below so I am just
> > dropping them. I would suggest only including one set of results in
> > any future cover page as it is confusing to duplicate it like that.
> >
> >> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
> >> A new hook after buddy merging is used to set the bits in the bitmap.
> >> Currently, the bits are only cleared when pages are hinted, not when pages are
> >> re-allocated.
> >>
> >> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
> >> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
> >> threshold is met, trying to isolate and report pages that are still free.
> >>
> >> The isolated pages are reported via virtio-balloon, which is responsible for
> >> sending batched pages to the host synchronously. Once the hypervisor processed
> >> the hinting request, the isolated pages are returned back to the buddy.
> >>
> >> Changelog in v11:
> >> * Added logic to take care of multiple NUMA nodes scenarios.
> >> * Simplified the logic for reporting isolated pages to the host. (Eg. replaced
> >> dynamically allocated arrays with static ones, introduced wait event instead of
> >> the loop in order to wait for a response from the host)
> >> * Added a mutex to prevent race condition when page hinting is enabled by
> >> multiple drivers.
> >> * Simplified the logic responsible for decrementing free page counter for each
> >> zone.
> >> * Simplified code structuring/naming.
> >>
> >> Known work items for the future:
> >> * Test device assigned guests to ensure that hinting doesn't break it.
> >> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device-side support.
> >> * Decide between MADV_DONTNEED and MADV_FREE.
> >> * Look into memory hotplug, more efficient locking, better naming conventions to
> >> avoid confusion with VIRTIO_BALLOON_F_FREE_PAGE_HINT support.
> >> * Come up with proper/traceable error-message/logs and look into other code
> >> simplifications. (If necessary).
> >>
> >> Benefit analysis:
> >> 1. Number of 5GB guests (each touching 4GB memory) that can be launched without
> >> swap usage on a system with 15GB:
> >> unmodified kernel - 2, 3rd with 2.5GB
> >> v11 page hinting - 6, 7th with 26MB
> >> v1 bubble hinting - 6, 7th with 1.8GB
> >>
> >> Conclusion - In this particular testcase on using v11 page hinting and
> >> v1 bubble-hinting 4 more guests could be launched without swapping compared
> >> to an unmodified kernel.
> >> For the 7th guest launch, v11 page hinting is slightly better than v1 Bubble
> >> hinting as it touches lesser swap space.
> > I'm confused by the comment. From what I can tell bubble hinting came
> > up with 1.8GB of memory while page hinting only managed to achieve
> > .026GB (Using the same units makes it easier to visualize the
> > difference). Also your test says "can be launched without swap usage",
> > yet you say the bubble hinting is touching swap which makes not sense
> > to me.
> I will work on the cover to improve this part.
> Basically, In each case, the first number indicates the number of the
> guest which are launched without touching the swap space. For instance
> with bubble hinting, I was able to launch 6 guests without any swap
> usage. On launching the 7th guests initially there was no swap usage,
> however, as the test app starts allocating 4GB memory the swap came into
> the picture. 1.8 GB is the swap usage after the completion of the test
> application.
> >> Setup & procedure -
> >> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
> >> Guest Memory = 5GB
> >> Number of CPUs in the guest = 1
> >> Host swap = 4GB
> >> Workload = test allocation program that allocates 4GB memory, touches it via
> >> memset and exits.
> >> The first guest is launched and once its console is up, the test allocation
> >> program is executed with 4 GB memory request (Due to this the guest occupies
> >> almost 4-5 GB of memory in the host in a system without page hinting). Once
> >> this program exits at that time another guest is launched in the host and the
> >> same process is followed. It is continued until the swap is not used.
> >>
> >> 2. Memhog execution time (For 3 guests each of 6GB on a system with 15GB):
> >> unmodified kernel - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB
> >> v11 page hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0
> >> v1 bubble hinting - Guest1:23, Guest2:11s, Guest3:26s swap used = 0
> >>
> >> For this particular test-case in a guest which doesn't require swap access
> >> "memhog 6G" execution time lies within a range of 15-30s.
> >> Conclusion -
> >> In the above test case for an unmodified kernel on executing memhog in the
> >> third guest execution time rises to above 2minutes due to swap access.
> >> Using either page-hinting or bubble hinting brings this execution time to a
> >> a normal range of 15-30s.
> > So really this test doesn't add much in value. The whole reason why
> > Guest3 runs so much slower is because it is going to swap. I initially
> > did this to demonstrate a point, but now running this test doesn't
> > prove much as it isn't really meant to be a performance test. It is
> > essentially just a duplicate of the "how many guests can you run" test
> > that is passing itself off as some sort of performance test.
> >
> > We could probably just drop this from future version of this as long
> > as we verify that the memory hinting is freeing most of the memory
> > back and the guest is reporting a size less than the total guest
> > memory size.
> >
> +1, makes sense to keep just one of the above two.
> >> Setup & procedure -
> >> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
> >> Guest Memory = 6GB
> >> Number of CPUs in the guest = 4
> >> Process = 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
> >> one after the other in each of them.
> >> Host swap = 4GB
> >>
> >> Performance Analysis:
> >> 1. will-it-scale's page_faul1
> >> Setup -
> >> Guest Memory = 6GB
> >> Number of cores = 24
> >>
> >> Unmodified kernel -
> >> 0,0,100,0,100,0
> >> 1,514453,95.84,519502,95.83,519502
> >> 2,991485,91.67,932268,91.68,1039004
> >> 3,1381237,87.36,1264214,87.64,1558506
> >> 4,1789116,83.36,1597767,83.88,2078008
> >> 5,2181552,79.20,1889489,80.08,2597510
> >> 6,2452416,75.05,2001879,77.10,3117012
> >> 7,2671047,70.90,2263866,73.22,3636514
> >> 8,2930081,66.75,2333813,70.60,4156016
> >> 9,3126431,62.60,2370108,68.28,4675518
> >> 10,3211937,58.44,2454093,65.74,5195020
> >> 11,3162172,54.32,2450822,63.21,5714522
> >> 12,3154261,50.14,2272290,58.98,6234024
> >> 13,3115174,46.02,2369679,57.74,6753526
> >> 14,3150511,41.86,2470837,54.02,7273028
> >> 15,3134158,37.71,2428129,51.98,7792530
> >> 16,3143067,33.57,2340469,49.54,8312032
> >> 17,3112457,29.43,2263627,44.81,8831534
> >> 18,3089724,25.29,2181879,38.69,9351036
> >> 19,3076878,21.15,2236505,40.01,9870538
> >> 20,3091978,16.95,2266327,35.00,10390040
> >> 21,3082927,12.84,2172578,28.12,10909542
> >> 22,3055282,8.73,2176269,29.14,11429044
> >> 23,3081144,4.56,2138442,24.87,11948546
> >> 24,3075509,0.45,2173753,21.62,12468048
> >>
> >> page hinting -
> >> 0,0,100,0,100,0
> >> 1,491683,95.83,494366,95.82,494366
> >> 2,988415,91.67,919660,91.68,988732
> >> 3,1344829,87.52,1244608,87.69,1483098
> >> 4,1797933,83.37,1625797,83.70,1977464
> >> 5,2179009,79.21,1881534,80.13,2471830
> >> 6,2449858,75.07,2078137,76.82,2966196
> >> 7,2732122,70.90,2178105,73.75,3460562
> >> 8,2910965,66.75,2340901,70.28,3954928
> >> 9,3006665,62.61,2353748,67.91,4449294
> >> 10,3164752,58.46,2377936,65.08,4943660
> >> 11,3234846,54.32,2510149,63.14,5438026
> >> 12,3165477,50.17,2412007,59.91,5932392
> >> 13,3141457,46.05,2421548,57.85,6426758
> >> 14,3135839,41.90,2378021,53.81,6921124
> >> 15,3109113,37.75,2269290,51.76,7415490
> >> 16,3093613,33.62,2346185,48.73,7909856
> >> 17,3086542,29.49,2352140,46.19,8404222
> >> 18,3048991,25.36,2217144,41.52,8898588
> >> 19,2965500,21.18,2313614,38.18,9392954
> >> 20,2928977,17.05,2175316,35.67,9887320
> >> 21,2896667,12.91,2141311,28.90,10381686
> >> 22,3047782,8.76,2177664,28.24,10876052
> >> 23,2994503,4.58,2160976,22.97,11370418
> >> 24,3038762,0.47,2053533,22.39,11864784
> >>
> >> bubble-hinting v1 -
> >> 0,0,100,0,100,0
> >> 1,515272,95.83,492355,95.81,515272
> >> 2,985903,91.66,919653,91.68,1030544
> >> 3,1475300,87.51,1353723,87.65,1545816
> >> 4,1783938,83.36,1586307,83.78,2061088
> >> 5,2093307,79.20,1867395,79.95,2576360
> >> 6,2441370,75.05,2055421,76.65,3091632
> >> 7,2650471,70.89,2246014,72.93,3606904
> >> 8,2926782,66.75,2333601,70.41,4122176
> >> 9,3107617,62.60,2383112,68.46,4637448
> >> 10,3192332,58.44,2441626,65.84,5152720
> >> 11,3268043,54.32,2235964,62.92,5667992
> >> 12,3191105,50.18,2449045,60.49,6183264
> >> 13,3145317,46.05,2377317,57.80,6698536
> >> 14,3161552,41.91,2395814,53.26,7213808
> >> 15,3140443,37.77,2333200,51.42,7729080
> >> 16,3130866,33.65,2150967,46.11,8244352
> >> 17,3112894,29.52,2372068,45.93,8759624
> >> 18,3078424,25.39,2336211,39.85,9274896
> >> 19,3036457,21.27,2224821,35.25,9790168
> >> 20,3046330,17.13,2199755,37.43,10305440
> >> 21,2981130,12.98,2214862,28.67,10820712
> >> 22,3017481,8.84,2195996,29.69,11335984
> >> 23,2979906,4.68,2173395,25.90,11851256
> >> 24,2971170,0.52,2134311,21.89,12366528
> > Okay, so this doesn't match up with the results you gave me last time
> > (https://lore.kernel.org/lkml/afac6f92-74f5-4580-0303-12b7374e5011@redhat.com/),
> > and actually more closely matches what I was expecting to see. The
> > bubble-hinting patches are performing within a few percent of what the
> > baseline kernel was doing.
> Interestingly even with an unmodified kernel with every fresh boot, I
> observed a certain amount of variability in the results which I stated
> below.
> > I am assuming the results from before had
> > some additional debugging enabled for the bubble-hinting test that
> > wasn't enabled for the other ones.
>
> Nope, I had debugging options enabled for all the cases. This time
> around I disabled all the debug options.

We can agree to disagree I guess. Those debugging options had reduced
the throughput by over 30% on the guest kernel in my test runs. I was
never able to reproduce the data you reported as enabling the same
debug features on an unmodified kernel had reduced the throughput for
the test just the same as it did for the bubble hinting version. Were
you running the debug options on the host kernel or the guest? I
suppose it is possible that having those debug options enabled on the
host might trigger similar behavior to what you reported since you
were using MADV_FREE versus MADV_DONTNEED so you wouldn't have to
reallocate the pages and could circumvent the page allocation
debugging.

> >
> >> Conclusion -
> >> For an unmodified kernel, with every fresh boot, there is 3-4% delta observed
> >> in the results wrt the numbers mentioned above. For both bubble-hinting and
> >> page-hinting, there was no noticeable degradation observed other than the
> >> expected variability mentioned earlier.
> >>
> >> Page hinting vs bubble hinting:
> >> From the benefits and performance perspective, both solutions look quite similar
> >> so far. However, unlike bubble-hinting which is more invasive, the overall core
> >> mm changes required for page hinting are minimal.
> >>
> >> [1] https://lkml.org/lkml/2019/6/19/926
> > So I think I called it out in the review of the patch but I think we
> > may want to see what happens if we increase the size of the memory in
> > the guest to something more like 64G or larger. My main concern is
> > that as we increase the size of memory the walk through the bitmap is
> > going to become more and more expensive and I am worried that at some
> > point it will start impacting the results.
> Ok, I can try that scenario.
>
>
>
> --
> Thanks
> Nitesh
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH v11 0/2] mm: Support for page hinting
  2019-07-11 14:58     ` Alexander Duyck
@ 2019-07-11 15:03       ` Nitesh Narayan Lal
  2019-07-11 15:08         ` Alexander Duyck
  0 siblings, 1 reply; 43+ messages in thread
From: Nitesh Narayan Lal @ 2019-07-11 15:03 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko


On 7/11/19 10:58 AM, Alexander Duyck wrote:
> On Thu, Jul 11, 2019 at 4:31 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>
>> On 7/10/19 7:40 PM, Alexander Duyck wrote:
>>> On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>
>>> The results up here were redundant with what is below so I am just
>>> dropping them. I would suggest only including one set of results in
>>> any future cover page as it is confusing to duplicate it like that.
>>>
>>>> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
>>>> A new hook after buddy merging is used to set the bits in the bitmap.
>>>> Currently, the bits are only cleared when pages are hinted, not when pages are
>>>> re-allocated.
>>>>
>>>> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
>>>> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
>>>> threshold is met, trying to isolate and report pages that are still free.
>>>>
>>>> The isolated pages are reported via virtio-balloon, which is responsible for
>>>> sending batched pages to the host synchronously. Once the hypervisor processed
>>>> the hinting request, the isolated pages are returned back to the buddy.
>>>>
>>>> Changelog in v11:
>>>> * Added logic to take care of multiple NUMA nodes scenarios.
>>>> * Simplified the logic for reporting isolated pages to the host. (Eg. replaced
>>>> dynamically allocated arrays with static ones, introduced wait event instead of
>>>> the loop in order to wait for a response from the host)
>>>> * Added a mutex to prevent race condition when page hinting is enabled by
>>>> multiple drivers.
>>>> * Simplified the logic responsible for decrementing free page counter for each
>>>> zone.
>>>> * Simplified code structuring/naming.
>>>>
>>>> Known work items for the future:
>>>> * Test device assigned guests to ensure that hinting doesn't break it.
>>>> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device-side support.
>>>> * Decide between MADV_DONTNEED and MADV_FREE.
>>>> * Look into memory hotplug, more efficient locking, better naming conventions to
>>>> avoid confusion with VIRTIO_BALLOON_F_FREE_PAGE_HINT support.
>>>> * Come up with proper/traceable error-message/logs and look into other code
>>>> simplifications. (If necessary).
>>>>
>>>> Benefit analysis:
>>>> 1. Number of 5GB guests (each touching 4GB memory) that can be launched without
>>>> swap usage on a system with 15GB:
>>>> unmodified kernel - 2, 3rd with 2.5GB
>>>> v11 page hinting - 6, 7th with 26MB
>>>> v1 bubble hinting - 6, 7th with 1.8GB
>>>>
>>>> Conclusion - In this particular testcase on using v11 page hinting and
>>>> v1 bubble-hinting 4 more guests could be launched without swapping compared
>>>> to an unmodified kernel.
>>>> For the 7th guest launch, v11 page hinting is slightly better than v1 Bubble
>>>> hinting as it touches lesser swap space.
>>> I'm confused by the comment. From what I can tell bubble hinting came
>>> up with 1.8GB of memory while page hinting only managed to achieve
>>> .026GB (Using the same units makes it easier to visualize the
>>> difference). Also your test says "can be launched without swap usage",
>>> yet you say the bubble hinting is touching swap which makes not sense
>>> to me.
>> I will work on the cover to improve this part.
>> Basically, In each case, the first number indicates the number of the
>> guest which are launched without touching the swap space. For instance
>> with bubble hinting, I was able to launch 6 guests without any swap
>> usage. On launching the 7th guests initially there was no swap usage,
>> however, as the test app starts allocating 4GB memory the swap came into
>> the picture. 1.8 GB is the swap usage after the completion of the test
>> application.
>>>> Setup & procedure -
>>>> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
>>>> Guest Memory = 5GB
>>>> Number of CPUs in the guest = 1
>>>> Host swap = 4GB
>>>> Workload = test allocation program that allocates 4GB memory, touches it via
>>>> memset and exits.
>>>> The first guest is launched and once its console is up, the test allocation
>>>> program is executed with 4 GB memory request (Due to this the guest occupies
>>>> almost 4-5 GB of memory in the host in a system without page hinting). Once
>>>> this program exits at that time another guest is launched in the host and the
>>>> same process is followed. It is continued until the swap is not used.
>>>>
>>>> 2. Memhog execution time (For 3 guests each of 6GB on a system with 15GB):
>>>> unmodified kernel - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB
>>>> v11 page hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0
>>>> v1 bubble hinting - Guest1:23, Guest2:11s, Guest3:26s swap used = 0
>>>>
>>>> For this particular test-case in a guest which doesn't require swap access
>>>> "memhog 6G" execution time lies within a range of 15-30s.
>>>> Conclusion -
>>>> In the above test case for an unmodified kernel on executing memhog in the
>>>> third guest execution time rises to above 2minutes due to swap access.
>>>> Using either page-hinting or bubble hinting brings this execution time to a
>>>> a normal range of 15-30s.
>>> So really this test doesn't add much in value. The whole reason why
>>> Guest3 runs so much slower is because it is going to swap. I initially
>>> did this to demonstrate a point, but now running this test doesn't
>>> prove much as it isn't really meant to be a performance test. It is
>>> essentially just a duplicate of the "how many guests can you run" test
>>> that is passing itself off as some sort of performance test.
>>>
>>> We could probably just drop this from future version of this as long
>>> as we verify that the memory hinting is freeing most of the memory
>>> back and the guest is reporting a size less than the total guest
>>> memory size.
>>>
>> +1, makes sense to keep just one of the above two.
>>>> Setup & procedure -
>>>> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
>>>> Guest Memory = 6GB
>>>> Number of CPUs in the guest = 4
>>>> Process = 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
>>>> one after the other in each of them.
>>>> Host swap = 4GB
>>>>
>>>> Performance Analysis:
>>>> 1. will-it-scale's page_faul1
>>>> Setup -
>>>> Guest Memory = 6GB
>>>> Number of cores = 24
>>>>
>>>> Unmodified kernel -
>>>> 0,0,100,0,100,0
>>>> 1,514453,95.84,519502,95.83,519502
>>>> 2,991485,91.67,932268,91.68,1039004
>>>> 3,1381237,87.36,1264214,87.64,1558506
>>>> 4,1789116,83.36,1597767,83.88,2078008
>>>> 5,2181552,79.20,1889489,80.08,2597510
>>>> 6,2452416,75.05,2001879,77.10,3117012
>>>> 7,2671047,70.90,2263866,73.22,3636514
>>>> 8,2930081,66.75,2333813,70.60,4156016
>>>> 9,3126431,62.60,2370108,68.28,4675518
>>>> 10,3211937,58.44,2454093,65.74,5195020
>>>> 11,3162172,54.32,2450822,63.21,5714522
>>>> 12,3154261,50.14,2272290,58.98,6234024
>>>> 13,3115174,46.02,2369679,57.74,6753526
>>>> 14,3150511,41.86,2470837,54.02,7273028
>>>> 15,3134158,37.71,2428129,51.98,7792530
>>>> 16,3143067,33.57,2340469,49.54,8312032
>>>> 17,3112457,29.43,2263627,44.81,8831534
>>>> 18,3089724,25.29,2181879,38.69,9351036
>>>> 19,3076878,21.15,2236505,40.01,9870538
>>>> 20,3091978,16.95,2266327,35.00,10390040
>>>> 21,3082927,12.84,2172578,28.12,10909542
>>>> 22,3055282,8.73,2176269,29.14,11429044
>>>> 23,3081144,4.56,2138442,24.87,11948546
>>>> 24,3075509,0.45,2173753,21.62,12468048
>>>>
>>>> page hinting -
>>>> 0,0,100,0,100,0
>>>> 1,491683,95.83,494366,95.82,494366
>>>> 2,988415,91.67,919660,91.68,988732
>>>> 3,1344829,87.52,1244608,87.69,1483098
>>>> 4,1797933,83.37,1625797,83.70,1977464
>>>> 5,2179009,79.21,1881534,80.13,2471830
>>>> 6,2449858,75.07,2078137,76.82,2966196
>>>> 7,2732122,70.90,2178105,73.75,3460562
>>>> 8,2910965,66.75,2340901,70.28,3954928
>>>> 9,3006665,62.61,2353748,67.91,4449294
>>>> 10,3164752,58.46,2377936,65.08,4943660
>>>> 11,3234846,54.32,2510149,63.14,5438026
>>>> 12,3165477,50.17,2412007,59.91,5932392
>>>> 13,3141457,46.05,2421548,57.85,6426758
>>>> 14,3135839,41.90,2378021,53.81,6921124
>>>> 15,3109113,37.75,2269290,51.76,7415490
>>>> 16,3093613,33.62,2346185,48.73,7909856
>>>> 17,3086542,29.49,2352140,46.19,8404222
>>>> 18,3048991,25.36,2217144,41.52,8898588
>>>> 19,2965500,21.18,2313614,38.18,9392954
>>>> 20,2928977,17.05,2175316,35.67,9887320
>>>> 21,2896667,12.91,2141311,28.90,10381686
>>>> 22,3047782,8.76,2177664,28.24,10876052
>>>> 23,2994503,4.58,2160976,22.97,11370418
>>>> 24,3038762,0.47,2053533,22.39,11864784
>>>>
>>>> bubble-hinting v1 -
>>>> 0,0,100,0,100,0
>>>> 1,515272,95.83,492355,95.81,515272
>>>> 2,985903,91.66,919653,91.68,1030544
>>>> 3,1475300,87.51,1353723,87.65,1545816
>>>> 4,1783938,83.36,1586307,83.78,2061088
>>>> 5,2093307,79.20,1867395,79.95,2576360
>>>> 6,2441370,75.05,2055421,76.65,3091632
>>>> 7,2650471,70.89,2246014,72.93,3606904
>>>> 8,2926782,66.75,2333601,70.41,4122176
>>>> 9,3107617,62.60,2383112,68.46,4637448
>>>> 10,3192332,58.44,2441626,65.84,5152720
>>>> 11,3268043,54.32,2235964,62.92,5667992
>>>> 12,3191105,50.18,2449045,60.49,6183264
>>>> 13,3145317,46.05,2377317,57.80,6698536
>>>> 14,3161552,41.91,2395814,53.26,7213808
>>>> 15,3140443,37.77,2333200,51.42,7729080
>>>> 16,3130866,33.65,2150967,46.11,8244352
>>>> 17,3112894,29.52,2372068,45.93,8759624
>>>> 18,3078424,25.39,2336211,39.85,9274896
>>>> 19,3036457,21.27,2224821,35.25,9790168
>>>> 20,3046330,17.13,2199755,37.43,10305440
>>>> 21,2981130,12.98,2214862,28.67,10820712
>>>> 22,3017481,8.84,2195996,29.69,11335984
>>>> 23,2979906,4.68,2173395,25.90,11851256
>>>> 24,2971170,0.52,2134311,21.89,12366528
>>> Okay, so this doesn't match up with the results you gave me last time
>>> (https://lore.kernel.org/lkml/afac6f92-74f5-4580-0303-12b7374e5011@redhat.com/),
>>> and actually more closely matches what I was expecting to see. The
>>> bubble-hinting patches are performing within a few percent of what the
>>> baseline kernel was doing.
>> Interestingly even with an unmodified kernel with every fresh boot, I
>> observed a certain amount of variability in the results which I stated
>> below.
>>> I am assuming the results from before had
>>> some additional debugging enabled for the bubble-hinting test that
>>> wasn't enabled for the other ones.
>> Nope, I had debugging options enabled for all the cases. This time
>> around I disabled all the debug options.
> We can agree to disagree I guess. Those debugging options had reduced
> the throughput by over 30% on the guest kernel in my test runs. I was
> never able to reproduce the data you reported as enabling the same
> debug features on an unmodified kernel had reduced the throughput for
> the test just the same as it did for the bubble hinting version. Were
> you running the debug options on the host kernel or the guest?
In the guest. Do the results which I shared without debug options, match
with what you have?
I am also curious to know if you see any variability in the results of
page_fault1 for an unmodified kernel with every fresh boot? If so how
often?
>  I
> suppose it is possible that having those debug options enabled on the
> host might trigger similar behavior to what you reported since you
> were using MADV_FREE versus MADV_DONTNEED so you wouldn't have to
> reallocate the pages and could circumvent the page allocation
> debugging.
>
>>>> Conclusion -
>>>> For an unmodified kernel, with every fresh boot, there is 3-4% delta observed
>>>> in the results wrt the numbers mentioned above. For both bubble-hinting and
>>>> page-hinting, there was no noticeable degradation observed other than the
>>>> expected variability mentioned earlier.
>>>>
>>>> Page hinting vs bubble hinting:
>>>> From the benefits and performance perspective, both solutions look quite similar
>>>> so far. However, unlike bubble-hinting which is more invasive, the overall core
>>>> mm changes required for page hinting are minimal.
>>>>
>>>> [1] https://lkml.org/lkml/2019/6/19/926
>>> So I think I called it out in the review of the patch but I think we
>>> may want to see what happens if we increase the size of the memory in
>>> the guest to something more like 64G or larger. My main concern is
>>> that as we increase the size of memory the walk through the bitmap is
>>> going to become more and more expensive and I am worried that at some
>>> point it will start impacting the results.
>> Ok, I can try that scenario.
>>
>>
>>
>> --
>> Thanks
>> Nitesh
>>
-- 
Thanks
Nitesh

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH v11 0/2] mm: Support for page hinting
  2019-07-11 15:03       ` Nitesh Narayan Lal
@ 2019-07-11 15:08         ` Alexander Duyck
  2019-07-11 15:19           ` Nitesh Narayan Lal
  0 siblings, 1 reply; 43+ messages in thread
From: Alexander Duyck @ 2019-07-11 15:08 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko

On Thu, Jul 11, 2019 at 8:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 7/11/19 10:58 AM, Alexander Duyck wrote:
> > On Thu, Jul 11, 2019 at 4:31 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>
> >> On 7/10/19 7:40 PM, Alexander Duyck wrote:
> >>> On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>>
> >>> The results up here were redundant with what is below so I am just
> >>> dropping them. I would suggest only including one set of results in
> >>> any future cover page as it is confusing to duplicate it like that.
> >>>
> >>>> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
> >>>> A new hook after buddy merging is used to set the bits in the bitmap.
> >>>> Currently, the bits are only cleared when pages are hinted, not when pages are
> >>>> re-allocated.
> >>>>
> >>>> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
> >>>> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
> >>>> threshold is met, trying to isolate and report pages that are still free.
> >>>>
> >>>> The isolated pages are reported via virtio-balloon, which is responsible for
> >>>> sending batched pages to the host synchronously. Once the hypervisor processed
> >>>> the hinting request, the isolated pages are returned back to the buddy.
> >>>>
> >>>> Changelog in v11:
> >>>> * Added logic to take care of multiple NUMA nodes scenarios.
> >>>> * Simplified the logic for reporting isolated pages to the host. (Eg. replaced
> >>>> dynamically allocated arrays with static ones, introduced wait event instead of
> >>>> the loop in order to wait for a response from the host)
> >>>> * Added a mutex to prevent race condition when page hinting is enabled by
> >>>> multiple drivers.
> >>>> * Simplified the logic responsible for decrementing free page counter for each
> >>>> zone.
> >>>> * Simplified code structuring/naming.
> >>>>
> >>>> Known work items for the future:
> >>>> * Test device assigned guests to ensure that hinting doesn't break it.
> >>>> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device-side support.
> >>>> * Decide between MADV_DONTNEED and MADV_FREE.
> >>>> * Look into memory hotplug, more efficient locking, better naming conventions to
> >>>> avoid confusion with VIRTIO_BALLOON_F_FREE_PAGE_HINT support.
> >>>> * Come up with proper/traceable error-message/logs and look into other code
> >>>> simplifications. (If necessary).
> >>>>
> >>>> Benefit analysis:
> >>>> 1. Number of 5GB guests (each touching 4GB memory) that can be launched without
> >>>> swap usage on a system with 15GB:
> >>>> unmodified kernel - 2, 3rd with 2.5GB
> >>>> v11 page hinting - 6, 7th with 26MB
> >>>> v1 bubble hinting - 6, 7th with 1.8GB
> >>>>
> >>>> Conclusion - In this particular testcase on using v11 page hinting and
> >>>> v1 bubble-hinting 4 more guests could be launched without swapping compared
> >>>> to an unmodified kernel.
> >>>> For the 7th guest launch, v11 page hinting is slightly better than v1 Bubble
> >>>> hinting as it touches lesser swap space.
> >>> I'm confused by the comment. From what I can tell bubble hinting came
> >>> up with 1.8GB of memory while page hinting only managed to achieve
> >>> .026GB (Using the same units makes it easier to visualize the
> >>> difference). Also your test says "can be launched without swap usage",
> >>> yet you say the bubble hinting is touching swap which makes not sense
> >>> to me.
> >> I will work on the cover to improve this part.
> >> Basically, In each case, the first number indicates the number of the
> >> guest which are launched without touching the swap space. For instance
> >> with bubble hinting, I was able to launch 6 guests without any swap
> >> usage. On launching the 7th guests initially there was no swap usage,
> >> however, as the test app starts allocating 4GB memory the swap came into
> >> the picture. 1.8 GB is the swap usage after the completion of the test
> >> application.
> >>>> Setup & procedure -
> >>>> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
> >>>> Guest Memory = 5GB
> >>>> Number of CPUs in the guest = 1
> >>>> Host swap = 4GB
> >>>> Workload = test allocation program that allocates 4GB memory, touches it via
> >>>> memset and exits.
> >>>> The first guest is launched and once its console is up, the test allocation
> >>>> program is executed with 4 GB memory request (Due to this the guest occupies
> >>>> almost 4-5 GB of memory in the host in a system without page hinting). Once
> >>>> this program exits at that time another guest is launched in the host and the
> >>>> same process is followed. It is continued until the swap is not used.
> >>>>
> >>>> 2. Memhog execution time (For 3 guests each of 6GB on a system with 15GB):
> >>>> unmodified kernel - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB
> >>>> v11 page hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0
> >>>> v1 bubble hinting - Guest1:23, Guest2:11s, Guest3:26s swap used = 0
> >>>>
> >>>> For this particular test-case in a guest which doesn't require swap access
> >>>> "memhog 6G" execution time lies within a range of 15-30s.
> >>>> Conclusion -
> >>>> In the above test case for an unmodified kernel on executing memhog in the
> >>>> third guest execution time rises to above 2minutes due to swap access.
> >>>> Using either page-hinting or bubble hinting brings this execution time to a
> >>>> a normal range of 15-30s.
> >>> So really this test doesn't add much in value. The whole reason why
> >>> Guest3 runs so much slower is because it is going to swap. I initially
> >>> did this to demonstrate a point, but now running this test doesn't
> >>> prove much as it isn't really meant to be a performance test. It is
> >>> essentially just a duplicate of the "how many guests can you run" test
> >>> that is passing itself off as some sort of performance test.
> >>>
> >>> We could probably just drop this from future version of this as long
> >>> as we verify that the memory hinting is freeing most of the memory
> >>> back and the guest is reporting a size less than the total guest
> >>> memory size.
> >>>
> >> +1, makes sense to keep just one of the above two.
> >>>> Setup & procedure -
> >>>> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
> >>>> Guest Memory = 6GB
> >>>> Number of CPUs in the guest = 4
> >>>> Process = 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
> >>>> one after the other in each of them.
> >>>> Host swap = 4GB
> >>>>
> >>>> Performance Analysis:
> >>>> 1. will-it-scale's page_faul1
> >>>> Setup -
> >>>> Guest Memory = 6GB
> >>>> Number of cores = 24
> >>>>
> >>>> Unmodified kernel -
> >>>> 0,0,100,0,100,0
> >>>> 1,514453,95.84,519502,95.83,519502
> >>>> 2,991485,91.67,932268,91.68,1039004
> >>>> 3,1381237,87.36,1264214,87.64,1558506
> >>>> 4,1789116,83.36,1597767,83.88,2078008
> >>>> 5,2181552,79.20,1889489,80.08,2597510
> >>>> 6,2452416,75.05,2001879,77.10,3117012
> >>>> 7,2671047,70.90,2263866,73.22,3636514
> >>>> 8,2930081,66.75,2333813,70.60,4156016
> >>>> 9,3126431,62.60,2370108,68.28,4675518
> >>>> 10,3211937,58.44,2454093,65.74,5195020
> >>>> 11,3162172,54.32,2450822,63.21,5714522
> >>>> 12,3154261,50.14,2272290,58.98,6234024
> >>>> 13,3115174,46.02,2369679,57.74,6753526
> >>>> 14,3150511,41.86,2470837,54.02,7273028
> >>>> 15,3134158,37.71,2428129,51.98,7792530
> >>>> 16,3143067,33.57,2340469,49.54,8312032
> >>>> 17,3112457,29.43,2263627,44.81,8831534
> >>>> 18,3089724,25.29,2181879,38.69,9351036
> >>>> 19,3076878,21.15,2236505,40.01,9870538
> >>>> 20,3091978,16.95,2266327,35.00,10390040
> >>>> 21,3082927,12.84,2172578,28.12,10909542
> >>>> 22,3055282,8.73,2176269,29.14,11429044
> >>>> 23,3081144,4.56,2138442,24.87,11948546
> >>>> 24,3075509,0.45,2173753,21.62,12468048
> >>>>
> >>>> page hinting -
> >>>> 0,0,100,0,100,0
> >>>> 1,491683,95.83,494366,95.82,494366
> >>>> 2,988415,91.67,919660,91.68,988732
> >>>> 3,1344829,87.52,1244608,87.69,1483098
> >>>> 4,1797933,83.37,1625797,83.70,1977464
> >>>> 5,2179009,79.21,1881534,80.13,2471830
> >>>> 6,2449858,75.07,2078137,76.82,2966196
> >>>> 7,2732122,70.90,2178105,73.75,3460562
> >>>> 8,2910965,66.75,2340901,70.28,3954928
> >>>> 9,3006665,62.61,2353748,67.91,4449294
> >>>> 10,3164752,58.46,2377936,65.08,4943660
> >>>> 11,3234846,54.32,2510149,63.14,5438026
> >>>> 12,3165477,50.17,2412007,59.91,5932392
> >>>> 13,3141457,46.05,2421548,57.85,6426758
> >>>> 14,3135839,41.90,2378021,53.81,6921124
> >>>> 15,3109113,37.75,2269290,51.76,7415490
> >>>> 16,3093613,33.62,2346185,48.73,7909856
> >>>> 17,3086542,29.49,2352140,46.19,8404222
> >>>> 18,3048991,25.36,2217144,41.52,8898588
> >>>> 19,2965500,21.18,2313614,38.18,9392954
> >>>> 20,2928977,17.05,2175316,35.67,9887320
> >>>> 21,2896667,12.91,2141311,28.90,10381686
> >>>> 22,3047782,8.76,2177664,28.24,10876052
> >>>> 23,2994503,4.58,2160976,22.97,11370418
> >>>> 24,3038762,0.47,2053533,22.39,11864784
> >>>>
> >>>> bubble-hinting v1 -
> >>>> 0,0,100,0,100,0
> >>>> 1,515272,95.83,492355,95.81,515272
> >>>> 2,985903,91.66,919653,91.68,1030544
> >>>> 3,1475300,87.51,1353723,87.65,1545816
> >>>> 4,1783938,83.36,1586307,83.78,2061088
> >>>> 5,2093307,79.20,1867395,79.95,2576360
> >>>> 6,2441370,75.05,2055421,76.65,3091632
> >>>> 7,2650471,70.89,2246014,72.93,3606904
> >>>> 8,2926782,66.75,2333601,70.41,4122176
> >>>> 9,3107617,62.60,2383112,68.46,4637448
> >>>> 10,3192332,58.44,2441626,65.84,5152720
> >>>> 11,3268043,54.32,2235964,62.92,5667992
> >>>> 12,3191105,50.18,2449045,60.49,6183264
> >>>> 13,3145317,46.05,2377317,57.80,6698536
> >>>> 14,3161552,41.91,2395814,53.26,7213808
> >>>> 15,3140443,37.77,2333200,51.42,7729080
> >>>> 16,3130866,33.65,2150967,46.11,8244352
> >>>> 17,3112894,29.52,2372068,45.93,8759624
> >>>> 18,3078424,25.39,2336211,39.85,9274896
> >>>> 19,3036457,21.27,2224821,35.25,9790168
> >>>> 20,3046330,17.13,2199755,37.43,10305440
> >>>> 21,2981130,12.98,2214862,28.67,10820712
> >>>> 22,3017481,8.84,2195996,29.69,11335984
> >>>> 23,2979906,4.68,2173395,25.90,11851256
> >>>> 24,2971170,0.52,2134311,21.89,12366528
> >>> Okay, so this doesn't match up with the results you gave me last time
> >>> (https://lore.kernel.org/lkml/afac6f92-74f5-4580-0303-12b7374e5011@redhat.com/),
> >>> and actually more closely matches what I was expecting to see. The
> >>> bubble-hinting patches are performing within a few percent of what the
> >>> baseline kernel was doing.
> >> Interestingly even with an unmodified kernel with every fresh boot, I
> >> observed a certain amount of variability in the results which I stated
> >> below.
> >>> I am assuming the results from before had
> >>> some additional debugging enabled for the bubble-hinting test that
> >>> wasn't enabled for the other ones.
> >> Nope, I had debugging options enabled for all the cases. This time
> >> around I disabled all the debug options.
> > We can agree to disagree I guess. Those debugging options had reduced
> > the throughput by over 30% on the guest kernel in my test runs. I was
> > never able to reproduce the data you reported as enabling the same
> > debug features on an unmodified kernel had reduced the throughput for
> > the test just the same as it did for the bubble hinting version. Were
> > you running the debug options on the host kernel or the guest?
> In the guest. Do the results which I shared without debug options, match
> with what you have?
> I am also curious to know if you see any variability in the results of
> page_fault1 for an unmodified kernel with every fresh boot? If so how
> often?

I see some variability, but not much. Usually it can vary by +/- 5% or
so. What I have been doing is collecting multiple runs, working out
the average, and then comparing that against an average with the
patches applied.

One other thing you can probably do to limit the variability would be
to look at disabling any power management features on the system. One
thing you could be seeing is the effect of the CPU enabling turbo mode
or going into sleep states if idle. That can easily throw the numbers
around quite a bit.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH v11 0/2] mm: Support for page hinting
  2019-07-11 15:08         ` Alexander Duyck
@ 2019-07-11 15:19           ` Nitesh Narayan Lal
  2019-07-11 17:01             ` Alexander Duyck
  0 siblings, 1 reply; 43+ messages in thread
From: Nitesh Narayan Lal @ 2019-07-11 15:19 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko


On 7/11/19 11:08 AM, Alexander Duyck wrote:
> On Thu, Jul 11, 2019 at 8:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>
>> On 7/11/19 10:58 AM, Alexander Duyck wrote:
>>> On Thu, Jul 11, 2019 at 4:31 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>> On 7/10/19 7:40 PM, Alexander Duyck wrote:
>>>>> On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>>>
>>>>> The results up here were redundant with what is below so I am just
>>>>> dropping them. I would suggest only including one set of results in
>>>>> any future cover page as it is confusing to duplicate it like that.
>>>>>
>>>>>> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
>>>>>> A new hook after buddy merging is used to set the bits in the bitmap.
>>>>>> Currently, the bits are only cleared when pages are hinted, not when pages are
>>>>>> re-allocated.
>>>>>>
>>>>>> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
>>>>>> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
>>>>>> threshold is met, trying to isolate and report pages that are still free.
>>>>>>
>>>>>> The isolated pages are reported via virtio-balloon, which is responsible for
>>>>>> sending batched pages to the host synchronously. Once the hypervisor processed
>>>>>> the hinting request, the isolated pages are returned back to the buddy.
>>>>>>
>>>>>> Changelog in v11:
>>>>>> * Added logic to take care of multiple NUMA nodes scenarios.
>>>>>> * Simplified the logic for reporting isolated pages to the host. (Eg. replaced
>>>>>> dynamically allocated arrays with static ones, introduced wait event instead of
>>>>>> the loop in order to wait for a response from the host)
>>>>>> * Added a mutex to prevent race condition when page hinting is enabled by
>>>>>> multiple drivers.
>>>>>> * Simplified the logic responsible for decrementing free page counter for each
>>>>>> zone.
>>>>>> * Simplified code structuring/naming.
>>>>>>
>>>>>> Known work items for the future:
>>>>>> * Test device assigned guests to ensure that hinting doesn't break it.
>>>>>> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device-side support.
>>>>>> * Decide between MADV_DONTNEED and MADV_FREE.
>>>>>> * Look into memory hotplug, more efficient locking, better naming conventions to
>>>>>> avoid confusion with VIRTIO_BALLOON_F_FREE_PAGE_HINT support.
>>>>>> * Come up with proper/traceable error-message/logs and look into other code
>>>>>> simplifications. (If necessary).
>>>>>>
>>>>>> Benefit analysis:
>>>>>> 1. Number of 5GB guests (each touching 4GB memory) that can be launched without
>>>>>> swap usage on a system with 15GB:
>>>>>> unmodified kernel - 2, 3rd with 2.5GB
>>>>>> v11 page hinting - 6, 7th with 26MB
>>>>>> v1 bubble hinting - 6, 7th with 1.8GB
>>>>>>
>>>>>> Conclusion - In this particular testcase on using v11 page hinting and
>>>>>> v1 bubble-hinting 4 more guests could be launched without swapping compared
>>>>>> to an unmodified kernel.
>>>>>> For the 7th guest launch, v11 page hinting is slightly better than v1 Bubble
>>>>>> hinting as it touches lesser swap space.
>>>>> I'm confused by the comment. From what I can tell bubble hinting came
>>>>> up with 1.8GB of memory while page hinting only managed to achieve
>>>>> .026GB (Using the same units makes it easier to visualize the
>>>>> difference). Also your test says "can be launched without swap usage",
>>>>> yet you say the bubble hinting is touching swap which makes not sense
>>>>> to me.
>>>> I will work on the cover to improve this part.
>>>> Basically, In each case, the first number indicates the number of the
>>>> guest which are launched without touching the swap space. For instance
>>>> with bubble hinting, I was able to launch 6 guests without any swap
>>>> usage. On launching the 7th guests initially there was no swap usage,
>>>> however, as the test app starts allocating 4GB memory the swap came into
>>>> the picture. 1.8 GB is the swap usage after the completion of the test
>>>> application.
>>>>>> Setup & procedure -
>>>>>> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
>>>>>> Guest Memory = 5GB
>>>>>> Number of CPUs in the guest = 1
>>>>>> Host swap = 4GB
>>>>>> Workload = test allocation program that allocates 4GB memory, touches it via
>>>>>> memset and exits.
>>>>>> The first guest is launched and once its console is up, the test allocation
>>>>>> program is executed with 4 GB memory request (Due to this the guest occupies
>>>>>> almost 4-5 GB of memory in the host in a system without page hinting). Once
>>>>>> this program exits at that time another guest is launched in the host and the
>>>>>> same process is followed. It is continued until the swap is not used.
>>>>>>
>>>>>> 2. Memhog execution time (For 3 guests each of 6GB on a system with 15GB):
>>>>>> unmodified kernel - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB
>>>>>> v11 page hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0
>>>>>> v1 bubble hinting - Guest1:23, Guest2:11s, Guest3:26s swap used = 0
>>>>>>
>>>>>> For this particular test-case in a guest which doesn't require swap access
>>>>>> "memhog 6G" execution time lies within a range of 15-30s.
>>>>>> Conclusion -
>>>>>> In the above test case for an unmodified kernel on executing memhog in the
>>>>>> third guest execution time rises to above 2minutes due to swap access.
>>>>>> Using either page-hinting or bubble hinting brings this execution time to a
>>>>>> a normal range of 15-30s.
>>>>> So really this test doesn't add much in value. The whole reason why
>>>>> Guest3 runs so much slower is because it is going to swap. I initially
>>>>> did this to demonstrate a point, but now running this test doesn't
>>>>> prove much as it isn't really meant to be a performance test. It is
>>>>> essentially just a duplicate of the "how many guests can you run" test
>>>>> that is passing itself off as some sort of performance test.
>>>>>
>>>>> We could probably just drop this from future version of this as long
>>>>> as we verify that the memory hinting is freeing most of the memory
>>>>> back and the guest is reporting a size less than the total guest
>>>>> memory size.
>>>>>
>>>> +1, makes sense to keep just one of the above two.
>>>>>> Setup & procedure -
>>>>>> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
>>>>>> Guest Memory = 6GB
>>>>>> Number of CPUs in the guest = 4
>>>>>> Process = 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
>>>>>> one after the other in each of them.
>>>>>> Host swap = 4GB
>>>>>>
>>>>>> Performance Analysis:
>>>>>> 1. will-it-scale's page_faul1
>>>>>> Setup -
>>>>>> Guest Memory = 6GB
>>>>>> Number of cores = 24
>>>>>>
>>>>>> Unmodified kernel -
>>>>>> 0,0,100,0,100,0
>>>>>> 1,514453,95.84,519502,95.83,519502
>>>>>> 2,991485,91.67,932268,91.68,1039004
>>>>>> 3,1381237,87.36,1264214,87.64,1558506
>>>>>> 4,1789116,83.36,1597767,83.88,2078008
>>>>>> 5,2181552,79.20,1889489,80.08,2597510
>>>>>> 6,2452416,75.05,2001879,77.10,3117012
>>>>>> 7,2671047,70.90,2263866,73.22,3636514
>>>>>> 8,2930081,66.75,2333813,70.60,4156016
>>>>>> 9,3126431,62.60,2370108,68.28,4675518
>>>>>> 10,3211937,58.44,2454093,65.74,5195020
>>>>>> 11,3162172,54.32,2450822,63.21,5714522
>>>>>> 12,3154261,50.14,2272290,58.98,6234024
>>>>>> 13,3115174,46.02,2369679,57.74,6753526
>>>>>> 14,3150511,41.86,2470837,54.02,7273028
>>>>>> 15,3134158,37.71,2428129,51.98,7792530
>>>>>> 16,3143067,33.57,2340469,49.54,8312032
>>>>>> 17,3112457,29.43,2263627,44.81,8831534
>>>>>> 18,3089724,25.29,2181879,38.69,9351036
>>>>>> 19,3076878,21.15,2236505,40.01,9870538
>>>>>> 20,3091978,16.95,2266327,35.00,10390040
>>>>>> 21,3082927,12.84,2172578,28.12,10909542
>>>>>> 22,3055282,8.73,2176269,29.14,11429044
>>>>>> 23,3081144,4.56,2138442,24.87,11948546
>>>>>> 24,3075509,0.45,2173753,21.62,12468048
>>>>>>
>>>>>> page hinting -
>>>>>> 0,0,100,0,100,0
>>>>>> 1,491683,95.83,494366,95.82,494366
>>>>>> 2,988415,91.67,919660,91.68,988732
>>>>>> 3,1344829,87.52,1244608,87.69,1483098
>>>>>> 4,1797933,83.37,1625797,83.70,1977464
>>>>>> 5,2179009,79.21,1881534,80.13,2471830
>>>>>> 6,2449858,75.07,2078137,76.82,2966196
>>>>>> 7,2732122,70.90,2178105,73.75,3460562
>>>>>> 8,2910965,66.75,2340901,70.28,3954928
>>>>>> 9,3006665,62.61,2353748,67.91,4449294
>>>>>> 10,3164752,58.46,2377936,65.08,4943660
>>>>>> 11,3234846,54.32,2510149,63.14,5438026
>>>>>> 12,3165477,50.17,2412007,59.91,5932392
>>>>>> 13,3141457,46.05,2421548,57.85,6426758
>>>>>> 14,3135839,41.90,2378021,53.81,6921124
>>>>>> 15,3109113,37.75,2269290,51.76,7415490
>>>>>> 16,3093613,33.62,2346185,48.73,7909856
>>>>>> 17,3086542,29.49,2352140,46.19,8404222
>>>>>> 18,3048991,25.36,2217144,41.52,8898588
>>>>>> 19,2965500,21.18,2313614,38.18,9392954
>>>>>> 20,2928977,17.05,2175316,35.67,9887320
>>>>>> 21,2896667,12.91,2141311,28.90,10381686
>>>>>> 22,3047782,8.76,2177664,28.24,10876052
>>>>>> 23,2994503,4.58,2160976,22.97,11370418
>>>>>> 24,3038762,0.47,2053533,22.39,11864784
>>>>>>
>>>>>> bubble-hinting v1 -
>>>>>> 0,0,100,0,100,0
>>>>>> 1,515272,95.83,492355,95.81,515272
>>>>>> 2,985903,91.66,919653,91.68,1030544
>>>>>> 3,1475300,87.51,1353723,87.65,1545816
>>>>>> 4,1783938,83.36,1586307,83.78,2061088
>>>>>> 5,2093307,79.20,1867395,79.95,2576360
>>>>>> 6,2441370,75.05,2055421,76.65,3091632
>>>>>> 7,2650471,70.89,2246014,72.93,3606904
>>>>>> 8,2926782,66.75,2333601,70.41,4122176
>>>>>> 9,3107617,62.60,2383112,68.46,4637448
>>>>>> 10,3192332,58.44,2441626,65.84,5152720
>>>>>> 11,3268043,54.32,2235964,62.92,5667992
>>>>>> 12,3191105,50.18,2449045,60.49,6183264
>>>>>> 13,3145317,46.05,2377317,57.80,6698536
>>>>>> 14,3161552,41.91,2395814,53.26,7213808
>>>>>> 15,3140443,37.77,2333200,51.42,7729080
>>>>>> 16,3130866,33.65,2150967,46.11,8244352
>>>>>> 17,3112894,29.52,2372068,45.93,8759624
>>>>>> 18,3078424,25.39,2336211,39.85,9274896
>>>>>> 19,3036457,21.27,2224821,35.25,9790168
>>>>>> 20,3046330,17.13,2199755,37.43,10305440
>>>>>> 21,2981130,12.98,2214862,28.67,10820712
>>>>>> 22,3017481,8.84,2195996,29.69,11335984
>>>>>> 23,2979906,4.68,2173395,25.90,11851256
>>>>>> 24,2971170,0.52,2134311,21.89,12366528
>>>>> Okay, so this doesn't match up with the results you gave me last time
>>>>> (https://lore.kernel.org/lkml/afac6f92-74f5-4580-0303-12b7374e5011@redhat.com/),
>>>>> and actually more closely matches what I was expecting to see. The
>>>>> bubble-hinting patches are performing within a few percent of what the
>>>>> baseline kernel was doing.
>>>> Interestingly even with an unmodified kernel with every fresh boot, I
>>>> observed a certain amount of variability in the results which I stated
>>>> below.
>>>>> I am assuming the results from before had
>>>>> some additional debugging enabled for the bubble-hinting test that
>>>>> wasn't enabled for the other ones.
>>>> Nope, I had debugging options enabled for all the cases. This time
>>>> around I disabled all the debug options.
>>> We can agree to disagree I guess. Those debugging options had reduced
>>> the throughput by over 30% on the guest kernel in my test runs. I was
>>> never able to reproduce the data you reported as enabling the same
>>> debug features on an unmodified kernel had reduced the throughput for
>>> the test just the same as it did for the bubble hinting version. Were
>>> you running the debug options on the host kernel or the guest?
>> In the guest. Do the results which I shared without debug options, match
>> with what you have?
>> I am also curious to know if you see any variability in the results of
>> page_fault1 for an unmodified kernel with every fresh boot? If so how
>> often?
> I see some variability, but not much. Usually it can vary by +/- 5% or
> so.
+1
>  What I have been doing is collecting multiple runs, working out
> the average, and then comparing that against an average with the
> patches applied.
Yeah, I didn't share the average values but I do the same.
I just wanted to mention the variability so that there is no confusion
if later the value comes out to be in the range of +/- 3-4%.
>
> One other thing you can probably do to limit the variability would be
> to look at disabling any power management features on the system. One
> thing you could be seeing is the effect of the CPU enabling turbo mode
> or going into sleep states if idle. That can easily throw the numbers
> around quite a bit.
The variability you mentioned, was it after disabling these options?
-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure
  2019-07-10 20:45   ` Dave Hansen
  2019-07-11 11:48     ` Nitesh Narayan Lal
@ 2019-07-11 15:25     ` Nitesh Narayan Lal
  2019-07-11 15:50       ` Nitesh Narayan Lal
  2019-07-11 16:22       ` Dave Hansen
  2019-07-15  9:26     ` David Hildenbrand
  2 siblings, 2 replies; 43+ messages in thread
From: Nitesh Narayan Lal @ 2019-07-11 15:25 UTC (permalink / raw)
  To: Dave Hansen, kvm, linux-kernel, linux-mm, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck, john.starks,
	mhocko


On 7/10/19 4:45 PM, Dave Hansen wrote:
> On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote:
>> +struct zone_free_area {
>> +	unsigned long *bitmap;
>> +	unsigned long base_pfn;
>> +	unsigned long end_pfn;
>> +	atomic_t free_pages;
>> +	unsigned long nbits;
>> +} free_area[MAX_NR_ZONES];
> Why do we need an extra data structure.  What's wrong with putting
> per-zone data in ... 'struct zone'?
Will it be acceptable to add fields in struct zone, when they will only
be used by page hinting?
>   The cover letter claims that it
> doesn't touch core-mm infrastructure, but if it depends on mechanisms
> like this, I think that's a very bad thing.
>
> To be honest, I'm not sure this series is worth reviewing at this point.
>  It's horribly lightly commented and full of kernel antipatterns lik
>
> void func()
> {
> 	if () {
> 		... indent entire logic
> 		... of function
> 	}
> }
I usually run checkpatch to detect such indentation issues. For the
patches, I shared it didn't show me any issues.
>
> It has big "TODO"s.  It's virtually comment-free.  I'm shocked it's at
> the 11th version and still looking like this.
>
>> +
>> +		for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) {
>> +			unsigned long pages = free_area[zone_idx].end_pfn -
>> +					free_area[zone_idx].base_pfn;
>> +			bitmap_size = (pages >> PAGE_HINTING_MIN_ORDER) + 1;
>> +			if (!bitmap_size)
>> +				continue;
>> +			free_area[zone_idx].bitmap = bitmap_zalloc(bitmap_size,
>> +								   GFP_KERNEL);
> This doesn't support sparse zones.  We can have zones with massive
> spanned page sizes, but very few present pages.  On those zones, this
> will exhaust memory for no good reason.
>
> Comparing this to Alex's patch set, it's of much lower quality and at a
> much earlier stage of development.  The two sets are not really even
> comparable right now.  This certainly doesn't sell me on (or even really
> enumerate the deltas in) this approach vs. Alex's.
>
-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure
  2019-07-11 15:25     ` Nitesh Narayan Lal
@ 2019-07-11 15:50       ` Nitesh Narayan Lal
  2019-07-11 16:22       ` Dave Hansen
  1 sibling, 0 replies; 43+ messages in thread
From: Nitesh Narayan Lal @ 2019-07-11 15:50 UTC (permalink / raw)
  To: Dave Hansen, kvm, linux-kernel, linux-mm, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck, john.starks,
	mhocko


On 7/11/19 11:25 AM, Nitesh Narayan Lal wrote:
> On 7/10/19 4:45 PM, Dave Hansen wrote:
>> On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote:
>>> +struct zone_free_area {
>>> +	unsigned long *bitmap;
>>> +	unsigned long base_pfn;
>>> +	unsigned long end_pfn;
>>> +	atomic_t free_pages;
>>> +	unsigned long nbits;
>>> +} free_area[MAX_NR_ZONES];
>> Why do we need an extra data structure.  What's wrong with putting
>> per-zone data in ... 'struct zone'?
> Will it be acceptable to add fields in struct zone, when they will only
> be used by page hinting?
>>   The cover letter claims that it
>> doesn't touch core-mm infrastructure, but if it depends on mechanisms
>> like this, I think that's a very bad thing.
>>
>> To be honest, I'm not sure this series is worth reviewing at this point.
>>  It's horribly lightly commented and full of kernel antipatterns lik
>>
>> void func()
>> {
>> 	if () {
>> 		... indent entire logic
>> 		... of function
>> 	}
>> }
> I usually run checkpatch to detect such indentation issues. For the
> patches, I shared it didn't show me any issues.
My bad I think I jumped here, I saw what you are referring to here.
I will fix these kind of things.
>> It has big "TODO"s.  It's virtually comment-free.  I'm shocked it's at
>> the 11th version and still looking like this.
>>
>>> +
>>> +		for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) {
>>> +			unsigned long pages = free_area[zone_idx].end_pfn -
>>> +					free_area[zone_idx].base_pfn;
>>> +			bitmap_size = (pages >> PAGE_HINTING_MIN_ORDER) + 1;
>>> +			if (!bitmap_size)
>>> +				continue;
>>> +			free_area[zone_idx].bitmap = bitmap_zalloc(bitmap_size,
>>> +								   GFP_KERNEL);
>> This doesn't support sparse zones.  We can have zones with massive
>> spanned page sizes, but very few present pages.  On those zones, this
>> will exhaust memory for no good reason.
>>
>> Comparing this to Alex's patch set, it's of much lower quality and at a
>> much earlier stage of development.  The two sets are not really even
>> comparable right now.  This certainly doesn't sell me on (or even really
>> enumerate the deltas in) this approach vs. Alex's.
>>
-- 
Thanks
Nitesh

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure
  2019-07-11 15:25     ` Nitesh Narayan Lal
  2019-07-11 15:50       ` Nitesh Narayan Lal
@ 2019-07-11 16:22       ` Dave Hansen
  2019-07-11 16:36         ` Nitesh Narayan Lal
  1 sibling, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2019-07-11 16:22 UTC (permalink / raw)
  To: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini,
	lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, david,
	mst, dodgen, konrad.wilk, dhildenb, aarcange, alexander.duyck,
	john.starks, mhocko

On 7/11/19 8:25 AM, Nitesh Narayan Lal wrote:
> On 7/10/19 4:45 PM, Dave Hansen wrote:
>> On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote:
>>> +struct zone_free_area {
>>> +	unsigned long *bitmap;
>>> +	unsigned long base_pfn;
>>> +	unsigned long end_pfn;
>>> +	atomic_t free_pages;
>>> +	unsigned long nbits;
>>> +} free_area[MAX_NR_ZONES];
>> Why do we need an extra data structure.  What's wrong with putting
>> per-zone data in ... 'struct zone'?
> Will it be acceptable to add fields in struct zone, when they will only
> be used by page hinting?

Wait a sec...  MAX_NR_ZONES the number of zone types not the maximum
number of *zones* in the system.

Did you test this on a NUMA system?

In any case, yes, you can put these in 'struct zone'.  It will waste
less space that way, on average, than what you have here (one you scale
it to MAX_NR_ZONE*MAX_NUM_NODES.

>>   The cover letter claims that it
>> doesn't touch core-mm infrastructure, but if it depends on mechanisms
>> like this, I think that's a very bad thing.
>>
>> To be honest, I'm not sure this series is worth reviewing at this point.
>>  It's horribly lightly commented and full of kernel antipatterns lik
>>
>> void func()
>> {
>> 	if () {
>> 		... indent entire logic
>> 		... of function
>> 	}
>> }
> I usually run checkpatch to detect such indentation issues. For the
> patches, I shared it didn't show me any issues.

Just because checkpatch doesn't complain does not mean it is good form.
 We write the above as:

void func()
{
	if (!something)
		goto out;

	... logic of function here
out:
	// cleanup
}

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure
  2019-07-11 16:22       ` Dave Hansen
@ 2019-07-11 16:36         ` Nitesh Narayan Lal
  2019-07-11 16:45           ` Dave Hansen
  0 siblings, 1 reply; 43+ messages in thread
From: Nitesh Narayan Lal @ 2019-07-11 16:36 UTC (permalink / raw)
  To: Dave Hansen, kvm, linux-kernel, linux-mm, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck, john.starks,
	mhocko


On 7/11/19 12:22 PM, Dave Hansen wrote:
> On 7/11/19 8:25 AM, Nitesh Narayan Lal wrote:
>> On 7/10/19 4:45 PM, Dave Hansen wrote:
>>> On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote:
>>>> +struct zone_free_area {
>>>> +	unsigned long *bitmap;
>>>> +	unsigned long base_pfn;
>>>> +	unsigned long end_pfn;
>>>> +	atomic_t free_pages;
>>>> +	unsigned long nbits;
>>>> +} free_area[MAX_NR_ZONES];
>>> Why do we need an extra data structure.  What's wrong with putting
>>> per-zone data in ... 'struct zone'?
>> Will it be acceptable to add fields in struct zone, when they will only
>> be used by page hinting?
> Wait a sec...  MAX_NR_ZONES the number of zone types not the maximum
> number of *zones* in the system.
>
> Did you test this on a NUMA system?
Yes, I tested it with a guest having 2 and 3 NUMA nodes.
> In any case, yes, you can put these in 'struct zone'.  It will waste
> less space that way, on average, than what you have here (one you scale
> it to MAX_NR_ZONE*MAX_NUM_NODES.
>>>   The cover letter claims that it
>>> doesn't touch core-mm infrastructure, but if it depends on mechanisms
>>> like this, I think that's a very bad thing.
>>>
>>> To be honest, I'm not sure this series is worth reviewing at this point.
>>>  It's horribly lightly commented and full of kernel antipatterns lik
>>>
>>> void func()
>>> {
>>> 	if () {
>>> 		... indent entire logic
>>> 		... of function
>>> 	}
>>> }
>> I usually run checkpatch to detect such indentation issues. For the
>> patches, I shared it didn't show me any issues.
> Just because checkpatch doesn't complain does not mean it is good form.
>  We write the above as:
>
> void func()
> {
> 	if (!something)
> 		goto out;
>
> 	... logic of function here
> out:
> 	// cleanup
> }

Yeap, I got it. I will correct this.


-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure
  2019-07-11 16:36         ` Nitesh Narayan Lal
@ 2019-07-11 16:45           ` Dave Hansen
  2019-07-11 16:52             ` Nitesh Narayan Lal
  0 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2019-07-11 16:45 UTC (permalink / raw)
  To: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini,
	lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, david,
	mst, dodgen, konrad.wilk, dhildenb, aarcange, alexander.duyck,
	john.starks, mhocko

On 7/11/19 9:36 AM, Nitesh Narayan Lal wrote:
>>>>> +struct zone_free_area {
>>>>> +	unsigned long *bitmap;
>>>>> +	unsigned long base_pfn;
>>>>> +	unsigned long end_pfn;
>>>>> +	atomic_t free_pages;
>>>>> +	unsigned long nbits;
>>>>> +} free_area[MAX_NR_ZONES];
>>>> Why do we need an extra data structure.  What's wrong with putting
>>>> per-zone data in ... 'struct zone'?
>>> Will it be acceptable to add fields in struct zone, when they will only
>>> be used by page hinting?
>> Wait a sec...  MAX_NR_ZONES the number of zone types not the maximum
>> number of *zones* in the system.
>>
>> Did you test this on a NUMA system?
> Yes, I tested it with a guest having 2 and 3 NUMA nodes.

How can this *possibly* have worked?

Won't each same-typed zone just use the same free_area[] entry since
zone_idx(zone1)==zone_idx(zone2) if zone1 and zone2 are (for example)
both ZONE_NORMAL?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure
  2019-07-11 16:45           ` Dave Hansen
@ 2019-07-11 16:52             ` Nitesh Narayan Lal
  0 siblings, 0 replies; 43+ messages in thread
From: Nitesh Narayan Lal @ 2019-07-11 16:52 UTC (permalink / raw)
  To: Dave Hansen, kvm, linux-kernel, linux-mm, pbonzini, lcapitulino,
	pagupta, wei.w.wang, yang.zhang.wz, riel, david, mst, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck, john.starks,
	mhocko


On 7/11/19 12:45 PM, Dave Hansen wrote:
> On 7/11/19 9:36 AM, Nitesh Narayan Lal wrote:
>>>>>> +struct zone_free_area {
>>>>>> +	unsigned long *bitmap;
>>>>>> +	unsigned long base_pfn;
>>>>>> +	unsigned long end_pfn;
>>>>>> +	atomic_t free_pages;
>>>>>> +	unsigned long nbits;
>>>>>> +} free_area[MAX_NR_ZONES];
>>>>> Why do we need an extra data structure.  What's wrong with putting
>>>>> per-zone data in ... 'struct zone'?
>>>> Will it be acceptable to add fields in struct zone, when they will only
>>>> be used by page hinting?
>>> Wait a sec...  MAX_NR_ZONES the number of zone types not the maximum
>>> number of *zones* in the system.
>>>
>>> Did you test this on a NUMA system?
>> Yes, I tested it with a guest having 2 and 3 NUMA nodes.
> How can this *possibly* have worked?
>
> Won't each same-typed zone just use the same free_area[] entry since
> zone_idx(zone1)==zone_idx(zone2) if zone1 and zone2 are (for example)
> both ZONE_NORMAL?
Yes. However, the base_pfn and end_pfn will be updated with the zone1's
base and zone2s end_pfn value from page_hinting_enable(). Isn't?

-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH v11 0/2] mm: Support for page hinting
  2019-07-11 15:19           ` Nitesh Narayan Lal
@ 2019-07-11 17:01             ` Alexander Duyck
  0 siblings, 0 replies; 43+ messages in thread
From: Alexander Duyck @ 2019-07-11 17:01 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko

On Thu, Jul 11, 2019 at 8:19 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 7/11/19 11:08 AM, Alexander Duyck wrote:
> > On Thu, Jul 11, 2019 at 8:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>
> >> On 7/11/19 10:58 AM, Alexander Duyck wrote:
> >>> On Thu, Jul 11, 2019 at 4:31 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>>> On 7/10/19 7:40 PM, Alexander Duyck wrote:
> >>>>> On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>>>>
> >>>>> The results up here were redundant with what is below so I am just
> >>>>> dropping them. I would suggest only including one set of results in
> >>>>> any future cover page as it is confusing to duplicate it like that.
> >>>>>
> >>>>>> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
> >>>>>> A new hook after buddy merging is used to set the bits in the bitmap.
> >>>>>> Currently, the bits are only cleared when pages are hinted, not when pages are
> >>>>>> re-allocated.
> >>>>>>
> >>>>>> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
> >>>>>> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
> >>>>>> threshold is met, trying to isolate and report pages that are still free.
> >>>>>>
> >>>>>> The isolated pages are reported via virtio-balloon, which is responsible for
> >>>>>> sending batched pages to the host synchronously. Once the hypervisor processed
> >>>>>> the hinting request, the isolated pages are returned back to the buddy.
> >>>>>>
> >>>>>> Changelog in v11:
> >>>>>> * Added logic to take care of multiple NUMA nodes scenarios.
> >>>>>> * Simplified the logic for reporting isolated pages to the host. (Eg. replaced
> >>>>>> dynamically allocated arrays with static ones, introduced wait event instead of
> >>>>>> the loop in order to wait for a response from the host)
> >>>>>> * Added a mutex to prevent race condition when page hinting is enabled by
> >>>>>> multiple drivers.
> >>>>>> * Simplified the logic responsible for decrementing free page counter for each
> >>>>>> zone.
> >>>>>> * Simplified code structuring/naming.
> >>>>>>
> >>>>>> Known work items for the future:
> >>>>>> * Test device assigned guests to ensure that hinting doesn't break it.
> >>>>>> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device-side support.
> >>>>>> * Decide between MADV_DONTNEED and MADV_FREE.
> >>>>>> * Look into memory hotplug, more efficient locking, better naming conventions to
> >>>>>> avoid confusion with VIRTIO_BALLOON_F_FREE_PAGE_HINT support.
> >>>>>> * Come up with proper/traceable error-message/logs and look into other code
> >>>>>> simplifications. (If necessary).
> >>>>>>
> >>>>>> Benefit analysis:
> >>>>>> 1. Number of 5GB guests (each touching 4GB memory) that can be launched without
> >>>>>> swap usage on a system with 15GB:
> >>>>>> unmodified kernel - 2, 3rd with 2.5GB
> >>>>>> v11 page hinting - 6, 7th with 26MB
> >>>>>> v1 bubble hinting - 6, 7th with 1.8GB
> >>>>>>
> >>>>>> Conclusion - In this particular testcase on using v11 page hinting and
> >>>>>> v1 bubble-hinting 4 more guests could be launched without swapping compared
> >>>>>> to an unmodified kernel.
> >>>>>> For the 7th guest launch, v11 page hinting is slightly better than v1 Bubble
> >>>>>> hinting as it touches lesser swap space.
> >>>>> I'm confused by the comment. From what I can tell bubble hinting came
> >>>>> up with 1.8GB of memory while page hinting only managed to achieve
> >>>>> .026GB (Using the same units makes it easier to visualize the
> >>>>> difference). Also your test says "can be launched without swap usage",
> >>>>> yet you say the bubble hinting is touching swap which makes not sense
> >>>>> to me.
> >>>> I will work on the cover to improve this part.
> >>>> Basically, In each case, the first number indicates the number of the
> >>>> guest which are launched without touching the swap space. For instance
> >>>> with bubble hinting, I was able to launch 6 guests without any swap
> >>>> usage. On launching the 7th guests initially there was no swap usage,
> >>>> however, as the test app starts allocating 4GB memory the swap came into
> >>>> the picture. 1.8 GB is the swap usage after the completion of the test
> >>>> application.
> >>>>>> Setup & procedure -
> >>>>>> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
> >>>>>> Guest Memory = 5GB
> >>>>>> Number of CPUs in the guest = 1
> >>>>>> Host swap = 4GB
> >>>>>> Workload = test allocation program that allocates 4GB memory, touches it via
> >>>>>> memset and exits.
> >>>>>> The first guest is launched and once its console is up, the test allocation
> >>>>>> program is executed with 4 GB memory request (Due to this the guest occupies
> >>>>>> almost 4-5 GB of memory in the host in a system without page hinting). Once
> >>>>>> this program exits at that time another guest is launched in the host and the
> >>>>>> same process is followed. It is continued until the swap is not used.
> >>>>>>
> >>>>>> 2. Memhog execution time (For 3 guests each of 6GB on a system with 15GB):
> >>>>>> unmodified kernel - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB
> >>>>>> v11 page hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0
> >>>>>> v1 bubble hinting - Guest1:23, Guest2:11s, Guest3:26s swap used = 0
> >>>>>>
> >>>>>> For this particular test-case in a guest which doesn't require swap access
> >>>>>> "memhog 6G" execution time lies within a range of 15-30s.
> >>>>>> Conclusion -
> >>>>>> In the above test case for an unmodified kernel on executing memhog in the
> >>>>>> third guest execution time rises to above 2minutes due to swap access.
> >>>>>> Using either page-hinting or bubble hinting brings this execution time to a
> >>>>>> a normal range of 15-30s.
> >>>>> So really this test doesn't add much in value. The whole reason why
> >>>>> Guest3 runs so much slower is because it is going to swap. I initially
> >>>>> did this to demonstrate a point, but now running this test doesn't
> >>>>> prove much as it isn't really meant to be a performance test. It is
> >>>>> essentially just a duplicate of the "how many guests can you run" test
> >>>>> that is passing itself off as some sort of performance test.
> >>>>>
> >>>>> We could probably just drop this from future version of this as long
> >>>>> as we verify that the memory hinting is freeing most of the memory
> >>>>> back and the guest is reporting a size less than the total guest
> >>>>> memory size.
> >>>>>
> >>>> +1, makes sense to keep just one of the above two.
> >>>>>> Setup & procedure -
> >>>>>> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
> >>>>>> Guest Memory = 6GB
> >>>>>> Number of CPUs in the guest = 4
> >>>>>> Process = 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
> >>>>>> one after the other in each of them.
> >>>>>> Host swap = 4GB
> >>>>>>
> >>>>>> Performance Analysis:
> >>>>>> 1. will-it-scale's page_faul1
> >>>>>> Setup -
> >>>>>> Guest Memory = 6GB
> >>>>>> Number of cores = 24
> >>>>>>
> >>>>>> Unmodified kernel -
> >>>>>> 0,0,100,0,100,0
> >>>>>> 1,514453,95.84,519502,95.83,519502
> >>>>>> 2,991485,91.67,932268,91.68,1039004
> >>>>>> 3,1381237,87.36,1264214,87.64,1558506
> >>>>>> 4,1789116,83.36,1597767,83.88,2078008
> >>>>>> 5,2181552,79.20,1889489,80.08,2597510
> >>>>>> 6,2452416,75.05,2001879,77.10,3117012
> >>>>>> 7,2671047,70.90,2263866,73.22,3636514
> >>>>>> 8,2930081,66.75,2333813,70.60,4156016
> >>>>>> 9,3126431,62.60,2370108,68.28,4675518
> >>>>>> 10,3211937,58.44,2454093,65.74,5195020
> >>>>>> 11,3162172,54.32,2450822,63.21,5714522
> >>>>>> 12,3154261,50.14,2272290,58.98,6234024
> >>>>>> 13,3115174,46.02,2369679,57.74,6753526
> >>>>>> 14,3150511,41.86,2470837,54.02,7273028
> >>>>>> 15,3134158,37.71,2428129,51.98,7792530
> >>>>>> 16,3143067,33.57,2340469,49.54,8312032
> >>>>>> 17,3112457,29.43,2263627,44.81,8831534
> >>>>>> 18,3089724,25.29,2181879,38.69,9351036
> >>>>>> 19,3076878,21.15,2236505,40.01,9870538
> >>>>>> 20,3091978,16.95,2266327,35.00,10390040
> >>>>>> 21,3082927,12.84,2172578,28.12,10909542
> >>>>>> 22,3055282,8.73,2176269,29.14,11429044
> >>>>>> 23,3081144,4.56,2138442,24.87,11948546
> >>>>>> 24,3075509,0.45,2173753,21.62,12468048
> >>>>>>
> >>>>>> page hinting -
> >>>>>> 0,0,100,0,100,0
> >>>>>> 1,491683,95.83,494366,95.82,494366
> >>>>>> 2,988415,91.67,919660,91.68,988732
> >>>>>> 3,1344829,87.52,1244608,87.69,1483098
> >>>>>> 4,1797933,83.37,1625797,83.70,1977464
> >>>>>> 5,2179009,79.21,1881534,80.13,2471830
> >>>>>> 6,2449858,75.07,2078137,76.82,2966196
> >>>>>> 7,2732122,70.90,2178105,73.75,3460562
> >>>>>> 8,2910965,66.75,2340901,70.28,3954928
> >>>>>> 9,3006665,62.61,2353748,67.91,4449294
> >>>>>> 10,3164752,58.46,2377936,65.08,4943660
> >>>>>> 11,3234846,54.32,2510149,63.14,5438026
> >>>>>> 12,3165477,50.17,2412007,59.91,5932392
> >>>>>> 13,3141457,46.05,2421548,57.85,6426758
> >>>>>> 14,3135839,41.90,2378021,53.81,6921124
> >>>>>> 15,3109113,37.75,2269290,51.76,7415490
> >>>>>> 16,3093613,33.62,2346185,48.73,7909856
> >>>>>> 17,3086542,29.49,2352140,46.19,8404222
> >>>>>> 18,3048991,25.36,2217144,41.52,8898588
> >>>>>> 19,2965500,21.18,2313614,38.18,9392954
> >>>>>> 20,2928977,17.05,2175316,35.67,9887320
> >>>>>> 21,2896667,12.91,2141311,28.90,10381686
> >>>>>> 22,3047782,8.76,2177664,28.24,10876052
> >>>>>> 23,2994503,4.58,2160976,22.97,11370418
> >>>>>> 24,3038762,0.47,2053533,22.39,11864784
> >>>>>>
> >>>>>> bubble-hinting v1 -
> >>>>>> 0,0,100,0,100,0
> >>>>>> 1,515272,95.83,492355,95.81,515272
> >>>>>> 2,985903,91.66,919653,91.68,1030544
> >>>>>> 3,1475300,87.51,1353723,87.65,1545816
> >>>>>> 4,1783938,83.36,1586307,83.78,2061088
> >>>>>> 5,2093307,79.20,1867395,79.95,2576360
> >>>>>> 6,2441370,75.05,2055421,76.65,3091632
> >>>>>> 7,2650471,70.89,2246014,72.93,3606904
> >>>>>> 8,2926782,66.75,2333601,70.41,4122176
> >>>>>> 9,3107617,62.60,2383112,68.46,4637448
> >>>>>> 10,3192332,58.44,2441626,65.84,5152720
> >>>>>> 11,3268043,54.32,2235964,62.92,5667992
> >>>>>> 12,3191105,50.18,2449045,60.49,6183264
> >>>>>> 13,3145317,46.05,2377317,57.80,6698536
> >>>>>> 14,3161552,41.91,2395814,53.26,7213808
> >>>>>> 15,3140443,37.77,2333200,51.42,7729080
> >>>>>> 16,3130866,33.65,2150967,46.11,8244352
> >>>>>> 17,3112894,29.52,2372068,45.93,8759624
> >>>>>> 18,3078424,25.39,2336211,39.85,9274896
> >>>>>> 19,3036457,21.27,2224821,35.25,9790168
> >>>>>> 20,3046330,17.13,2199755,37.43,10305440
> >>>>>> 21,2981130,12.98,2214862,28.67,10820712
> >>>>>> 22,3017481,8.84,2195996,29.69,11335984
> >>>>>> 23,2979906,4.68,2173395,25.90,11851256
> >>>>>> 24,2971170,0.52,2134311,21.89,12366528
> >>>>> Okay, so this doesn't match up with the results you gave me last time
> >>>>> (https://lore.kernel.org/lkml/afac6f92-74f5-4580-0303-12b7374e5011@redhat.com/),
> >>>>> and actually more closely matches what I was expecting to see. The
> >>>>> bubble-hinting patches are performing within a few percent of what the
> >>>>> baseline kernel was doing.
> >>>> Interestingly even with an unmodified kernel with every fresh boot, I
> >>>> observed a certain amount of variability in the results which I stated
> >>>> below.
> >>>>> I am assuming the results from before had
> >>>>> some additional debugging enabled for the bubble-hinting test that
> >>>>> wasn't enabled for the other ones.
> >>>> Nope, I had debugging options enabled for all the cases. This time
> >>>> around I disabled all the debug options.
> >>> We can agree to disagree I guess. Those debugging options had reduced
> >>> the throughput by over 30% on the guest kernel in my test runs. I was
> >>> never able to reproduce the data you reported as enabling the same
> >>> debug features on an unmodified kernel had reduced the throughput for
> >>> the test just the same as it did for the bubble hinting version. Were
> >>> you running the debug options on the host kernel or the guest?
> >> In the guest. Do the results which I shared without debug options, match
> >> with what you have?
> >> I am also curious to know if you see any variability in the results of
> >> page_fault1 for an unmodified kernel with every fresh boot? If so how
> >> often?
> > I see some variability, but not much. Usually it can vary by +/- 5% or
> > so.
> +1
> >  What I have been doing is collecting multiple runs, working out
> > the average, and then comparing that against an average with the
> > patches applied.
> Yeah, I didn't share the average values but I do the same.
> I just wanted to mention the variability so that there is no confusion
> if later the value comes out to be in the range of +/- 3-4%.
> >
> > One other thing you can probably do to limit the variability would be
> > to look at disabling any power management features on the system. One
> > thing you could be seeing is the effect of the CPU enabling turbo mode
> > or going into sleep states if idle. That can easily throw the numbers
> > around quite a bit.
> The variability you mentioned, was it after disabling these options?

It wasn't entirely eliminated, however it was reduced. It also reduces
the overall performance for the lower thread counts as well though as
it drops things by like 10% or more for single thread performance.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure
  2019-07-10 21:56   ` Alexander Duyck
@ 2019-07-11 17:58     ` Nitesh Narayan Lal
  2019-07-11 23:20       ` Alexander Duyck
  0 siblings, 1 reply; 43+ messages in thread
From: Nitesh Narayan Lal @ 2019-07-11 17:58 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko


On 7/10/19 5:56 PM, Alexander Duyck wrote:
> On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>> This patch introduces the core infrastructure for free page hinting in
>> virtual environments. It enables the kernel to track the free pages which
>> can be reported to its hypervisor so that the hypervisor could
>> free and reuse that memory as per its requirement.
>>
>> While the pages are getting processed in the hypervisor (e.g.,
>> via MADV_FREE), the guest must not use them, otherwise, data loss
>> would be possible. To avoid such a situation, these pages are
>> temporarily removed from the buddy. The amount of pages removed
>> temporarily from the buddy is governed by the backend(virtio-balloon
>> in our case).
>>
>> To efficiently identify free pages that can to be hinted to the
>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>> chunks are reported to the hypervisor - especially, to not break up THP
>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>> in the bitmap are an indication whether a page *might* be free, not a
>> guarantee. A new hook after buddy merging sets the bits.
>>
>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>> asynchronously processes the bitmaps, trying to isolate and report pages
>> that are still free. The backend (virtio-balloon) is responsible for
>> reporting these batched pages to the host synchronously. Once reporting/
>> freeing is complete, isolated pages are returned back to the buddy.
>>
>> There are still various things to look into (e.g., memory hotplug, more
>> efficient locking, possible races when disabling).
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> ---
>>  include/linux/page_hinting.h |  45 +++++++
>>  mm/Kconfig                   |   6 +
>>  mm/Makefile                  |   1 +
>>  mm/page_alloc.c              |  18 +--
>>  mm/page_hinting.c            | 250 +++++++++++++++++++++++++++++++++++
>>  5 files changed, 312 insertions(+), 8 deletions(-)
>>  create mode 100644 include/linux/page_hinting.h
>>  create mode 100644 mm/page_hinting.c
>>
>> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
>> new file mode 100644
>> index 000000000000..4900feb796f9
>> --- /dev/null
>> +++ b/include/linux/page_hinting.h
>> @@ -0,0 +1,45 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _LINUX_PAGE_HINTING_H
>> +#define _LINUX_PAGE_HINTING_H
>> +
>> +/*
>> + * Minimum page order required for a page to be hinted to the host.
>> + */
>> +#define PAGE_HINTING_MIN_ORDER         (MAX_ORDER - 2)
>> +
> Why use (MAX_ORDER - 2)? Is this just because of the issues I pointed
> out earlier for is it due to something else? I'm just wondering if
> this will have an impact on architectures outside of x86 as I had
> chose pageblock_order which happened to be MAX_ORDER - 2 on x86, but I
> don't know that the impact of doing that is on other architectures
> versus the (MAX_ORDER - 2) approach you took here.
If I am not wrong then any order  < (MAX_ORDER - 2) will break the THP.
That's one reason we decided to stick with this.
>
>> +/*
>> + * struct page_hinting_config: holds the information supplied by the balloon
>> + * device to page hinting.
>> + * @hint_pages:                Callback which reports the isolated pages
>> + *                     synchornously to the host.
>> + * @max_pages:         Maxmimum pages that are going to be hinted to the host
>> + *                     at a time of granularity >= PAGE_HINTING_MIN_ORDER.
>> + */
>> +struct page_hinting_config {
>> +       void (*hint_pages)(struct list_head *list);
>> +       int max_pages;
>> +};
>> +
>> +extern int __isolate_free_page(struct page *page, unsigned int order);
>> +extern void __free_one_page(struct page *page, unsigned long pfn,
>> +                           struct zone *zone, unsigned int order,
>> +                           int migratetype, bool hint);
>> +#ifdef CONFIG_PAGE_HINTING
>> +void page_hinting_enqueue(struct page *page, int order);
>> +int page_hinting_enable(const struct page_hinting_config *conf);
>> +void page_hinting_disable(void);
>> +#else
>> +static inline void page_hinting_enqueue(struct page *page, int order)
>> +{
>> +}
>> +
>> +static inline int page_hinting_enable(const struct page_hinting_config *conf)
>> +{
>> +       return -EOPNOTSUPP;
>> +}
>> +
>> +static inline void page_hinting_disable(void)
>> +{
>> +}
>> +#endif
>> +#endif /* _LINUX_PAGE_HINTING_H */
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index f0c76ba47695..e97fab429d9b 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -765,4 +765,10 @@ config GUP_BENCHMARK
>>  config ARCH_HAS_PTE_SPECIAL
>>         bool
>>
>> +# PAGE_HINTING will allow the guest to report the free pages to the
>> +# host in fixed chunks as soon as the threshold is reached.
>> +config PAGE_HINTING
>> +       bool
>> +       def_bool n
>> +       depends on X86_64
>>  endmenu
> If there are no issue with using the term "PAGE_HINTING" I guess I
> will update my patch set to use that term instead of aeration.
Not sure, at places like virtio_balloon, we may have to think of
something else to avoid any confusion.
>
>> diff --git a/mm/Makefile b/mm/Makefile
>> index ac5e5ba78874..73be49177656 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -94,6 +94,7 @@ obj-$(CONFIG_Z3FOLD)  += z3fold.o
>>  obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
>>  obj-$(CONFIG_CMA)      += cma.o
>>  obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
>> +obj-$(CONFIG_PAGE_HINTING) += page_hinting.o
>>  obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
>>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index d66bc8abe0af..8a44338bd04e 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -69,6 +69,7 @@
>>  #include <linux/lockdep.h>
>>  #include <linux/nmi.h>
>>  #include <linux/psi.h>
>> +#include <linux/page_hinting.h>
>>
>>  #include <asm/sections.h>
>>  #include <asm/tlbflush.h>
>> @@ -874,10 +875,10 @@ compaction_capture(struct capture_control *capc, struct page *page,
>>   * -- nyc
>>   */
>>
>> -static inline void __free_one_page(struct page *page,
>> +inline void __free_one_page(struct page *page,
>>                 unsigned long pfn,
>>                 struct zone *zone, unsigned int order,
>> -               int migratetype)
>> +               int migratetype, bool hint)
>>  {
>>         unsigned long combined_pfn;
>>         unsigned long uninitialized_var(buddy_pfn);
>> @@ -980,7 +981,8 @@ static inline void __free_one_page(struct page *page,
>>                                 migratetype);
>>         else
>>                 add_to_free_area(page, &zone->free_area[order], migratetype);
>> -
>> +       if (hint)
>> +               page_hinting_enqueue(page, order);
>>  }
> I'm not sure I am a fan of the way the word "hint" is used here. At
> first I thought this was supposed to be !hint since I thought hint
> meant that it was a hinted page, not that we need to record that this
> page has been freed. Maybe "record" or "report" might be a better word
> to use here.
"hint" basically means that the page is supposed to be hinted.
>>  /*
>> @@ -1263,7 +1265,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>                 if (unlikely(isolated_pageblocks))
>>                         mt = get_pageblock_migratetype(page);
>>
>> -               __free_one_page(page, page_to_pfn(page), zone, 0, mt);
>> +               __free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
>>                 trace_mm_page_pcpu_drain(page, 0, mt);
>>         }
>>         spin_unlock(&zone->lock);
>> @@ -1272,14 +1274,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>  static void free_one_page(struct zone *zone,
>>                                 struct page *page, unsigned long pfn,
>>                                 unsigned int order,
>> -                               int migratetype)
>> +                               int migratetype, bool hint)
>>  {
>>         spin_lock(&zone->lock);
>>         if (unlikely(has_isolate_pageblock(zone) ||
>>                 is_migrate_isolate(migratetype))) {
>>                 migratetype = get_pfnblock_migratetype(page, pfn);
>>         }
>> -       __free_one_page(page, pfn, zone, order, migratetype);
>> +       __free_one_page(page, pfn, zone, order, migratetype, hint);
>>         spin_unlock(&zone->lock);
>>  }
>>
>> @@ -1369,7 +1371,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>>         migratetype = get_pfnblock_migratetype(page, pfn);
>>         local_irq_save(flags);
>>         __count_vm_events(PGFREE, 1 << order);
>> -       free_one_page(page_zone(page), page, pfn, order, migratetype);
>> +       free_one_page(page_zone(page), page, pfn, order, migratetype, true);
>>         local_irq_restore(flags);
>>  }
>>
>> @@ -2969,7 +2971,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
>>          */
>>         if (migratetype >= MIGRATE_PCPTYPES) {
>>                 if (unlikely(is_migrate_isolate(migratetype))) {
>> -                       free_one_page(zone, page, pfn, 0, migratetype);
>> +                       free_one_page(zone, page, pfn, 0, migratetype, true);
>>                         return;
>>                 }
>>                 migratetype = MIGRATE_MOVABLE;
>> diff --git a/mm/page_hinting.c b/mm/page_hinting.c
>> new file mode 100644
>> index 000000000000..0bfa09f8c3ed
>> --- /dev/null
>> +++ b/mm/page_hinting.c
>> @@ -0,0 +1,250 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Page hinting core infrastructure to enable a VM to report free pages to its
>> + * hypervisor.
>> + *
>> + * Copyright Red Hat, Inc. 2019
>> + *
>> + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/slab.h>
>> +#include <linux/page_hinting.h>
>> +#include <linux/kvm_host.h>
>> +
>> +/*
>> + * struct zone_free_area: For a single zone across NUMA nodes, it holds the
>> + * bitmap pointer to track the free pages and other required parameters
>> + * used to recover these pages by scanning the bitmap.
>> + * @bitmap:            Pointer to the bitmap in PAGE_HINTING_MIN_ORDER
>> + *                     granularity.
>> + * @base_pfn:          Starting PFN value for the zone whose bitmap is stored.
>> + * @end_pfn:           Indicates the last PFN value for the zone.
>> + * @free_pages:                Tracks the number of free pages of granularity
>> + *                     PAGE_HINTING_MIN_ORDER.
>> + * @nbits:             Indicates the total size of the bitmap in bits allocated
>> + *                     at the time of initialization.
>> + */
>> +struct zone_free_area {
>> +       unsigned long *bitmap;
>> +       unsigned long base_pfn;
>> +       unsigned long end_pfn;
>> +       atomic_t free_pages;
>> +       unsigned long nbits;
>> +} free_area[MAX_NR_ZONES];
>> +
> You still haven't addressed the NUMA issue I pointed out with v10. You
> are only able to address the first set of zones with this setup. As
> such you can end up missing large sections of memory if it is split
> over multiple nodes.
I think I did.
>
>> +static void init_hinting_wq(struct work_struct *work);
>> +static DEFINE_MUTEX(page_hinting_init);
>> +const struct page_hinting_config *page_hitning_conf;
>> +struct work_struct hinting_work;
>> +atomic_t page_hinting_active;
>> +
>> +void free_area_cleanup(int nr_zones)
>> +{
> I'm not sure why you are passing nr_zones as an argument here. Won't
> this always be MAX_NR_ZONES?
free_area_cleanup() gets called from page_hinting_disable() and
page_hinting_enable(). In page_hinting_enable() when the allocation
fails we may not have to perform cleanup for all the zones everytime.
>
>> +       int zone_idx;
>> +
>> +       for (zone_idx = 0; zone_idx < nr_zones; zone_idx++) {
>> +               bitmap_free(free_area[zone_idx].bitmap);
>> +               free_area[zone_idx].base_pfn = 0;
>> +               free_area[zone_idx].end_pfn = 0;
>> +               free_area[zone_idx].nbits = 0;
>> +               atomic_set(&free_area[zone_idx].free_pages, 0);
>> +       }
>> +}
>> +
>> +int page_hinting_enable(const struct page_hinting_config *conf)
>> +{
>> +       unsigned long bitmap_size = 0;
>> +       int zone_idx = 0, ret = -EBUSY;
>> +       struct zone *zone;
>> +
>> +       mutex_lock(&page_hinting_init);
>> +       if (!page_hitning_conf) {
>> +               for_each_populated_zone(zone) {
> So for_each_populated_zone will go through all of the NUMA nodes. So
> if I am not mistaken you will overwrite the free_area values of all
> the previous nodes with the last node in the system.
Not sure if I understood.
>  So if we have a
> setup that has all the memory in the first node, and none in the
> second it would effectively disable free page hinting would it not?
Why will it happen? The base_pfn will still be pointing to the base_pfn
of the first node. Isn't?
>
>> +                       zone_idx = zone_idx(zone);
>> +#ifdef CONFIG_ZONE_DEVICE
>> +                       if (zone_idx == ZONE_DEVICE)
>> +                               continue;
>> +#endif
>> +                       spin_lock(&zone->lock);
>> +                       if (free_area[zone_idx].base_pfn) {
>> +                               free_area[zone_idx].base_pfn =
>> +                                       min(free_area[zone_idx].base_pfn,
>> +                                           zone->zone_start_pfn);
>> +                               free_area[zone_idx].end_pfn =
>> +                                       max(free_area[zone_idx].end_pfn,
>> +                                           zone->zone_start_pfn +
>> +                                           zone->spanned_pages);
>> +                       } else {
>> +                               free_area[zone_idx].base_pfn =
>> +                                       zone->zone_start_pfn;
>> +                               free_area[zone_idx].end_pfn =
>> +                                       zone->zone_start_pfn +
>> +                                       zone->spanned_pages;
>> +                       }
>> +                       spin_unlock(&zone->lock);
>> +               }
>> +
>> +               for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) {
>> +                       unsigned long pages = free_area[zone_idx].end_pfn -
>> +                                       free_area[zone_idx].base_pfn;
>> +                       bitmap_size = (pages >> PAGE_HINTING_MIN_ORDER) + 1;
>> +                       if (!bitmap_size)
>> +                               continue;
>> +                       free_area[zone_idx].bitmap = bitmap_zalloc(bitmap_size,
>> +                                                                  GFP_KERNEL);
>> +                       if (!free_area[zone_idx].bitmap) {
>> +                               free_area_cleanup(zone_idx);
>> +                               mutex_unlock(&page_hinting_init);
>> +                               return -ENOMEM;
>> +                       }
>> +                       free_area[zone_idx].nbits = bitmap_size;
>> +               }
> So this is the bit that still needs to address hotplug right? 
Yes, hotplug still needs to be addressed.
> I would
> imagine you need to reallocate this if the spanned_pages range changes
> correct?
>
>> +               page_hitning_conf = conf;
>> +               INIT_WORK(&hinting_work, init_hinting_wq);
>> +               ret = 0;
>> +       }
>> +       mutex_unlock(&page_hinting_init);
>> +       return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(page_hinting_enable);
>> +
>> +void page_hinting_disable(void)
>> +{
>> +       cancel_work_sync(&hinting_work);
>> +       page_hitning_conf = NULL;
>> +       free_area_cleanup(MAX_NR_ZONES);
>> +}
>> +EXPORT_SYMBOL_GPL(page_hinting_disable);
>> +
>> +static unsigned long pfn_to_bit(struct page *page, int zone_idx)
>> +{
>> +       unsigned long bitnr;
>> +
>> +       bitnr = (page_to_pfn(page) - free_area[zone_idx].base_pfn)
>> +                        >> PAGE_HINTING_MIN_ORDER;
>> +       return bitnr;
>> +}
>> +
>> +static void release_buddy_pages(struct list_head *pages)
>> +{
>> +       int mt = 0, zone_idx, order;
>> +       struct page *page, *next;
>> +       unsigned long bitnr;
>> +       struct zone *zone;
>> +
>> +       list_for_each_entry_safe(page, next, pages, lru) {
>> +               zone_idx = page_zonenum(page);
>> +               zone = page_zone(page);
>> +               bitnr = pfn_to_bit(page, zone_idx);
>> +               spin_lock(&zone->lock);
>> +               list_del(&page->lru);
>> +               order = page_private(page);
>> +               set_page_private(page, 0);
>> +               mt = get_pageblock_migratetype(page);
>> +               __free_one_page(page, page_to_pfn(page), zone,
>> +                               order, mt, false);
>> +               spin_unlock(&zone->lock);
>> +       }
>> +}
>> +
>> +static void bm_set_pfn(struct page *page)
>> +{
>> +       struct zone *zone = page_zone(page);
>> +       int zone_idx = page_zonenum(page);
>> +       unsigned long bitnr = 0;
>> +
>> +       lockdep_assert_held(&zone->lock);
>> +       bitnr = pfn_to_bit(page, zone_idx);
>> +       /*
>> +        * TODO: fix possible underflows.
>> +        */
>> +       if (free_area[zone_idx].bitmap &&
>> +           bitnr < free_area[zone_idx].nbits &&
>> +           !test_and_set_bit(bitnr, free_area[zone_idx].bitmap))
>> +               atomic_inc(&free_area[zone_idx].free_pages);
>> +}
>> +
>> +static void scan_zone_free_area(int zone_idx, int free_pages)
>> +{
>> +       int ret = 0, order, isolated_cnt = 0;
>> +       unsigned long set_bit, start = 0;
>> +       LIST_HEAD(isolated_pages);
>> +       struct page *page;
>> +       struct zone *zone;
>> +
>> +       for (;;) {
>> +               ret = 0;
>> +               set_bit = find_next_bit(free_area[zone_idx].bitmap,
>> +                                       free_area[zone_idx].nbits, start);
>> +               if (set_bit >= free_area[zone_idx].nbits)
>> +                       break;
>> +               page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) +
>> +                               free_area[zone_idx].base_pfn);
>> +               if (!page)
>> +                       continue;
>> +               zone = page_zone(page);
>> +               spin_lock(&zone->lock);
>> +
>> +               if (PageBuddy(page) && page_private(page) >=
>> +                   PAGE_HINTING_MIN_ORDER) {
>> +                       order = page_private(page);
>> +                       ret = __isolate_free_page(page, order);
>> +               }
>> +               clear_bit(set_bit, free_area[zone_idx].bitmap);
>> +               atomic_dec(&free_area[zone_idx].free_pages);
>> +               spin_unlock(&zone->lock);
>> +               if (ret) {
>> +                       /*
>> +                        * restoring page order to use it while releasing
>> +                        * the pages back to the buddy.
>> +                        */
>> +                       set_page_private(page, order);
>> +                       list_add_tail(&page->lru, &isolated_pages);
>> +                       isolated_cnt++;
>> +                       if (isolated_cnt == page_hitning_conf->max_pages) {
>> +                               page_hitning_conf->hint_pages(&isolated_pages);
>> +                               release_buddy_pages(&isolated_pages);
>> +                               isolated_cnt = 0;
>> +                       }
>> +               }
>> +               start = set_bit + 1;
>> +       }
>> +       if (isolated_cnt) {
>> +               page_hitning_conf->hint_pages(&isolated_pages);
>> +               release_buddy_pages(&isolated_pages);
>> +       }
>> +}
>> +
> I really worry that this loop is going to become more expensive as the
> size of memory increases. For example if we hint on just 16 pages we
> would have to walk something like 4K bits, 512 longs, if a system had
> 64G of memory. Have you considered testing with a larger memory
> footprint to see if it has an impact on performance?
I am hoping this will be noticeable in will-it-scale's page_fault1, if I
run it on a larger system?
>
>> +static void init_hinting_wq(struct work_struct *work)
>> +{
>> +       int zone_idx, free_pages;
>> +
>> +       atomic_set(&page_hinting_active, 1);
>> +       for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) {
>> +               free_pages = atomic_read(&free_area[zone_idx].free_pages);
>> +               if (free_pages >= page_hitning_conf->max_pages)
>> +                       scan_zone_free_area(zone_idx, free_pages);
>> +       }
>> +       atomic_set(&page_hinting_active, 0);
>> +}
>> +
>> +void page_hinting_enqueue(struct page *page, int order)
>> +{
>> +       int zone_idx;
>> +
>> +       if (!page_hitning_conf || order < PAGE_HINTING_MIN_ORDER)
>> +               return;
> I would think it is going to be expensive to be jumping into this
> function for every freed page. You should probably have an inline
> taking care of the order check before you even get here since it would
> be faster that way.
I see, I can take a look. Thanks.
>
>> +
>> +       bm_set_pfn(page);
>> +       if (atomic_read(&page_hinting_active))
>> +               return;
> So I would think this piece is racy. Specifically if you set a PFN
> that is somewhere below the PFN you are currently processing in your
> scan it is going to remain unset until you have another page freed
> after the scan is completed. I would worry you can end up with a batch
> free of memory resulting in a group of pages sitting at the start of
> your bitmap unhinted.
True, but that will be hinted next time threshold is met.
>
> In my patches I resolved this by looping through all of the zones,
> however your approach is missing the necessary pieces to make that
> safe as you could end up in a soft lockup with the scanning thread
> spinning on a noisy system.
>
>> +       zone_idx = zone_idx(page_zone(page));
>> +       if (atomic_read(&free_area[zone_idx].free_pages) >=
>> +                       page_hitning_conf->max_pages) {
>> +               int cpu = smp_processor_id();
>> +
>> +               queue_work_on(cpu, system_wq, &hinting_work);
>> +       }
>> +}
-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure
  2019-07-10 19:51 ` [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure Nitesh Narayan Lal
  2019-07-10 20:45   ` Dave Hansen
  2019-07-10 21:56   ` Alexander Duyck
@ 2019-07-11 18:21   ` Dave Hansen
  2019-07-15  9:33     ` David Hildenbrand
  2 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2019-07-11 18:21 UTC (permalink / raw)
  To: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini,
	lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, david,
	mst, dodgen, konrad.wilk, dhildenb, aarcange, alexander.duyck,
	john.starks, mhocko

On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote:
> +static void bm_set_pfn(struct page *page)
> +{
> +	struct zone *zone = page_zone(page);
> +	int zone_idx = page_zonenum(page);
> +	unsigned long bitnr = 0;
> +
> +	lockdep_assert_held(&zone->lock);
> +	bitnr = pfn_to_bit(page, zone_idx);
> +	/*
> +	 * TODO: fix possible underflows.
> +	 */
> +	if (free_area[zone_idx].bitmap &&
> +	    bitnr < free_area[zone_idx].nbits &&
> +	    !test_and_set_bit(bitnr, free_area[zone_idx].bitmap))
> +		atomic_inc(&free_area[zone_idx].free_pages);
> +}

Let's say I have two NUMA nodes, each with ZONE_NORMAL and ZONE_MOVABLE
and each zone with 1GB of memory:

Node:         0        1
NORMAL   0->1GB   2->3GB
MOVABLE  1->2GB   3->4GB

This code will allocate two bitmaps.  The ZONE_NORMAL bitmap will
represent data from 0->3GB and the ZONE_MOVABLE bitmap will represent
data from 1->4GB.  That's the result of this code:

> +			if (free_area[zone_idx].base_pfn) {
> +				free_area[zone_idx].base_pfn =
> +					min(free_area[zone_idx].base_pfn,
> +					    zone->zone_start_pfn);
> +				free_area[zone_idx].end_pfn =
> +					max(free_area[zone_idx].end_pfn,
> +					    zone->zone_start_pfn +
> +					    zone->spanned_pages);

But that means that both bitmaps will have space for PFNs in the other
zone type, which is completely bogus.  This is fundamental because the
data structures are incorrectly built per zone *type* instead of per zone.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [QEMU Patch] virtio-baloon: Support for page hinting
  2019-07-10 19:53 ` [QEMU Patch] virtio-baloon: Support for page hinting Nitesh Narayan Lal
  2019-07-10 20:17   ` Alexander Duyck
  2019-07-11  8:49   ` Cornelia Huck
@ 2019-07-11 18:55   ` Michael S. Tsirkin
  2019-07-11 19:06     ` Nitesh Narayan Lal
  2 siblings, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2019-07-11 18:55 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta,
	wei.w.wang, yang.zhang.wz, riel, david, dodgen, konrad.wilk,
	dhildenb, aarcange, alexander.duyck, john.starks, dave.hansen,
	mhocko

On Wed, Jul 10, 2019 at 03:53:03PM -0400, Nitesh Narayan Lal wrote:
> Enables QEMU to perform madvise free on the memory range reported
> by the vm.
> 
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>

Missing second "l" in the subject :)

> ---
>  hw/virtio/trace-events                        |  1 +
>  hw/virtio/virtio-balloon.c                    | 59 +++++++++++++++++++
>  include/hw/virtio/virtio-balloon.h            |  2 +-
>  include/qemu/osdep.h                          |  7 +++
>  .../standard-headers/linux/virtio_balloon.h   |  1 +
>  5 files changed, 69 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> index e28ba48da6..f703a22d36 100644
> --- a/hw/virtio/trace-events
> +++ b/hw/virtio/trace-events
> @@ -46,6 +46,7 @@ virtio_balloon_handle_output(const char *name, uint64_t gpa) "section name: %s g
>  virtio_balloon_get_config(uint32_t num_pages, uint32_t actual) "num_pages: %d actual: %d"
>  virtio_balloon_set_config(uint32_t actual, uint32_t oldactual) "actual: %d oldactual: %d"
>  virtio_balloon_to_target(uint64_t target, uint32_t num_pages) "balloon target: 0x%"PRIx64" num_pages: %d"
> +virtio_balloon_hinting_request(unsigned long pfn, unsigned int num_pages) "Guest page hinting request PFN:%lu size: %d"
>  
>  # virtio-mmio.c
>  virtio_mmio_read(uint64_t offset) "virtio_mmio_read offset 0x%" PRIx64
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 2112874055..5d186707b5 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -34,6 +34,9 @@
>  
>  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
>  
> +#define VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES	16
> +void free_mem_range(uint64_t addr, uint64_t len);
> +
>  struct PartiallyBalloonedPage {
>      RAMBlock *rb;
>      ram_addr_t base;
> @@ -328,6 +331,58 @@ static void balloon_stats_set_poll_interval(Object *obj, Visitor *v,
>      balloon_stats_change_timer(s, 0);
>  }
>  
> +void free_mem_range(uint64_t addr, uint64_t len)
> +{
> +    int ret = 0;
> +    void *hvaddr_to_free;
> +    MemoryRegionSection mrs = memory_region_find(get_system_memory(),
> +                                                 addr, 1);
> +    if (!mrs.mr) {
> +	warn_report("%s:No memory is mapped at address 0x%lu", __func__, addr);
> +        return;
> +    }
> +
> +    if (!memory_region_is_ram(mrs.mr) && !memory_region_is_romd(mrs.mr)) {
> +	warn_report("%s:Memory at address 0x%s is not RAM:0x%lu", __func__,
> +		    HWADDR_PRIx, addr);
> +        memory_region_unref(mrs.mr);
> +        return;
> +    }
> +
> +    hvaddr_to_free = qemu_map_ram_ptr(mrs.mr->ram_block, mrs.offset_within_region);
> +    trace_virtio_balloon_hinting_request(addr, len);
> +    ret = qemu_madvise(hvaddr_to_free,len, QEMU_MADV_FREE);
> +    if (ret == -1) {
> +	warn_report("%s: Madvise failed with error:%d", __func__, ret);
> +    }
> +}
> +
> +static void virtio_balloon_handle_page_hinting(VirtIODevice *vdev,
> +					       VirtQueue *vq)
> +{
> +    VirtQueueElement *elem;
> +    size_t offset = 0;
> +    uint64_t gpa, len;
> +    elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
> +    if (!elem) {
> +        return;
> +    }
> +    /* For pending hints which are < max_pages(16), 'gpa != 0' ensures that we
> +     * only read the buffer which holds a valid PFN value.
> +     * TODO: Find a better way to do this.

Indeed. In fact, what is wrong with passing the gpa as
part of the element itself?

> +     */
> +    while (iov_to_buf(elem->out_sg, elem->out_num, offset, &gpa, 8) == 8 && gpa != 0) {
> +	offset += 8;
> +	offset += iov_to_buf(elem->out_sg, elem->out_num, offset, &len, 8);
> +	if (!qemu_balloon_is_inhibited()) {
> +	    free_mem_range(gpa, len);
> +	}
> +    }
> +    virtqueue_push(vq, elem, offset);
> +    virtio_notify(vdev, vq);
> +    g_free(elem);
> +}
> +
>  static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>  {
>      VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
> @@ -694,6 +749,7 @@ static uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f,
>      VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
>      f |= dev->host_features;
>      virtio_add_feature(&f, VIRTIO_BALLOON_F_STATS_VQ);
> +    virtio_add_feature(&f, VIRTIO_BALLOON_F_HINTING);
>  
>      return f;
>  }
> @@ -780,6 +836,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>      s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>      s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>      s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
> +    s->hvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_page_hinting);
>  
>      if (virtio_has_feature(s->host_features,
>                             VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
> @@ -875,6 +932,8 @@ static void virtio_balloon_instance_init(Object *obj)
>  
>      object_property_add(obj, "guest-stats", "guest statistics",
>                          balloon_stats_get_all, NULL, NULL, s, NULL);
> +    object_property_add(obj, "guest-page-hinting", "guest page hinting",
> +                        NULL, NULL, NULL, s, NULL);
>  
>      object_property_add(obj, "guest-stats-polling-interval", "int",
>                          balloon_stats_get_poll_interval,
> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> index 1afafb12f6..a58b24fdf2 100644
> --- a/include/hw/virtio/virtio-balloon.h
> +++ b/include/hw/virtio/virtio-balloon.h
> @@ -44,7 +44,7 @@ enum virtio_balloon_free_page_report_status {
>  
>  typedef struct VirtIOBalloon {
>      VirtIODevice parent_obj;
> -    VirtQueue *ivq, *dvq, *svq, *free_page_vq;
> +    VirtQueue *ivq, *dvq, *svq, *free_page_vq, *hvq;
>      uint32_t free_page_report_status;
>      uint32_t num_pages;
>      uint32_t actual;
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index af2b91f0b8..bb9207e7f4 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -360,6 +360,11 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>  #else
>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
>  #endif
> +#ifdef MADV_FREE
> +#define QEMU_MADV_FREE MADV_FREE
> +#else
> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
> +#endif
>  
>  #elif defined(CONFIG_POSIX_MADVISE)
>  
> @@ -373,6 +378,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>  #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
>  #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
>  
>  #else /* no-op */
>  
> @@ -386,6 +392,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>  #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
>  #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
>  
>  #endif
>  
> diff --git a/include/standard-headers/linux/virtio_balloon.h b/include/standard-headers/linux/virtio_balloon.h
> index 9375ca2a70..f9e3e82562 100644
> --- a/include/standard-headers/linux/virtio_balloon.h
> +++ b/include/standard-headers/linux/virtio_balloon.h
> @@ -36,6 +36,7 @@
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
>  #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
> +#define VIRTIO_BALLOON_F_HINTING	5 /* Page hinting virtqueue */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> -- 
> 2.21.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [QEMU Patch] virtio-baloon: Support for page hinting
  2019-07-11 18:55   ` Michael S. Tsirkin
@ 2019-07-11 19:06     ` Nitesh Narayan Lal
  2019-07-11 22:36       ` Alexander Duyck
  0 siblings, 1 reply; 43+ messages in thread
From: Nitesh Narayan Lal @ 2019-07-11 19:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta,
	wei.w.wang, yang.zhang.wz, riel, david, dodgen, konrad.wilk,
	dhildenb, aarcange, alexander.duyck, john.starks, dave.hansen,
	mhocko


On 7/11/19 2:55 PM, Michael S. Tsirkin wrote:
> On Wed, Jul 10, 2019 at 03:53:03PM -0400, Nitesh Narayan Lal wrote:
>> Enables QEMU to perform madvise free on the memory range reported
>> by the vm.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> Missing second "l" in the subject :)
>
>> ---
>>  hw/virtio/trace-events                        |  1 +
>>  hw/virtio/virtio-balloon.c                    | 59 +++++++++++++++++++
>>  include/hw/virtio/virtio-balloon.h            |  2 +-
>>  include/qemu/osdep.h                          |  7 +++
>>  .../standard-headers/linux/virtio_balloon.h   |  1 +
>>  5 files changed, 69 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
>> index e28ba48da6..f703a22d36 100644
>> --- a/hw/virtio/trace-events
>> +++ b/hw/virtio/trace-events
>> @@ -46,6 +46,7 @@ virtio_balloon_handle_output(const char *name, uint64_t gpa) "section name: %s g
>>  virtio_balloon_get_config(uint32_t num_pages, uint32_t actual) "num_pages: %d actual: %d"
>>  virtio_balloon_set_config(uint32_t actual, uint32_t oldactual) "actual: %d oldactual: %d"
>>  virtio_balloon_to_target(uint64_t target, uint32_t num_pages) "balloon target: 0x%"PRIx64" num_pages: %d"
>> +virtio_balloon_hinting_request(unsigned long pfn, unsigned int num_pages) "Guest page hinting request PFN:%lu size: %d"
>>  
>>  # virtio-mmio.c
>>  virtio_mmio_read(uint64_t offset) "virtio_mmio_read offset 0x%" PRIx64
>> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
>> index 2112874055..5d186707b5 100644
>> --- a/hw/virtio/virtio-balloon.c
>> +++ b/hw/virtio/virtio-balloon.c
>> @@ -34,6 +34,9 @@
>>  
>>  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
>>  
>> +#define VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES	16
>> +void free_mem_range(uint64_t addr, uint64_t len);
>> +
>>  struct PartiallyBalloonedPage {
>>      RAMBlock *rb;
>>      ram_addr_t base;
>> @@ -328,6 +331,58 @@ static void balloon_stats_set_poll_interval(Object *obj, Visitor *v,
>>      balloon_stats_change_timer(s, 0);
>>  }
>>  
>> +void free_mem_range(uint64_t addr, uint64_t len)
>> +{
>> +    int ret = 0;
>> +    void *hvaddr_to_free;
>> +    MemoryRegionSection mrs = memory_region_find(get_system_memory(),
>> +                                                 addr, 1);
>> +    if (!mrs.mr) {
>> +	warn_report("%s:No memory is mapped at address 0x%lu", __func__, addr);
>> +        return;
>> +    }
>> +
>> +    if (!memory_region_is_ram(mrs.mr) && !memory_region_is_romd(mrs.mr)) {
>> +	warn_report("%s:Memory at address 0x%s is not RAM:0x%lu", __func__,
>> +		    HWADDR_PRIx, addr);
>> +        memory_region_unref(mrs.mr);
>> +        return;
>> +    }
>> +
>> +    hvaddr_to_free = qemu_map_ram_ptr(mrs.mr->ram_block, mrs.offset_within_region);
>> +    trace_virtio_balloon_hinting_request(addr, len);
>> +    ret = qemu_madvise(hvaddr_to_free,len, QEMU_MADV_FREE);
>> +    if (ret == -1) {
>> +	warn_report("%s: Madvise failed with error:%d", __func__, ret);
>> +    }
>> +}
>> +
>> +static void virtio_balloon_handle_page_hinting(VirtIODevice *vdev,
>> +					       VirtQueue *vq)
>> +{
>> +    VirtQueueElement *elem;
>> +    size_t offset = 0;
>> +    uint64_t gpa, len;
>> +    elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
>> +    if (!elem) {
>> +        return;
>> +    }
>> +    /* For pending hints which are < max_pages(16), 'gpa != 0' ensures that we
>> +     * only read the buffer which holds a valid PFN value.
>> +     * TODO: Find a better way to do this.
> Indeed. In fact, what is wrong with passing the gpa as
> part of the element itself?
There are two values which I need to read 'gpa' and 'len'. I will have
to check how to pass them both as part of the element.
But, I will look into it.
>> +     */
>> +    while (iov_to_buf(elem->out_sg, elem->out_num, offset, &gpa, 8) == 8 && gpa != 0) {
>> +	offset += 8;
>> +	offset += iov_to_buf(elem->out_sg, elem->out_num, offset, &len, 8);
>> +	if (!qemu_balloon_is_inhibited()) {
>> +	    free_mem_range(gpa, len);
>> +	}
>> +    }
>> +    virtqueue_push(vq, elem, offset);
>> +    virtio_notify(vdev, vq);
>> +    g_free(elem);
>> +}
>> +
>>  static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>>  {
>>      VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
>> @@ -694,6 +749,7 @@ static uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f,
>>      VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
>>      f |= dev->host_features;
>>      virtio_add_feature(&f, VIRTIO_BALLOON_F_STATS_VQ);
>> +    virtio_add_feature(&f, VIRTIO_BALLOON_F_HINTING);
>>  
>>      return f;
>>  }
>> @@ -780,6 +836,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>>      s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>>      s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>>      s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
>> +    s->hvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_page_hinting);
>>  
>>      if (virtio_has_feature(s->host_features,
>>                             VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
>> @@ -875,6 +932,8 @@ static void virtio_balloon_instance_init(Object *obj)
>>  
>>      object_property_add(obj, "guest-stats", "guest statistics",
>>                          balloon_stats_get_all, NULL, NULL, s, NULL);
>> +    object_property_add(obj, "guest-page-hinting", "guest page hinting",
>> +                        NULL, NULL, NULL, s, NULL);
>>  
>>      object_property_add(obj, "guest-stats-polling-interval", "int",
>>                          balloon_stats_get_poll_interval,
>> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
>> index 1afafb12f6..a58b24fdf2 100644
>> --- a/include/hw/virtio/virtio-balloon.h
>> +++ b/include/hw/virtio/virtio-balloon.h
>> @@ -44,7 +44,7 @@ enum virtio_balloon_free_page_report_status {
>>  
>>  typedef struct VirtIOBalloon {
>>      VirtIODevice parent_obj;
>> -    VirtQueue *ivq, *dvq, *svq, *free_page_vq;
>> +    VirtQueue *ivq, *dvq, *svq, *free_page_vq, *hvq;
>>      uint32_t free_page_report_status;
>>      uint32_t num_pages;
>>      uint32_t actual;
>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>> index af2b91f0b8..bb9207e7f4 100644
>> --- a/include/qemu/osdep.h
>> +++ b/include/qemu/osdep.h
>> @@ -360,6 +360,11 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>>  #else
>>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
>>  #endif
>> +#ifdef MADV_FREE
>> +#define QEMU_MADV_FREE MADV_FREE
>> +#else
>> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
>> +#endif
>>  
>>  #elif defined(CONFIG_POSIX_MADVISE)
>>  
>> @@ -373,6 +378,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>>  #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
>>  #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
>>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
>> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
>>  
>>  #else /* no-op */
>>  
>> @@ -386,6 +392,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
>>  #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
>>  #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
>>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
>> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
>>  
>>  #endif
>>  
>> diff --git a/include/standard-headers/linux/virtio_balloon.h b/include/standard-headers/linux/virtio_balloon.h
>> index 9375ca2a70..f9e3e82562 100644
>> --- a/include/standard-headers/linux/virtio_balloon.h
>> +++ b/include/standard-headers/linux/virtio_balloon.h
>> @@ -36,6 +36,7 @@
>>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>>  #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
>>  #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
>> +#define VIRTIO_BALLOON_F_HINTING	5 /* Page hinting virtqueue */
>>  
>>  /* Size of a PFN in the balloon interface. */
>>  #define VIRTIO_BALLOON_PFN_SHIFT 12
>> -- 
>> 2.21.0
-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [QEMU Patch] virtio-baloon: Support for page hinting
  2019-07-11 19:06     ` Nitesh Narayan Lal
@ 2019-07-11 22:36       ` Alexander Duyck
  0 siblings, 0 replies; 43+ messages in thread
From: Alexander Duyck @ 2019-07-11 22:36 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: Michael S. Tsirkin, kvm list, LKML, linux-mm, Paolo Bonzini,
	lcapitulino, pagupta, wei.w.wang, Yang Zhang, Rik van Riel,
	David Hildenbrand, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko

On Thu, Jul 11, 2019 at 12:06 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 7/11/19 2:55 PM, Michael S. Tsirkin wrote:
> > On Wed, Jul 10, 2019 at 03:53:03PM -0400, Nitesh Narayan Lal wrote:
> >> Enables QEMU to perform madvise free on the memory range reported
> >> by the vm.
> >>
> >> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> > Missing second "l" in the subject :)
> >
> >> ---
> >>  hw/virtio/trace-events                        |  1 +
> >>  hw/virtio/virtio-balloon.c                    | 59 +++++++++++++++++++
> >>  include/hw/virtio/virtio-balloon.h            |  2 +-
> >>  include/qemu/osdep.h                          |  7 +++
> >>  .../standard-headers/linux/virtio_balloon.h   |  1 +
> >>  5 files changed, 69 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> >> index e28ba48da6..f703a22d36 100644
> >> --- a/hw/virtio/trace-events
> >> +++ b/hw/virtio/trace-events
> >> @@ -46,6 +46,7 @@ virtio_balloon_handle_output(const char *name, uint64_t gpa) "section name: %s g
> >>  virtio_balloon_get_config(uint32_t num_pages, uint32_t actual) "num_pages: %d actual: %d"
> >>  virtio_balloon_set_config(uint32_t actual, uint32_t oldactual) "actual: %d oldactual: %d"
> >>  virtio_balloon_to_target(uint64_t target, uint32_t num_pages) "balloon target: 0x%"PRIx64" num_pages: %d"
> >> +virtio_balloon_hinting_request(unsigned long pfn, unsigned int num_pages) "Guest page hinting request PFN:%lu size: %d"
> >>
> >>  # virtio-mmio.c
> >>  virtio_mmio_read(uint64_t offset) "virtio_mmio_read offset 0x%" PRIx64
> >> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> >> index 2112874055..5d186707b5 100644
> >> --- a/hw/virtio/virtio-balloon.c
> >> +++ b/hw/virtio/virtio-balloon.c
> >> @@ -34,6 +34,9 @@
> >>
> >>  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
> >>
> >> +#define VIRTIO_BALLOON_PAGE_HINTING_MAX_PAGES       16
> >> +void free_mem_range(uint64_t addr, uint64_t len);
> >> +
> >>  struct PartiallyBalloonedPage {
> >>      RAMBlock *rb;
> >>      ram_addr_t base;
> >> @@ -328,6 +331,58 @@ static void balloon_stats_set_poll_interval(Object *obj, Visitor *v,
> >>      balloon_stats_change_timer(s, 0);
> >>  }
> >>
> >> +void free_mem_range(uint64_t addr, uint64_t len)
> >> +{
> >> +    int ret = 0;
> >> +    void *hvaddr_to_free;
> >> +    MemoryRegionSection mrs = memory_region_find(get_system_memory(),
> >> +                                                 addr, 1);
> >> +    if (!mrs.mr) {
> >> +    warn_report("%s:No memory is mapped at address 0x%lu", __func__, addr);
> >> +        return;
> >> +    }
> >> +
> >> +    if (!memory_region_is_ram(mrs.mr) && !memory_region_is_romd(mrs.mr)) {
> >> +    warn_report("%s:Memory at address 0x%s is not RAM:0x%lu", __func__,
> >> +                HWADDR_PRIx, addr);
> >> +        memory_region_unref(mrs.mr);
> >> +        return;
> >> +    }
> >> +
> >> +    hvaddr_to_free = qemu_map_ram_ptr(mrs.mr->ram_block, mrs.offset_within_region);
> >> +    trace_virtio_balloon_hinting_request(addr, len);
> >> +    ret = qemu_madvise(hvaddr_to_free,len, QEMU_MADV_FREE);
> >> +    if (ret == -1) {
> >> +    warn_report("%s: Madvise failed with error:%d", __func__, ret);
> >> +    }
> >> +}
> >> +
> >> +static void virtio_balloon_handle_page_hinting(VirtIODevice *vdev,
> >> +                                           VirtQueue *vq)
> >> +{
> >> +    VirtQueueElement *elem;
> >> +    size_t offset = 0;
> >> +    uint64_t gpa, len;
> >> +    elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
> >> +    if (!elem) {
> >> +        return;
> >> +    }
> >> +    /* For pending hints which are < max_pages(16), 'gpa != 0' ensures that we
> >> +     * only read the buffer which holds a valid PFN value.
> >> +     * TODO: Find a better way to do this.
> > Indeed. In fact, what is wrong with passing the gpa as
> > part of the element itself?
> There are two values which I need to read 'gpa' and 'len'. I will have
> to check how to pass them both as part of the element.
> But, I will look into it.

One advantage of doing it as a scatter-gather list being passed via
the element is that you only get one completion. If you are going to
do an element per page then you will need to somehow identify if the
entire ring has been processed or not before you free your local page
list.

> >> +     */
> >> +    while (iov_to_buf(elem->out_sg, elem->out_num, offset, &gpa, 8) == 8 && gpa != 0) {
> >> +    offset += 8;
> >> +    offset += iov_to_buf(elem->out_sg, elem->out_num, offset, &len, 8);
> >> +    if (!qemu_balloon_is_inhibited()) {
> >> +        free_mem_range(gpa, len);
> >> +    }
> >> +    }
> >> +    virtqueue_push(vq, elem, offset);
> >> +    virtio_notify(vdev, vq);
> >> +    g_free(elem);
> >> +}
> >> +
> >>  static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
> >>  {
> >>      VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
> >> @@ -694,6 +749,7 @@ static uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f,
> >>      VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
> >>      f |= dev->host_features;
> >>      virtio_add_feature(&f, VIRTIO_BALLOON_F_STATS_VQ);
> >> +    virtio_add_feature(&f, VIRTIO_BALLOON_F_HINTING);
> >>
> >>      return f;
> >>  }
> >> @@ -780,6 +836,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
> >>      s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
> >>      s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
> >>      s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
> >> +    s->hvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_page_hinting);
> >>
> >>      if (virtio_has_feature(s->host_features,
> >>                             VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
> >> @@ -875,6 +932,8 @@ static void virtio_balloon_instance_init(Object *obj)
> >>
> >>      object_property_add(obj, "guest-stats", "guest statistics",
> >>                          balloon_stats_get_all, NULL, NULL, s, NULL);
> >> +    object_property_add(obj, "guest-page-hinting", "guest page hinting",
> >> +                        NULL, NULL, NULL, s, NULL);
> >>
> >>      object_property_add(obj, "guest-stats-polling-interval", "int",
> >>                          balloon_stats_get_poll_interval,
> >> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> >> index 1afafb12f6..a58b24fdf2 100644
> >> --- a/include/hw/virtio/virtio-balloon.h
> >> +++ b/include/hw/virtio/virtio-balloon.h
> >> @@ -44,7 +44,7 @@ enum virtio_balloon_free_page_report_status {
> >>
> >>  typedef struct VirtIOBalloon {
> >>      VirtIODevice parent_obj;
> >> -    VirtQueue *ivq, *dvq, *svq, *free_page_vq;
> >> +    VirtQueue *ivq, *dvq, *svq, *free_page_vq, *hvq;
> >>      uint32_t free_page_report_status;
> >>      uint32_t num_pages;
> >>      uint32_t actual;
> >> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> >> index af2b91f0b8..bb9207e7f4 100644
> >> --- a/include/qemu/osdep.h
> >> +++ b/include/qemu/osdep.h
> >> @@ -360,6 +360,11 @@ void qemu_anon_ram_free(void *ptr, size_t size);
> >>  #else
> >>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
> >>  #endif
> >> +#ifdef MADV_FREE
> >> +#define QEMU_MADV_FREE MADV_FREE
> >> +#else
> >> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
> >> +#endif
> >>
> >>  #elif defined(CONFIG_POSIX_MADVISE)
> >>
> >> @@ -373,6 +378,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
> >>  #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
> >>  #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
> >>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
> >> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
> >>
> >>  #else /* no-op */
> >>
> >> @@ -386,6 +392,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
> >>  #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
> >>  #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
> >>  #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
> >> +#define QEMU_MADV_FREE QEMU_MADV_INVALID
> >>
> >>  #endif
> >>
> >> diff --git a/include/standard-headers/linux/virtio_balloon.h b/include/standard-headers/linux/virtio_balloon.h
> >> index 9375ca2a70..f9e3e82562 100644
> >> --- a/include/standard-headers/linux/virtio_balloon.h
> >> +++ b/include/standard-headers/linux/virtio_balloon.h
> >> @@ -36,6 +36,7 @@
> >>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM     2 /* Deflate balloon on OOM */
> >>  #define VIRTIO_BALLOON_F_FREE_PAGE_HINT     3 /* VQ to report free pages */
> >>  #define VIRTIO_BALLOON_F_PAGE_POISON        4 /* Guest is using page poisoning */
> >> +#define VIRTIO_BALLOON_F_HINTING    5 /* Page hinting virtqueue */
> >>
> >>  /* Size of a PFN in the balloon interface. */
> >>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> >> --
> >> 2.21.0
> --
> Thanks
> Nitesh
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure
  2019-07-11 17:58     ` Nitesh Narayan Lal
@ 2019-07-11 23:20       ` Alexander Duyck
  2019-07-12  1:12         ` Nitesh Narayan Lal
  0 siblings, 1 reply; 43+ messages in thread
From: Alexander Duyck @ 2019-07-11 23:20 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko

On Thu, Jul 11, 2019 at 10:58 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 7/10/19 5:56 PM, Alexander Duyck wrote:
> > On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >> This patch introduces the core infrastructure for free page hinting in
> >> virtual environments. It enables the kernel to track the free pages which
> >> can be reported to its hypervisor so that the hypervisor could
> >> free and reuse that memory as per its requirement.
> >>
> >> While the pages are getting processed in the hypervisor (e.g.,
> >> via MADV_FREE), the guest must not use them, otherwise, data loss
> >> would be possible. To avoid such a situation, these pages are
> >> temporarily removed from the buddy. The amount of pages removed
> >> temporarily from the buddy is governed by the backend(virtio-balloon
> >> in our case).
> >>
> >> To efficiently identify free pages that can to be hinted to the
> >> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> >> chunks are reported to the hypervisor - especially, to not break up THP
> >> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> >> in the bitmap are an indication whether a page *might* be free, not a
> >> guarantee. A new hook after buddy merging sets the bits.
> >>
> >> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> >> asynchronously processes the bitmaps, trying to isolate and report pages
> >> that are still free. The backend (virtio-balloon) is responsible for
> >> reporting these batched pages to the host synchronously. Once reporting/
> >> freeing is complete, isolated pages are returned back to the buddy.
> >>
> >> There are still various things to look into (e.g., memory hotplug, more
> >> efficient locking, possible races when disabling).
> >>
> >> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>

So just FYI, I thought I would try the patches. It looks like there
might be a bug somewhere that is causing it to free memory it
shouldn't be. After about 10 minutes my VM crashed with a system log
full of various NULL pointer dereferences. The only change I had made
is to use MADV_DONTNEED instead of MADV_FREE in QEMU since my headers
didn't have MADV_FREE on the host. It occurs to me one advantage of
MADV_DONTNEED over MADV_FREE is that you are more likely to catch
these sort of errors since it zeros the pages instead of leaving them
intact.

> >> ---
> >>  include/linux/page_hinting.h |  45 +++++++
> >>  mm/Kconfig                   |   6 +
> >>  mm/Makefile                  |   1 +
> >>  mm/page_alloc.c              |  18 +--
> >>  mm/page_hinting.c            | 250 +++++++++++++++++++++++++++++++++++
> >>  5 files changed, 312 insertions(+), 8 deletions(-)
> >>  create mode 100644 include/linux/page_hinting.h
> >>  create mode 100644 mm/page_hinting.c
> >>
> >> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
> >> new file mode 100644
> >> index 000000000000..4900feb796f9
> >> --- /dev/null
> >> +++ b/include/linux/page_hinting.h
> >> @@ -0,0 +1,45 @@
> >> +/* SPDX-License-Identifier: GPL-2.0 */
> >> +#ifndef _LINUX_PAGE_HINTING_H
> >> +#define _LINUX_PAGE_HINTING_H
> >> +
> >> +/*
> >> + * Minimum page order required for a page to be hinted to the host.
> >> + */
> >> +#define PAGE_HINTING_MIN_ORDER         (MAX_ORDER - 2)
> >> +
> > Why use (MAX_ORDER - 2)? Is this just because of the issues I pointed
> > out earlier for is it due to something else? I'm just wondering if
> > this will have an impact on architectures outside of x86 as I had
> > chose pageblock_order which happened to be MAX_ORDER - 2 on x86, but I
> > don't know that the impact of doing that is on other architectures
> > versus the (MAX_ORDER - 2) approach you took here.
> If I am not wrong then any order  < (MAX_ORDER - 2) will break the THP.
> That's one reason we decided to stick with this.

That is true for x86, but I don't think that is true for other
architectures. That is why I went with pageblock_order instead of just
using a fixed value such as MAX_ORDER - 2.

<snip>

> >> diff --git a/mm/page_hinting.c b/mm/page_hinting.c
> >> new file mode 100644
> >> index 000000000000..0bfa09f8c3ed
> >> --- /dev/null
> >> +++ b/mm/page_hinting.c
> >> @@ -0,0 +1,250 @@
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +/*
> >> + * Page hinting core infrastructure to enable a VM to report free pages to its
> >> + * hypervisor.
> >> + *
> >> + * Copyright Red Hat, Inc. 2019
> >> + *
> >> + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
> >> + */
> >> +
> >> +#include <linux/mm.h>
> >> +#include <linux/slab.h>
> >> +#include <linux/page_hinting.h>
> >> +#include <linux/kvm_host.h>
> >> +
> >> +/*
> >> + * struct zone_free_area: For a single zone across NUMA nodes, it holds the
> >> + * bitmap pointer to track the free pages and other required parameters
> >> + * used to recover these pages by scanning the bitmap.
> >> + * @bitmap:            Pointer to the bitmap in PAGE_HINTING_MIN_ORDER
> >> + *                     granularity.
> >> + * @base_pfn:          Starting PFN value for the zone whose bitmap is stored.
> >> + * @end_pfn:           Indicates the last PFN value for the zone.
> >> + * @free_pages:                Tracks the number of free pages of granularity
> >> + *                     PAGE_HINTING_MIN_ORDER.
> >> + * @nbits:             Indicates the total size of the bitmap in bits allocated
> >> + *                     at the time of initialization.
> >> + */
> >> +struct zone_free_area {
> >> +       unsigned long *bitmap;
> >> +       unsigned long base_pfn;
> >> +       unsigned long end_pfn;
> >> +       atomic_t free_pages;
> >> +       unsigned long nbits;
> >> +} free_area[MAX_NR_ZONES];
> >> +
> > You still haven't addressed the NUMA issue I pointed out with v10. You
> > are only able to address the first set of zones with this setup. As
> > such you can end up missing large sections of memory if it is split
> > over multiple nodes.
> I think I did.

I just realized what you did. Actually this doesn't really improve
things in my opinion. More comments below.

> >
> >> +static void init_hinting_wq(struct work_struct *work);
> >> +static DEFINE_MUTEX(page_hinting_init);
> >> +const struct page_hinting_config *page_hitning_conf;
> >> +struct work_struct hinting_work;
> >> +atomic_t page_hinting_active;
> >> +
> >> +void free_area_cleanup(int nr_zones)
> >> +{
> > I'm not sure why you are passing nr_zones as an argument here. Won't
> > this always be MAX_NR_ZONES?
> free_area_cleanup() gets called from page_hinting_disable() and
> page_hinting_enable(). In page_hinting_enable() when the allocation
> fails we may not have to perform cleanup for all the zones everytime.

Just adding a NULL pointer check to this loop below would still keep
it pretty cheap as the cost for initializing memory to 0 isn't that
high, and this is slow path anyway. Either way I guess it works. You
might want to reset the bitmap pointer to NULL though after you free
it to more easily catch the double free case.

> >> +       int zone_idx;
> >> +
> >> +       for (zone_idx = 0; zone_idx < nr_zones; zone_idx++) {
> >> +               bitmap_free(free_area[zone_idx].bitmap);
> >> +               free_area[zone_idx].base_pfn = 0;
> >> +               free_area[zone_idx].end_pfn = 0;
> >> +               free_area[zone_idx].nbits = 0;
> >> +               atomic_set(&free_area[zone_idx].free_pages, 0);
> >> +       }
> >> +}
> >> +
> >> +int page_hinting_enable(const struct page_hinting_config *conf)
> >> +{
> >> +       unsigned long bitmap_size = 0;
> >> +       int zone_idx = 0, ret = -EBUSY;
> >> +       struct zone *zone;
> >> +
> >> +       mutex_lock(&page_hinting_init);
> >> +       if (!page_hitning_conf) {
> >> +               for_each_populated_zone(zone) {
> > So for_each_populated_zone will go through all of the NUMA nodes. So
> > if I am not mistaken you will overwrite the free_area values of all
> > the previous nodes with the last node in the system.
> Not sure if I understood.

I misread the code. More comments below.

> >  So if we have a
> > setup that has all the memory in the first node, and none in the
> > second it would effectively disable free page hinting would it not?
> Why will it happen? The base_pfn will still be pointing to the base_pfn
> of the first node. Isn't?

So this does address my concern however, it introduces a new issue.
Specifically you could end up introducing a gap of unused bits if the
memory from one zone is not immediately adjacent to another. This gets
back to the SPARSEMEM issue that I think Dave pointed out.


<snip>

> >> +static void scan_zone_free_area(int zone_idx, int free_pages)
> >> +{
> >> +       int ret = 0, order, isolated_cnt = 0;
> >> +       unsigned long set_bit, start = 0;
> >> +       LIST_HEAD(isolated_pages);
> >> +       struct page *page;
> >> +       struct zone *zone;
> >> +
> >> +       for (;;) {
> >> +               ret = 0;
> >> +               set_bit = find_next_bit(free_area[zone_idx].bitmap,
> >> +                                       free_area[zone_idx].nbits, start);
> >> +               if (set_bit >= free_area[zone_idx].nbits)
> >> +                       break;
> >> +               page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) +
> >> +                               free_area[zone_idx].base_pfn);
> >> +               if (!page)
> >> +                       continue;
> >> +               zone = page_zone(page);
> >> +               spin_lock(&zone->lock);
> >> +
> >> +               if (PageBuddy(page) && page_private(page) >=
> >> +                   PAGE_HINTING_MIN_ORDER) {
> >> +                       order = page_private(page);
> >> +                       ret = __isolate_free_page(page, order);
> >> +               }
> >> +               clear_bit(set_bit, free_area[zone_idx].bitmap);
> >> +               atomic_dec(&free_area[zone_idx].free_pages);
> >> +               spin_unlock(&zone->lock);
> >> +               if (ret) {
> >> +                       /*
> >> +                        * restoring page order to use it while releasing
> >> +                        * the pages back to the buddy.
> >> +                        */
> >> +                       set_page_private(page, order);
> >> +                       list_add_tail(&page->lru, &isolated_pages);
> >> +                       isolated_cnt++;
> >> +                       if (isolated_cnt == page_hitning_conf->max_pages) {
> >> +                               page_hitning_conf->hint_pages(&isolated_pages);
> >> +                               release_buddy_pages(&isolated_pages);
> >> +                               isolated_cnt = 0;
> >> +                       }
> >> +               }
> >> +               start = set_bit + 1;
> >> +       }
> >> +       if (isolated_cnt) {
> >> +               page_hitning_conf->hint_pages(&isolated_pages);
> >> +               release_buddy_pages(&isolated_pages);
> >> +       }
> >> +}
> >> +
> > I really worry that this loop is going to become more expensive as the
> > size of memory increases. For example if we hint on just 16 pages we
> > would have to walk something like 4K bits, 512 longs, if a system had
> > 64G of memory. Have you considered testing with a larger memory
> > footprint to see if it has an impact on performance?
> I am hoping this will be noticeable in will-it-scale's page_fault1, if I
> run it on a larger system?

What you will probably see is that the CPU that is running the scan is
going to be sitting at somewhere near 100% because I cannot see how it
can hope to stay efficient if it has to check something like 512 64b
longs searching for just a handful of idle pages.

> >
> >> +static void init_hinting_wq(struct work_struct *work)
> >> +{
> >> +       int zone_idx, free_pages;
> >> +
> >> +       atomic_set(&page_hinting_active, 1);
> >> +       for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) {
> >> +               free_pages = atomic_read(&free_area[zone_idx].free_pages);
> >> +               if (free_pages >= page_hitning_conf->max_pages)
> >> +                       scan_zone_free_area(zone_idx, free_pages);
> >> +       }
> >> +       atomic_set(&page_hinting_active, 0);
> >> +}
> >> +
> >> +void page_hinting_enqueue(struct page *page, int order)
> >> +{
> >> +       int zone_idx;
> >> +
> >> +       if (!page_hitning_conf || order < PAGE_HINTING_MIN_ORDER)
> >> +               return;
> > I would think it is going to be expensive to be jumping into this
> > function for every freed page. You should probably have an inline
> > taking care of the order check before you even get here since it would
> > be faster that way.
> I see, I can take a look. Thanks.
> >
> >> +
> >> +       bm_set_pfn(page);
> >> +       if (atomic_read(&page_hinting_active))
> >> +               return;
> > So I would think this piece is racy. Specifically if you set a PFN
> > that is somewhere below the PFN you are currently processing in your
> > scan it is going to remain unset until you have another page freed
> > after the scan is completed. I would worry you can end up with a batch
> > free of memory resulting in a group of pages sitting at the start of
> > your bitmap unhinted.
> True, but that will be hinted next time threshold is met.

Yes, but that assumes that there is another free immediately coming.
It is possible that you have a big application run and then
immediately shut down and have it free all its memory at once. Worst
case scenario would be that it starts by freeing from the end and
works toward the start. With that you could theoretically end up with
a significant chunk of memory waiting some time for another big free
to come along.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure
  2019-07-11 23:20       ` Alexander Duyck
@ 2019-07-12  1:12         ` Nitesh Narayan Lal
  2019-07-12 16:22           ` Alexander Duyck
  0 siblings, 1 reply; 43+ messages in thread
From: Nitesh Narayan Lal @ 2019-07-12  1:12 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko


On 7/11/19 7:20 PM, Alexander Duyck wrote:
> On Thu, Jul 11, 2019 at 10:58 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>
>> On 7/10/19 5:56 PM, Alexander Duyck wrote:
>>> On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>> This patch introduces the core infrastructure for free page hinting in
>>>> virtual environments. It enables the kernel to track the free pages which
>>>> can be reported to its hypervisor so that the hypervisor could
>>>> free and reuse that memory as per its requirement.
>>>>
>>>> While the pages are getting processed in the hypervisor (e.g.,
>>>> via MADV_FREE), the guest must not use them, otherwise, data loss
>>>> would be possible. To avoid such a situation, these pages are
>>>> temporarily removed from the buddy. The amount of pages removed
>>>> temporarily from the buddy is governed by the backend(virtio-balloon
>>>> in our case).
>>>>
>>>> To efficiently identify free pages that can to be hinted to the
>>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>>>> chunks are reported to the hypervisor - especially, to not break up THP
>>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>>>> in the bitmap are an indication whether a page *might* be free, not a
>>>> guarantee. A new hook after buddy merging sets the bits.
>>>>
>>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>>>> asynchronously processes the bitmaps, trying to isolate and report pages
>>>> that are still free. The backend (virtio-balloon) is responsible for
>>>> reporting these batched pages to the host synchronously. Once reporting/
>>>> freeing is complete, isolated pages are returned back to the buddy.
>>>>
>>>> There are still various things to look into (e.g., memory hotplug, more
>>>> efficient locking, possible races when disabling).
>>>>
>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> So just FYI, I thought I would try the patches. It looks like there
> might be a bug somewhere that is causing it to free memory it
> shouldn't be. After about 10 minutes my VM crashed with a system log
> full of various NULL pointer dereferences.

That's interesting, I have tried the patches with MADV_DONTNEED as well.
I just retried it but didn't see any crash. May I know what kind of
workload you are running?

>  The only change I had made
> is to use MADV_DONTNEED instead of MADV_FREE in QEMU since my headers
> didn't have MADV_FREE on the host. It occurs to me one advantage of
> MADV_DONTNEED over MADV_FREE is that you are more likely to catch
> these sort of errors since it zeros the pages instead of leaving them
> intact.
For development purpose maybe. For the final patch-set I think we
discussed earlier why we should keep MADV_FREE.
>
>>>> ---
>>>>  include/linux/page_hinting.h |  45 +++++++
>>>>  mm/Kconfig                   |   6 +
>>>>  mm/Makefile                  |   1 +
>>>>  mm/page_alloc.c              |  18 +--
>>>>  mm/page_hinting.c            | 250 +++++++++++++++++++++++++++++++++++
>>>>  5 files changed, 312 insertions(+), 8 deletions(-)
>>>>  create mode 100644 include/linux/page_hinting.h
>>>>  create mode 100644 mm/page_hinting.c
>>>>
>>>> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
>>>> new file mode 100644
>>>> index 000000000000..4900feb796f9
>>>> --- /dev/null
>>>> +++ b/include/linux/page_hinting.h
>>>> @@ -0,0 +1,45 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>> +#ifndef _LINUX_PAGE_HINTING_H
>>>> +#define _LINUX_PAGE_HINTING_H
>>>> +
>>>> +/*
>>>> + * Minimum page order required for a page to be hinted to the host.
>>>> + */
>>>> +#define PAGE_HINTING_MIN_ORDER         (MAX_ORDER - 2)
>>>> +
>>> Why use (MAX_ORDER - 2)? Is this just because of the issues I pointed
>>> out earlier for is it due to something else? I'm just wondering if
>>> this will have an impact on architectures outside of x86 as I had
>>> chose pageblock_order which happened to be MAX_ORDER - 2 on x86, but I
>>> don't know that the impact of doing that is on other architectures
>>> versus the (MAX_ORDER - 2) approach you took here.
>> If I am not wrong then any order  < (MAX_ORDER - 2) will break the THP.
>> That's one reason we decided to stick with this.
> That is true for x86, but I don't think that is true for other
> architectures. That is why I went with pageblock_order instead of just
> using a fixed value such as MAX_ORDER - 2.
I see, I will have to check this.
>
> <snip>
>
>>>> diff --git a/mm/page_hinting.c b/mm/page_hinting.c
>>>> new file mode 100644
>>>> index 000000000000..0bfa09f8c3ed
>>>> --- /dev/null
>>>> +++ b/mm/page_hinting.c
>>>> @@ -0,0 +1,250 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/*
>>>> + * Page hinting core infrastructure to enable a VM to report free pages to its
>>>> + * hypervisor.
>>>> + *
>>>> + * Copyright Red Hat, Inc. 2019
>>>> + *
>>>> + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
>>>> + */
>>>> +
>>>> +#include <linux/mm.h>
>>>> +#include <linux/slab.h>
>>>> +#include <linux/page_hinting.h>
>>>> +#include <linux/kvm_host.h>
>>>> +
>>>> +/*
>>>> + * struct zone_free_area: For a single zone across NUMA nodes, it holds the
>>>> + * bitmap pointer to track the free pages and other required parameters
>>>> + * used to recover these pages by scanning the bitmap.
>>>> + * @bitmap:            Pointer to the bitmap in PAGE_HINTING_MIN_ORDER
>>>> + *                     granularity.
>>>> + * @base_pfn:          Starting PFN value for the zone whose bitmap is stored.
>>>> + * @end_pfn:           Indicates the last PFN value for the zone.
>>>> + * @free_pages:                Tracks the number of free pages of granularity
>>>> + *                     PAGE_HINTING_MIN_ORDER.
>>>> + * @nbits:             Indicates the total size of the bitmap in bits allocated
>>>> + *                     at the time of initialization.
>>>> + */
>>>> +struct zone_free_area {
>>>> +       unsigned long *bitmap;
>>>> +       unsigned long base_pfn;
>>>> +       unsigned long end_pfn;
>>>> +       atomic_t free_pages;
>>>> +       unsigned long nbits;
>>>> +} free_area[MAX_NR_ZONES];
>>>> +
>>> You still haven't addressed the NUMA issue I pointed out with v10. You
>>> are only able to address the first set of zones with this setup. As
>>> such you can end up missing large sections of memory if it is split
>>> over multiple nodes.
>> I think I did.
> I just realized what you did. Actually this doesn't really improve
> things in my opinion. More comments below.
>
>>>> +static void init_hinting_wq(struct work_struct *work);
>>>> +static DEFINE_MUTEX(page_hinting_init);
>>>> +const struct page_hinting_config *page_hitning_conf;
>>>> +struct work_struct hinting_work;
>>>> +atomic_t page_hinting_active;
>>>> +
>>>> +void free_area_cleanup(int nr_zones)
>>>> +{
>>> I'm not sure why you are passing nr_zones as an argument here. Won't
>>> this always be MAX_NR_ZONES?
>> free_area_cleanup() gets called from page_hinting_disable() and
>> page_hinting_enable(). In page_hinting_enable() when the allocation
>> fails we may not have to perform cleanup for all the zones everytime.
> Just adding a NULL pointer check to this loop below would still keep
> it pretty cheap as the cost for initializing memory to 0 isn't that
> high, and this is slow path anyway. Either way I guess it works. 
Yeah.
> You
> might want to reset the bitmap pointer to NULL though after you free
> it to more easily catch the double free case.
I think resetting the bitmap pointer to NULL is a good idea. Thanks.
>
>>>> +       int zone_idx;
>>>> +
>>>> +       for (zone_idx = 0; zone_idx < nr_zones; zone_idx++) {
>>>> +               bitmap_free(free_area[zone_idx].bitmap);
>>>> +               free_area[zone_idx].base_pfn = 0;
>>>> +               free_area[zone_idx].end_pfn = 0;
>>>> +               free_area[zone_idx].nbits = 0;
>>>> +               atomic_set(&free_area[zone_idx].free_pages, 0);
>>>> +       }
>>>> +}
>>>> +
>>>> +int page_hinting_enable(const struct page_hinting_config *conf)
>>>> +{
>>>> +       unsigned long bitmap_size = 0;
>>>> +       int zone_idx = 0, ret = -EBUSY;
>>>> +       struct zone *zone;
>>>> +
>>>> +       mutex_lock(&page_hinting_init);
>>>> +       if (!page_hitning_conf) {
>>>> +               for_each_populated_zone(zone) {
>>> So for_each_populated_zone will go through all of the NUMA nodes. So
>>> if I am not mistaken you will overwrite the free_area values of all
>>> the previous nodes with the last node in the system.
>> Not sure if I understood.
> I misread the code. More comments below.
>
>>>  So if we have a
>>> setup that has all the memory in the first node, and none in the
>>> second it would effectively disable free page hinting would it not?
>> Why will it happen? The base_pfn will still be pointing to the base_pfn
>> of the first node. Isn't?
> So this does address my concern however, it introduces a new issue.
> Specifically you could end up introducing a gap of unused bits if the
> memory from one zone is not immediately adjacent to another. This gets
> back to the SPARSEMEM issue that I think Dave pointed out.
Yeah, he did point it out. It looks a valid issue, I will look into it.
>
>
> <snip>
>
>>>> +static void scan_zone_free_area(int zone_idx, int free_pages)
>>>> +{
>>>> +       int ret = 0, order, isolated_cnt = 0;
>>>> +       unsigned long set_bit, start = 0;
>>>> +       LIST_HEAD(isolated_pages);
>>>> +       struct page *page;
>>>> +       struct zone *zone;
>>>> +
>>>> +       for (;;) {
>>>> +               ret = 0;
>>>> +               set_bit = find_next_bit(free_area[zone_idx].bitmap,
>>>> +                                       free_area[zone_idx].nbits, start);
>>>> +               if (set_bit >= free_area[zone_idx].nbits)
>>>> +                       break;
>>>> +               page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) +
>>>> +                               free_area[zone_idx].base_pfn);
>>>> +               if (!page)
>>>> +                       continue;
>>>> +               zone = page_zone(page);
>>>> +               spin_lock(&zone->lock);
>>>> +
>>>> +               if (PageBuddy(page) && page_private(page) >=
>>>> +                   PAGE_HINTING_MIN_ORDER) {
>>>> +                       order = page_private(page);
>>>> +                       ret = __isolate_free_page(page, order);
>>>> +               }
>>>> +               clear_bit(set_bit, free_area[zone_idx].bitmap);
>>>> +               atomic_dec(&free_area[zone_idx].free_pages);
>>>> +               spin_unlock(&zone->lock);
>>>> +               if (ret) {
>>>> +                       /*
>>>> +                        * restoring page order to use it while releasing
>>>> +                        * the pages back to the buddy.
>>>> +                        */
>>>> +                       set_page_private(page, order);
>>>> +                       list_add_tail(&page->lru, &isolated_pages);
>>>> +                       isolated_cnt++;
>>>> +                       if (isolated_cnt == page_hitning_conf->max_pages) {
>>>> +                               page_hitning_conf->hint_pages(&isolated_pages);
>>>> +                               release_buddy_pages(&isolated_pages);
>>>> +                               isolated_cnt = 0;
>>>> +                       }
>>>> +               }
>>>> +               start = set_bit + 1;
>>>> +       }
>>>> +       if (isolated_cnt) {
>>>> +               page_hitning_conf->hint_pages(&isolated_pages);
>>>> +               release_buddy_pages(&isolated_pages);
>>>> +       }
>>>> +}
>>>> +
>>> I really worry that this loop is going to become more expensive as the
>>> size of memory increases. For example if we hint on just 16 pages we
>>> would have to walk something like 4K bits, 512 longs, if a system had
>>> 64G of memory. Have you considered testing with a larger memory
>>> footprint to see if it has an impact on performance?
>> I am hoping this will be noticeable in will-it-scale's page_fault1, if I
>> run it on a larger system?
> What you will probably see is that the CPU that is running the scan is
> going to be sitting at somewhere near 100% because I cannot see how it
> can hope to stay efficient if it has to check something like 512 64b
> longs searching for just a handful of idle pages.
>
>>>> +static void init_hinting_wq(struct work_struct *work)
>>>> +{
>>>> +       int zone_idx, free_pages;
>>>> +
>>>> +       atomic_set(&page_hinting_active, 1);
>>>> +       for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) {
>>>> +               free_pages = atomic_read(&free_area[zone_idx].free_pages);
>>>> +               if (free_pages >= page_hitning_conf->max_pages)
>>>> +                       scan_zone_free_area(zone_idx, free_pages);
>>>> +       }
>>>> +       atomic_set(&page_hinting_active, 0);
>>>> +}
>>>> +
>>>> +void page_hinting_enqueue(struct page *page, int order)
>>>> +{
>>>> +       int zone_idx;
>>>> +
>>>> +       if (!page_hitning_conf || order < PAGE_HINTING_MIN_ORDER)
>>>> +               return;
>>> I would think it is going to be expensive to be jumping into this
>>> function for every freed page. You should probably have an inline
>>> taking care of the order check before you even get here since it would
>>> be faster that way.
>> I see, I can take a look. Thanks.
>>>> +
>>>> +       bm_set_pfn(page);
>>>> +       if (atomic_read(&page_hinting_active))
>>>> +               return;
>>> So I would think this piece is racy. Specifically if you set a PFN
>>> that is somewhere below the PFN you are currently processing in your
>>> scan it is going to remain unset until you have another page freed
>>> after the scan is completed. I would worry you can end up with a batch
>>> free of memory resulting in a group of pages sitting at the start of
>>> your bitmap unhinted.
>> True, but that will be hinted next time threshold is met.
> Yes, but that assumes that there is another free immediately coming.
> It is possible that you have a big application run and then
> immediately shut down and have it free all its memory at once. Worst
> case scenario would be that it starts by freeing from the end and
> works toward the start. With that you could theoretically end up with
> a significant chunk of memory waiting some time for another big free
> to come along.

Any suggestion on some benchmark/test application which I could run to
see this kind of behavior?

-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure
  2019-07-12  1:12         ` Nitesh Narayan Lal
@ 2019-07-12 16:22           ` Alexander Duyck
  2019-07-12 16:25             ` Nitesh Narayan Lal
  2019-08-08 11:41             ` Nitesh Narayan Lal
  0 siblings, 2 replies; 43+ messages in thread
From: Alexander Duyck @ 2019-07-12 16:22 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko

On Thu, Jul 11, 2019 at 6:13 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 7/11/19 7:20 PM, Alexander Duyck wrote:
> > On Thu, Jul 11, 2019 at 10:58 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>
> >> On 7/10/19 5:56 PM, Alexander Duyck wrote:
> >>> On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>>> This patch introduces the core infrastructure for free page hinting in
> >>>> virtual environments. It enables the kernel to track the free pages which
> >>>> can be reported to its hypervisor so that the hypervisor could
> >>>> free and reuse that memory as per its requirement.
> >>>>
> >>>> While the pages are getting processed in the hypervisor (e.g.,
> >>>> via MADV_FREE), the guest must not use them, otherwise, data loss
> >>>> would be possible. To avoid such a situation, these pages are
> >>>> temporarily removed from the buddy. The amount of pages removed
> >>>> temporarily from the buddy is governed by the backend(virtio-balloon
> >>>> in our case).
> >>>>
> >>>> To efficiently identify free pages that can to be hinted to the
> >>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> >>>> chunks are reported to the hypervisor - especially, to not break up THP
> >>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> >>>> in the bitmap are an indication whether a page *might* be free, not a
> >>>> guarantee. A new hook after buddy merging sets the bits.
> >>>>
> >>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> >>>> asynchronously processes the bitmaps, trying to isolate and report pages
> >>>> that are still free. The backend (virtio-balloon) is responsible for
> >>>> reporting these batched pages to the host synchronously. Once reporting/
> >>>> freeing is complete, isolated pages are returned back to the buddy.
> >>>>
> >>>> There are still various things to look into (e.g., memory hotplug, more
> >>>> efficient locking, possible races when disabling).
> >>>>
> >>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> > So just FYI, I thought I would try the patches. It looks like there
> > might be a bug somewhere that is causing it to free memory it
> > shouldn't be. After about 10 minutes my VM crashed with a system log
> > full of various NULL pointer dereferences.
>
> That's interesting, I have tried the patches with MADV_DONTNEED as well.
> I just retried it but didn't see any crash. May I know what kind of
> workload you are running?

I was running the page_fault1 test on a VM with 80G of memory.

> >  The only change I had made
> > is to use MADV_DONTNEED instead of MADV_FREE in QEMU since my headers
> > didn't have MADV_FREE on the host. It occurs to me one advantage of
> > MADV_DONTNEED over MADV_FREE is that you are more likely to catch
> > these sort of errors since it zeros the pages instead of leaving them
> > intact.
> For development purpose maybe. For the final patch-set I think we
> discussed earlier why we should keep MADV_FREE.

I'm still not convinced MADV_FREE is a net win, at least for
performance. You are still paying the cost for the VMEXIT in order to
regain ownership of the page. In the case that you are under memory
pressure it is essentially equivalent to MADV_DONTNEED. Also it
doesn't really do much to help with the memory footprint of the VM
itself. With the MADV_DONTNEED the pages are freed back and you have a
greater liklihood of reducing the overall memory footprint of the
entire system since you would be more likely to be assigned pages that
were recently used rather than having to access a cold page.

<snip>

> >>>> +void page_hinting_enqueue(struct page *page, int order)
> >>>> +{
> >>>> +       int zone_idx;
> >>>> +
> >>>> +       if (!page_hitning_conf || order < PAGE_HINTING_MIN_ORDER)
> >>>> +               return;
> >>> I would think it is going to be expensive to be jumping into this
> >>> function for every freed page. You should probably have an inline
> >>> taking care of the order check before you even get here since it would
> >>> be faster that way.
> >> I see, I can take a look. Thanks.
> >>>> +
> >>>> +       bm_set_pfn(page);
> >>>> +       if (atomic_read(&page_hinting_active))
> >>>> +               return;
> >>> So I would think this piece is racy. Specifically if you set a PFN
> >>> that is somewhere below the PFN you are currently processing in your
> >>> scan it is going to remain unset until you have another page freed
> >>> after the scan is completed. I would worry you can end up with a batch
> >>> free of memory resulting in a group of pages sitting at the start of
> >>> your bitmap unhinted.
> >> True, but that will be hinted next time threshold is met.
> > Yes, but that assumes that there is another free immediately coming.
> > It is possible that you have a big application run and then
> > immediately shut down and have it free all its memory at once. Worst
> > case scenario would be that it starts by freeing from the end and
> > works toward the start. With that you could theoretically end up with
> > a significant chunk of memory waiting some time for another big free
> > to come along.
>
> Any suggestion on some benchmark/test application which I could run to
> see this kind of behavior?

Like I mentioned before, try doing a VM with a bigger memory
footprint. You could probably just do a stack of VMs like what we were
doing with the memhog test. Basically the longer it takes to process
all the pages the greater the liklihood that there are still pages
left when they are freed.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure
  2019-07-12 16:22           ` Alexander Duyck
@ 2019-07-12 16:25             ` Nitesh Narayan Lal
  2019-08-08 11:41             ` Nitesh Narayan Lal
  1 sibling, 0 replies; 43+ messages in thread
From: Nitesh Narayan Lal @ 2019-07-12 16:25 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko


On 7/12/19 12:22 PM, Alexander Duyck wrote:
> On Thu, Jul 11, 2019 at 6:13 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>
>> On 7/11/19 7:20 PM, Alexander Duyck wrote:
>>> On Thu, Jul 11, 2019 at 10:58 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>> On 7/10/19 5:56 PM, Alexander Duyck wrote:
>>>>> On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>>>> This patch introduces the core infrastructure for free page hinting in
>>>>>> virtual environments. It enables the kernel to track the free pages which
>>>>>> can be reported to its hypervisor so that the hypervisor could
>>>>>> free and reuse that memory as per its requirement.
>>>>>>
>>>>>> While the pages are getting processed in the hypervisor (e.g.,
>>>>>> via MADV_FREE), the guest must not use them, otherwise, data loss
>>>>>> would be possible. To avoid such a situation, these pages are
>>>>>> temporarily removed from the buddy. The amount of pages removed
>>>>>> temporarily from the buddy is governed by the backend(virtio-balloon
>>>>>> in our case).
>>>>>>
>>>>>> To efficiently identify free pages that can to be hinted to the
>>>>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>>>>>> chunks are reported to the hypervisor - especially, to not break up THP
>>>>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>>>>>> in the bitmap are an indication whether a page *might* be free, not a
>>>>>> guarantee. A new hook after buddy merging sets the bits.
>>>>>>
>>>>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>>>>>> asynchronously processes the bitmaps, trying to isolate and report pages
>>>>>> that are still free. The backend (virtio-balloon) is responsible for
>>>>>> reporting these batched pages to the host synchronously. Once reporting/
>>>>>> freeing is complete, isolated pages are returned back to the buddy.
>>>>>>
>>>>>> There are still various things to look into (e.g., memory hotplug, more
>>>>>> efficient locking, possible races when disabling).
>>>>>>
>>>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>>> So just FYI, I thought I would try the patches. It looks like there
>>> might be a bug somewhere that is causing it to free memory it
>>> shouldn't be. After about 10 minutes my VM crashed with a system log
>>> full of various NULL pointer dereferences.
>> That's interesting, I have tried the patches with MADV_DONTNEED as well.
>> I just retried it but didn't see any crash. May I know what kind of
>> workload you are running?
> I was running the page_fault1 test on a VM with 80G of memory.
>
>>>  The only change I had made
>>> is to use MADV_DONTNEED instead of MADV_FREE in QEMU since my headers
>>> didn't have MADV_FREE on the host. It occurs to me one advantage of
>>> MADV_DONTNEED over MADV_FREE is that you are more likely to catch
>>> these sort of errors since it zeros the pages instead of leaving them
>>> intact.
>> For development purpose maybe. For the final patch-set I think we
>> discussed earlier why we should keep MADV_FREE.
> I'm still not convinced MADV_FREE is a net win, at least for
> performance. You are still paying the cost for the VMEXIT in order to
> regain ownership of the page. In the case that you are under memory
> pressure it is essentially equivalent to MADV_DONTNEED. Also it
> doesn't really do much to help with the memory footprint of the VM
> itself. With the MADV_DONTNEED the pages are freed back and you have a
> greater liklihood of reducing the overall memory footprint of the
> entire system since you would be more likely to be assigned pages that
> were recently used rather than having to access a cold page.	
>
> <snip>
>
>>>>>> +void page_hinting_enqueue(struct page *page, int order)
>>>>>> +{
>>>>>> +       int zone_idx;
>>>>>> +
>>>>>> +       if (!page_hitning_conf || order < PAGE_HINTING_MIN_ORDER)
>>>>>> +               return;
>>>>> I would think it is going to be expensive to be jumping into this
>>>>> function for every freed page. You should probably have an inline
>>>>> taking care of the order check before you even get here since it would
>>>>> be faster that way.
>>>> I see, I can take a look. Thanks.
>>>>>> +
>>>>>> +       bm_set_pfn(page);
>>>>>> +       if (atomic_read(&page_hinting_active))
>>>>>> +               return;
>>>>> So I would think this piece is racy. Specifically if you set a PFN
>>>>> that is somewhere below the PFN you are currently processing in your
>>>>> scan it is going to remain unset until you have another page freed
>>>>> after the scan is completed. I would worry you can end up with a batch
>>>>> free of memory resulting in a group of pages sitting at the start of
>>>>> your bitmap unhinted.
>>>> True, but that will be hinted next time threshold is met.
>>> Yes, but that assumes that there is another free immediately coming.
>>> It is possible that you have a big application run and then
>>> immediately shut down and have it free all its memory at once. Worst
>>> case scenario would be that it starts by freeing from the end and
>>> works toward the start. With that you could theoretically end up with
>>> a significant chunk of memory waiting some time for another big free
>>> to come along.
>> Any suggestion on some benchmark/test application which I could run to
>> see this kind of behavior?
> Like I mentioned before, try doing a VM with a bigger memory
> footprint. You could probably just do a stack of VMs like what we were
> doing with the memhog test. Basically the longer it takes to process
> all the pages the greater the liklihood that there are still pages
> left when they are freed.
Thanks. Before next posting I will make sure to test with a larger VM
(>64GB).
-- 
Thanks
Nitesh

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure
  2019-07-10 20:45   ` Dave Hansen
  2019-07-11 11:48     ` Nitesh Narayan Lal
  2019-07-11 15:25     ` Nitesh Narayan Lal
@ 2019-07-15  9:26     ` David Hildenbrand
  2 siblings, 0 replies; 43+ messages in thread
From: David Hildenbrand @ 2019-07-15  9:26 UTC (permalink / raw)
  To: Dave Hansen, Nitesh Narayan Lal, kvm, linux-kernel, linux-mm,
	pbonzini, lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel,
	mst, dodgen, konrad.wilk, dhildenb, aarcange, alexander.duyck,
	john.starks, mhocko

On 10.07.19 22:45, Dave Hansen wrote:
> On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote:
>> +struct zone_free_area {
>> +	unsigned long *bitmap;
>> +	unsigned long base_pfn;
>> +	unsigned long end_pfn;
>> +	atomic_t free_pages;
>> +	unsigned long nbits;
>> +} free_area[MAX_NR_ZONES];
> 
> Why do we need an extra data structure.  What's wrong with putting
> per-zone data in ... 'struct zone'?  The cover letter claims that it
> doesn't touch core-mm infrastructure, but if it depends on mechanisms
> like this, I think that's a very bad thing.
> 
> To be honest, I'm not sure this series is worth reviewing at this point.
>  It's horribly lightly commented and full of kernel antipatterns lik
> 
> void func()
> {
> 	if () {
> 		... indent entire logic
> 		... of function
> 	}
> }

"full of". Hmm.

> 
> It has big "TODO"s.  It's virtually comment-free.  I'm shocked it's at
> the 11th version and still looking like this.
> 
>> +
>> +		for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) {
>> +			unsigned long pages = free_area[zone_idx].end_pfn -
>> +					free_area[zone_idx].base_pfn;
>> +			bitmap_size = (pages >> PAGE_HINTING_MIN_ORDER) + 1;
>> +			if (!bitmap_size)
>> +				continue;
>> +			free_area[zone_idx].bitmap = bitmap_zalloc(bitmap_size,
>> +								   GFP_KERNEL);
> 
> This doesn't support sparse zones.  We can have zones with massive
> spanned page sizes, but very few present pages.  On those zones, this
> will exhaust memory for no good reason.

Yes, AFAIKS, sparse zones are problematic when we have NORMAL/MOVABLE mixed.

1 bit for 2MB, 1 byte for 16MB, 64 bytes for 1GB

IOW, this isn't optimal but only really problematic for big systems /
very huge sparse zones.

> 
> Comparing this to Alex's patch set, it's of much lower quality and at a
> much earlier stage of development.  The two sets are not really even
> comparable right now.  This certainly doesn't sell me on (or even really

To be honest, I find this statement quite harsh. Nitesh's hard work in
the previous RFC's and many discussions with Alex essentially resulted
in the two approaches we have right now. Alex's approach would not look
the way it looks today without Nitesh's RFCs.

So much to that.

> enumerate the deltas in) this approach vs. Alex's.

I am aware that memory hotplug is not properly supported yet (future
work). Sparse zones work but eventually waste a handful of pages (!) -
future work. Anything else you are aware of that is missing?

My opinion:

1. Alex' solution is clearly beneficial, as we don't need to manage/scan
a bitmap. *however* we were concerned right from the beginning if
core-buddy modifications will be accepted upstream for a purely
virtualization-specific (as of now!) feature. If we can get it upstream,
perfect. Back when we discussed the idea with Alex I was skeptical - I
was expecting way more core modifications.

2. We were looking for an alternative solution that doesn't require to
modify the buddy. We have that now - yes, some things have to be worked
out and cleaned up, not arguing against that. A cleaned-up version of
this RFC with some fixes and enhancements should be ready to be used in
*many* (not all) setups. Which is perfectly fine.

So in summary, I think we should try our best to get Alex's series into
shape and accepted upstream. However, if we get upstream resistance or
it will take ages to get it in, I think we can start with this series
here (which requires no major buddy modifications as of now) and the
slowly see if we can convert it into Alex approach.

The important part for me is that the core<->driver interface and the
virtio interface is in a clean shape, so we can essentially swap out the
implementation specific parts in the core.

Cheers.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure
  2019-07-11 18:21   ` Dave Hansen
@ 2019-07-15  9:33     ` David Hildenbrand
  2019-07-15 14:40       ` David Hildenbrand
  0 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2019-07-15  9:33 UTC (permalink / raw)
  To: Dave Hansen, Nitesh Narayan Lal, kvm, linux-kernel, linux-mm,
	pbonzini, lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel,
	mst, dodgen, konrad.wilk, dhildenb, aarcange, alexander.duyck,
	john.starks, mhocko

On 11.07.19 20:21, Dave Hansen wrote:
> On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote:
>> +static void bm_set_pfn(struct page *page)
>> +{
>> +	struct zone *zone = page_zone(page);
>> +	int zone_idx = page_zonenum(page);
>> +	unsigned long bitnr = 0;
>> +
>> +	lockdep_assert_held(&zone->lock);
>> +	bitnr = pfn_to_bit(page, zone_idx);
>> +	/*
>> +	 * TODO: fix possible underflows.
>> +	 */
>> +	if (free_area[zone_idx].bitmap &&
>> +	    bitnr < free_area[zone_idx].nbits &&
>> +	    !test_and_set_bit(bitnr, free_area[zone_idx].bitmap))
>> +		atomic_inc(&free_area[zone_idx].free_pages);
>> +}
> 
> Let's say I have two NUMA nodes, each with ZONE_NORMAL and ZONE_MOVABLE
> and each zone with 1GB of memory:
> 
> Node:         0        1
> NORMAL   0->1GB   2->3GB
> MOVABLE  1->2GB   3->4GB
> 
> This code will allocate two bitmaps.  The ZONE_NORMAL bitmap will
> represent data from 0->3GB and the ZONE_MOVABLE bitmap will represent
> data from 1->4GB.  That's the result of this code:
> 
>> +			if (free_area[zone_idx].base_pfn) {
>> +				free_area[zone_idx].base_pfn =
>> +					min(free_area[zone_idx].base_pfn,
>> +					    zone->zone_start_pfn);
>> +				free_area[zone_idx].end_pfn =
>> +					max(free_area[zone_idx].end_pfn,
>> +					    zone->zone_start_pfn +
>> +					    zone->spanned_pages);
> 
> But that means that both bitmaps will have space for PFNs in the other
> zone type, which is completely bogus.  This is fundamental because the
> data structures are incorrectly built per zone *type* instead of per zone.
> 

I don't think it's incorrect, it's just not optimal in all scenarios.
E.g., in you example, this approach would "waste" 2 * 1GB of tracking
data for the wholes (2* 64bytes when using 1 bit for 2MB).

FWIW, this is not a numa-specific thingy. We can have sparse zones
easily on single-numa systems.

Node:                 0
NORMAL   0->1GB, 2->3GB
MOVABLE  1->2GB, 3->4GB

So tracking it per zones instead instead of zone type is only one part
of the story.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure
  2019-07-15  9:33     ` David Hildenbrand
@ 2019-07-15 14:40       ` David Hildenbrand
  0 siblings, 0 replies; 43+ messages in thread
From: David Hildenbrand @ 2019-07-15 14:40 UTC (permalink / raw)
  To: Dave Hansen, Nitesh Narayan Lal, kvm, linux-kernel, linux-mm,
	pbonzini, lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel,
	mst, dodgen, konrad.wilk, dhildenb, aarcange, alexander.duyck,
	john.starks, mhocko

On 15.07.19 11:33, David Hildenbrand wrote:
> On 11.07.19 20:21, Dave Hansen wrote:
>> On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote:
>>> +static void bm_set_pfn(struct page *page)
>>> +{
>>> +	struct zone *zone = page_zone(page);
>>> +	int zone_idx = page_zonenum(page);
>>> +	unsigned long bitnr = 0;
>>> +
>>> +	lockdep_assert_held(&zone->lock);
>>> +	bitnr = pfn_to_bit(page, zone_idx);
>>> +	/*
>>> +	 * TODO: fix possible underflows.
>>> +	 */
>>> +	if (free_area[zone_idx].bitmap &&
>>> +	    bitnr < free_area[zone_idx].nbits &&
>>> +	    !test_and_set_bit(bitnr, free_area[zone_idx].bitmap))
>>> +		atomic_inc(&free_area[zone_idx].free_pages);
>>> +}
>>
>> Let's say I have two NUMA nodes, each with ZONE_NORMAL and ZONE_MOVABLE
>> and each zone with 1GB of memory:
>>
>> Node:         0        1
>> NORMAL   0->1GB   2->3GB
>> MOVABLE  1->2GB   3->4GB
>>
>> This code will allocate two bitmaps.  The ZONE_NORMAL bitmap will
>> represent data from 0->3GB and the ZONE_MOVABLE bitmap will represent
>> data from 1->4GB.  That's the result of this code:
>>
>>> +			if (free_area[zone_idx].base_pfn) {
>>> +				free_area[zone_idx].base_pfn =
>>> +					min(free_area[zone_idx].base_pfn,
>>> +					    zone->zone_start_pfn);
>>> +				free_area[zone_idx].end_pfn =
>>> +					max(free_area[zone_idx].end_pfn,
>>> +					    zone->zone_start_pfn +
>>> +					    zone->spanned_pages);
>>
>> But that means that both bitmaps will have space for PFNs in the other
>> zone type, which is completely bogus.  This is fundamental because the
>> data structures are incorrectly built per zone *type* instead of per zone.
>>
> 
> I don't think it's incorrect, it's just not optimal in all scenarios.
> E.g., in you example, this approach would "waste" 2 * 1GB of tracking
> data for the wholes (2* 64bytes when using 1 bit for 2MB).
> 
> FWIW, this is not a numa-specific thingy. We can have sparse zones
> easily on single-numa systems.
> 
> Node:                 0
> NORMAL   0->1GB, 2->3GB
> MOVABLE  1->2GB, 3->4GB
> 
> So tracking it per zones instead instead of zone type is only one part
> of the story.
> 

Oh, and FWIW,

in setups like

Node:                 0               1
NORMAL   4->5GB, 6->7GB  5->6GB, 8->9GB

What Nitesh proposes is actually better. So it really depends on the use
case - but in general sparsity is the issue.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][Patch v11 2/2] virtio-balloon: page_hinting: reporting to the host
  2019-07-10 19:51 ` [RFC][Patch v11 2/2] virtio-balloon: page_hinting: reporting to the host Nitesh Narayan Lal
@ 2019-07-24 19:47   ` Michael S. Tsirkin
  2019-07-24 19:56     ` David Hildenbrand
  2019-07-24 20:06     ` Nitesh Narayan Lal
  0 siblings, 2 replies; 43+ messages in thread
From: Michael S. Tsirkin @ 2019-07-24 19:47 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta,
	wei.w.wang, yang.zhang.wz, riel, david, dodgen, konrad.wilk,
	dhildenb, aarcange, alexander.duyck, john.starks, dave.hansen,
	mhocko

On Wed, Jul 10, 2019 at 03:51:58PM -0400, Nitesh Narayan Lal wrote:
> Enables the kernel to negotiate VIRTIO_BALLOON_F_HINTING feature with the
> host. If it is available and page_hinting_flag is set to true, page_hinting
> is enabled and its callbacks are configured along with the max_pages count
> which indicates the maximum number of pages that can be isolated and hinted
> at a time. Currently, only free pages of order >= (MAX_ORDER - 2) are
> reported. To prevent any false OOM max_pages count is set to 16.
> 
> By default page_hinting feature is enabled and gets loaded as soon
> as the virtio-balloon driver is loaded. However, it could be disabled
> by writing the page_hinting_flag which is a virtio-balloon parameter.
> 
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> ---
>  drivers/virtio/Kconfig              |  1 +
>  drivers/virtio/virtio_balloon.c     | 91 ++++++++++++++++++++++++++++-
>  include/uapi/linux/virtio_balloon.h | 11 ++++
>  3 files changed, 102 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 023fc3bc01c6..dcc0cb4269a5 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -47,6 +47,7 @@ config VIRTIO_BALLOON
>  	tristate "Virtio balloon driver"
>  	depends on VIRTIO
>  	select MEMORY_BALLOON
> +	select PAGE_HINTING
>  	---help---
>  	 This driver supports increasing and decreasing the amount
>  	 of memory within a KVM guest.
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 44339fc87cc7..1fb0eb0b2c20 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -18,6 +18,7 @@
>  #include <linux/mm.h>
>  #include <linux/mount.h>
>  #include <linux/magic.h>
> +#include <linux/page_hinting.h>
>  
>  /*
>   * Balloon device works in 4K page units.  So each page is pointed to by
> @@ -35,6 +36,12 @@
>  /* The size of a free page block in bytes */
>  #define VIRTIO_BALLOON_FREE_PAGE_SIZE \
>  	(1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT))
> +/* Number of isolated pages to be reported to the host at a time.
> + * TODO:
> + * 1. Set it via host.
> + * 2. Find an optimal value for this.
> + */
> +#define PAGE_HINTING_MAX_PAGES	16
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
>  static struct vfsmount *balloon_mnt;
> @@ -45,6 +52,7 @@ enum virtio_balloon_vq {
>  	VIRTIO_BALLOON_VQ_DEFLATE,
>  	VIRTIO_BALLOON_VQ_STATS,
>  	VIRTIO_BALLOON_VQ_FREE_PAGE,
> +	VIRTIO_BALLOON_VQ_HINTING,
>  	VIRTIO_BALLOON_VQ_MAX
>  };
>  
> @@ -54,7 +62,8 @@ enum virtio_balloon_config_read {
>  
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq,
> +			 *hinting_vq;
>  
>  	/* Balloon's own wq for cpu-intensive work items */
>  	struct workqueue_struct *balloon_wq;
> @@ -112,6 +121,9 @@ struct virtio_balloon {
>  
>  	/* To register a shrinker to shrink memory upon memory pressure */
>  	struct shrinker shrinker;
> +
> +	/* Array object pointing at the isolated pages ready for hinting */
> +	struct isolated_memory isolated_pages[PAGE_HINTING_MAX_PAGES];
>  };
>  
>  static struct virtio_device_id id_table[] = {
> @@ -119,6 +131,66 @@ static struct virtio_device_id id_table[] = {
>  	{ 0 },
>  };
>  
> +static struct page_hinting_config page_hinting_conf;
> +bool page_hinting_flag = true;
> +struct virtio_balloon *hvb;
> +module_param(page_hinting_flag, bool, 0444);
> +MODULE_PARM_DESC(page_hinting_flag, "Enable page hinting");
> +
> +static int page_hinting_report(void)
> +{
> +	struct virtqueue *vq = hvb->hinting_vq;
> +	struct scatterlist sg;
> +	int err = 0, unused;
> +
> +	mutex_lock(&hvb->balloon_lock);
> +	sg_init_one(&sg, hvb->isolated_pages, sizeof(hvb->isolated_pages[0]) *
> +		    PAGE_HINTING_MAX_PAGES);
> +	err = virtqueue_add_outbuf(vq, &sg, 1, hvb, GFP_KERNEL);

In Alex's patch, I really like it that he's passing pages as sg
entries. IMHO that's both cleaner and allows seamless
support for arbitrary page sizes.

In particular ....

> +	if (!err)
> +		virtqueue_kick(hvb->hinting_vq);
> +	wait_event(hvb->acked, virtqueue_get_buf(vq, &unused));
> +	mutex_unlock(&hvb->balloon_lock);
> +	return err;
> +}
> +
> +void hint_pages(struct list_head *pages)
> +{
> +	struct device *dev = &hvb->vdev->dev;
> +	struct page *page, *next;
> +	int idx = 0, order, err;
> +	unsigned long pfn;
> +
> +	list_for_each_entry_safe(page, next, pages, lru) {
> +		pfn = page_to_pfn(page);
> +		order = page_private(page);
> +		hvb->isolated_pages[idx].phys_addr = pfn << PAGE_SHIFT;
> +		hvb->isolated_pages[idx].size = (1 << order) * PAGE_SIZE;
> +		idx++;

... passing native endian-ness values to host creates pain for
cross-endian configurations.

> +	}
> +	err = page_hinting_report();
> +	if (err < 0)
> +		dev_err(dev, "Failed to hint pages, err = %d\n", err);
> +}
> +
> +static void page_hinting_init(struct virtio_balloon *vb)
> +{
> +	struct device *dev = &vb->vdev->dev;
> +	int err;
> +
> +	page_hinting_conf.hint_pages = hint_pages;
> +	page_hinting_conf.max_pages = PAGE_HINTING_MAX_PAGES;
> +	err = page_hinting_enable(&page_hinting_conf);
> +	if (err < 0) {
> +		dev_err(dev, "Failed to enable page-hinting, err = %d\n", err);

It would be nicer to disable the feature bit then, or fail probe
completely.

> +		page_hinting_flag = false;
> +		page_hinting_conf.hint_pages = NULL;
> +		page_hinting_conf.max_pages = 0;
> +		return;
> +	}
> +	hvb = vb;
> +}
> +
>  static u32 page_to_balloon_pfn(struct page *page)
>  {
>  	unsigned long pfn = page_to_pfn(page);
> @@ -475,6 +547,7 @@ static int init_vqs(struct virtio_balloon *vb)
>  	names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
>  	names[VIRTIO_BALLOON_VQ_STATS] = NULL;
>  	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> +	names[VIRTIO_BALLOON_VQ_HINTING] = NULL;
>  
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>  		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
> @@ -486,11 +559,18 @@ static int init_vqs(struct virtio_balloon *vb)
>  		callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
>  	}
>  
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
> +		names[VIRTIO_BALLOON_VQ_HINTING] = "hinting_vq";
> +		callbacks[VIRTIO_BALLOON_VQ_HINTING] = balloon_ack;
> +	}
>  	err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
>  					 vqs, callbacks, names, NULL, NULL);
>  	if (err)
>  		return err;
>  
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
> +		vb->hinting_vq = vqs[VIRTIO_BALLOON_VQ_HINTING];
> +
>  	vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
>  	vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> @@ -929,6 +1009,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  		if (err)
>  			goto out_del_balloon_wq;
>  	}
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING) &&
> +	    page_hinting_flag)
> +		page_hinting_init(vb);
>  	virtio_device_ready(vdev);
>  
>  	if (towards_target(vb))
> @@ -976,6 +1059,10 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  		destroy_workqueue(vb->balloon_wq);
>  	}
>  
> +	if (!page_hinting_flag) {
> +		hvb = NULL;
> +		page_hinting_disable();
> +	}
>  	remove_common(vb);
>  #ifdef CONFIG_BALLOON_COMPACTION
>  	if (vb->vb_dev_info.inode)
> @@ -1030,8 +1117,10 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_HINTING,
>  	VIRTIO_BALLOON_F_FREE_PAGE_HINT,
>  	VIRTIO_BALLOON_F_PAGE_POISON,
> +	VIRTIO_BALLOON_F_HINTING,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index a1966cd7b677..29eed0ec83d3 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -36,6 +36,8 @@
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
>  #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
> +/* TODO: Find a better name to avoid any confusion with FREE_PAGE_HINT */
> +#define VIRTIO_BALLOON_F_HINTING	5 /* Page hinting virtqueue */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> @@ -108,4 +110,13 @@ struct virtio_balloon_stat {
>  	__virtio64 val;
>  } __attribute__((packed));
>  
> +/*
> + * struct isolated_memory- holds the pages which will be reported to the host.
> + * @phys_add:	physical address associated with a page.
> + * @size:	total size of memory to be reported.
> + */
> +struct isolated_memory {
> +	__virtio64 phys_addr;
> +	__virtio64 size;
> +};
>  #endif /* _LINUX_VIRTIO_BALLOON_H */
> -- 
> 2.21.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][Patch v11 2/2] virtio-balloon: page_hinting: reporting to the host
  2019-07-24 19:47   ` Michael S. Tsirkin
@ 2019-07-24 19:56     ` David Hildenbrand
  2019-07-24 20:10       ` Nitesh Narayan Lal
  2019-07-24 20:06     ` Nitesh Narayan Lal
  1 sibling, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2019-07-24 19:56 UTC (permalink / raw)
  To: Michael S. Tsirkin, Nitesh Narayan Lal
  Cc: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta,
	wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk, dhildenb,
	aarcange, alexander.duyck, john.starks, dave.hansen, mhocko

On 24.07.19 21:47, Michael S. Tsirkin wrote:
> On Wed, Jul 10, 2019 at 03:51:58PM -0400, Nitesh Narayan Lal wrote:
>> Enables the kernel to negotiate VIRTIO_BALLOON_F_HINTING feature with the
>> host. If it is available and page_hinting_flag is set to true, page_hinting
>> is enabled and its callbacks are configured along with the max_pages count
>> which indicates the maximum number of pages that can be isolated and hinted
>> at a time. Currently, only free pages of order >= (MAX_ORDER - 2) are
>> reported. To prevent any false OOM max_pages count is set to 16.
>>
>> By default page_hinting feature is enabled and gets loaded as soon
>> as the virtio-balloon driver is loaded. However, it could be disabled
>> by writing the page_hinting_flag which is a virtio-balloon parameter.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> ---
>>  drivers/virtio/Kconfig              |  1 +
>>  drivers/virtio/virtio_balloon.c     | 91 ++++++++++++++++++++++++++++-
>>  include/uapi/linux/virtio_balloon.h | 11 ++++
>>  3 files changed, 102 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
>> index 023fc3bc01c6..dcc0cb4269a5 100644
>> --- a/drivers/virtio/Kconfig
>> +++ b/drivers/virtio/Kconfig
>> @@ -47,6 +47,7 @@ config VIRTIO_BALLOON
>>  	tristate "Virtio balloon driver"
>>  	depends on VIRTIO
>>  	select MEMORY_BALLOON
>> +	select PAGE_HINTING
>>  	---help---
>>  	 This driver supports increasing and decreasing the amount
>>  	 of memory within a KVM guest.
>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>> index 44339fc87cc7..1fb0eb0b2c20 100644
>> --- a/drivers/virtio/virtio_balloon.c
>> +++ b/drivers/virtio/virtio_balloon.c
>> @@ -18,6 +18,7 @@
>>  #include <linux/mm.h>
>>  #include <linux/mount.h>
>>  #include <linux/magic.h>
>> +#include <linux/page_hinting.h>
>>  
>>  /*
>>   * Balloon device works in 4K page units.  So each page is pointed to by
>> @@ -35,6 +36,12 @@
>>  /* The size of a free page block in bytes */
>>  #define VIRTIO_BALLOON_FREE_PAGE_SIZE \
>>  	(1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT))
>> +/* Number of isolated pages to be reported to the host at a time.
>> + * TODO:
>> + * 1. Set it via host.
>> + * 2. Find an optimal value for this.
>> + */
>> +#define PAGE_HINTING_MAX_PAGES	16
>>  
>>  #ifdef CONFIG_BALLOON_COMPACTION
>>  static struct vfsmount *balloon_mnt;
>> @@ -45,6 +52,7 @@ enum virtio_balloon_vq {
>>  	VIRTIO_BALLOON_VQ_DEFLATE,
>>  	VIRTIO_BALLOON_VQ_STATS,
>>  	VIRTIO_BALLOON_VQ_FREE_PAGE,
>> +	VIRTIO_BALLOON_VQ_HINTING,
>>  	VIRTIO_BALLOON_VQ_MAX
>>  };
>>  
>> @@ -54,7 +62,8 @@ enum virtio_balloon_config_read {
>>  
>>  struct virtio_balloon {
>>  	struct virtio_device *vdev;
>> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq,
>> +			 *hinting_vq;
>>  
>>  	/* Balloon's own wq for cpu-intensive work items */
>>  	struct workqueue_struct *balloon_wq;
>> @@ -112,6 +121,9 @@ struct virtio_balloon {
>>  
>>  	/* To register a shrinker to shrink memory upon memory pressure */
>>  	struct shrinker shrinker;
>> +
>> +	/* Array object pointing at the isolated pages ready for hinting */
>> +	struct isolated_memory isolated_pages[PAGE_HINTING_MAX_PAGES];
>>  };
>>  
>>  static struct virtio_device_id id_table[] = {
>> @@ -119,6 +131,66 @@ static struct virtio_device_id id_table[] = {
>>  	{ 0 },
>>  };
>>  
>> +static struct page_hinting_config page_hinting_conf;
>> +bool page_hinting_flag = true;
>> +struct virtio_balloon *hvb;
>> +module_param(page_hinting_flag, bool, 0444);
>> +MODULE_PARM_DESC(page_hinting_flag, "Enable page hinting");
>> +
>> +static int page_hinting_report(void)
>> +{
>> +	struct virtqueue *vq = hvb->hinting_vq;
>> +	struct scatterlist sg;
>> +	int err = 0, unused;
>> +
>> +	mutex_lock(&hvb->balloon_lock);
>> +	sg_init_one(&sg, hvb->isolated_pages, sizeof(hvb->isolated_pages[0]) *
>> +		    PAGE_HINTING_MAX_PAGES);
>> +	err = virtqueue_add_outbuf(vq, &sg, 1, hvb, GFP_KERNEL);
> 
> In Alex's patch, I really like it that he's passing pages as sg
> entries. IMHO that's both cleaner and allows seamless
> support for arbitrary page sizes.
> 

+1

I especially like passing full addresses and sizes instead of PFNs and
orders (compared to Alex's v1, where he would pass PFNs and orders).

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][Patch v11 2/2] virtio-balloon: page_hinting: reporting to the host
  2019-07-24 19:47   ` Michael S. Tsirkin
  2019-07-24 19:56     ` David Hildenbrand
@ 2019-07-24 20:06     ` Nitesh Narayan Lal
  1 sibling, 0 replies; 43+ messages in thread
From: Nitesh Narayan Lal @ 2019-07-24 20:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta,
	wei.w.wang, yang.zhang.wz, riel, david, dodgen, konrad.wilk,
	dhildenb, aarcange, alexander.duyck, john.starks, dave.hansen,
	mhocko


On 7/24/19 3:47 PM, Michael S. Tsirkin wrote:
> On Wed, Jul 10, 2019 at 03:51:58PM -0400, Nitesh Narayan Lal wrote:
>> Enables the kernel to negotiate VIRTIO_BALLOON_F_HINTING feature with the
>> host. If it is available and page_hinting_flag is set to true, page_hinting
>> is enabled and its callbacks are configured along with the max_pages count
>> which indicates the maximum number of pages that can be isolated and hinted
>> at a time. Currently, only free pages of order >= (MAX_ORDER - 2) are
>> reported. To prevent any false OOM max_pages count is set to 16.
>>
>> By default page_hinting feature is enabled and gets loaded as soon
>> as the virtio-balloon driver is loaded. However, it could be disabled
>> by writing the page_hinting_flag which is a virtio-balloon parameter.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> ---
>>  drivers/virtio/Kconfig              |  1 +
>>  drivers/virtio/virtio_balloon.c     | 91 ++++++++++++++++++++++++++++-
>>  include/uapi/linux/virtio_balloon.h | 11 ++++
>>  3 files changed, 102 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
>> index 023fc3bc01c6..dcc0cb4269a5 100644
>> --- a/drivers/virtio/Kconfig
>> +++ b/drivers/virtio/Kconfig
>> @@ -47,6 +47,7 @@ config VIRTIO_BALLOON
>>  	tristate "Virtio balloon driver"
>>  	depends on VIRTIO
>>  	select MEMORY_BALLOON
>> +	select PAGE_HINTING
>>  	---help---
>>  	 This driver supports increasing and decreasing the amount
>>  	 of memory within a KVM guest.
>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>> index 44339fc87cc7..1fb0eb0b2c20 100644
>> --- a/drivers/virtio/virtio_balloon.c
>> +++ b/drivers/virtio/virtio_balloon.c
>> @@ -18,6 +18,7 @@
>>  #include <linux/mm.h>
>>  #include <linux/mount.h>
>>  #include <linux/magic.h>
>> +#include <linux/page_hinting.h>
>>  
>>  /*
>>   * Balloon device works in 4K page units.  So each page is pointed to by
>> @@ -35,6 +36,12 @@
>>  /* The size of a free page block in bytes */
>>  #define VIRTIO_BALLOON_FREE_PAGE_SIZE \
>>  	(1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT))
>> +/* Number of isolated pages to be reported to the host at a time.
>> + * TODO:
>> + * 1. Set it via host.
>> + * 2. Find an optimal value for this.
>> + */
>> +#define PAGE_HINTING_MAX_PAGES	16
>>  
>>  #ifdef CONFIG_BALLOON_COMPACTION
>>  static struct vfsmount *balloon_mnt;
>> @@ -45,6 +52,7 @@ enum virtio_balloon_vq {
>>  	VIRTIO_BALLOON_VQ_DEFLATE,
>>  	VIRTIO_BALLOON_VQ_STATS,
>>  	VIRTIO_BALLOON_VQ_FREE_PAGE,
>> +	VIRTIO_BALLOON_VQ_HINTING,
>>  	VIRTIO_BALLOON_VQ_MAX
>>  };
>>  
>> @@ -54,7 +62,8 @@ enum virtio_balloon_config_read {
>>  
>>  struct virtio_balloon {
>>  	struct virtio_device *vdev;
>> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq,
>> +			 *hinting_vq;
>>  
>>  	/* Balloon's own wq for cpu-intensive work items */
>>  	struct workqueue_struct *balloon_wq;
>> @@ -112,6 +121,9 @@ struct virtio_balloon {
>>  
>>  	/* To register a shrinker to shrink memory upon memory pressure */
>>  	struct shrinker shrinker;
>> +
>> +	/* Array object pointing at the isolated pages ready for hinting */
>> +	struct isolated_memory isolated_pages[PAGE_HINTING_MAX_PAGES];
>>  };
>>  
>>  static struct virtio_device_id id_table[] = {
>> @@ -119,6 +131,66 @@ static struct virtio_device_id id_table[] = {
>>  	{ 0 },
>>  };
>>  
>> +static struct page_hinting_config page_hinting_conf;
>> +bool page_hinting_flag = true;
>> +struct virtio_balloon *hvb;
>> +module_param(page_hinting_flag, bool, 0444);
>> +MODULE_PARM_DESC(page_hinting_flag, "Enable page hinting");
>> +
>> +static int page_hinting_report(void)
>> +{
>> +	struct virtqueue *vq = hvb->hinting_vq;
>> +	struct scatterlist sg;
>> +	int err = 0, unused;
>> +
>> +	mutex_lock(&hvb->balloon_lock);
>> +	sg_init_one(&sg, hvb->isolated_pages, sizeof(hvb->isolated_pages[0]) *
>> +		    PAGE_HINTING_MAX_PAGES);
>> +	err = virtqueue_add_outbuf(vq, &sg, 1, hvb, GFP_KERNEL);
> In Alex's patch, I really like it that he's passing pages as sg
> entries. IMHO that's both cleaner and allows seamless
> support for arbitrary page sizes.
>
> In particular ....
+1. I will also incorporate this change.
>
>> +	if (!err)
>> +		virtqueue_kick(hvb->hinting_vq);
>> +	wait_event(hvb->acked, virtqueue_get_buf(vq, &unused));
>> +	mutex_unlock(&hvb->balloon_lock);
>> +	return err;
>> +}
>> +
>> +void hint_pages(struct list_head *pages)
>> +{
>> +	struct device *dev = &hvb->vdev->dev;
>> +	struct page *page, *next;
>> +	int idx = 0, order, err;
>> +	unsigned long pfn;
>> +
>> +	list_for_each_entry_safe(page, next, pages, lru) {
>> +		pfn = page_to_pfn(page);
>> +		order = page_private(page);
>> +		hvb->isolated_pages[idx].phys_addr = pfn << PAGE_SHIFT;
>> +		hvb->isolated_pages[idx].size = (1 << order) * PAGE_SIZE;
>> +		idx++;
> ... passing native endian-ness values to host creates pain for
> cross-endian configurations.
>
>> +	}
>> +	err = page_hinting_report();
>> +	if (err < 0)
>> +		dev_err(dev, "Failed to hint pages, err = %d\n", err);
>> +}
>> +
>> +static void page_hinting_init(struct virtio_balloon *vb)
>> +{
>> +	struct device *dev = &vb->vdev->dev;
>> +	int err;
>> +
>> +	page_hinting_conf.hint_pages = hint_pages;
>> +	page_hinting_conf.max_pages = PAGE_HINTING_MAX_PAGES;
>> +	err = page_hinting_enable(&page_hinting_conf);
>> +	if (err < 0) {
>> +		dev_err(dev, "Failed to enable page-hinting, err = %d\n", err);
> It would be nicer to disable the feature bit then, or fail probe
> completely.
Makes sense. Thanks.
>> +		page_hinting_flag = false;
>> +		page_hinting_conf.hint_pages = NULL;
>> +		page_hinting_conf.max_pages = 0;
>> +		return;
>> +	}
>> +	hvb = vb;
>> +}
>> +
>>  static u32 page_to_balloon_pfn(struct page *page)
>>  {
>>  	unsigned long pfn = page_to_pfn(page);
>> @@ -475,6 +547,7 @@ static int init_vqs(struct virtio_balloon *vb)
>>  	names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
>>  	names[VIRTIO_BALLOON_VQ_STATS] = NULL;
>>  	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
>> +	names[VIRTIO_BALLOON_VQ_HINTING] = NULL;
>>  
>>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>>  		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
>> @@ -486,11 +559,18 @@ static int init_vqs(struct virtio_balloon *vb)
>>  		callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
>>  	}
>>  
>> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
>> +		names[VIRTIO_BALLOON_VQ_HINTING] = "hinting_vq";
>> +		callbacks[VIRTIO_BALLOON_VQ_HINTING] = balloon_ack;
>> +	}
>>  	err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
>>  					 vqs, callbacks, names, NULL, NULL);
>>  	if (err)
>>  		return err;
>>  
>> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
>> +		vb->hinting_vq = vqs[VIRTIO_BALLOON_VQ_HINTING];
>> +
>>  	vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
>>  	vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
>>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>> @@ -929,6 +1009,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>>  		if (err)
>>  			goto out_del_balloon_wq;
>>  	}
>> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING) &&
>> +	    page_hinting_flag)
>> +		page_hinting_init(vb);
>>  	virtio_device_ready(vdev);
>>  
>>  	if (towards_target(vb))
>> @@ -976,6 +1059,10 @@ static void virtballoon_remove(struct virtio_device *vdev)
>>  		destroy_workqueue(vb->balloon_wq);
>>  	}
>>  
>> +	if (!page_hinting_flag) {
>> +		hvb = NULL;
>> +		page_hinting_disable();
>> +	}
>>  	remove_common(vb);
>>  #ifdef CONFIG_BALLOON_COMPACTION
>>  	if (vb->vb_dev_info.inode)
>> @@ -1030,8 +1117,10 @@ static unsigned int features[] = {
>>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>>  	VIRTIO_BALLOON_F_STATS_VQ,
>>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>> +	VIRTIO_BALLOON_F_HINTING,
>>  	VIRTIO_BALLOON_F_FREE_PAGE_HINT,
>>  	VIRTIO_BALLOON_F_PAGE_POISON,
>> +	VIRTIO_BALLOON_F_HINTING,
>>  };
>>  
>>  static struct virtio_driver virtio_balloon_driver = {
>> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
>> index a1966cd7b677..29eed0ec83d3 100644
>> --- a/include/uapi/linux/virtio_balloon.h
>> +++ b/include/uapi/linux/virtio_balloon.h
>> @@ -36,6 +36,8 @@
>>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>>  #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
>>  #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
>> +/* TODO: Find a better name to avoid any confusion with FREE_PAGE_HINT */
>> +#define VIRTIO_BALLOON_F_HINTING	5 /* Page hinting virtqueue */
>>  
>>  /* Size of a PFN in the balloon interface. */
>>  #define VIRTIO_BALLOON_PFN_SHIFT 12
>> @@ -108,4 +110,13 @@ struct virtio_balloon_stat {
>>  	__virtio64 val;
>>  } __attribute__((packed));
>>  
>> +/*
>> + * struct isolated_memory- holds the pages which will be reported to the host.
>> + * @phys_add:	physical address associated with a page.
>> + * @size:	total size of memory to be reported.
>> + */
>> +struct isolated_memory {
>> +	__virtio64 phys_addr;
>> +	__virtio64 size;
>> +};
>>  #endif /* _LINUX_VIRTIO_BALLOON_H */
>> -- 
>> 2.21.0
-- 
Thanks
Nitesh

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][Patch v11 2/2] virtio-balloon: page_hinting: reporting to the host
  2019-07-24 19:56     ` David Hildenbrand
@ 2019-07-24 20:10       ` Nitesh Narayan Lal
  0 siblings, 0 replies; 43+ messages in thread
From: Nitesh Narayan Lal @ 2019-07-24 20:10 UTC (permalink / raw)
  To: David Hildenbrand, Michael S. Tsirkin
  Cc: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta,
	wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk, dhildenb,
	aarcange, alexander.duyck, john.starks, dave.hansen, mhocko


On 7/24/19 3:56 PM, David Hildenbrand wrote:
> On 24.07.19 21:47, Michael S. Tsirkin wrote:
>> On Wed, Jul 10, 2019 at 03:51:58PM -0400, Nitesh Narayan Lal wrote:
>>> Enables the kernel to negotiate VIRTIO_BALLOON_F_HINTING feature with the
>>> host. If it is available and page_hinting_flag is set to true, page_hinting
>>> is enabled and its callbacks are configured along with the max_pages count
>>> which indicates the maximum number of pages that can be isolated and hinted
>>> at a time. Currently, only free pages of order >= (MAX_ORDER - 2) are
>>> reported. To prevent any false OOM max_pages count is set to 16.
>>>
>>> By default page_hinting feature is enabled and gets loaded as soon
>>> as the virtio-balloon driver is loaded. However, it could be disabled
>>> by writing the page_hinting_flag which is a virtio-balloon parameter.
>>>
>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>>> ---
>>>  drivers/virtio/Kconfig              |  1 +
>>>  drivers/virtio/virtio_balloon.c     | 91 ++++++++++++++++++++++++++++-
>>>  include/uapi/linux/virtio_balloon.h | 11 ++++
>>>  3 files changed, 102 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
>>> index 023fc3bc01c6..dcc0cb4269a5 100644
>>> --- a/drivers/virtio/Kconfig
>>> +++ b/drivers/virtio/Kconfig
>>> @@ -47,6 +47,7 @@ config VIRTIO_BALLOON
>>>  	tristate "Virtio balloon driver"
>>>  	depends on VIRTIO
>>>  	select MEMORY_BALLOON
>>> +	select PAGE_HINTING
>>>  	---help---
>>>  	 This driver supports increasing and decreasing the amount
>>>  	 of memory within a KVM guest.
>>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>>> index 44339fc87cc7..1fb0eb0b2c20 100644
>>> --- a/drivers/virtio/virtio_balloon.c
>>> +++ b/drivers/virtio/virtio_balloon.c
>>> @@ -18,6 +18,7 @@
>>>  #include <linux/mm.h>
>>>  #include <linux/mount.h>
>>>  #include <linux/magic.h>
>>> +#include <linux/page_hinting.h>
>>>  
>>>  /*
>>>   * Balloon device works in 4K page units.  So each page is pointed to by
>>> @@ -35,6 +36,12 @@
>>>  /* The size of a free page block in bytes */
>>>  #define VIRTIO_BALLOON_FREE_PAGE_SIZE \
>>>  	(1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT))
>>> +/* Number of isolated pages to be reported to the host at a time.
>>> + * TODO:
>>> + * 1. Set it via host.
>>> + * 2. Find an optimal value for this.
>>> + */
>>> +#define PAGE_HINTING_MAX_PAGES	16
>>>  
>>>  #ifdef CONFIG_BALLOON_COMPACTION
>>>  static struct vfsmount *balloon_mnt;
>>> @@ -45,6 +52,7 @@ enum virtio_balloon_vq {
>>>  	VIRTIO_BALLOON_VQ_DEFLATE,
>>>  	VIRTIO_BALLOON_VQ_STATS,
>>>  	VIRTIO_BALLOON_VQ_FREE_PAGE,
>>> +	VIRTIO_BALLOON_VQ_HINTING,
>>>  	VIRTIO_BALLOON_VQ_MAX
>>>  };
>>>  
>>> @@ -54,7 +62,8 @@ enum virtio_balloon_config_read {
>>>  
>>>  struct virtio_balloon {
>>>  	struct virtio_device *vdev;
>>> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>>> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq,
>>> +			 *hinting_vq;
>>>  
>>>  	/* Balloon's own wq for cpu-intensive work items */
>>>  	struct workqueue_struct *balloon_wq;
>>> @@ -112,6 +121,9 @@ struct virtio_balloon {
>>>  
>>>  	/* To register a shrinker to shrink memory upon memory pressure */
>>>  	struct shrinker shrinker;
>>> +
>>> +	/* Array object pointing at the isolated pages ready for hinting */
>>> +	struct isolated_memory isolated_pages[PAGE_HINTING_MAX_PAGES];
>>>  };
>>>  
>>>  static struct virtio_device_id id_table[] = {
>>> @@ -119,6 +131,66 @@ static struct virtio_device_id id_table[] = {
>>>  	{ 0 },
>>>  };
>>>  
>>> +static struct page_hinting_config page_hinting_conf;
>>> +bool page_hinting_flag = true;
>>> +struct virtio_balloon *hvb;
>>> +module_param(page_hinting_flag, bool, 0444);
>>> +MODULE_PARM_DESC(page_hinting_flag, "Enable page hinting");
>>> +
>>> +static int page_hinting_report(void)
>>> +{
>>> +	struct virtqueue *vq = hvb->hinting_vq;
>>> +	struct scatterlist sg;
>>> +	int err = 0, unused;
>>> +
>>> +	mutex_lock(&hvb->balloon_lock);
>>> +	sg_init_one(&sg, hvb->isolated_pages, sizeof(hvb->isolated_pages[0]) *
>>> +		    PAGE_HINTING_MAX_PAGES);
>>> +	err = virtqueue_add_outbuf(vq, &sg, 1, hvb, GFP_KERNEL);
>> In Alex's patch, I really like it that he's passing pages as sg
>> entries. IMHO that's both cleaner and allows seamless
>> support for arbitrary page sizes.
>>
> +1
>
> I especially like passing full addresses and sizes instead of PFNs and
> orders (compared to Alex's v1, where he would pass PFNs and orders).
I agree it fixes the issues which could have been introduced due to different
page sizes in the host and the guest.
>
-- 
Thanks
Nitesh

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure
  2019-07-12 16:22           ` Alexander Duyck
  2019-07-12 16:25             ` Nitesh Narayan Lal
@ 2019-08-08 11:41             ` Nitesh Narayan Lal
  1 sibling, 0 replies; 43+ messages in thread
From: Nitesh Narayan Lal @ 2019-08-08 11:41 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta,
	wei.w.wang, Yang Zhang, Rik van Riel, David Hildenbrand,
	Michael S. Tsirkin, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli, john.starks, Dave Hansen, Michal Hocko


On 7/12/19 12:22 PM, Alexander Duyck wrote:
> On Thu, Jul 11, 2019 at 6:13 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>
>> On 7/11/19 7:20 PM, Alexander Duyck wrote:
>>> On Thu, Jul 11, 2019 at 10:58 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>> On 7/10/19 5:56 PM, Alexander Duyck wrote:
>>>>> On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>>>> This patch introduces the core infrastructure for free page hinting in
>>>>>> virtual environments. It enables the kernel to track the free pages which
>>>>>> can be reported to its hypervisor so that the hypervisor could
>>>>>> free and reuse that memory as per its requirement.
>>>>>>
>>>>>> While the pages are getting processed in the hypervisor (e.g.,
>>>>>> via MADV_FREE), the guest must not use them, otherwise, data loss
>>>>>> would be possible. To avoid such a situation, these pages are
>>>>>> temporarily removed from the buddy. The amount of pages removed
>>>>>> temporarily from the buddy is governed by the backend(virtio-balloon
>>>>>> in our case).
>>>>>>
>>>>>> To efficiently identify free pages that can to be hinted to the
>>>>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>>>>>> chunks are reported to the hypervisor - especially, to not break up THP
>>>>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>>>>>> in the bitmap are an indication whether a page *might* be free, not a
>>>>>> guarantee. A new hook after buddy merging sets the bits.
>>>>>>
>>>>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>>>>>> asynchronously processes the bitmaps, trying to isolate and report pages
>>>>>> that are still free. The backend (virtio-balloon) is responsible for
>>>>>> reporting these batched pages to the host synchronously. Once reporting/
>>>>>> freeing is complete, isolated pages are returned back to the buddy.
>>>>>>
>>>>>> There are still various things to look into (e.g., memory hotplug, more
>>>>>> efficient locking, possible races when disabling).
>>>>>>
>>>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>>> So just FYI, I thought I would try the patches. It looks like there
>>> might be a bug somewhere that is causing it to free memory it
>>> shouldn't be. After about 10 minutes my VM crashed with a system log
>>> full of various NULL pointer dereferences.
>> That's interesting, I have tried the patches with MADV_DONTNEED as well.
>> I just retried it but didn't see any crash. May I know what kind of
>> workload you are running?
> I was running the page_fault1 test on a VM with 80G of memory.
>
>>>  The only change I had made
>>> is to use MADV_DONTNEED instead of MADV_FREE in QEMU since my headers
>>> didn't have MADV_FREE on the host. It occurs to me one advantage of
>>> MADV_DONTNEED over MADV_FREE is that you are more likely to catch
>>> these sort of errors since it zeros the pages instead of leaving them
>>> intact.
>> For development purpose maybe. For the final patch-set I think we
>> discussed earlier why we should keep MADV_FREE.
> I'm still not convinced MADV_FREE is a net win, at least for
> performance. You are still paying the cost for the VMEXIT in order to
> regain ownership of the page. In the case that you are under memory
> pressure it is essentially equivalent to MADV_DONTNEED. Also it
> doesn't really do much to help with the memory footprint of the VM
> itself. With the MADV_DONTNEED the pages are freed back and you have a
> greater liklihood of reducing the overall memory footprint of the
> entire system since you would be more likely to be assigned pages that
> were recently used rather than having to access a cold page.

I was able to reproduce this bug and have fixed it.
I tried testing the fix by running will-it-scale/page_fault1 for around 12 hours.
For now, I have also moved to MADV_DONTNEED.


> <snip>
>
>>>>>> +void page_hinting_enqueue(struct page *page, int order)
>>>>>> +{
>>>>>> +       int zone_idx;
>>>>>> +
>>>>>> +       if (!page_hitning_conf || order < PAGE_HINTING_MIN_ORDER)
>>>>>> +               return;
>>>>> I would think it is going to be expensive to be jumping into this
>>>>> function for every freed page. You should probably have an inline
>>>>> taking care of the order check before you even get here since it would
>>>>> be faster that way.
>>>> I see, I can take a look. Thanks.
>>>>>> +
>>>>>> +       bm_set_pfn(page);
>>>>>> +       if (atomic_read(&page_hinting_active))
>>>>>> +               return;
>>>>> So I would think this piece is racy. Specifically if you set a PFN
>>>>> that is somewhere below the PFN you are currently processing in your
>>>>> scan it is going to remain unset until you have another page freed
>>>>> after the scan is completed. I would worry you can end up with a batch
>>>>> free of memory resulting in a group of pages sitting at the start of
>>>>> your bitmap unhinted.
>>>> True, but that will be hinted next time threshold is met.
>>> Yes, but that assumes that there is another free immediately coming.
>>> It is possible that you have a big application run and then
>>> immediately shut down and have it free all its memory at once. Worst
>>> case scenario would be that it starts by freeing from the end and
>>> works toward the start. With that you could theoretically end up with
>>> a significant chunk of memory waiting some time for another big free
>>> to come along.
>> Any suggestion on some benchmark/test application which I could run to
>> see this kind of behavior?
> Like I mentioned before, try doing a VM with a bigger memory
> footprint. You could probably just do a stack of VMs like what we were
> doing with the memhog test. Basically the longer it takes to process
> all the pages the greater the liklihood that there are still pages
> left when they are freed.
-- 
Thanks
Nitesh

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2019-08-08 11:41 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-10 19:51 [RFC][PATCH v11 0/2] mm: Support for page hinting Nitesh Narayan Lal
2019-07-10 19:51 ` [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure Nitesh Narayan Lal
2019-07-10 20:45   ` Dave Hansen
2019-07-11 11:48     ` Nitesh Narayan Lal
2019-07-11 15:25     ` Nitesh Narayan Lal
2019-07-11 15:50       ` Nitesh Narayan Lal
2019-07-11 16:22       ` Dave Hansen
2019-07-11 16:36         ` Nitesh Narayan Lal
2019-07-11 16:45           ` Dave Hansen
2019-07-11 16:52             ` Nitesh Narayan Lal
2019-07-15  9:26     ` David Hildenbrand
2019-07-10 21:56   ` Alexander Duyck
2019-07-11 17:58     ` Nitesh Narayan Lal
2019-07-11 23:20       ` Alexander Duyck
2019-07-12  1:12         ` Nitesh Narayan Lal
2019-07-12 16:22           ` Alexander Duyck
2019-07-12 16:25             ` Nitesh Narayan Lal
2019-08-08 11:41             ` Nitesh Narayan Lal
2019-07-11 18:21   ` Dave Hansen
2019-07-15  9:33     ` David Hildenbrand
2019-07-15 14:40       ` David Hildenbrand
2019-07-10 19:51 ` [RFC][Patch v11 2/2] virtio-balloon: page_hinting: reporting to the host Nitesh Narayan Lal
2019-07-24 19:47   ` Michael S. Tsirkin
2019-07-24 19:56     ` David Hildenbrand
2019-07-24 20:10       ` Nitesh Narayan Lal
2019-07-24 20:06     ` Nitesh Narayan Lal
2019-07-10 19:53 ` [QEMU Patch] virtio-baloon: Support for page hinting Nitesh Narayan Lal
2019-07-10 20:17   ` Alexander Duyck
2019-07-11 12:03     ` Nitesh Narayan Lal
2019-07-11  8:49   ` Cornelia Huck
2019-07-11 11:13     ` Nitesh Narayan Lal
2019-07-11 18:55   ` Michael S. Tsirkin
2019-07-11 19:06     ` Nitesh Narayan Lal
2019-07-11 22:36       ` Alexander Duyck
2019-07-10 20:19 ` [RFC][PATCH v11 0/2] mm: " Dave Hansen
2019-07-11 11:37   ` Nitesh Narayan Lal
2019-07-10 23:40 ` Alexander Duyck
2019-07-11 11:30   ` Nitesh Narayan Lal
2019-07-11 14:58     ` Alexander Duyck
2019-07-11 15:03       ` Nitesh Narayan Lal
2019-07-11 15:08         ` Alexander Duyck
2019-07-11 15:19           ` Nitesh Narayan Lal
2019-07-11 17:01             ` Alexander Duyck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).