linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting
@ 2020-01-22 17:43 Alexander Duyck
  2020-01-22 17:43 ` [PATCH v16.1 1/9] mm: Adjust shuffle code to allow for future coalescing Alexander Duyck
                   ` (11 more replies)
  0 siblings, 12 replies; 39+ messages in thread
From: Alexander Duyck @ 2020-01-22 17:43 UTC (permalink / raw)
  To: kvm, mst, linux-kernel, willy, mhocko, linux-mm, akpm, mgorman, vbabka
  Cc: yang.zhang.wz, nitesh, konrad.wilk, david, pagupta, riel,
	lcapitulino, dave.hansen, wei.w.wang, aarcange, pbonzini,
	dan.j.williams, alexander.h.duyck, osalvador

This series provides an asynchronous means of reporting free guest pages
to a hypervisor so that the memory associated with those pages can be
dropped and reused by other processes and/or guests on the host. Using
this it is possible to avoid unnecessary I/O to disk and greatly improve
performance in the case of memory overcommit on the host.

When enabled we will be performing a scan of free memory every 2 seconds
while pages of sufficiently high order are being freed. In each pass at
least one sixteenth of each free list will be reported. By doing this we
avoid racing against other threads that may be causing a high amount of
memory churn.

The lowest page order currently scanned when reporting pages is
pageblock_order so that this feature will not interfere with the use of
Transparent Huge Pages in the case of virtualization.

Currently this is only in use by virtio-balloon however there is the hope
that at some point in the future other hypervisors might be able to make
use of it. In the virtio-balloon/QEMU implementation the hypervisor is
currently using MADV_DONTNEED to indicate to the host kernel that the page
is currently free. It will be zeroed and faulted back into the guest the
next time the page is accessed.

To track if a page is reported or not the Uptodate flag was repurposed and
used as a Reported flag for Buddy pages. We walk though the free list
isolating pages and adding them to the scatterlist until we either
encounter the end of the list, processed as many pages as were listed in
nr_free prior to us starting, or have filled the scatterlist with pages to
be reported. If we fill the scatterlist before we reach the end of the
list we rotate the list so that the first unreported page we encounter is
moved to the head of the list as that is where we will resume after we
have freed the reported pages back into the tail of the list.

Below are the results from various benchmarks. I primarily focused on two
tests. The first is the will-it-scale/page_fault2 test, and the other is
a modified version of will-it-scale/page_fault1 that was enabled to use
THP. I did this as it allows for better visibility into different parts
of the memory subsystem. The guest is running with 32G for RAM on one
node of a E5-2630 v3. The host has had some features such as CPU turbo
disabled in the BIOS.

Test                   page_fault1 (THP)    page_fault2
Name            tasks  Process Iter  STDEV  Process Iter  STDEV
Baseline            1    1012402.50  0.14%     361855.25  0.81%
                   16    8827457.25  0.09%    3282347.00  0.34%

Patches Applied     1    1007897.00  0.23%     361887.00  0.26%
                   16    8784741.75  0.39%    3240669.25  0.48%

Patches Enabled     1    1010227.50  0.39%     359749.25  0.56%
                   16    8756219.00  0.24%    3226608.75  0.97%

Patches Enabled     1    1050982.00  4.26%     357966.25  0.14%
 page shuffle      16    8672601.25  0.49%    3223177.75  0.40%

Patches enabled     1    1003238.00  0.22%     360211.00  0.22%
 shuffle w/ RFC    16    8767010.50  0.32%    3199874.00  0.71%

The results above are for a baseline with a linux-next-20191219 kernel,
that kernel with this patch set applied but page reporting disabled in
virtio-balloon, the patches applied and page reporting fully enabled, the
patches enabled with page shuffling enabled, and the patches applied with
page shuffling enabled and an RFC patch that makes used of MADV_FREE in
QEMU. These results include the deviation seen between the average value
reported here versus the high and/or low value. I observed that during the
test memory usage for the first three tests never dropped whereas with the
patches fully enabled the VM would drop to using only a few GB of the
host's memory when switching from memhog to page fault tests.

Any of the overhead visible with this patch set enabled seems due to page
faults caused by accessing the reported pages and the host zeroing the page
before giving it back to the guest. This overhead is much more visible when
using THP than with standard 4K pages. In addition page shuffling seemed to
increase the amount of faults generated due to an increase in memory churn.
The overhead is reduced when using MADV_FREE as we can avoid the extra
zeroing of the pages when they are reintroduced to the host, as can be seen
when the RFC is applied with shuffling enabled.

The overall guest size is kept fairly small to only a few GB while the test
is running. If the host memory were oversubscribed this patch set should
result in a performance improvement as swapping memory in the host can be
avoided.

A brief history on the background of free page reporting can be found at:
https://lore.kernel.org/lkml/29f43d5796feed0dec8e8bb98b187d9dac03b900.camel@linux.intel.com/

Changes from v14:
https://lore.kernel.org/lkml/20191119214454.24996.66289.stgit@localhost.localdomain/
Renamed "unused page reporting" to "free page reporting"
  Updated code, kconfig, and patch descriptions
Split out patch for __free_isolated_page
  Renamed function to __putback_isolated_page
Rewrote core reporting functionality
  Added logic to reschedule worker in 2 seconds instead of run to completion
  Removed reported_pages statistics
  Removed REPORTING_REQUESTED bit used in zone flags
  Replaced page_reporting_dev_info refcount with state variable
  Removed scatterlist from page_reporting_dev_info
  Removed capacity from page reporting device
  Added dynamic scatterlist allocation/free at start/end of reporting process
  Updated __free_one_page so that reported pages are not always added to tail
  Added logic to handle error from report function
Updated virtio-balloon patch that adds support for page reporting
  Updated patch description to try and highlight differences in approaches
  Updated logic to reflect that we cannot limit the scatterlist from device
  Added logic to return error from report function
Moved documentation patch to end of patch set

Changes from v15:
https://lore.kernel.org/lkml/20191205161928.19548.41654.stgit@localhost.localdomain/
Rebased on linux-next-20191219
Split out patches for budget and moving head to last page processed
Updated budget code to reduce how much memory is reported per pass
Added logic to also rotate the list if we exit due a page isolation failure
Added migratetype as argument in __putback_isolated_page

Changes from v16:
https://lore.kernel.org/lkml/20200103210509.29237.18426.stgit@localhost.localdomain/
Rebased on linux-next-20200122
  Updated patch 2 to to account for removal of pr_info in __isolate_free_page
Updated patch title for patches 7, 8, and 9 to use prefix mm/page_reporting
No code changes other than conflict resolution for patch 2

---

Alexander Duyck (9):
      mm: Adjust shuffle code to allow for future coalescing
      mm: Use zone and order instead of free area in free_list manipulators
      mm: Add function __putback_isolated_page
      mm: Introduce Reported pages
      virtio-balloon: Pull page poisoning config out of free page hinting
      virtio-balloon: Add support for providing free page reports to host
      mm/page_reporting: Rotate reported pages to the tail of the list
      mm/page_reporting: Add budget limit on how many pages can be reported per pass
      mm/page_reporting: Add free page reporting documentation


 Documentation/vm/free_page_reporting.rst |   41 +++
 drivers/virtio/Kconfig                   |    1 
 drivers/virtio/virtio_balloon.c          |   87 +++++++
 include/linux/mmzone.h                   |   44 ----
 include/linux/page-flags.h               |   11 +
 include/linux/page_reporting.h           |   26 ++
 include/uapi/linux/virtio_balloon.h      |    1 
 mm/Kconfig                               |   11 +
 mm/Makefile                              |    1 
 mm/internal.h                            |    2 
 mm/page_alloc.c                          |  164 ++++++++++----
 mm/page_isolation.c                      |    6 
 mm/page_reporting.c                      |  364 ++++++++++++++++++++++++++++++
 mm/page_reporting.h                      |   54 ++++
 mm/shuffle.c                             |   12 -
 mm/shuffle.h                             |    6 
 16 files changed, 725 insertions(+), 106 deletions(-)
 create mode 100644 Documentation/vm/free_page_reporting.rst
 create mode 100644 include/linux/page_reporting.h
 create mode 100644 mm/page_reporting.c
 create mode 100644 mm/page_reporting.h

--

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v16.1 1/9] mm: Adjust shuffle code to allow for future coalescing
  2020-01-22 17:43 [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting Alexander Duyck
@ 2020-01-22 17:43 ` Alexander Duyck
  2020-01-22 17:43 ` [PATCH v16.1 2/9] mm: Use zone and order instead of free area in free_list manipulators Alexander Duyck
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Alexander Duyck @ 2020-01-22 17:43 UTC (permalink / raw)
  To: kvm, mst, linux-kernel, willy, mhocko, linux-mm, akpm, mgorman, vbabka
  Cc: yang.zhang.wz, nitesh, konrad.wilk, david, pagupta, riel,
	lcapitulino, dave.hansen, wei.w.wang, aarcange, pbonzini,
	dan.j.williams, alexander.h.duyck, osalvador

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Move the head/tail adding logic out of the shuffle code and into the
__free_one_page function since ultimately that is where it is really
needed anyway. By doing this we should be able to reduce the overhead
and can consolidate all of the list addition bits in one spot.

Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/mmzone.h |   12 --------
 mm/page_alloc.c        |   71 ++++++++++++++++++++++++++++--------------------
 mm/shuffle.c           |   12 ++++----
 mm/shuffle.h           |    6 ++++
 4 files changed, 54 insertions(+), 47 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 462f6873905a..bdcd071ab67f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -116,18 +116,6 @@ static inline void add_to_free_area_tail(struct page *page, struct free_area *ar
 	area->nr_free++;
 }
 
-#ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR
-/* Used to preserve page allocation order entropy */
-void add_to_free_area_random(struct page *page, struct free_area *area,
-		int migratetype);
-#else
-static inline void add_to_free_area_random(struct page *page,
-		struct free_area *area, int migratetype)
-{
-	add_to_free_area(page, area, migratetype);
-}
-#endif
-
 /* Used for pages which are on another list */
 static inline void move_to_free_area(struct page *page, struct free_area *area,
 			     int migratetype)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 621716a25639..2a5949833069 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -871,6 +871,36 @@ static inline struct capture_control *task_capc(struct zone *zone)
 #endif /* CONFIG_COMPACTION */
 
 /*
+ * If this is not the largest possible page, check if the buddy
+ * of the next-highest order is free. If it is, it's possible
+ * that pages are being freed that will coalesce soon. In case,
+ * that is happening, add the free page to the tail of the list
+ * so it's less likely to be used soon and more likely to be merged
+ * as a higher order page
+ */
+static inline bool
+buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
+		   struct page *page, unsigned int order)
+{
+	struct page *higher_page, *higher_buddy;
+	unsigned long combined_pfn;
+
+	if (order >= MAX_ORDER - 2)
+		return false;
+
+	if (!pfn_valid_within(buddy_pfn))
+		return false;
+
+	combined_pfn = buddy_pfn & pfn;
+	higher_page = page + (combined_pfn - pfn);
+	buddy_pfn = __find_buddy_pfn(combined_pfn, order + 1);
+	higher_buddy = higher_page + (buddy_pfn - combined_pfn);
+
+	return pfn_valid_within(buddy_pfn) &&
+	       page_is_buddy(higher_page, higher_buddy, order + 1);
+}
+
+/*
  * Freeing function for a buddy system allocator.
  *
  * The concept of a buddy system is to maintain direct-mapped table
@@ -899,11 +929,13 @@ static inline void __free_one_page(struct page *page,
 		struct zone *zone, unsigned int order,
 		int migratetype)
 {
-	unsigned long combined_pfn;
+	struct capture_control *capc = task_capc(zone);
 	unsigned long uninitialized_var(buddy_pfn);
-	struct page *buddy;
+	unsigned long combined_pfn;
+	struct free_area *area;
 	unsigned int max_order;
-	struct capture_control *capc = task_capc(zone);
+	struct page *buddy;
+	bool to_tail;
 
 	max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
 
@@ -972,35 +1004,16 @@ static inline void __free_one_page(struct page *page,
 done_merging:
 	set_page_order(page, order);
 
-	/*
-	 * If this is not the largest possible page, check if the buddy
-	 * of the next-highest order is free. If it is, it's possible
-	 * that pages are being freed that will coalesce soon. In case,
-	 * that is happening, add the free page to the tail of the list
-	 * so it's less likely to be used soon and more likely to be merged
-	 * as a higher order page
-	 */
-	if ((order < MAX_ORDER-2) && pfn_valid_within(buddy_pfn)
-			&& !is_shuffle_order(order)) {
-		struct page *higher_page, *higher_buddy;
-		combined_pfn = buddy_pfn & pfn;
-		higher_page = page + (combined_pfn - pfn);
-		buddy_pfn = __find_buddy_pfn(combined_pfn, order + 1);
-		higher_buddy = higher_page + (buddy_pfn - combined_pfn);
-		if (pfn_valid_within(buddy_pfn) &&
-		    page_is_buddy(higher_page, higher_buddy, order + 1)) {
-			add_to_free_area_tail(page, &zone->free_area[order],
-					      migratetype);
-			return;
-		}
-	}
-
+	area = &zone->free_area[order];
 	if (is_shuffle_order(order))
-		add_to_free_area_random(page, &zone->free_area[order],
-				migratetype);
+		to_tail = shuffle_pick_tail();
 	else
-		add_to_free_area(page, &zone->free_area[order], migratetype);
+		to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);
 
+	if (to_tail)
+		add_to_free_area_tail(page, area, migratetype);
+	else
+		add_to_free_area(page, area, migratetype);
 }
 
 /*
diff --git a/mm/shuffle.c b/mm/shuffle.c
index b3fe97fd6654..e65d57f39486 100644
--- a/mm/shuffle.c
+++ b/mm/shuffle.c
@@ -183,11 +183,11 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
 		shuffle_zone(z);
 }
 
-void add_to_free_area_random(struct page *page, struct free_area *area,
-		int migratetype)
+bool shuffle_pick_tail(void)
 {
 	static u64 rand;
 	static u8 rand_bits;
+	bool ret;
 
 	/*
 	 * The lack of locking is deliberate. If 2 threads race to
@@ -198,10 +198,10 @@ void add_to_free_area_random(struct page *page, struct free_area *area,
 		rand = get_random_u64();
 	}
 
-	if (rand & 1)
-		add_to_free_area(page, area, migratetype);
-	else
-		add_to_free_area_tail(page, area, migratetype);
+	ret = rand & 1;
+
 	rand_bits--;
 	rand >>= 1;
+
+	return ret;
 }
diff --git a/mm/shuffle.h b/mm/shuffle.h
index 777a257a0d2f..4d79f03b6658 100644
--- a/mm/shuffle.h
+++ b/mm/shuffle.h
@@ -22,6 +22,7 @@ enum mm_shuffle_ctl {
 DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
 extern void page_alloc_shuffle(enum mm_shuffle_ctl ctl);
 extern void __shuffle_free_memory(pg_data_t *pgdat);
+extern bool shuffle_pick_tail(void);
 static inline void shuffle_free_memory(pg_data_t *pgdat)
 {
 	if (!static_branch_unlikely(&page_alloc_shuffle_key))
@@ -44,6 +45,11 @@ static inline bool is_shuffle_order(int order)
 	return order >= SHUFFLE_ORDER;
 }
 #else
+static inline bool shuffle_pick_tail(void)
+{
+	return false;
+}
+
 static inline void shuffle_free_memory(pg_data_t *pgdat)
 {
 }


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v16.1 2/9] mm: Use zone and order instead of free area in free_list manipulators
  2020-01-22 17:43 [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting Alexander Duyck
  2020-01-22 17:43 ` [PATCH v16.1 1/9] mm: Adjust shuffle code to allow for future coalescing Alexander Duyck
@ 2020-01-22 17:43 ` Alexander Duyck
  2020-01-22 17:43 ` [PATCH v16.1 3/9] mm: Add function __putback_isolated_page Alexander Duyck
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Alexander Duyck @ 2020-01-22 17:43 UTC (permalink / raw)
  To: kvm, mst, linux-kernel, willy, mhocko, linux-mm, akpm, mgorman, vbabka
  Cc: yang.zhang.wz, nitesh, konrad.wilk, david, pagupta, riel,
	lcapitulino, dave.hansen, wei.w.wang, aarcange, pbonzini,
	dan.j.williams, alexander.h.duyck, osalvador

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

In order to enable the use of the zone from the list manipulator functions
I will need access to the zone pointer. As it turns out most of the
accessors were always just being directly passed &zone->free_area[order]
anyway so it would make sense to just fold that into the function itself
and pass the zone and order as arguments instead of the free area.

In order to be able to reference the zone we need to move the declaration
of the functions down so that we have the zone defined before we define the
list manipulation functions. Since the functions are only used in the file
mm/page_alloc.c we can just move them there to reduce noise in the header.

Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/mmzone.h |   32 -----------------------
 mm/page_alloc.c        |   67 +++++++++++++++++++++++++++++++++++-------------
 2 files changed, 49 insertions(+), 50 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bdcd071ab67f..a32bd503b9fc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -100,29 +100,6 @@ struct free_area {
 	unsigned long		nr_free;
 };
 
-/* Used for pages not on another list */
-static inline void add_to_free_area(struct page *page, struct free_area *area,
-			     int migratetype)
-{
-	list_add(&page->lru, &area->free_list[migratetype]);
-	area->nr_free++;
-}
-
-/* Used for pages not on another list */
-static inline void add_to_free_area_tail(struct page *page, struct free_area *area,
-				  int migratetype)
-{
-	list_add_tail(&page->lru, &area->free_list[migratetype]);
-	area->nr_free++;
-}
-
-/* Used for pages which are on another list */
-static inline void move_to_free_area(struct page *page, struct free_area *area,
-			     int migratetype)
-{
-	list_move(&page->lru, &area->free_list[migratetype]);
-}
-
 static inline struct page *get_page_from_free_area(struct free_area *area,
 					    int migratetype)
 {
@@ -130,15 +107,6 @@ static inline struct page *get_page_from_free_area(struct free_area *area,
 					struct page, lru);
 }
 
-static inline void del_page_from_free_area(struct page *page,
-		struct free_area *area)
-{
-	list_del(&page->lru);
-	__ClearPageBuddy(page);
-	set_page_private(page, 0);
-	area->nr_free--;
-}
-
 static inline bool free_area_empty(struct free_area *area, int migratetype)
 {
 	return list_empty(&area->free_list[migratetype]);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2a5949833069..b1cc0dab1c29 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -870,6 +870,44 @@ static inline struct capture_control *task_capc(struct zone *zone)
 }
 #endif /* CONFIG_COMPACTION */
 
+/* Used for pages not on another list */
+static inline void add_to_free_list(struct page *page, struct zone *zone,
+				    unsigned int order, int migratetype)
+{
+	struct free_area *area = &zone->free_area[order];
+
+	list_add(&page->lru, &area->free_list[migratetype]);
+	area->nr_free++;
+}
+
+/* Used for pages not on another list */
+static inline void add_to_free_list_tail(struct page *page, struct zone *zone,
+					 unsigned int order, int migratetype)
+{
+	struct free_area *area = &zone->free_area[order];
+
+	list_add_tail(&page->lru, &area->free_list[migratetype]);
+	area->nr_free++;
+}
+
+/* Used for pages which are on another list */
+static inline void move_to_free_list(struct page *page, struct zone *zone,
+				     unsigned int order, int migratetype)
+{
+	struct free_area *area = &zone->free_area[order];
+
+	list_move(&page->lru, &area->free_list[migratetype]);
+}
+
+static inline void del_page_from_free_list(struct page *page, struct zone *zone,
+					   unsigned int order)
+{
+	list_del(&page->lru);
+	__ClearPageBuddy(page);
+	set_page_private(page, 0);
+	zone->free_area[order].nr_free--;
+}
+
 /*
  * If this is not the largest possible page, check if the buddy
  * of the next-highest order is free. If it is, it's possible
@@ -932,7 +970,6 @@ static inline void __free_one_page(struct page *page,
 	struct capture_control *capc = task_capc(zone);
 	unsigned long uninitialized_var(buddy_pfn);
 	unsigned long combined_pfn;
-	struct free_area *area;
 	unsigned int max_order;
 	struct page *buddy;
 	bool to_tail;
@@ -970,7 +1007,7 @@ static inline void __free_one_page(struct page *page,
 		if (page_is_guard(buddy))
 			clear_page_guard(zone, buddy, order, migratetype);
 		else
-			del_page_from_free_area(buddy, &zone->free_area[order]);
+			del_page_from_free_list(buddy, zone, order);
 		combined_pfn = buddy_pfn & pfn;
 		page = page + (combined_pfn - pfn);
 		pfn = combined_pfn;
@@ -1004,16 +1041,15 @@ static inline void __free_one_page(struct page *page,
 done_merging:
 	set_page_order(page, order);
 
-	area = &zone->free_area[order];
 	if (is_shuffle_order(order))
 		to_tail = shuffle_pick_tail();
 	else
 		to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);
 
 	if (to_tail)
-		add_to_free_area_tail(page, area, migratetype);
+		add_to_free_list_tail(page, zone, order, migratetype);
 	else
-		add_to_free_area(page, area, migratetype);
+		add_to_free_list(page, zone, order, migratetype);
 }
 
 /*
@@ -2027,13 +2063,11 @@ void __init init_cma_reserved_pageblock(struct page *page)
  * -- nyc
  */
 static inline void expand(struct zone *zone, struct page *page,
-	int low, int high, struct free_area *area,
-	int migratetype)
+	int low, int high, int migratetype)
 {
 	unsigned long size = 1 << high;
 
 	while (high > low) {
-		area--;
 		high--;
 		size >>= 1;
 		VM_BUG_ON_PAGE(bad_range(zone, &page[size]), &page[size]);
@@ -2047,7 +2081,7 @@ static inline void expand(struct zone *zone, struct page *page,
 		if (set_page_guard(zone, &page[size], high, migratetype))
 			continue;
 
-		add_to_free_area(&page[size], area, migratetype);
+		add_to_free_list(&page[size], zone, high, migratetype);
 		set_page_order(&page[size], high);
 	}
 }
@@ -2205,8 +2239,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 		page = get_page_from_free_area(area, migratetype);
 		if (!page)
 			continue;
-		del_page_from_free_area(page, area);
-		expand(zone, page, order, current_order, area, migratetype);
+		del_page_from_free_list(page, zone, current_order);
+		expand(zone, page, order, current_order, migratetype);
 		set_pcppage_migratetype(page, migratetype);
 		return page;
 	}
@@ -2280,7 +2314,7 @@ static int move_freepages(struct zone *zone,
 		VM_BUG_ON_PAGE(page_zone(page) != zone, page);
 
 		order = page_order(page);
-		move_to_free_area(page, &zone->free_area[order], migratetype);
+		move_to_free_list(page, zone, order, migratetype);
 		page += 1 << order;
 		pages_moved += 1 << order;
 	}
@@ -2396,7 +2430,6 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 		unsigned int alloc_flags, int start_type, bool whole_block)
 {
 	unsigned int current_order = page_order(page);
-	struct free_area *area;
 	int free_pages, movable_pages, alike_pages;
 	int old_block_type;
 
@@ -2467,8 +2500,7 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 	return;
 
 single_page:
-	area = &zone->free_area[current_order];
-	move_to_free_area(page, area, start_type);
+	move_to_free_list(page, zone, current_order, start_type);
 }
 
 /*
@@ -3139,7 +3171,6 @@ void split_page(struct page *page, unsigned int order)
 
 int __isolate_free_page(struct page *page, unsigned int order)
 {
-	struct free_area *area = &page_zone(page)->free_area[order];
 	unsigned long watermark;
 	struct zone *zone;
 	int mt;
@@ -3165,7 +3196,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
 
 	/* Remove page from free list */
 
-	del_page_from_free_area(page, area);
+	del_page_from_free_list(page, zone, order);
 
 	/*
 	 * Set the pageblock if the isolated page is at least half of a
@@ -8725,7 +8756,7 @@ void zone_pcp_reset(struct zone *zone)
 		BUG_ON(!PageBuddy(page));
 		order = page_order(page);
 		offlined_pages += 1 << order;
-		del_page_from_free_area(page, &zone->free_area[order]);
+		del_page_from_free_list(page, zone, order);
 		pfn += (1 << order);
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v16.1 3/9] mm: Add function __putback_isolated_page
  2020-01-22 17:43 [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting Alexander Duyck
  2020-01-22 17:43 ` [PATCH v16.1 1/9] mm: Adjust shuffle code to allow for future coalescing Alexander Duyck
  2020-01-22 17:43 ` [PATCH v16.1 2/9] mm: Use zone and order instead of free area in free_list manipulators Alexander Duyck
@ 2020-01-22 17:43 ` Alexander Duyck
  2020-01-22 17:43 ` [PATCH v16.1 4/9] mm: Introduce Reported pages Alexander Duyck
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Alexander Duyck @ 2020-01-22 17:43 UTC (permalink / raw)
  To: kvm, mst, linux-kernel, willy, mhocko, linux-mm, akpm, mgorman, vbabka
  Cc: yang.zhang.wz, nitesh, konrad.wilk, david, pagupta, riel,
	lcapitulino, dave.hansen, wei.w.wang, aarcange, pbonzini,
	dan.j.williams, alexander.h.duyck, osalvador

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

There are cases where we would benefit from avoiding having to go through
the allocation and free cycle to return an isolated page.

Examples for this might include page poisoning in which we isolate a page
and then put it back in the free list without ever having actually
allocated it.

This will enable us to also avoid notifiers for the future free page
reporting which will need to avoid retriggering page reporting when
returning pages that have been reported on.

Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 mm/internal.h       |    2 ++
 mm/page_alloc.c     |   19 +++++++++++++++++++
 mm/page_isolation.c |    6 ++----
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 3cf20ab3ca01..7b108222e5f4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -157,6 +157,8 @@ static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn,
 }
 
 extern int __isolate_free_page(struct page *page, unsigned int order);
+extern void __putback_isolated_page(struct page *page, unsigned int order,
+				    int mt);
 extern void memblock_free_pages(struct page *page, unsigned long pfn,
 					unsigned int order);
 extern void __free_pages_core(struct page *page, unsigned int order);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b1cc0dab1c29..f65e398eed89 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3217,6 +3217,25 @@ int __isolate_free_page(struct page *page, unsigned int order)
 	return 1UL << order;
 }
 
+/**
+ * __putback_isolated_page - Return a now-isolated page back where we got it
+ * @page: Page that was isolated
+ * @order: Order of the isolated page
+ *
+ * This function is meant to return a page pulled from the free lists via
+ * __isolate_free_page back to the free lists they were pulled from.
+ */
+void __putback_isolated_page(struct page *page, unsigned int order, int mt)
+{
+	struct zone *zone = page_zone(page);
+
+	/* zone lock should be held when this function is called */
+	lockdep_assert_held(&zone->lock);
+
+	/* Return isolated page to tail of freelist. */
+	__free_one_page(page, page_to_pfn(page), zone, order, mt);
+}
+
 /*
  * Update NUMA hit/miss statistics
  *
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index e70586523ca3..28d5ef1f85ef 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -113,13 +113,11 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
 		__mod_zone_freepage_state(zone, nr_pages, migratetype);
 	}
 	set_pageblock_migratetype(page, migratetype);
+	if (isolated_page)
+		__putback_isolated_page(page, order, migratetype);
 	zone->nr_isolate_pageblock--;
 out:
 	spin_unlock_irqrestore(&zone->lock, flags);
-	if (isolated_page) {
-		post_alloc_hook(page, order, __GFP_MOVABLE);
-		__free_pages(page, order);
-	}
 }
 
 static inline struct page *


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v16.1 4/9] mm: Introduce Reported pages
  2020-01-22 17:43 [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting Alexander Duyck
                   ` (2 preceding siblings ...)
  2020-01-22 17:43 ` [PATCH v16.1 3/9] mm: Add function __putback_isolated_page Alexander Duyck
@ 2020-01-22 17:43 ` Alexander Duyck
  2020-01-22 17:43 ` [PATCH v16.1 5/9] virtio-balloon: Pull page poisoning config out of free page hinting Alexander Duyck
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Alexander Duyck @ 2020-01-22 17:43 UTC (permalink / raw)
  To: kvm, mst, linux-kernel, willy, mhocko, linux-mm, akpm, mgorman, vbabka
  Cc: yang.zhang.wz, nitesh, konrad.wilk, david, pagupta, riel,
	lcapitulino, dave.hansen, wei.w.wang, aarcange, pbonzini,
	dan.j.williams, alexander.h.duyck, osalvador

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

In order to pave the way for free page reporting in virtualized
environments we will need a way to get pages out of the free lists and
identify those pages after they have been returned. To accomplish this,
this patch adds the concept of a Reported Buddy, which is essentially
meant to just be the Uptodate flag used in conjunction with the Buddy
page type.

To prevent the reported pages from leaking outside of the buddy lists I
added a check to clear the PageReported bit in the del_page_from_free_list
function. As a result any reported page that is split, merged, or
allocated will have the flag cleared prior to the PageBuddy value being
cleared.

The process for reporting pages is fairly simple. Once we free a page that
meets the minimum order for page reporting we will schedule a worker thread
to start 2s or more in the future. That worker thread will begin working
from the lowest supported page reporting order up to MAX_ORDER - 1 pulling
unreported pages from the free list and storing them in the scatterlist.

When processing each individual free list it is necessary for the worker
thread to release the zone lock when it needs to stop and report the full
scatterlist of pages. To reduce the work of the next iteration the worker
thread will rotate the free list so that the first unreported page in the
free list becomes the first entry in the list.

It will then call a reporting function providing information on how many
entries are in the scatterlist. Once the function completes it will return
the pages to the free area from which they were allocated and start over
pulling more pages from the free areas until there are no longer enough
pages to report on to keep the worker busy, or we have processed as many
pages as were contained in the free area when we started processing the
list.

The worker thread will work in a round-robin fashion making its way
though each zone requesting reporting, and through each reportable free
list within that zone. Once all free areas within the zone have been
processed it will check to see if there have been any requests for
reporting while it was processing. If so it will reschedule the worker
thread to start up again in roughly 2s and exit.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/page-flags.h     |   11 +
 include/linux/page_reporting.h |   25 +++
 mm/Kconfig                     |   11 +
 mm/Makefile                    |    1 
 mm/page_alloc.c                |   17 ++
 mm/page_reporting.c            |  319 ++++++++++++++++++++++++++++++++++++++++
 mm/page_reporting.h            |   54 +++++++
 7 files changed, 434 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/page_reporting.h
 create mode 100644 mm/page_reporting.c
 create mode 100644 mm/page_reporting.h

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 1bf83c8fcaa7..49c2697046b9 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -163,6 +163,9 @@ enum pageflags {
 
 	/* non-lru isolated movable page */
 	PG_isolated = PG_reclaim,
+
+	/* Only valid for buddy pages. Used to track pages that are reported */
+	PG_reported = PG_uptodate,
 };
 
 #ifndef __GENERATING_BOUNDS_H
@@ -432,6 +435,14 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page)
 #endif
 
 /*
+ * PageReported() is used to track reported free pages within the Buddy
+ * allocator. We can use the non-atomic version of the test and set
+ * operations as both should be shielded with the zone lock to prevent
+ * any possible races on the setting or clearing of the bit.
+ */
+__PAGEFLAG(Reported, reported, PF_NO_COMPOUND)
+
+/*
  * On an anonymous page mapped into a user virtual memory area,
  * page->mapping points to its anon_vma, not to a struct address_space;
  * with the PAGE_MAPPING_ANON bit set to distinguish it.  See rmap.h.
diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
new file mode 100644
index 000000000000..32355486f572
--- /dev/null
+++ b/include/linux/page_reporting.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PAGE_REPORTING_H
+#define _LINUX_PAGE_REPORTING_H
+
+#include <linux/mmzone.h>
+#include <linux/scatterlist.h>
+
+#define PAGE_REPORTING_CAPACITY		32
+
+struct page_reporting_dev_info {
+	/* function that alters pages to make them "reported" */
+	int (*report)(struct page_reporting_dev_info *prdev,
+		      struct scatterlist *sg, unsigned int nents);
+
+	/* work struct for processing reports */
+	struct delayed_work work;
+
+	/* Current state of page reporting */
+	atomic_t state;
+};
+
+/* Tear-down and bring-up for page reporting devices */
+void page_reporting_unregister(struct page_reporting_dev_info *prdev);
+int page_reporting_register(struct page_reporting_dev_info *prdev);
+#endif /*_LINUX_PAGE_REPORTING_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index ab80933be65f..d40a873402ff 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -237,6 +237,17 @@ config COMPACTION
 	  linux-mm@kvack.org.
 
 #
+# support for free page reporting
+config PAGE_REPORTING
+	bool "Free page reporting"
+	def_bool n
+	help
+	  Free page reporting allows for the incremental acquisition of
+	  free pages from the buddy allocator for the purpose of reporting
+	  those pages to another entity, such as a hypervisor, so that the
+	  memory can be freed within the host for other uses.
+
+#
 # support for page migration
 #
 config MIGRATION
diff --git a/mm/Makefile b/mm/Makefile
index c9696f3ec840..7b5eec34d0e9 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -118,3 +118,4 @@ obj-$(CONFIG_HMM_MIRROR) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
+obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f65e398eed89..cbf04ea9c817 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -74,6 +74,7 @@
 #include <asm/div64.h>
 #include "internal.h"
 #include "shuffle.h"
+#include "page_reporting.h"
 
 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
 static DEFINE_MUTEX(pcp_batch_high_lock);
@@ -902,6 +903,10 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
 static inline void del_page_from_free_list(struct page *page, struct zone *zone,
 					   unsigned int order)
 {
+	/* clear reported state and update reported page count */
+	if (page_reported(page))
+		__ClearPageReported(page);
+
 	list_del(&page->lru);
 	__ClearPageBuddy(page);
 	set_page_private(page, 0);
@@ -965,7 +970,7 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
 static inline void __free_one_page(struct page *page,
 		unsigned long pfn,
 		struct zone *zone, unsigned int order,
-		int migratetype)
+		int migratetype, bool report)
 {
 	struct capture_control *capc = task_capc(zone);
 	unsigned long uninitialized_var(buddy_pfn);
@@ -1050,6 +1055,10 @@ static inline void __free_one_page(struct page *page,
 		add_to_free_list_tail(page, zone, order, migratetype);
 	else
 		add_to_free_list(page, zone, order, migratetype);
+
+	/* Notify page reporting subsystem of freed page */
+	if (report)
+		page_reporting_notify_free(order);
 }
 
 /*
@@ -1366,7 +1375,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		if (unlikely(isolated_pageblocks))
 			mt = get_pageblock_migratetype(page);
 
-		__free_one_page(page, page_to_pfn(page), zone, 0, mt);
+		__free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
 		trace_mm_page_pcpu_drain(page, 0, mt);
 	}
 	spin_unlock(&zone->lock);
@@ -1382,7 +1391,7 @@ static void free_one_page(struct zone *zone,
 		is_migrate_isolate(migratetype))) {
 		migratetype = get_pfnblock_migratetype(page, pfn);
 	}
-	__free_one_page(page, pfn, zone, order, migratetype);
+	__free_one_page(page, pfn, zone, order, migratetype, true);
 	spin_unlock(&zone->lock);
 }
 
@@ -3233,7 +3242,7 @@ void __putback_isolated_page(struct page *page, unsigned int order, int mt)
 	lockdep_assert_held(&zone->lock);
 
 	/* Return isolated page to tail of freelist. */
-	__free_one_page(page, page_to_pfn(page), zone, order, mt);
+	__free_one_page(page, page_to_pfn(page), zone, order, mt, false);
 }
 
 /*
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
new file mode 100644
index 000000000000..1047c6872d4f
--- /dev/null
+++ b/mm/page_reporting.c
@@ -0,0 +1,319 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/page_reporting.h>
+#include <linux/gfp.h>
+#include <linux/export.h>
+#include <linux/delay.h>
+#include <linux/scatterlist.h>
+
+#include "page_reporting.h"
+#include "internal.h"
+
+#define PAGE_REPORTING_DELAY	(2 * HZ)
+static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly;
+
+enum {
+	PAGE_REPORTING_IDLE = 0,
+	PAGE_REPORTING_REQUESTED,
+	PAGE_REPORTING_ACTIVE
+};
+
+/* request page reporting */
+static void
+__page_reporting_request(struct page_reporting_dev_info *prdev)
+{
+	unsigned int state;
+
+	/* Check to see if we are in desired state */
+	state = atomic_read(&prdev->state);
+	if (state == PAGE_REPORTING_REQUESTED)
+		return;
+
+	/*
+	 *  If reporting is already active there is nothing we need to do.
+	 *  Test against 0 as that represents PAGE_REPORTING_IDLE.
+	 */
+	state = atomic_xchg(&prdev->state, PAGE_REPORTING_REQUESTED);
+	if (state != PAGE_REPORTING_IDLE)
+		return;
+
+	/*
+	 * Delay the start of work to allow a sizable queue to build. For
+	 * now we are limiting this to running no more than once every
+	 * couple of seconds.
+	 */
+	schedule_delayed_work(&prdev->work, PAGE_REPORTING_DELAY);
+}
+
+/* notify prdev of free page reporting request */
+void __page_reporting_notify(void)
+{
+	struct page_reporting_dev_info *prdev;
+
+	/*
+	 * We use RCU to protect the pr_dev_info pointer. In almost all
+	 * cases this should be present, however in the unlikely case of
+	 * a shutdown this will be NULL and we should exit.
+	 */
+	rcu_read_lock();
+	prdev = rcu_dereference(pr_dev_info);
+	if (likely(prdev))
+		__page_reporting_request(prdev);
+
+	rcu_read_unlock();
+}
+
+static void
+page_reporting_drain(struct page_reporting_dev_info *prdev,
+		     struct scatterlist *sgl, unsigned int nents, bool reported)
+{
+	struct scatterlist *sg = sgl;
+
+	/*
+	 * Drain the now reported pages back into their respective
+	 * free lists/areas. We assume at least one page is populated.
+	 */
+	do {
+		struct page *page = sg_page(sg);
+		int mt = get_pageblock_migratetype(page);
+		unsigned int order = get_order(sg->length);
+
+		__putback_isolated_page(page, order, mt);
+
+		/* If the pages were not reported due to error skip flagging */
+		if (!reported)
+			continue;
+
+		/*
+		 * If page was not comingled with another page we can
+		 * consider the result to be "reported" since the page
+		 * hasn't been modified, otherwise we will need to
+		 * report on the new larger page when we make our way
+		 * up to that higher order.
+		 */
+		if (PageBuddy(page) && page_order(page) == order)
+			__SetPageReported(page);
+	} while ((sg = sg_next(sg)));
+
+	/* reinitialize scatterlist now that it is empty */
+	sg_init_table(sgl, nents);
+}
+
+/*
+ * The page reporting cycle consists of 4 stages, fill, report, drain, and
+ * idle. We will cycle through the first 3 stages until we cannot obtain a
+ * full scatterlist of pages, in that case we will switch to idle.
+ */
+static int
+page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
+		     unsigned int order, unsigned int mt,
+		     struct scatterlist *sgl, unsigned int *offset)
+{
+	struct free_area *area = &zone->free_area[order];
+	struct list_head *list = &area->free_list[mt];
+	unsigned int page_len = PAGE_SIZE << order;
+	struct page *page, *next;
+	int err = 0;
+
+	/*
+	 * Perform early check, if free area is empty there is
+	 * nothing to process so we can skip this free_list.
+	 */
+	if (list_empty(list))
+		return err;
+
+	spin_lock_irq(&zone->lock);
+
+	/* loop through free list adding unreported pages to sg list */
+	list_for_each_entry_safe(page, next, list, lru) {
+		/* We are going to skip over the reported pages. */
+		if (PageReported(page))
+			continue;
+
+		/* Attempt to pull page from list */
+		if (!__isolate_free_page(page, order))
+			break;
+
+		/* Add page to scatter list */
+		--(*offset);
+		sg_set_page(&sgl[*offset], page, page_len, 0);
+
+		/* If scatterlist isn't full grab more pages */
+		if (*offset)
+			continue;
+
+		/* release lock before waiting on report processing */
+		spin_unlock_irq(&zone->lock);
+
+		/* begin processing pages in local list */
+		err = prdev->report(prdev, sgl, PAGE_REPORTING_CAPACITY);
+
+		/* reset offset since the full list was reported */
+		*offset = PAGE_REPORTING_CAPACITY;
+
+		/* reacquire zone lock and resume processing */
+		spin_lock_irq(&zone->lock);
+
+		/* flush reported pages from the sg list */
+		page_reporting_drain(prdev, sgl, PAGE_REPORTING_CAPACITY, !err);
+
+		/*
+		 * Reset next to first entry, the old next isn't valid
+		 * since we dropped the lock to report the pages
+		 */
+		next = list_first_entry(list, struct page, lru);
+
+		/* exit on error */
+		if (err)
+			break;
+	}
+
+	spin_unlock_irq(&zone->lock);
+
+	return err;
+}
+
+static int
+page_reporting_process_zone(struct page_reporting_dev_info *prdev,
+			    struct scatterlist *sgl, struct zone *zone)
+{
+	unsigned int order, mt, leftover, offset = PAGE_REPORTING_CAPACITY;
+	unsigned long watermark;
+	int err = 0;
+
+	/* Generate minimum watermark to be able to guarantee progress */
+	watermark = low_wmark_pages(zone) +
+		    (PAGE_REPORTING_CAPACITY << PAGE_REPORTING_MIN_ORDER);
+
+	/*
+	 * Cancel request if insufficient free memory or if we failed
+	 * to allocate page reporting statistics for the zone.
+	 */
+	if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
+		return err;
+
+	/* Process each free list starting from lowest order/mt */
+	for (order = PAGE_REPORTING_MIN_ORDER; order < MAX_ORDER; order++) {
+		for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+			/* We do not pull pages from the isolate free list */
+			if (is_migrate_isolate(mt))
+				continue;
+
+			err = page_reporting_cycle(prdev, zone, order, mt,
+						   sgl, &offset);
+			if (err)
+				return err;
+		}
+	}
+
+	/* report the leftover pages before going idle */
+	leftover = PAGE_REPORTING_CAPACITY - offset;
+	if (leftover) {
+		sgl = &sgl[offset];
+		err = prdev->report(prdev, sgl, leftover);
+
+		/* flush any remaining pages out from the last report */
+		spin_lock_irq(&zone->lock);
+		page_reporting_drain(prdev, sgl, leftover, !err);
+		spin_unlock_irq(&zone->lock);
+	}
+
+	return err;
+}
+
+static void page_reporting_process(struct work_struct *work)
+{
+	struct delayed_work *d_work = to_delayed_work(work);
+	struct page_reporting_dev_info *prdev =
+		container_of(d_work, struct page_reporting_dev_info, work);
+	int err = 0, state = PAGE_REPORTING_ACTIVE;
+	struct scatterlist *sgl;
+	struct zone *zone;
+
+	/*
+	 * Change the state to "Active" so that we can track if there is
+	 * anyone requests page reporting after we complete our pass. If
+	 * the state is not altered by the end of the pass we will switch
+	 * to idle and quit scheduling reporting runs.
+	 */
+	atomic_set(&prdev->state, state);
+
+	/* allocate scatterlist to store pages being reported on */
+	sgl = kmalloc_array(PAGE_REPORTING_CAPACITY, sizeof(*sgl), GFP_KERNEL);
+	if (!sgl)
+		goto err_out;
+
+	sg_init_table(sgl, PAGE_REPORTING_CAPACITY);
+
+	for_each_zone(zone) {
+		err = page_reporting_process_zone(prdev, sgl, zone);
+		if (err)
+			break;
+	}
+
+	kfree(sgl);
+err_out:
+	/*
+	 * If the state has reverted back to requested then there may be
+	 * additional pages to be processed. We will defer for 2s to allow
+	 * more pages to accumulate.
+	 */
+	state = atomic_cmpxchg(&prdev->state, state, PAGE_REPORTING_IDLE);
+	if (state == PAGE_REPORTING_REQUESTED)
+		schedule_delayed_work(&prdev->work, PAGE_REPORTING_DELAY);
+}
+
+static DEFINE_MUTEX(page_reporting_mutex);
+DEFINE_STATIC_KEY_FALSE(page_reporting_enabled);
+
+int page_reporting_register(struct page_reporting_dev_info *prdev)
+{
+	int err = 0;
+
+	mutex_lock(&page_reporting_mutex);
+
+	/* nothing to do if already in use */
+	if (rcu_access_pointer(pr_dev_info)) {
+		err = -EBUSY;
+		goto err_out;
+	}
+
+	/* initialize state and work structures */
+	atomic_set(&prdev->state, PAGE_REPORTING_IDLE);
+	INIT_DELAYED_WORK(&prdev->work, &page_reporting_process);
+
+	/* Begin initial flush of zones */
+	__page_reporting_request(prdev);
+
+	/* Assign device to allow notifications */
+	rcu_assign_pointer(pr_dev_info, prdev);
+
+	/* enable page reporting notification */
+	if (!static_key_enabled(&page_reporting_enabled)) {
+		static_branch_enable(&page_reporting_enabled);
+		pr_info("Free page reporting enabled\n");
+	}
+err_out:
+	mutex_unlock(&page_reporting_mutex);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(page_reporting_register);
+
+void page_reporting_unregister(struct page_reporting_dev_info *prdev)
+{
+	mutex_lock(&page_reporting_mutex);
+
+	if (rcu_access_pointer(pr_dev_info) == prdev) {
+		/* Disable page reporting notification */
+		RCU_INIT_POINTER(pr_dev_info, NULL);
+		synchronize_rcu();
+
+		/* Flush any existing work, and lock it out */
+		cancel_delayed_work_sync(&prdev->work);
+	}
+
+	mutex_unlock(&page_reporting_mutex);
+}
+EXPORT_SYMBOL_GPL(page_reporting_unregister);
diff --git a/mm/page_reporting.h b/mm/page_reporting.h
new file mode 100644
index 000000000000..aa6d37f4dc22
--- /dev/null
+++ b/mm/page_reporting.h
@@ -0,0 +1,54 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _MM_PAGE_REPORTING_H
+#define _MM_PAGE_REPORTING_H
+
+#include <linux/mmzone.h>
+#include <linux/pageblock-flags.h>
+#include <linux/page-isolation.h>
+#include <linux/jump_label.h>
+#include <linux/slab.h>
+#include <asm/pgtable.h>
+#include <linux/scatterlist.h>
+
+#define PAGE_REPORTING_MIN_ORDER	pageblock_order
+
+#ifdef CONFIG_PAGE_REPORTING
+DECLARE_STATIC_KEY_FALSE(page_reporting_enabled);
+void __page_reporting_notify(void);
+
+static inline bool page_reported(struct page *page)
+{
+	return static_branch_unlikely(&page_reporting_enabled) &&
+	       PageReported(page);
+}
+
+/**
+ * page_reporting_notify_free - Free page notification to start page processing
+ *
+ * This function is meant to act as a screener for __page_reporting_notify
+ * which will determine if a give zone has crossed over the high-water mark
+ * that will justify us beginning page treatment. If we have crossed that
+ * threshold then it will start the process of pulling some pages and
+ * placing them in the batch list for treatment.
+ */
+static inline void page_reporting_notify_free(unsigned int order)
+{
+	/* Called from hot path in __free_one_page() */
+	if (!static_branch_unlikely(&page_reporting_enabled))
+		return;
+
+	/* Determine if we have crossed reporting threshold */
+	if (order < PAGE_REPORTING_MIN_ORDER)
+		return;
+
+	/* This will add a few cycles, but should be called infrequently */
+	__page_reporting_notify();
+}
+#else /* CONFIG_PAGE_REPORTING */
+#define page_reported(_page)	false
+
+static inline void page_reporting_notify_free(unsigned int order)
+{
+}
+#endif /* CONFIG_PAGE_REPORTING */
+#endif /*_MM_PAGE_REPORTING_H */


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v16.1 5/9] virtio-balloon: Pull page poisoning config out of free page hinting
  2020-01-22 17:43 [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting Alexander Duyck
                   ` (3 preceding siblings ...)
  2020-01-22 17:43 ` [PATCH v16.1 4/9] mm: Introduce Reported pages Alexander Duyck
@ 2020-01-22 17:43 ` Alexander Duyck
  2020-01-22 17:43 ` [PATCH v16.1 6/9] virtio-balloon: Add support for providing free page reports to host Alexander Duyck
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Alexander Duyck @ 2020-01-22 17:43 UTC (permalink / raw)
  To: kvm, mst, linux-kernel, willy, mhocko, linux-mm, akpm, mgorman, vbabka
  Cc: yang.zhang.wz, nitesh, konrad.wilk, david, pagupta, riel,
	lcapitulino, dave.hansen, wei.w.wang, aarcange, pbonzini,
	dan.j.williams, alexander.h.duyck, osalvador

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Currently the page poisoning setting wasn't being enabled unless free page
hinting was enabled. However we will need the page poisoning tracking logic
as well for free page reporting. As such pull it out and make it a separate
bit of config in the probe function.

In addition we need to add support for the more recent init_on_free feature
which expects a behavior similar to page poisoning in that we expect the
page to be pre-zeroed.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 drivers/virtio/virtio_balloon.c |   23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 8e400ece9273..40bb7693e3de 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -862,7 +862,6 @@ static int virtio_balloon_register_shrinker(struct virtio_balloon *vb)
 static int virtballoon_probe(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb;
-	__u32 poison_val;
 	int err;
 
 	if (!vdev->config->get) {
@@ -929,11 +928,20 @@ static int virtballoon_probe(struct virtio_device *vdev)
 						  VIRTIO_BALLOON_CMD_ID_STOP);
 		spin_lock_init(&vb->free_page_list_lock);
 		INIT_LIST_HEAD(&vb->free_page_list);
-		if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
+	}
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
+		/* Start with poison val of 0 representing general init */
+		__u32 poison_val = 0;
+
+		/*
+		 * Let the hypervisor know that we are expecting a
+		 * specific value to be written back in balloon pages.
+		 */
+		if (!want_init_on_free())
 			memset(&poison_val, PAGE_POISON, sizeof(poison_val));
-			virtio_cwrite(vb->vdev, struct virtio_balloon_config,
-				      poison_val, &poison_val);
-		}
+
+		virtio_cwrite(vb->vdev, struct virtio_balloon_config,
+			      poison_val, &poison_val);
 	}
 	/*
 	 * We continue to use VIRTIO_BALLOON_F_DEFLATE_ON_OOM to decide if a
@@ -1034,7 +1042,10 @@ static int virtballoon_restore(struct virtio_device *vdev)
 
 static int virtballoon_validate(struct virtio_device *vdev)
 {
-	if (!page_poisoning_enabled())
+	/* Tell the host whether we care about poisoned pages. */
+	if (!want_init_on_free() &&
+	    (IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY) ||
+	     !page_poisoning_enabled()))
 		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);
 
 	__virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM);


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v16.1 6/9] virtio-balloon: Add support for providing free page reports to host
  2020-01-22 17:43 [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting Alexander Duyck
                   ` (4 preceding siblings ...)
  2020-01-22 17:43 ` [PATCH v16.1 5/9] virtio-balloon: Pull page poisoning config out of free page hinting Alexander Duyck
@ 2020-01-22 17:43 ` Alexander Duyck
  2020-02-11 11:03   ` David Hildenbrand
  2020-01-22 17:43 ` [PATCH v16.1 7/9] mm/page_reporting: Rotate reported pages to the tail of the list Alexander Duyck
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 39+ messages in thread
From: Alexander Duyck @ 2020-01-22 17:43 UTC (permalink / raw)
  To: kvm, mst, linux-kernel, willy, mhocko, linux-mm, akpm, mgorman, vbabka
  Cc: yang.zhang.wz, nitesh, konrad.wilk, david, pagupta, riel,
	lcapitulino, dave.hansen, wei.w.wang, aarcange, pbonzini,
	dan.j.williams, alexander.h.duyck, osalvador

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Add support for the page reporting feature provided by virtio-balloon.
Reporting differs from the regular balloon functionality in that is is
much less durable than a standard memory balloon. Instead of creating a
list of pages that cannot be accessed the pages are only inaccessible
while they are being indicated to the virtio interface. Once the
interface has acknowledged them they are placed back into their respective
free lists and are once again accessible by the guest system.

Unlike a standard balloon we don't inflate and deflate the pages. Instead
we perform the reporting, and once the reporting is completed it is
assumed that the page has been dropped from the guest and will be faulted
back in the next time the page is accessed.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 drivers/virtio/Kconfig              |    1 +
 drivers/virtio/virtio_balloon.c     |   64 +++++++++++++++++++++++++++++++++++
 include/uapi/linux/virtio_balloon.h |    1 +
 3 files changed, 66 insertions(+)

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 078615cf2afc..4b2dd8259ff5 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -58,6 +58,7 @@ config VIRTIO_BALLOON
 	tristate "Virtio balloon driver"
 	depends on VIRTIO
 	select MEMORY_BALLOON
+	select PAGE_REPORTING
 	---help---
 	 This driver supports increasing and decreasing the amount
 	 of memory within a KVM guest.
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 40bb7693e3de..a07b9e18a292 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -19,6 +19,7 @@
 #include <linux/mount.h>
 #include <linux/magic.h>
 #include <linux/pseudo_fs.h>
+#include <linux/page_reporting.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -47,6 +48,7 @@ enum virtio_balloon_vq {
 	VIRTIO_BALLOON_VQ_DEFLATE,
 	VIRTIO_BALLOON_VQ_STATS,
 	VIRTIO_BALLOON_VQ_FREE_PAGE,
+	VIRTIO_BALLOON_VQ_REPORTING,
 	VIRTIO_BALLOON_VQ_MAX
 };
 
@@ -114,6 +116,10 @@ struct virtio_balloon {
 
 	/* To register a shrinker to shrink memory upon memory pressure */
 	struct shrinker shrinker;
+
+	/* Free page reporting device */
+	struct virtqueue *reporting_vq;
+	struct page_reporting_dev_info pr_dev_info;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -153,6 +159,33 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 
 }
 
+int virtballoon_free_page_report(struct page_reporting_dev_info *pr_dev_info,
+				   struct scatterlist *sg, unsigned int nents)
+{
+	struct virtio_balloon *vb =
+		container_of(pr_dev_info, struct virtio_balloon, pr_dev_info);
+	struct virtqueue *vq = vb->reporting_vq;
+	unsigned int unused, err;
+
+	/* We should always be able to add these buffers to an empty queue. */
+	err = virtqueue_add_inbuf(vq, sg, nents, vb, GFP_NOWAIT | __GFP_NOWARN);
+
+	/*
+	 * In the extremely unlikely case that something has occurred and we
+	 * are able to trigger an error we will simply display a warning
+	 * and exit without actually processing the pages.
+	 */
+	if (WARN_ON_ONCE(err))
+		return err;
+
+	virtqueue_kick(vq);
+
+	/* When host has read buffer, this completes via balloon_ack */
+	wait_event(vb->acked, virtqueue_get_buf(vq, &unused));
+
+	return 0;
+}
+
 static void set_page_pfns(struct virtio_balloon *vb,
 			  __virtio32 pfns[], struct page *page)
 {
@@ -479,6 +512,7 @@ static int init_vqs(struct virtio_balloon *vb)
 	names[VIRTIO_BALLOON_VQ_STATS] = NULL;
 	callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
 	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
+	names[VIRTIO_BALLOON_VQ_REPORTING] = NULL;
 
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
 		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
@@ -490,6 +524,11 @@ static int init_vqs(struct virtio_balloon *vb)
 		callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
 	}
 
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
+		names[VIRTIO_BALLOON_VQ_REPORTING] = "reporting_vq";
+		callbacks[VIRTIO_BALLOON_VQ_REPORTING] = balloon_ack;
+	}
+
 	err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
 					 vqs, callbacks, names, NULL, NULL);
 	if (err)
@@ -522,6 +561,9 @@ static int init_vqs(struct virtio_balloon *vb)
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
 		vb->free_page_vq = vqs[VIRTIO_BALLOON_VQ_FREE_PAGE];
 
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
+		vb->reporting_vq = vqs[VIRTIO_BALLOON_VQ_REPORTING];
+
 	return 0;
 }
 
@@ -952,12 +994,31 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		if (err)
 			goto out_del_balloon_wq;
 	}
+
+	vb->pr_dev_info.report = virtballoon_free_page_report;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
+		unsigned int capacity;
+
+		capacity = virtqueue_get_vring_size(vb->reporting_vq);
+		if (capacity < PAGE_REPORTING_CAPACITY) {
+			err = -ENOSPC;
+			goto out_unregister_shrinker;
+		}
+
+		err = page_reporting_register(&vb->pr_dev_info);
+		if (err)
+			goto out_unregister_shrinker;
+	}
+
 	virtio_device_ready(vdev);
 
 	if (towards_target(vb))
 		virtballoon_changed(vdev);
 	return 0;
 
+out_unregister_shrinker:
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
+		virtio_balloon_unregister_shrinker(vb);
 out_del_balloon_wq:
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
 		destroy_workqueue(vb->balloon_wq);
@@ -986,6 +1047,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb = vdev->priv;
 
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
+		page_reporting_unregister(&vb->pr_dev_info);
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
 		virtio_balloon_unregister_shrinker(vb);
 	spin_lock_irq(&vb->stop_update_lock);
@@ -1058,6 +1121,7 @@ static int virtballoon_validate(struct virtio_device *vdev)
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_FREE_PAGE_HINT,
 	VIRTIO_BALLOON_F_PAGE_POISON,
+	VIRTIO_BALLOON_F_REPORTING,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index a1966cd7b677..19974392d324 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -36,6 +36,7 @@
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
 #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
+#define VIRTIO_BALLOON_F_REPORTING	5 /* Page reporting virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v16.1 7/9] mm/page_reporting: Rotate reported pages to the tail of the list
  2020-01-22 17:43 [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting Alexander Duyck
                   ` (5 preceding siblings ...)
  2020-01-22 17:43 ` [PATCH v16.1 6/9] virtio-balloon: Add support for providing free page reports to host Alexander Duyck
@ 2020-01-22 17:43 ` Alexander Duyck
  2020-01-22 17:43 ` [PATCH v16.1 8/9] mm/page_reporting: Add budget limit on how many pages can be reported per pass Alexander Duyck
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Alexander Duyck @ 2020-01-22 17:43 UTC (permalink / raw)
  To: kvm, mst, linux-kernel, willy, mhocko, linux-mm, akpm, mgorman, vbabka
  Cc: yang.zhang.wz, nitesh, konrad.wilk, david, pagupta, riel,
	lcapitulino, dave.hansen, wei.w.wang, aarcange, pbonzini,
	dan.j.williams, alexander.h.duyck, osalvador

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Rather than walking over the same pages again and again to get to the pages
that have yet to be reported we can save ourselves a significant amount of
time by simply rotating the list so that when we have a full list of
reported pages the head of the list is pointing to the next non-reported
page. Doing this should save us some significant time when processing each
free list.

This doesn't gain us much in the standard case as all of the non-reported
pages should be near the top of the list already. However in the case of
page shuffling this results in a noticeable improvement. Below are the
will-it-scale page_fault1 w/ THP numbers for 16 tasks with and without
this patch.

Without:
tasks   processes       processes_idle  threads         threads_idle
16      8093776.25      0.17            5393242.00      38.20

With:
tasks   processes       processes_idle  threads         threads_idle
16      8283274.75      0.17            5594261.00      38.15

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 mm/page_reporting.c |   30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 1047c6872d4f..6885e74c2367 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -131,17 +131,27 @@ void __page_reporting_notify(void)
 		if (PageReported(page))
 			continue;
 
-		/* Attempt to pull page from list */
-		if (!__isolate_free_page(page, order))
-			break;
+		/* Attempt to pull page from list and place in scatterlist */
+		if (*offset) {
+			if (!__isolate_free_page(page, order)) {
+				next = page;
+				break;
+			}
 
-		/* Add page to scatter list */
-		--(*offset);
-		sg_set_page(&sgl[*offset], page, page_len, 0);
+			/* Add page to scatter list */
+			--(*offset);
+			sg_set_page(&sgl[*offset], page, page_len, 0);
 
-		/* If scatterlist isn't full grab more pages */
-		if (*offset)
 			continue;
+		}
+
+		/*
+		 * Make the first non-processed page in the free list
+		 * the new head of the free list before we release the
+		 * zone lock.
+		 */
+		if (&page->lru != list && !list_is_first(&page->lru, list))
+			list_rotate_to_front(&page->lru, list);
 
 		/* release lock before waiting on report processing */
 		spin_unlock_irq(&zone->lock);
@@ -169,6 +179,10 @@ void __page_reporting_notify(void)
 			break;
 	}
 
+	/* Rotate any leftover pages to the head of the freelist */
+	if (&next->lru != list && !list_is_first(&next->lru, list))
+		list_rotate_to_front(&next->lru, list);
+
 	spin_unlock_irq(&zone->lock);
 
 	return err;


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v16.1 8/9] mm/page_reporting: Add budget limit on how many pages can be reported per pass
  2020-01-22 17:43 [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting Alexander Duyck
                   ` (6 preceding siblings ...)
  2020-01-22 17:43 ` [PATCH v16.1 7/9] mm/page_reporting: Rotate reported pages to the tail of the list Alexander Duyck
@ 2020-01-22 17:43 ` Alexander Duyck
  2020-01-22 17:44 ` [PATCH v16.1 9/9] mm/page_reporting: Add free page reporting documentation Alexander Duyck
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Alexander Duyck @ 2020-01-22 17:43 UTC (permalink / raw)
  To: kvm, mst, linux-kernel, willy, mhocko, linux-mm, akpm, mgorman, vbabka
  Cc: yang.zhang.wz, nitesh, konrad.wilk, david, pagupta, riel,
	lcapitulino, dave.hansen, wei.w.wang, aarcange, pbonzini,
	dan.j.williams, alexander.h.duyck, osalvador

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

In order to keep ourselves from reporting pages that are just going to be
reused again in the case of heavy churn we can put a limit on how many
total pages we will process per pass. Doing this will allow the worker
thread to go into idle much more quickly so that we avoid competing with
other threads that might be allocating or freeing pages.

The logic added here will limit the worker thread to no more than one
sixteenth of the total free pages in a given area per list. Once that limit
is reached it will update the state so that at the end of the pass we will
reschedule the worker to try again in 2 seconds when the memory churn has
hopefully settled down.

Again this optimization doesn't show much of a benefit in the standard case
as the memory churn is minmal. However with page allocator shuffling
enabled the gain is quite noticeable. Below are the results with a THP
enabled version of the will-it-scale page_fault1 test showing the
improvement in iterations for 16 processes or threads.

Without:
tasks   processes       processes_idle  threads         threads_idle
16      8283274.75      0.17            5594261.00      38.15

With:
tasks   processes       processes_idle  threads         threads_idle
16      8767010.50      0.21            5791312.75      36.98

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/page_reporting.h |    1 +
 mm/page_reporting.c            |   33 ++++++++++++++++++++++++++++++++-
 2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
index 32355486f572..3b99e0ec24f2 100644
--- a/include/linux/page_reporting.h
+++ b/include/linux/page_reporting.h
@@ -5,6 +5,7 @@
 #include <linux/mmzone.h>
 #include <linux/scatterlist.h>
 
+/* This value should always be a power of 2, see page_reporting_cycle() */
 #define PAGE_REPORTING_CAPACITY		32
 
 struct page_reporting_dev_info {
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 6885e74c2367..3bbd471cfc81 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -114,6 +114,7 @@ void __page_reporting_notify(void)
 	struct list_head *list = &area->free_list[mt];
 	unsigned int page_len = PAGE_SIZE << order;
 	struct page *page, *next;
+	long budget;
 	int err = 0;
 
 	/*
@@ -125,12 +126,39 @@ void __page_reporting_notify(void)
 
 	spin_lock_irq(&zone->lock);
 
+	/*
+	 * Limit how many calls we will be making to the page reporting
+	 * device for this list. By doing this we avoid processing any
+	 * given list for too long.
+	 *
+	 * The current value used allows us enough calls to process over a
+	 * sixteenth of the current list plus one additional call to handle
+	 * any pages that may have already been present from the previous
+	 * list processed. This should result in us reporting all pages on
+	 * an idle system in about 30 seconds.
+	 *
+	 * The division here should be cheap since PAGE_REPORTING_CAPACITY
+	 * should always be a power of 2.
+	 */
+	budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16);
+
 	/* loop through free list adding unreported pages to sg list */
 	list_for_each_entry_safe(page, next, list, lru) {
 		/* We are going to skip over the reported pages. */
 		if (PageReported(page))
 			continue;
 
+		/*
+		 * If we fully consumed our budget then update our
+		 * state to indicate that we are requesting additional
+		 * processing and exit this list.
+		 */
+		if (budget < 0) {
+			atomic_set(&prdev->state, PAGE_REPORTING_REQUESTED);
+			next = page;
+			break;
+		}
+
 		/* Attempt to pull page from list and place in scatterlist */
 		if (*offset) {
 			if (!__isolate_free_page(page, order)) {
@@ -146,7 +174,7 @@ void __page_reporting_notify(void)
 		}
 
 		/*
-		 * Make the first non-processed page in the free list
+		 * Make the first non-reported page in the free list
 		 * the new head of the free list before we release the
 		 * zone lock.
 		 */
@@ -162,6 +190,9 @@ void __page_reporting_notify(void)
 		/* reset offset since the full list was reported */
 		*offset = PAGE_REPORTING_CAPACITY;
 
+		/* update budget to reflect call to report function */
+		budget--;
+
 		/* reacquire zone lock and resume processing */
 		spin_lock_irq(&zone->lock);
 


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v16.1 9/9] mm/page_reporting: Add free page reporting documentation
  2020-01-22 17:43 [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting Alexander Duyck
                   ` (7 preceding siblings ...)
  2020-01-22 17:43 ` [PATCH v16.1 8/9] mm/page_reporting: Add budget limit on how many pages can be reported per pass Alexander Duyck
@ 2020-01-22 17:44 ` Alexander Duyck
  2020-01-23 10:20 ` [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting Alexander Graf
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Alexander Duyck @ 2020-01-22 17:44 UTC (permalink / raw)
  To: kvm, mst, linux-kernel, willy, mhocko, linux-mm, akpm, mgorman, vbabka
  Cc: yang.zhang.wz, nitesh, konrad.wilk, david, pagupta, riel,
	lcapitulino, dave.hansen, wei.w.wang, aarcange, pbonzini,
	dan.j.williams, alexander.h.duyck, osalvador

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Add documentation for free page reporting. Currently the only consumer is
virtio-balloon, however it is possible that other drivers might make use of
this so it is best to add a bit of documetation explaining at a high level
how to use the API.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 Documentation/vm/free_page_reporting.rst |   41 ++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)
 create mode 100644 Documentation/vm/free_page_reporting.rst

diff --git a/Documentation/vm/free_page_reporting.rst b/Documentation/vm/free_page_reporting.rst
new file mode 100644
index 000000000000..33f54a450a4a
--- /dev/null
+++ b/Documentation/vm/free_page_reporting.rst
@@ -0,0 +1,41 @@
+.. _free_page_reporting:
+
+=====================
+Free Page Reporting
+=====================
+
+Free page reporting is an API by which a device can register to receive
+lists of pages that are currently unused by the system. This is useful in
+the case of virtualization where a guest is then able to use this data to
+notify the hypervisor that it is no longer using certain pages in memory.
+
+For the driver, typically a balloon driver, to use of this functionality
+it will allocate and initialize a page_reporting_dev_info structure. The
+field within the structure it will populate is the "report" function
+pointer used to process the scatterlist. It must also guarantee that it can
+handle at least PAGE_REPORTING_CAPACITY worth of scatterlist entries per
+call to the function. A call to page_reporting_register will register the
+page reporting interface with the reporting framework assuming no other
+page reporting devices are already registered.
+
+Once registered the page reporting API will begin reporting batches of
+pages to the driver. The API will start reporting pages 2 seconds after
+the interface is registered and will continue to do so 2 seconds after any
+page of a sufficiently high order is freed.
+
+Pages reported will be stored in the scatterlist passed to the reporting
+function with the final entry having the end bit set in entry nent - 1.
+While pages are being processed by the report function they will not be
+accessible to the allocator. Once the report function has been completed
+the pages will be returned to the free area from which they were obtained.
+
+Prior to removing a driver that is making use of free page reporting it
+is necessary to call page_reporting_unregister to have the
+page_reporting_dev_info structure that is currently in use by free page
+reporting removed. Doing this will prevent further reports from being
+issued via the interface. If another driver or the same driver is
+registered it is possible for it to resume where the previous driver had
+left off in terms of reporting free pages.
+
+Alexander Duyck, Dec 04, 2019
+


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting
  2020-01-22 17:43 [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting Alexander Duyck
                   ` (8 preceding siblings ...)
  2020-01-22 17:44 ` [PATCH v16.1 9/9] mm/page_reporting: Add free page reporting documentation Alexander Duyck
@ 2020-01-23 10:20 ` Alexander Graf
  2020-01-23 14:05   ` David Hildenbrand
  2020-01-23 16:26   ` Alexander Duyck
       [not found] ` <20200124132352.12824-1-hdanton@sina.com>
  2020-02-03 22:05 ` Alexander Duyck
  11 siblings, 2 replies; 39+ messages in thread
From: Alexander Graf @ 2020-01-23 10:20 UTC (permalink / raw)
  To: Alexander Duyck, kvm, mst, linux-kernel, willy, mhocko, linux-mm,
	akpm, mgorman, vbabka
  Cc: yang.zhang.wz, nitesh, konrad.wilk, david, pagupta, riel,
	lcapitulino, dave.hansen, wei.w.wang, aarcange, pbonzini,
	dan.j.williams, alexander.h.duyck, osalvador, Paterson-Jones,
	Roland, hannes, hare

Hi Alex,

On 22.01.20 18:43, Alexander Duyck wrote:
> This series provides an asynchronous means of reporting free guest pages
> to a hypervisor so that the memory associated with those pages can be
> dropped and reused by other processes and/or guests on the host. Using
> this it is possible to avoid unnecessary I/O to disk and greatly improve
> performance in the case of memory overcommit on the host.
> 
> When enabled we will be performing a scan of free memory every 2 seconds
> while pages of sufficiently high order are being freed. In each pass at
> least one sixteenth of each free list will be reported. By doing this we
> avoid racing against other threads that may be causing a high amount of
> memory churn.
> 
> The lowest page order currently scanned when reporting pages is
> pageblock_order so that this feature will not interfere with the use of
> Transparent Huge Pages in the case of virtualization.
> 
> Currently this is only in use by virtio-balloon however there is the hope
> that at some point in the future other hypervisors might be able to make
> use of it. In the virtio-balloon/QEMU implementation the hypervisor is
> currently using MADV_DONTNEED to indicate to the host kernel that the page
> is currently free. It will be zeroed and faulted back into the guest the
> next time the page is accessed.
> 
> To track if a page is reported or not the Uptodate flag was repurposed and
> used as a Reported flag for Buddy pages. We walk though the free list
> isolating pages and adding them to the scatterlist until we either
> encounter the end of the list, processed as many pages as were listed in
> nr_free prior to us starting, or have filled the scatterlist with pages to
> be reported. If we fill the scatterlist before we reach the end of the
> list we rotate the list so that the first unreported page we encounter is
> moved to the head of the list as that is where we will resume after we
> have freed the reported pages back into the tail of the list.
> 
> Below are the results from various benchmarks. I primarily focused on two
> tests. The first is the will-it-scale/page_fault2 test, and the other is
> a modified version of will-it-scale/page_fault1 that was enabled to use
> THP. I did this as it allows for better visibility into different parts
> of the memory subsystem. The guest is running with 32G for RAM on one
> node of a E5-2630 v3. The host has had some features such as CPU turbo
> disabled in the BIOS.
> 
> Test                   page_fault1 (THP)    page_fault2
> Name            tasks  Process Iter  STDEV  Process Iter  STDEV
> Baseline            1    1012402.50  0.14%     361855.25  0.81%
>                     16    8827457.25  0.09%    3282347.00  0.34%
> 
> Patches Applied     1    1007897.00  0.23%     361887.00  0.26%
>                     16    8784741.75  0.39%    3240669.25  0.48%
> 
> Patches Enabled     1    1010227.50  0.39%     359749.25  0.56%
>                     16    8756219.00  0.24%    3226608.75  0.97%
> 
> Patches Enabled     1    1050982.00  4.26%     357966.25  0.14%
>   page shuffle      16    8672601.25  0.49%    3223177.75  0.40%
> 
> Patches enabled     1    1003238.00  0.22%     360211.00  0.22%
>   shuffle w/ RFC    16    8767010.50  0.32%    3199874.00  0.71%
> 
> The results above are for a baseline with a linux-next-20191219 kernel,
> that kernel with this patch set applied but page reporting disabled in
> virtio-balloon, the patches applied and page reporting fully enabled, the
> patches enabled with page shuffling enabled, and the patches applied with
> page shuffling enabled and an RFC patch that makes used of MADV_FREE in
> QEMU. These results include the deviation seen between the average value
> reported here versus the high and/or low value. I observed that during the
> test memory usage for the first three tests never dropped whereas with the
> patches fully enabled the VM would drop to using only a few GB of the
> host's memory when switching from memhog to page fault tests.
> 
> Any of the overhead visible with this patch set enabled seems due to page
> faults caused by accessing the reported pages and the host zeroing the page
> before giving it back to the guest. This overhead is much more visible when
> using THP than with standard 4K pages. In addition page shuffling seemed to
> increase the amount of faults generated due to an increase in memory churn.
> The overhead is reduced when using MADV_FREE as we can avoid the extra
> zeroing of the pages when they are reintroduced to the host, as can be seen
> when the RFC is applied with shuffling enabled.
> 
> The overall guest size is kept fairly small to only a few GB while the test
> is running. If the host memory were oversubscribed this patch set should
> result in a performance improvement as swapping memory in the host can be
> avoided.


I really like the approach overall. Voluntarily propagating free memory 
from a guest to the host has been a sore point ever since KVM was 
around. This solution looks like a very elegant way to do so.

The big piece I'm missing is the page cache. Linux will by default try 
to keep the free list as small as it can in favor of page cache, so most 
of the benefit of this patch set will be void in real world scenarios.

Traditionally, this was solved by creating pressure from the host 
through virtio-balloon: Exactly the piece that this patch set gets away 
with. I never liked "ballooning", because the host has very limited 
visibility into the actual memory utility of its guests. So leaving the 
decision on how much memory is actually needed at a given point in time 
should ideally stay with the guest.

What would keep us from applying the page hinting approach to inactive, 
clean page cache pages? With writeback in place as well, we would slowly 
propagate pages from

   dirty -> clean -> clean, inactive -> free -> host owned

which gives a guest a natural path to give up "not important" memory.

The big problem I see is that what I really want from a user's point of 
view is a tuneable that says "Automatically free clean page cache pages 
that were not accessed in the last X minutes". Otherwise we may run into 
the risk of evicting some times in use page cache pages.

I have a hard time grasping the mm code to understand how hard that 
would be to implement that though :).


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting
  2020-01-23 10:20 ` [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting Alexander Graf
@ 2020-01-23 14:05   ` David Hildenbrand
  2020-01-23 14:52     ` Alexander Graf
  2020-01-23 16:26   ` Alexander Duyck
  1 sibling, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2020-01-23 14:05 UTC (permalink / raw)
  To: Alexander Graf, Alexander Duyck, kvm, mst, linux-kernel, willy,
	mhocko, linux-mm, akpm, mgorman, vbabka
  Cc: yang.zhang.wz, nitesh, konrad.wilk, pagupta, riel, lcapitulino,
	dave.hansen, wei.w.wang, aarcange, pbonzini, dan.j.williams,
	alexander.h.duyck, osalvador, Paterson-Jones, Roland, hannes,
	hare

On 23.01.20 11:20, Alexander Graf wrote:
> Hi Alex,
> 
> On 22.01.20 18:43, Alexander Duyck wrote:
>> This series provides an asynchronous means of reporting free guest pages
>> to a hypervisor so that the memory associated with those pages can be
>> dropped and reused by other processes and/or guests on the host. Using
>> this it is possible to avoid unnecessary I/O to disk and greatly improve
>> performance in the case of memory overcommit on the host.
>>
>> When enabled we will be performing a scan of free memory every 2 seconds
>> while pages of sufficiently high order are being freed. In each pass at
>> least one sixteenth of each free list will be reported. By doing this we
>> avoid racing against other threads that may be causing a high amount of
>> memory churn.
>>
>> The lowest page order currently scanned when reporting pages is
>> pageblock_order so that this feature will not interfere with the use of
>> Transparent Huge Pages in the case of virtualization.
>>
>> Currently this is only in use by virtio-balloon however there is the hope
>> that at some point in the future other hypervisors might be able to make
>> use of it. In the virtio-balloon/QEMU implementation the hypervisor is
>> currently using MADV_DONTNEED to indicate to the host kernel that the page
>> is currently free. It will be zeroed and faulted back into the guest the
>> next time the page is accessed.
>>
>> To track if a page is reported or not the Uptodate flag was repurposed and
>> used as a Reported flag for Buddy pages. We walk though the free list
>> isolating pages and adding them to the scatterlist until we either
>> encounter the end of the list, processed as many pages as were listed in
>> nr_free prior to us starting, or have filled the scatterlist with pages to
>> be reported. If we fill the scatterlist before we reach the end of the
>> list we rotate the list so that the first unreported page we encounter is
>> moved to the head of the list as that is where we will resume after we
>> have freed the reported pages back into the tail of the list.
>>
>> Below are the results from various benchmarks. I primarily focused on two
>> tests. The first is the will-it-scale/page_fault2 test, and the other is
>> a modified version of will-it-scale/page_fault1 that was enabled to use
>> THP. I did this as it allows for better visibility into different parts
>> of the memory subsystem. The guest is running with 32G for RAM on one
>> node of a E5-2630 v3. The host has had some features such as CPU turbo
>> disabled in the BIOS.
>>
>> Test                   page_fault1 (THP)    page_fault2
>> Name            tasks  Process Iter  STDEV  Process Iter  STDEV
>> Baseline            1    1012402.50  0.14%     361855.25  0.81%
>>                     16    8827457.25  0.09%    3282347.00  0.34%
>>
>> Patches Applied     1    1007897.00  0.23%     361887.00  0.26%
>>                     16    8784741.75  0.39%    3240669.25  0.48%
>>
>> Patches Enabled     1    1010227.50  0.39%     359749.25  0.56%
>>                     16    8756219.00  0.24%    3226608.75  0.97%
>>
>> Patches Enabled     1    1050982.00  4.26%     357966.25  0.14%
>>   page shuffle      16    8672601.25  0.49%    3223177.75  0.40%
>>
>> Patches enabled     1    1003238.00  0.22%     360211.00  0.22%
>>   shuffle w/ RFC    16    8767010.50  0.32%    3199874.00  0.71%
>>
>> The results above are for a baseline with a linux-next-20191219 kernel,
>> that kernel with this patch set applied but page reporting disabled in
>> virtio-balloon, the patches applied and page reporting fully enabled, the
>> patches enabled with page shuffling enabled, and the patches applied with
>> page shuffling enabled and an RFC patch that makes used of MADV_FREE in
>> QEMU. These results include the deviation seen between the average value
>> reported here versus the high and/or low value. I observed that during the
>> test memory usage for the first three tests never dropped whereas with the
>> patches fully enabled the VM would drop to using only a few GB of the
>> host's memory when switching from memhog to page fault tests.
>>
>> Any of the overhead visible with this patch set enabled seems due to page
>> faults caused by accessing the reported pages and the host zeroing the page
>> before giving it back to the guest. This overhead is much more visible when
>> using THP than with standard 4K pages. In addition page shuffling seemed to
>> increase the amount of faults generated due to an increase in memory churn.
>> The overhead is reduced when using MADV_FREE as we can avoid the extra
>> zeroing of the pages when they are reintroduced to the host, as can be seen
>> when the RFC is applied with shuffling enabled.
>>
>> The overall guest size is kept fairly small to only a few GB while the test
>> is running. If the host memory were oversubscribed this patch set should
>> result in a performance improvement as swapping memory in the host can be
>> avoided.
> 
> 
> I really like the approach overall. Voluntarily propagating free memory 
> from a guest to the host has been a sore point ever since KVM was 
> around. This solution looks like a very elegant way to do so.
> 
> The big piece I'm missing is the page cache. Linux will by default try 
> to keep the free list as small as it can in favor of page cache, so most 
> of the benefit of this patch set will be void in real world scenarios.

One approach is to move (parts of) the page cache from the guest to the
hypervisor - e.g., using emulated NVDIMM or virtio-pmem.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting
  2020-01-23 14:05   ` David Hildenbrand
@ 2020-01-23 14:52     ` Alexander Graf
  2020-01-24 13:25       ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: Alexander Graf @ 2020-01-23 14:52 UTC (permalink / raw)
  To: David Hildenbrand, Alexander Duyck, kvm, mst, linux-kernel,
	willy, mhocko, linux-mm, akpm, mgorman, vbabka
  Cc: yang.zhang.wz, nitesh, konrad.wilk, pagupta, riel, lcapitulino,
	dave.hansen, wei.w.wang, aarcange, pbonzini, dan.j.williams,
	alexander.h.duyck, osalvador, Paterson-Jones, Roland, hannes,
	hare



On 23.01.20 15:05, David Hildenbrand wrote:
> On 23.01.20 11:20, Alexander Graf wrote:
>> Hi Alex,
>>
>> On 22.01.20 18:43, Alexander Duyck wrote:
>>> This series provides an asynchronous means of reporting free guest pages
>>> to a hypervisor so that the memory associated with those pages can be
>>> dropped and reused by other processes and/or guests on the host. Using
>>> this it is possible to avoid unnecessary I/O to disk and greatly improve
>>> performance in the case of memory overcommit on the host.
>>>
>>> When enabled we will be performing a scan of free memory every 2 seconds
>>> while pages of sufficiently high order are being freed. In each pass at
>>> least one sixteenth of each free list will be reported. By doing this we
>>> avoid racing against other threads that may be causing a high amount of
>>> memory churn.
>>>
>>> The lowest page order currently scanned when reporting pages is
>>> pageblock_order so that this feature will not interfere with the use of
>>> Transparent Huge Pages in the case of virtualization.
>>>
>>> Currently this is only in use by virtio-balloon however there is the hope
>>> that at some point in the future other hypervisors might be able to make
>>> use of it. In the virtio-balloon/QEMU implementation the hypervisor is
>>> currently using MADV_DONTNEED to indicate to the host kernel that the page
>>> is currently free. It will be zeroed and faulted back into the guest the
>>> next time the page is accessed.
>>>
>>> To track if a page is reported or not the Uptodate flag was repurposed and
>>> used as a Reported flag for Buddy pages. We walk though the free list
>>> isolating pages and adding them to the scatterlist until we either
>>> encounter the end of the list, processed as many pages as were listed in
>>> nr_free prior to us starting, or have filled the scatterlist with pages to
>>> be reported. If we fill the scatterlist before we reach the end of the
>>> list we rotate the list so that the first unreported page we encounter is
>>> moved to the head of the list as that is where we will resume after we
>>> have freed the reported pages back into the tail of the list.
>>>
>>> Below are the results from various benchmarks. I primarily focused on two
>>> tests. The first is the will-it-scale/page_fault2 test, and the other is
>>> a modified version of will-it-scale/page_fault1 that was enabled to use
>>> THP. I did this as it allows for better visibility into different parts
>>> of the memory subsystem. The guest is running with 32G for RAM on one
>>> node of a E5-2630 v3. The host has had some features such as CPU turbo
>>> disabled in the BIOS.
>>>
>>> Test                   page_fault1 (THP)    page_fault2
>>> Name            tasks  Process Iter  STDEV  Process Iter  STDEV
>>> Baseline            1    1012402.50  0.14%     361855.25  0.81%
>>>                      16    8827457.25  0.09%    3282347.00  0.34%
>>>
>>> Patches Applied     1    1007897.00  0.23%     361887.00  0.26%
>>>                      16    8784741.75  0.39%    3240669.25  0.48%
>>>
>>> Patches Enabled     1    1010227.50  0.39%     359749.25  0.56%
>>>                      16    8756219.00  0.24%    3226608.75  0.97%
>>>
>>> Patches Enabled     1    1050982.00  4.26%     357966.25  0.14%
>>>    page shuffle      16    8672601.25  0.49%    3223177.75  0.40%
>>>
>>> Patches enabled     1    1003238.00  0.22%     360211.00  0.22%
>>>    shuffle w/ RFC    16    8767010.50  0.32%    3199874.00  0.71%
>>>
>>> The results above are for a baseline with a linux-next-20191219 kernel,
>>> that kernel with this patch set applied but page reporting disabled in
>>> virtio-balloon, the patches applied and page reporting fully enabled, the
>>> patches enabled with page shuffling enabled, and the patches applied with
>>> page shuffling enabled and an RFC patch that makes used of MADV_FREE in
>>> QEMU. These results include the deviation seen between the average value
>>> reported here versus the high and/or low value. I observed that during the
>>> test memory usage for the first three tests never dropped whereas with the
>>> patches fully enabled the VM would drop to using only a few GB of the
>>> host's memory when switching from memhog to page fault tests.
>>>
>>> Any of the overhead visible with this patch set enabled seems due to page
>>> faults caused by accessing the reported pages and the host zeroing the page
>>> before giving it back to the guest. This overhead is much more visible when
>>> using THP than with standard 4K pages. In addition page shuffling seemed to
>>> increase the amount of faults generated due to an increase in memory churn.
>>> The overhead is reduced when using MADV_FREE as we can avoid the extra
>>> zeroing of the pages when they are reintroduced to the host, as can be seen
>>> when the RFC is applied with shuffling enabled.
>>>
>>> The overall guest size is kept fairly small to only a few GB while the test
>>> is running. If the host memory were oversubscribed this patch set should
>>> result in a performance improvement as swapping memory in the host can be
>>> avoided.
>>
>>
>> I really like the approach overall. Voluntarily propagating free memory
>> from a guest to the host has been a sore point ever since KVM was
>> around. This solution looks like a very elegant way to do so.
>>
>> The big piece I'm missing is the page cache. Linux will by default try
>> to keep the free list as small as it can in favor of page cache, so most
>> of the benefit of this patch set will be void in real world scenarios.
> 
> One approach is to move (parts of) the page cache from the guest to the
> hypervisor - e.g., using emulated NVDIMM or virtio-pmem.

Whether you can do that depends heavily on your virtualization 
environment. On a host with single tenant VMs, that's definitely 
feasible. In a Kubernetes environment, it might also be feasible.

But when you have VMs that assume that the host is interfering with them 
as little as possible, it becomes harder:

How do you ensure fairness across different VMs' page cache that is 
munged into a single big host one?

Do you even have host page cache or are you using SR-IOV / mdev for 
storage for performance reasons?


The puzzle is still incomplete, even with NVDIMM exposure to the guest 
as an option unfortunately :).


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting
  2020-01-23 10:20 ` [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting Alexander Graf
  2020-01-23 14:05   ` David Hildenbrand
@ 2020-01-23 16:26   ` Alexander Duyck
  2020-01-23 16:54     ` Alexander Graf
                       ` (2 more replies)
  1 sibling, 3 replies; 39+ messages in thread
From: Alexander Duyck @ 2020-01-23 16:26 UTC (permalink / raw)
  To: Alexander Graf, Alexander Duyck, kvm, mst, linux-kernel, willy,
	mhocko, linux-mm, akpm, mgorman, vbabka
  Cc: yang.zhang.wz, nitesh, konrad.wilk, david, pagupta, riel,
	lcapitulino, dave.hansen, wei.w.wang, aarcange, pbonzini,
	dan.j.williams, osalvador, Paterson-Jones, Roland, hannes, hare

On Thu, 2020-01-23 at 11:20 +0100, Alexander Graf wrote:
> Hi Alex,
> 
> On 22.01.20 18:43, Alexander Duyck wrote:
> > This series provides an asynchronous means of reporting free guest pages
> > to a hypervisor so that the memory associated with those pages can be
> > dropped and reused by other processes and/or guests on the host. Using
> > this it is possible to avoid unnecessary I/O to disk and greatly improve
> > performance in the case of memory overcommit on the host.
> > 
> > When enabled we will be performing a scan of free memory every 2 seconds
> > while pages of sufficiently high order are being freed. In each pass at
> > least one sixteenth of each free list will be reported. By doing this we
> > avoid racing against other threads that may be causing a high amount of
> > memory churn.
> > 
> > The lowest page order currently scanned when reporting pages is
> > pageblock_order so that this feature will not interfere with the use of
> > Transparent Huge Pages in the case of virtualization.
> > 
> > Currently this is only in use by virtio-balloon however there is the hope
> > that at some point in the future other hypervisors might be able to make
> > use of it. In the virtio-balloon/QEMU implementation the hypervisor is
> > currently using MADV_DONTNEED to indicate to the host kernel that the page
> > is currently free. It will be zeroed and faulted back into the guest the
> > next time the page is accessed.
> > 
> > To track if a page is reported or not the Uptodate flag was repurposed and
> > used as a Reported flag for Buddy pages. We walk though the free list
> > isolating pages and adding them to the scatterlist until we either
> > encounter the end of the list, processed as many pages as were listed in
> > nr_free prior to us starting, or have filled the scatterlist with pages to
> > be reported. If we fill the scatterlist before we reach the end of the
> > list we rotate the list so that the first unreported page we encounter is
> > moved to the head of the list as that is where we will resume after we
> > have freed the reported pages back into the tail of the list.
> > 
> > Below are the results from various benchmarks. I primarily focused on two
> > tests. The first is the will-it-scale/page_fault2 test, and the other is
> > a modified version of will-it-scale/page_fault1 that was enabled to use
> > THP. I did this as it allows for better visibility into different parts
> > of the memory subsystem. The guest is running with 32G for RAM on one
> > node of a E5-2630 v3. The host has had some features such as CPU turbo
> > disabled in the BIOS.
> > 
> > Test                   page_fault1 (THP)    page_fault2
> > Name            tasks  Process Iter  STDEV  Process Iter  STDEV
> > Baseline            1    1012402.50  0.14%     361855.25  0.81%
> >                     16    8827457.25  0.09%    3282347.00  0.34%
> > 
> > Patches Applied     1    1007897.00  0.23%     361887.00  0.26%
> >                     16    8784741.75  0.39%    3240669.25  0.48%
> > 
> > Patches Enabled     1    1010227.50  0.39%     359749.25  0.56%
> >                     16    8756219.00  0.24%    3226608.75  0.97%
> > 
> > Patches Enabled     1    1050982.00  4.26%     357966.25  0.14%
> >   page shuffle      16    8672601.25  0.49%    3223177.75  0.40%
> > 
> > Patches enabled     1    1003238.00  0.22%     360211.00  0.22%
> >   shuffle w/ RFC    16    8767010.50  0.32%    3199874.00  0.71%
> > 
> > The results above are for a baseline with a linux-next-20191219 kernel,
> > that kernel with this patch set applied but page reporting disabled in
> > virtio-balloon, the patches applied and page reporting fully enabled, the
> > patches enabled with page shuffling enabled, and the patches applied with
> > page shuffling enabled and an RFC patch that makes used of MADV_FREE in
> > QEMU. These results include the deviation seen between the average value
> > reported here versus the high and/or low value. I observed that during the
> > test memory usage for the first three tests never dropped whereas with the
> > patches fully enabled the VM would drop to using only a few GB of the
> > host's memory when switching from memhog to page fault tests.
> > 
> > Any of the overhead visible with this patch set enabled seems due to page
> > faults caused by accessing the reported pages and the host zeroing the page
> > before giving it back to the guest. This overhead is much more visible when
> > using THP than with standard 4K pages. In addition page shuffling seemed to
> > increase the amount of faults generated due to an increase in memory churn.
> > The overhead is reduced when using MADV_FREE as we can avoid the extra
> > zeroing of the pages when they are reintroduced to the host, as can be seen
> > when the RFC is applied with shuffling enabled.
> > 
> > The overall guest size is kept fairly small to only a few GB while the test
> > is running. If the host memory were oversubscribed this patch set should
> > result in a performance improvement as swapping memory in the host can be
> > avoided.
> 
> I really like the approach overall. Voluntarily propagating free memory 
> from a guest to the host has been a sore point ever since KVM was 
> around. This solution looks like a very elegant way to do so.
> 
> The big piece I'm missing is the page cache. Linux will by default try 
> to keep the free list as small as it can in favor of page cache, so most 
> of the benefit of this patch set will be void in real world scenarios.

Agreed. This is a the next piece of this I plan to work on once this is
accepted. For now the quick and dirty approach is to essentially make use
of the /proc/sys/vm/drop_caches interface in the guest by either putting
it in a cronjob somewhere or to have it after memory intensive workloads.

> Traditionally, this was solved by creating pressure from the host 
> through virtio-balloon: Exactly the piece that this patch set gets away 
> with. I never liked "ballooning", because the host has very limited 
> visibility into the actual memory utility of its guests. So leaving the 
> decision on how much memory is actually needed at a given point in time 
> should ideally stay with the guest.
> 
> What would keep us from applying the page hinting approach to inactive, 
> clean page cache pages? With writeback in place as well, we would slowly 
> propagate pages from
> 
>    dirty -> clean -> clean, inactive -> free -> host owned
> 
> which gives a guest a natural path to give up "not important" memory.

I considered something similar. Basically one thought I had was to
essentially look at putting together some sort of epoch. When the host is
under memory pressure it would need to somehow notify the guest and then
the guest would start moving the epoch forward so that we start evicting
pages out of the page cache when the host is under memory pressure.

> The big problem I see is that what I really want from a user's point of 
> view is a tuneable that says "Automatically free clean page cache pages 
> that were not accessed in the last X minutes". Otherwise we may run into 
> the risk of evicting some times in use page cache pages.
> 
> I have a hard time grasping the mm code to understand how hard that 
> would be to implement that though :).
> 
> 
> Alex

Yeah, I am not exactly an expert on this either as I have only been
working int he MM tree for about a year now.

I have submitted this as a topic for LSF/MM summit[1] and I am hoping to
get some feedback on the best way to apply proactive memory pressure as
one of the subtopics if iti s selected.

Thanks.

- Alex

[1]: https://lore.kernel.org/linux-mm/4b8671d16573307da09afc56030c2a5f5a9c45bf.camel@linux.intel.com/



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting
  2020-01-23 16:26   ` Alexander Duyck
@ 2020-01-23 16:54     ` Alexander Graf
  2020-01-23 18:33       ` Alexander Duyck
  2020-01-23 17:20     ` Dave Hansen
  2020-01-23 19:17     ` Johannes Weiner
  2 siblings, 1 reply; 39+ messages in thread
From: Alexander Graf @ 2020-01-23 16:54 UTC (permalink / raw)
  To: Alexander Duyck, Alexander Duyck, kvm, mst, linux-kernel, willy,
	mhocko, linux-mm, akpm, mgorman, vbabka
  Cc: yang.zhang.wz, nitesh, konrad.wilk, david, pagupta, riel,
	lcapitulino, dave.hansen, wei.w.wang, aarcange, pbonzini,
	dan.j.williams, osalvador, Paterson-Jones, Roland, hannes, hare



On 23.01.20 17:26, Alexander Duyck wrote:
> On Thu, 2020-01-23 at 11:20 +0100, Alexander Graf wrote:
>> Hi Alex,
>>
>> On 22.01.20 18:43, Alexander Duyck wrote:
[...]
>>> The overall guest size is kept fairly small to only a few GB while the test
>>> is running. If the host memory were oversubscribed this patch set should
>>> result in a performance improvement as swapping memory in the host can be
>>> avoided.
>>
>> I really like the approach overall. Voluntarily propagating free memory
>> from a guest to the host has been a sore point ever since KVM was
>> around. This solution looks like a very elegant way to do so.
>>
>> The big piece I'm missing is the page cache. Linux will by default try
>> to keep the free list as small as it can in favor of page cache, so most
>> of the benefit of this patch set will be void in real world scenarios.
> 
> Agreed. This is a the next piece of this I plan to work on once this is
> accepted. For now the quick and dirty approach is to essentially make use
> of the /proc/sys/vm/drop_caches interface in the guest by either putting
> it in a cronjob somewhere or to have it after memory intensive workloads.
> 
>> Traditionally, this was solved by creating pressure from the host
>> through virtio-balloon: Exactly the piece that this patch set gets away
>> with. I never liked "ballooning", because the host has very limited
>> visibility into the actual memory utility of its guests. So leaving the
>> decision on how much memory is actually needed at a given point in time
>> should ideally stay with the guest.
>>
>> What would keep us from applying the page hinting approach to inactive,
>> clean page cache pages? With writeback in place as well, we would slowly
>> propagate pages from
>>
>>     dirty -> clean -> clean, inactive -> free -> host owned
>>
>> which gives a guest a natural path to give up "not important" memory.
> 
> I considered something similar. Basically one thought I had was to
> essentially look at putting together some sort of epoch. When the host is
> under memory pressure it would need to somehow notify the guest and then
> the guest would start moving the epoch forward so that we start evicting
> pages out of the page cache when the host is under memory pressure.

I think we want to consider an interface in which the host actively asks 
guests to purge pages to be on the same line as swapping: The last line 
of defense.

In the normal mode of operation, you still want to shrink down 
voluntarily, so that everyone cooperatively tries to make free for new 
guests you could potentially run on the same host.

If you start to apply pressure to guests to find out of they might have 
some pages to spare, we're almost back to the old style ballooning approach.

Btw, have you ever looked at CMM2 [1]? With that, the host can 
essentially just "steal" pages from the guest when it needs any, without 
the need to execute the guest meanwhile. That means inside the host 
swapping path, CMM2 can just evict guest page cache pages as easily as 
we evict host page cache pages. To me, that's even more attractive in 
the swap / emergency case than an interface which requires the guest to 
proactively execute while we are in a low mem situation.

>> The big problem I see is that what I really want from a user's point of
>> view is a tuneable that says "Automatically free clean page cache pages
>> that were not accessed in the last X minutes". Otherwise we may run into
>> the risk of evicting some times in use page cache pages.
>>
>> I have a hard time grasping the mm code to understand how hard that
>> would be to implement that though :).
>>
>>
>> Alex
> 
> Yeah, I am not exactly an expert on this either as I have only been
> working int he MM tree for about a year now.
> 
> I have submitted this as a topic for LSF/MM summit[1] and I am hoping to
> get some feedback on the best way to apply proactive memory pressure as
> one of the subtopics if it is selected.

That's a great idea! Hannes just mentioned LSF/MM as a good forum to 
discuss this at last night, I'm glad to see you already picked up on it :).


Alex

[1] https://www.kernel.org/doc/ols/2006/ols2006v2-pages-321-336.pdf



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting
  2020-01-23 16:26   ` Alexander Duyck
  2020-01-23 16:54     ` Alexander Graf
@ 2020-01-23 17:20     ` Dave Hansen
  2020-01-23 19:23       ` Konrad Rzeszutek Wilk
  2020-01-23 19:17     ` Johannes Weiner
  2 siblings, 1 reply; 39+ messages in thread
From: Dave Hansen @ 2020-01-23 17:20 UTC (permalink / raw)
  To: Alexander Duyck, Alexander Graf, Alexander Duyck, kvm, mst,
	linux-kernel, willy, mhocko, linux-mm, akpm, mgorman, vbabka,
	Van De Ven, Arjan
  Cc: yang.zhang.wz, nitesh, konrad.wilk, david, pagupta, riel,
	lcapitulino, wei.w.wang, aarcange, pbonzini, dan.j.williams,
	osalvador, Paterson-Jones, Roland, hannes, hare, Boeuf,
	Sebastien

On 1/23/20 8:26 AM, Alexander Duyck wrote:
>> The big piece I'm missing is the page cache. Linux will by default try 
>> to keep the free list as small as it can in favor of page cache, so most 
>> of the benefit of this patch set will be void in real world scenarios.
> Agreed. This is a the next piece of this I plan to work on once this is
> accepted. For now the quick and dirty approach is to essentially make use
> of the /proc/sys/vm/drop_caches interface in the guest by either putting
> it in a cronjob somewhere or to have it after memory intensive workloads.

There was an implementation in "Clear Linux" that used this sysctl:

> https://github.com/Conan-Kudo/omv-kernel-rc/blob/master/0154-sysctl-vm-Fine-grained-cache-shrinking.patch

(I can't find it in the Clear repos at the moment, must not be used
currently).  But the idea was to have a little daemon in the host that
periodically applied some artificial pressure with this sysctl.  This
sysctl is a smaller hammer than /proc/sys/vm/drop_caches and lets you
drop small amounts of cache.

The right way to do it is probably to do real, generic reclaim instead
of drop_caches.

This isn't conceptually *that* far away from the "proactive reclaim"
that other folks have proposed:

	https://lwn.net/Articles/787611/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting
  2020-01-23 16:54     ` Alexander Graf
@ 2020-01-23 18:33       ` Alexander Duyck
  2020-01-23 18:47         ` Graf (AWS), Alexander
  0 siblings, 1 reply; 39+ messages in thread
From: Alexander Duyck @ 2020-01-23 18:33 UTC (permalink / raw)
  To: Alexander Graf, Alexander Duyck, kvm, mst, linux-kernel, willy,
	mhocko, linux-mm, akpm, mgorman, vbabka
  Cc: yang.zhang.wz, nitesh, konrad.wilk, david, pagupta, riel,
	lcapitulino, dave.hansen, wei.w.wang, aarcange, pbonzini,
	dan.j.williams, osalvador, Paterson-Jones, Roland, hannes, hare

On Thu, 2020-01-23 at 17:54 +0100, Alexander Graf wrote:
> 
> On 23.01.20 17:26, Alexander Duyck wrote:
> > On Thu, 2020-01-23 at 11:20 +0100, Alexander Graf wrote:
> > > Hi Alex,
> > > 
> > > On 22.01.20 18:43, Alexander Duyck wrote:
> [...]
> > > > The overall guest size is kept fairly small to only a few GB while the test
> > > > is running. If the host memory were oversubscribed this patch set should
> > > > result in a performance improvement as swapping memory in the host can be
> > > > avoided.
> > > 
> > > I really like the approach overall. Voluntarily propagating free memory
> > > from a guest to the host has been a sore point ever since KVM was
> > > around. This solution looks like a very elegant way to do so.
> > > 
> > > The big piece I'm missing is the page cache. Linux will by default try
> > > to keep the free list as small as it can in favor of page cache, so most
> > > of the benefit of this patch set will be void in real world scenarios.
> > 
> > Agreed. This is a the next piece of this I plan to work on once this is
> > accepted. For now the quick and dirty approach is to essentially make use
> > of the /proc/sys/vm/drop_caches interface in the guest by either putting
> > it in a cronjob somewhere or to have it after memory intensive workloads.
> > 
> > > Traditionally, this was solved by creating pressure from the host
> > > through virtio-balloon: Exactly the piece that this patch set gets away
> > > with. I never liked "ballooning", because the host has very limited
> > > visibility into the actual memory utility of its guests. So leaving the
> > > decision on how much memory is actually needed at a given point in time
> > > should ideally stay with the guest.
> > > 
> > > What would keep us from applying the page hinting approach to inactive,
> > > clean page cache pages? With writeback in place as well, we would slowly
> > > propagate pages from
> > > 
> > >     dirty -> clean -> clean, inactive -> free -> host owned
> > > 
> > > which gives a guest a natural path to give up "not important" memory.
> > 
> > I considered something similar. Basically one thought I had was to
> > essentially look at putting together some sort of epoch. When the host is
> > under memory pressure it would need to somehow notify the guest and then
> > the guest would start moving the epoch forward so that we start evicting
> > pages out of the page cache when the host is under memory pressure.
> 
> I think we want to consider an interface in which the host actively asks 
> guests to purge pages to be on the same line as swapping: The last line 
> of defense.

I suppose. The only reason I was thinking that we may want to look at
doing something like that was to avoid putting pressure on the guest when
the host doesn't need us to.

> In the normal mode of operation, you still want to shrink down 
> voluntarily, so that everyone cooperatively tries to make free for new 
> guests you could potentially run on the same host.
> 
> If you start to apply pressure to guests to find out of they might have 
> some pages to spare, we're almost back to the old style ballooning approach.

Thats true. In addition we avoid possible issues with us trying to flush
out a bunch of memory from multiple guests as once since they would be
proactively freeing the memory.

I'm thinking the inactive state could be something similar to MADV_FREE in
terms of behavior.  If it sits in the queue for long enough we decide
nobody is using it anymore so it is freed, but if it is accessed it is
cheap for us to just put it back without much in the way of overhead.

> Btw, have you ever looked at CMM2 [1]? With that, the host can 
> essentially just "steal" pages from the guest when it needs any, without 
> the need to execute the guest meanwhile. That means inside the host 
> swapping path, CMM2 can just evict guest page cache pages as easily as 
> we evict host page cache pages. To me, that's even more attractive in 
> the swap / emergency case than an interface which requires the guest to 
> proactively execute while we are in a low mem situation.

<snip>

> [1] https://www.kernel.org/doc/ols/2006/ols2006v2-pages-321-336.pdf

I hadn't read through this before. If nothing else the verbiage is useful
since what we are discussing is essentially how to deal with the
"volatile" pages within the system, the "unused" pages are the ones we
have reported to the host with the page reporting, and the "stable" pages
are those pages that have been faulted back into the guest when it
accessed them.

I can see there would be some advantages to CMM2, however it seems like it
is adding a significant amount of state to pages since it has to support a
fairly significant number of states and then there is the added complexity
for all the transitions in and out of stable from the various states
depending on how things are being changed.

Do you happen to know if anyone has done any research into how much
overhead is added with CMM2 enabled? I'd be curious since it seems like
the paper mentions having to track a signficant number of state
transitions for the memory throughout the kernel.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting
  2020-01-23 18:33       ` Alexander Duyck
@ 2020-01-23 18:47         ` Graf (AWS), Alexander
  2020-01-23 22:05           ` Alexander Duyck
  0 siblings, 1 reply; 39+ messages in thread
From: Graf (AWS), Alexander @ 2020-01-23 18:47 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, kvm, mst, linux-kernel, willy, mhocko, linux-mm,
	akpm, mgorman, vbabka, yang.zhang.wz, nitesh, konrad.wilk, david,
	pagupta, riel, lcapitulino, dave.hansen, wei.w.wang, aarcange,
	pbonzini, dan.j.williams, osalvador, Paterson-Jones, Roland,
	hannes, hare, Christian Borntraeger, Singh, Balbir



>> Am 23.01.2020 um 19:34 schrieb Alexander Duyck <alexander.h.duyck@linux.intel.com>:
>> 
>> On Thu, 2020-01-23 at 17:54 +0100, Alexander Graf wrote:
>>> On 23.01.20 17:26, Alexander Duyck wrote:
>>> On Thu, 2020-01-23 at 11:20 +0100, Alexander Graf wrote:
>>>> Hi Alex,
>>>>> On 22.01.20 18:43, Alexander Duyck wrote:
>> [...]
>>>>> The overall guest size is kept fairly small to only a few GB while the test
>>>>> is running. If the host memory were oversubscribed this patch set should
>>>>> result in a performance improvement as swapping memory in the host can be
>>>>> avoided.
>>>> I really like the approach overall. Voluntarily propagating free memory
>>>> from a guest to the host has been a sore point ever since KVM was
>>>> around. This solution looks like a very elegant way to do so.
>>>> The big piece I'm missing is the page cache. Linux will by default try
>>>> to keep the free list as small as it can in favor of page cache, so most
>>>> of the benefit of this patch set will be void in real world scenarios.
>>> Agreed. This is a the next piece of this I plan to work on once this is
>>> accepted. For now the quick and dirty approach is to essentially make use
>>> of the /proc/sys/vm/drop_caches interface in the guest by either putting
>>> it in a cronjob somewhere or to have it after memory intensive workloads.
>>>> Traditionally, this was solved by creating pressure from the host
>>>> through virtio-balloon: Exactly the piece that this patch set gets away
>>>> with. I never liked "ballooning", because the host has very limited
>>>> visibility into the actual memory utility of its guests. So leaving the
>>>> decision on how much memory is actually needed at a given point in time
>>>> should ideally stay with the guest.
>>>> What would keep us from applying the page hinting approach to inactive,
>>>> clean page cache pages? With writeback in place as well, we would slowly
>>>> propagate pages from
>>>>   dirty -> clean -> clean, inactive -> free -> host owned
>>>> which gives a guest a natural path to give up "not important" memory.
>>> I considered something similar. Basically one thought I had was to
>>> essentially look at putting together some sort of epoch. When the host is
>>> under memory pressure it would need to somehow notify the guest and then
>>> the guest would start moving the epoch forward so that we start evicting
>>> pages out of the page cache when the host is under memory pressure.
>> I think we want to consider an interface in which the host actively asks
>> guests to purge pages to be on the same line as swapping: The last line
>> of defense.
> 
> I suppose. The only reason I was thinking that we may want to look at
> doing something like that was to avoid putting pressure on the guest when
> the host doesn't need us to.
> 
>> In the normal mode of operation, you still want to shrink down
>> voluntarily, so that everyone cooperatively tries to make free for new
>> guests you could potentially run on the same host.
>> If you start to apply pressure to guests to find out of they might have
>> some pages to spare, we're almost back to the old style ballooning approach.
> 
> Thats true. In addition we avoid possible issues with us trying to flush
> out a bunch of memory from multiple guests as once since they would be
> proactively freeing the memory.
> 
> I'm thinking the inactive state could be something similar to MADV_FREE in
> terms of behavior.  If it sits in the queue for long enough we decide
> nobody is using it anymore so it is freed, but if it is accessed it is
> cheap for us to just put it back without much in the way of overhead.

I think the main difference between the MADV_FREE and what we want is that we also want to pull the page into active state on read.

But sure, that's a possible interface. What I'd like to make sure of is that we can have different host policies: discard the page straight away, keep it for a fixed amount of time or discard it lazily on pressure. As long as the guest gives the host its clean pages voluntarily, I'm happy.

Btw, have you already given thought to the faulting interface when a page was evicted? That's where it gets especially tricky. With a simple "discard the page straight away" style interface, we would not have to fault.

> 
>> Btw, have you ever looked at CMM2 [1]? With that, the host can
>> essentially just "steal" pages from the guest when it needs any, without
>> the need to execute the guest meanwhile. That means inside the host
>> swapping path, CMM2 can just evict guest page cache pages as easily as
>> we evict host page cache pages. To me, that's even more attractive in
>> the swap / emergency case than an interface which requires the guest to
>> proactively execute while we are in a low mem situation.
> 
> <snip>
> 
>> [1] https://www.kernel.org/doc/ols/2006/ols2006v2-pages-321-336.pdf
> 
> I hadn't read through this before. If nothing else the verbiage is useful
> since what we are discussing is essentially how to deal with the
> "volatile" pages within the system, the "unused" pages are the ones we
> have reported to the host with the page reporting, and the "stable" pages
> are those pages that have been faulted back into the guest when it
> accessed them.
> 
> I can see there would be some advantages to CMM2, however it seems like it
> is adding a significant amount of state to pages since it has to support a
> fairly significant number of states and then there is the added complexity
> for all the transitions in and out of stable from the various states
> depending on how things are being changed.
> 
> Do you happen to know if anyone has done any research into how much
> overhead is added with CMM2 enabled? I'd be curious since it seems like
> the paper mentions having to track a signficant number of state
> transitions for the memory throughout the kernel.

Let me add Christian Borntraeger to the thread. He can definitely help on that side. I asked him earlier today and he confirmed that cmm2 is in active use on s390.

Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting
  2020-01-23 16:26   ` Alexander Duyck
  2020-01-23 16:54     ` Alexander Graf
  2020-01-23 17:20     ` Dave Hansen
@ 2020-01-23 19:17     ` Johannes Weiner
  2020-01-23 22:29       ` Alexander Duyck
  2 siblings, 1 reply; 39+ messages in thread
From: Johannes Weiner @ 2020-01-23 19:17 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Graf, Alexander Duyck, kvm, mst, linux-kernel, willy,
	mhocko, linux-mm, akpm, mgorman, vbabka, yang.zhang.wz, nitesh,
	konrad.wilk, david, pagupta, riel, lcapitulino, dave.hansen,
	wei.w.wang, aarcange, pbonzini, dan.j.williams, osalvador,
	Paterson-Jones, Roland, hare

On Thu, Jan 23, 2020 at 08:26:39AM -0800, Alexander Duyck wrote:
> On Thu, 2020-01-23 at 11:20 +0100, Alexander Graf wrote:
> > Hi Alex,
> > 
> > On 22.01.20 18:43, Alexander Duyck wrote:
> > > This series provides an asynchronous means of reporting free guest pages
> > > to a hypervisor so that the memory associated with those pages can be
> > > dropped and reused by other processes and/or guests on the host. Using
> > > this it is possible to avoid unnecessary I/O to disk and greatly improve
> > > performance in the case of memory overcommit on the host.
> > > 
> > > When enabled we will be performing a scan of free memory every 2 seconds
> > > while pages of sufficiently high order are being freed. In each pass at
> > > least one sixteenth of each free list will be reported. By doing this we
> > > avoid racing against other threads that may be causing a high amount of
> > > memory churn.
> > > 
> > > The lowest page order currently scanned when reporting pages is
> > > pageblock_order so that this feature will not interfere with the use of
> > > Transparent Huge Pages in the case of virtualization.
> > > 
> > > Currently this is only in use by virtio-balloon however there is the hope
> > > that at some point in the future other hypervisors might be able to make
> > > use of it. In the virtio-balloon/QEMU implementation the hypervisor is
> > > currently using MADV_DONTNEED to indicate to the host kernel that the page
> > > is currently free. It will be zeroed and faulted back into the guest the
> > > next time the page is accessed.
> > > 
> > > To track if a page is reported or not the Uptodate flag was repurposed and
> > > used as a Reported flag for Buddy pages. We walk though the free list
> > > isolating pages and adding them to the scatterlist until we either
> > > encounter the end of the list, processed as many pages as were listed in
> > > nr_free prior to us starting, or have filled the scatterlist with pages to
> > > be reported. If we fill the scatterlist before we reach the end of the
> > > list we rotate the list so that the first unreported page we encounter is
> > > moved to the head of the list as that is where we will resume after we
> > > have freed the reported pages back into the tail of the list.
> > > 
> > > Below are the results from various benchmarks. I primarily focused on two
> > > tests. The first is the will-it-scale/page_fault2 test, and the other is
> > > a modified version of will-it-scale/page_fault1 that was enabled to use
> > > THP. I did this as it allows for better visibility into different parts
> > > of the memory subsystem. The guest is running with 32G for RAM on one
> > > node of a E5-2630 v3. The host has had some features such as CPU turbo
> > > disabled in the BIOS.
> > > 
> > > Test                   page_fault1 (THP)    page_fault2
> > > Name            tasks  Process Iter  STDEV  Process Iter  STDEV
> > > Baseline            1    1012402.50  0.14%     361855.25  0.81%
> > >                     16    8827457.25  0.09%    3282347.00  0.34%
> > > 
> > > Patches Applied     1    1007897.00  0.23%     361887.00  0.26%
> > >                     16    8784741.75  0.39%    3240669.25  0.48%
> > > 
> > > Patches Enabled     1    1010227.50  0.39%     359749.25  0.56%
> > >                     16    8756219.00  0.24%    3226608.75  0.97%
> > > 
> > > Patches Enabled     1    1050982.00  4.26%     357966.25  0.14%
> > >   page shuffle      16    8672601.25  0.49%    3223177.75  0.40%
> > > 
> > > Patches enabled     1    1003238.00  0.22%     360211.00  0.22%
> > >   shuffle w/ RFC    16    8767010.50  0.32%    3199874.00  0.71%
> > > 
> > > The results above are for a baseline with a linux-next-20191219 kernel,
> > > that kernel with this patch set applied but page reporting disabled in
> > > virtio-balloon, the patches applied and page reporting fully enabled, the
> > > patches enabled with page shuffling enabled, and the patches applied with
> > > page shuffling enabled and an RFC patch that makes used of MADV_FREE in
> > > QEMU. These results include the deviation seen between the average value
> > > reported here versus the high and/or low value. I observed that during the
> > > test memory usage for the first three tests never dropped whereas with the
> > > patches fully enabled the VM would drop to using only a few GB of the
> > > host's memory when switching from memhog to page fault tests.
> > > 
> > > Any of the overhead visible with this patch set enabled seems due to page
> > > faults caused by accessing the reported pages and the host zeroing the page
> > > before giving it back to the guest. This overhead is much more visible when
> > > using THP than with standard 4K pages. In addition page shuffling seemed to
> > > increase the amount of faults generated due to an increase in memory churn.
> > > The overhead is reduced when using MADV_FREE as we can avoid the extra
> > > zeroing of the pages when they are reintroduced to the host, as can be seen
> > > when the RFC is applied with shuffling enabled.
> > > 
> > > The overall guest size is kept fairly small to only a few GB while the test
> > > is running. If the host memory were oversubscribed this patch set should
> > > result in a performance improvement as swapping memory in the host can be
> > > avoided.
> > 
> > I really like the approach overall. Voluntarily propagating free memory 
> > from a guest to the host has been a sore point ever since KVM was 
> > around. This solution looks like a very elegant way to do so.
> > 
> > The big piece I'm missing is the page cache. Linux will by default try 
> > to keep the free list as small as it can in favor of page cache, so most 
> > of the benefit of this patch set will be void in real world scenarios.
> 
> Agreed. This is a the next piece of this I plan to work on once this is
> accepted. For now the quick and dirty approach is to essentially make use
> of the /proc/sys/vm/drop_caches interface in the guest by either putting
> it in a cronjob somewhere or to have it after memory intensive workloads.
> 
> > Traditionally, this was solved by creating pressure from the host 
> > through virtio-balloon: Exactly the piece that this patch set gets away 
> > with. I never liked "ballooning", because the host has very limited 
> > visibility into the actual memory utility of its guests. So leaving the 
> > decision on how much memory is actually needed at a given point in time 
> > should ideally stay with the guest.
> > 
> > What would keep us from applying the page hinting approach to inactive, 
> > clean page cache pages? With writeback in place as well, we would slowly 
> > propagate pages from
> > 
> >    dirty -> clean -> clean, inactive -> free -> host owned
> > 
> > which gives a guest a natural path to give up "not important" memory.
> 
> I considered something similar. Basically one thought I had was to
> essentially look at putting together some sort of epoch. When the host is
> under memory pressure it would need to somehow notify the guest and then
> the guest would start moving the epoch forward so that we start evicting
> pages out of the page cache when the host is under memory pressure.
> 
> > The big problem I see is that what I really want from a user's point of 
> > view is a tuneable that says "Automatically free clean page cache pages 
> > that were not accessed in the last X minutes". Otherwise we may run into 
> > the risk of evicting some times in use page cache pages.
> > 
> > I have a hard time grasping the mm code to understand how hard that 
> > would be to implement that though :).
> > 
> > 
> > Alex
> 
> Yeah, I am not exactly an expert on this either as I have only been
> working int he MM tree for about a year now.
> 
> I have submitted this as a topic for LSF/MM summit[1] and I am hoping to
> get some feedback on the best way to apply proactive memory pressure as
> one of the subtopics if iti s selected.

I've been working on a proactive reclaim project that shrinks
workloads to their smallest, still healthy, memory footprint.

Because we (FB) have a similar problem with containers: in order to
know how many workloads can be safely combined on a host, we first
need to know how much memory a given workload truly requires - as
opposed to how many pages it would gobble up for one-off cache and
cold anon regions if it had the whole machine to itself.

This userspace tool uses cgroups and psi to adjust the memory limits
of workloads in a pressure feedback loop. It targets a minimal rate of
refaults/swapping/reclaim activity to identify the point where all the
cold pages have been evicted and we're *just* about to start eating
into warmer memory.

With SSDs, control over pressure is fine-grained enough that we can
run it on even highly latency-sensitive things like our web servers
without impacting response time meaningfully.

It harnesses the VM's existing LRU/clock algorithm to identify the
pages which are most likely to be cold, so the approach scales to
large memory sizes (256G+) with only minor CPU overhead.

https://github.com/facebookincubator/senpai

The same concept could be applicable to shrinking guests proactively
in virtualized environments?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting
  2020-01-23 17:20     ` Dave Hansen
@ 2020-01-23 19:23       ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 39+ messages in thread
From: Konrad Rzeszutek Wilk @ 2020-01-23 19:23 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Alexander Duyck, Alexander Graf, Alexander Duyck, kvm, mst,
	linux-kernel, willy, mhocko, linux-mm, akpm, mgorman, vbabka,
	Van De Ven, Arjan, yang.zhang.wz, nitesh, david, pagupta, riel,
	lcapitulino, wei.w.wang, aarcange, pbonzini, dan.j.williams,
	osalvador, Paterson-Jones, Roland, hannes, hare, Boeuf,
	Sebastien

On Thu, Jan 23, 2020 at 09:20:15AM -0800, Dave Hansen wrote:
> On 1/23/20 8:26 AM, Alexander Duyck wrote:
> >> The big piece I'm missing is the page cache. Linux will by default try 
> >> to keep the free list as small as it can in favor of page cache, so most 
> >> of the benefit of this patch set will be void in real world scenarios.
> > Agreed. This is a the next piece of this I plan to work on once this is
> > accepted. For now the quick and dirty approach is to essentially make use
> > of the /proc/sys/vm/drop_caches interface in the guest by either putting
> > it in a cronjob somewhere or to have it after memory intensive workloads.
> 
> There was an implementation in "Clear Linux" that used this sysctl:
> 
> > https://github.com/Conan-Kudo/omv-kernel-rc/blob/master/0154-sysctl-vm-Fine-grained-cache-shrinking.patch
> 
> (I can't find it in the Clear repos at the moment, must not be used
> currently).  But the idea was to have a little daemon in the host that
> periodically applied some artificial pressure with this sysctl.  This
> sysctl is a smaller hammer than /proc/sys/vm/drop_caches and lets you
> drop small amounts of cache.
> 
> The right way to do it is probably to do real, generic reclaim instead
> of drop_caches.

This  sounds like Transcendent Memory (https://www.linux-kvm.org/images/d/d7/TmemNotVirt-Linuxcon2011-Final.pdf)
which has (which is in the Linux kernel) a driver to push on the swapper
and everything else to evict pages to the hypervisor. 

Look at cleancache and frontswap and xen-selfballoon.c (was removed by
by 814bbf49dcd0ad642e7ceb8991e57555c5472cce)

Avi Kivity pointed out one big issue with all of this - customers have
to be nicely behaved - which they don't seem to be.

But I would recommend you look at cleancache for the page cache.

> 
> This isn't conceptually *that* far away from the "proactive reclaim"
> that other folks have proposed:
> 
> 	https://lwn.net/Articles/787611/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting
  2020-01-23 18:47         ` Graf (AWS), Alexander
@ 2020-01-23 22:05           ` Alexander Duyck
  0 siblings, 0 replies; 39+ messages in thread
From: Alexander Duyck @ 2020-01-23 22:05 UTC (permalink / raw)
  To: Graf (AWS), Alexander
  Cc: Alexander Duyck, kvm, mst, linux-kernel, willy, mhocko, linux-mm,
	akpm, mgorman, vbabka, yang.zhang.wz, nitesh, konrad.wilk, david,
	pagupta, riel, lcapitulino, dave.hansen, wei.w.wang, aarcange,
	pbonzini, dan.j.williams, osalvador, Paterson-Jones, Roland,
	hannes, hare, Christian Borntraeger, Singh, Balbir

On Thu, 2020-01-23 at 18:47 +0000, Graf (AWS), Alexander wrote:
> > > Am 23.01.2020 um 19:34 schrieb Alexander Duyck <alexander.h.duyck@linux.intel.com>:
> > > 
> > > On Thu, 2020-01-23 at 17:54 +0100, Alexander Graf wrote:
> > > > On 23.01.20 17:26, Alexander Duyck wrote:
> > > > On Thu, 2020-01-23 at 11:20 +0100, Alexander Graf wrote:
> > > > > Hi Alex,
> > > > > > On 22.01.20 18:43, Alexander Duyck wrote:
> > > [...]
> > > > > > The overall guest size is kept fairly small to only a few GB while the test
> > > > > > is running. If the host memory were oversubscribed this patch set should
> > > > > > result in a performance improvement as swapping memory in the host can be
> > > > > > avoided.
> > > > > I really like the approach overall. Voluntarily propagating free memory
> > > > > from a guest to the host has been a sore point ever since KVM was
> > > > > around. This solution looks like a very elegant way to do so.
> > > > > The big piece I'm missing is the page cache. Linux will by default try
> > > > > to keep the free list as small as it can in favor of page cache, so most
> > > > > of the benefit of this patch set will be void in real world scenarios.
> > > > Agreed. This is a the next piece of this I plan to work on once this is
> > > > accepted. For now the quick and dirty approach is to essentially make use
> > > > of the /proc/sys/vm/drop_caches interface in the guest by either putting
> > > > it in a cronjob somewhere or to have it after memory intensive workloads.
> > > > > Traditionally, this was solved by creating pressure from the host
> > > > > through virtio-balloon: Exactly the piece that this patch set gets away
> > > > > with. I never liked "ballooning", because the host has very limited
> > > > > visibility into the actual memory utility of its guests. So leaving the
> > > > > decision on how much memory is actually needed at a given point in time
> > > > > should ideally stay with the guest.
> > > > > What would keep us from applying the page hinting approach to inactive,
> > > > > clean page cache pages? With writeback in place as well, we would slowly
> > > > > propagate pages from
> > > > >   dirty -> clean -> clean, inactive -> free -> host owned
> > > > > which gives a guest a natural path to give up "not important" memory.
> > > > I considered something similar. Basically one thought I had was to
> > > > essentially look at putting together some sort of epoch. When the host is
> > > > under memory pressure it would need to somehow notify the guest and then
> > > > the guest would start moving the epoch forward so that we start evicting
> > > > pages out of the page cache when the host is under memory pressure.
> > > I think we want to consider an interface in which the host actively asks
> > > guests to purge pages to be on the same line as swapping: The last line
> > > of defense.
> > 
> > I suppose. The only reason I was thinking that we may want to look at
> > doing something like that was to avoid putting pressure on the guest when
> > the host doesn't need us to.
> > 
> > > In the normal mode of operation, you still want to shrink down
> > > voluntarily, so that everyone cooperatively tries to make free for new
> > > guests you could potentially run on the same host.
> > > If you start to apply pressure to guests to find out of they might have
> > > some pages to spare, we're almost back to the old style ballooning approach.
> > 
> > Thats true. In addition we avoid possible issues with us trying to flush
> > out a bunch of memory from multiple guests as once since they would be
> > proactively freeing the memory.
> > 
> > I'm thinking the inactive state could be something similar to MADV_FREE in
> > terms of behavior.  If it sits in the queue for long enough we decide
> > nobody is using it anymore so it is freed, but if it is accessed it is
> > cheap for us to just put it back without much in the way of overhead.
> 
> I think the main difference between the MADV_FREE and what we want is
> that we also want to pull the page into active state on read.
> 
> But sure, that's a possible interface. What I'd like to make sure of is
> that we can have different host policies: discard the page straight
> away, keep it for a fixed amount of time or discard it lazily on
> pressure. As long as the guest gives the host its clean pages
> voluntarily, I'm happy.

Well the current model I am working with has us using MAD_DONTNEED from
the hypervisor if the unsued page is reported. So it will still have to be
pulled back in, but it will start out as a zeroed page.

> Btw, have you already given thought to the faulting interface when a
> page was evicted? That's where it gets especially tricky. With a simple
> "discard the page straight away" style interface, we would not have to
> fault.

So the fault I was referring to would be inside the guest only. Basically
we would keep the page for a little while longer while it is inactive and
just let the mapping go. Then if something accesses it before we finally
release it we don't pay the heavy cost of having to get it back from the
host and then copying the memory back in from swap or the file.

I'm just loosely basing that on the "proactive reclaim" idea that was
proposed back at the last lsf/mm summit (https://lwn.net/Articles/787611/)
. I still haven't even started work on any of those pieces yet nor looked
at it too closely. I'm still in the information gathering phase.


> > > Btw, have you ever looked at CMM2 [1]? With that, the host can
> > > essentially just "steal" pages from the guest when it needs any, without
> > > the need to execute the guest meanwhile. That means inside the host
> > > swapping path, CMM2 can just evict guest page cache pages as easily as
> > > we evict host page cache pages. To me, that's even more attractive in
> > > the swap / emergency case than an interface which requires the guest to
> > > proactively execute while we are in a low mem situation.
> > 
> > <snip>
> > 
> > > [1] https://www.kernel.org/doc/ols/2006/ols2006v2-pages-321-336.pdf
> > 
> > I hadn't read through this before. If nothing else the verbiage is useful
> > since what we are discussing is essentially how to deal with the
> > "volatile" pages within the system, the "unused" pages are the ones we
> > have reported to the host with the page reporting, and the "stable" pages
> > are those pages that have been faulted back into the guest when it
> > accessed them.
> > 
> > I can see there would be some advantages to CMM2, however it seems like it
> > is adding a significant amount of state to pages since it has to support a
> > fairly significant number of states and then there is the added complexity
> > for all the transitions in and out of stable from the various states
> > depending on how things are being changed.
> > 
> > Do you happen to know if anyone has done any research into how much
> > overhead is added with CMM2 enabled? I'd be curious since it seems like
> > the paper mentions having to track a signficant number of state
> > transitions for the memory throughout the kernel.
> 
> Let me add Christian Borntraeger to the thread. He can definitely help
> on that side. I asked him earlier today and he confirmed that cmm2 is in
> active use on s390.
> 
> Alex

Okay, sounds good.

- Alex


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting
  2020-01-23 19:17     ` Johannes Weiner
@ 2020-01-23 22:29       ` Alexander Duyck
  2020-01-23 23:24         ` Dave Hansen
  0 siblings, 1 reply; 39+ messages in thread
From: Alexander Duyck @ 2020-01-23 22:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Alexander Graf, Alexander Duyck, kvm, mst, linux-kernel, willy,
	mhocko, linux-mm, akpm, mgorman, vbabka, yang.zhang.wz, nitesh,
	konrad.wilk, david, pagupta, riel, lcapitulino, dave.hansen,
	wei.w.wang, aarcange, pbonzini, dan.j.williams, osalvador,
	Paterson-Jones, Roland, hare

On Thu, 2020-01-23 at 14:17 -0500, Johannes Weiner wrote:
> On Thu, Jan 23, 2020 at 08:26:39AM -0800, Alexander Duyck wrote:
> > On Thu, 2020-01-23 at 11:20 +0100, Alexander Graf wrote:
> > > Hi Alex,
> > > 
> > > On 22.01.20 18:43, Alexander Duyck wrote:
> > > > This series provides an asynchronous means of reporting free guest pages
> > > > to a hypervisor so that the memory associated with those pages can be
> > > > dropped and reused by other processes and/or guests on the host. Using
> > > > this it is possible to avoid unnecessary I/O to disk and greatly improve
> > > > performance in the case of memory overcommit on the host.
> > > > 
> > > > When enabled we will be performing a scan of free memory every 2 seconds
> > > > while pages of sufficiently high order are being freed. In each pass at
> > > > least one sixteenth of each free list will be reported. By doing this we
> > > > avoid racing against other threads that may be causing a high amount of
> > > > memory churn.
> > > > 
> > > > The lowest page order currently scanned when reporting pages is
> > > > pageblock_order so that this feature will not interfere with the use of
> > > > Transparent Huge Pages in the case of virtualization.
> > > > 
> > > > Currently this is only in use by virtio-balloon however there is the hope
> > > > that at some point in the future other hypervisors might be able to make
> > > > use of it. In the virtio-balloon/QEMU implementation the hypervisor is
> > > > currently using MADV_DONTNEED to indicate to the host kernel that the page
> > > > is currently free. It will be zeroed and faulted back into the guest the
> > > > next time the page is accessed.
> > > > 
> > > > To track if a page is reported or not the Uptodate flag was repurposed and
> > > > used as a Reported flag for Buddy pages. We walk though the free list
> > > > isolating pages and adding them to the scatterlist until we either
> > > > encounter the end of the list, processed as many pages as were listed in
> > > > nr_free prior to us starting, or have filled the scatterlist with pages to
> > > > be reported. If we fill the scatterlist before we reach the end of the
> > > > list we rotate the list so that the first unreported page we encounter is
> > > > moved to the head of the list as that is where we will resume after we
> > > > have freed the reported pages back into the tail of the list.
> > > > 
> > > > Below are the results from various benchmarks. I primarily focused on two
> > > > tests. The first is the will-it-scale/page_fault2 test, and the other is
> > > > a modified version of will-it-scale/page_fault1 that was enabled to use
> > > > THP. I did this as it allows for better visibility into different parts
> > > > of the memory subsystem. The guest is running with 32G for RAM on one
> > > > node of a E5-2630 v3. The host has had some features such as CPU turbo
> > > > disabled in the BIOS.
> > > > 
> > > > Test                   page_fault1 (THP)    page_fault2
> > > > Name            tasks  Process Iter  STDEV  Process Iter  STDEV
> > > > Baseline            1    1012402.50  0.14%     361855.25  0.81%
> > > >                     16    8827457.25  0.09%    3282347.00  0.34%
> > > > 
> > > > Patches Applied     1    1007897.00  0.23%     361887.00  0.26%
> > > >                     16    8784741.75  0.39%    3240669.25  0.48%
> > > > 
> > > > Patches Enabled     1    1010227.50  0.39%     359749.25  0.56%
> > > >                     16    8756219.00  0.24%    3226608.75  0.97%
> > > > 
> > > > Patches Enabled     1    1050982.00  4.26%     357966.25  0.14%
> > > >   page shuffle      16    8672601.25  0.49%    3223177.75  0.40%
> > > > 
> > > > Patches enabled     1    1003238.00  0.22%     360211.00  0.22%
> > > >   shuffle w/ RFC    16    8767010.50  0.32%    3199874.00  0.71%
> > > > 
> > > > The results above are for a baseline with a linux-next-20191219 kernel,
> > > > that kernel with this patch set applied but page reporting disabled in
> > > > virtio-balloon, the patches applied and page reporting fully enabled, the
> > > > patches enabled with page shuffling enabled, and the patches applied with
> > > > page shuffling enabled and an RFC patch that makes used of MADV_FREE in
> > > > QEMU. These results include the deviation seen between the average value
> > > > reported here versus the high and/or low value. I observed that during the
> > > > test memory usage for the first three tests never dropped whereas with the
> > > > patches fully enabled the VM would drop to using only a few GB of the
> > > > host's memory when switching from memhog to page fault tests.
> > > > 
> > > > Any of the overhead visible with this patch set enabled seems due to page
> > > > faults caused by accessing the reported pages and the host zeroing the page
> > > > before giving it back to the guest. This overhead is much more visible when
> > > > using THP than with standard 4K pages. In addition page shuffling seemed to
> > > > increase the amount of faults generated due to an increase in memory churn.
> > > > The overhead is reduced when using MADV_FREE as we can avoid the extra
> > > > zeroing of the pages when they are reintroduced to the host, as can be seen
> > > > when the RFC is applied with shuffling enabled.
> > > > 
> > > > The overall guest size is kept fairly small to only a few GB while the test
> > > > is running. If the host memory were oversubscribed this patch set should
> > > > result in a performance improvement as swapping memory in the host can be
> > > > avoided.
> > > 
> > > I really like the approach overall. Voluntarily propagating free memory 
> > > from a guest to the host has been a sore point ever since KVM was 
> > > around. This solution looks like a very elegant way to do so.
> > > 
> > > The big piece I'm missing is the page cache. Linux will by default try 
> > > to keep the free list as small as it can in favor of page cache, so most 
> > > of the benefit of this patch set will be void in real world scenarios.
> > 
> > Agreed. This is a the next piece of this I plan to work on once this is
> > accepted. For now the quick and dirty approach is to essentially make use
> > of the /proc/sys/vm/drop_caches interface in the guest by either putting
> > it in a cronjob somewhere or to have it after memory intensive workloads.
> > 
> > > Traditionally, this was solved by creating pressure from the host 
> > > through virtio-balloon: Exactly the piece that this patch set gets away 
> > > with. I never liked "ballooning", because the host has very limited 
> > > visibility into the actual memory utility of its guests. So leaving the 
> > > decision on how much memory is actually needed at a given point in time 
> > > should ideally stay with the guest.
> > > 
> > > What would keep us from applying the page hinting approach to inactive, 
> > > clean page cache pages? With writeback in place as well, we would slowly 
> > > propagate pages from
> > > 
> > >    dirty -> clean -> clean, inactive -> free -> host owned
> > > 
> > > which gives a guest a natural path to give up "not important" memory.
> > 
> > I considered something similar. Basically one thought I had was to
> > essentially look at putting together some sort of epoch. When the host is
> > under memory pressure it would need to somehow notify the guest and then
> > the guest would start moving the epoch forward so that we start evicting
> > pages out of the page cache when the host is under memory pressure.
> > 
> > > The big problem I see is that what I really want from a user's point of 
> > > view is a tuneable that says "Automatically free clean page cache pages 
> > > that were not accessed in the last X minutes". Otherwise we may run into 
> > > the risk of evicting some times in use page cache pages.
> > > 
> > > I have a hard time grasping the mm code to understand how hard that 
> > > would be to implement that though :).
> > > 
> > > 
> > > Alex
> > 
> > Yeah, I am not exactly an expert on this either as I have only been
> > working int he MM tree for about a year now.
> > 
> > I have submitted this as a topic for LSF/MM summit[1] and I am hoping to
> > get some feedback on the best way to apply proactive memory pressure as
> > one of the subtopics if iti s selected.
> 
> I've been working on a proactive reclaim project that shrinks
> workloads to their smallest, still healthy, memory footprint.
> 
> Because we (FB) have a similar problem with containers: in order to
> know how many workloads can be safely combined on a host, we first
> need to know how much memory a given workload truly requires - as
> opposed to how many pages it would gobble up for one-off cache and
> cold anon regions if it had the whole machine to itself.
> 
> This userspace tool uses cgroups and psi to adjust the memory limits
> of workloads in a pressure feedback loop. It targets a minimal rate of
> refaults/swapping/reclaim activity to identify the point where all the
> cold pages have been evicted and we're *just* about to start eating
> into warmer memory.
> 
> With SSDs, control over pressure is fine-grained enough that we can
> run it on even highly latency-sensitive things like our web servers
> without impacting response time meaningfully.
> 
> It harnesses the VM's existing LRU/clock algorithm to identify the
> pages which are most likely to be cold, so the approach scales to
> large memory sizes (256G+) with only minor CPU overhead.
> 
> https://github.com/facebookincubator/senpai
> 
> The same concept could be applicable to shrinking guests proactively
> in virtualized environments?

Looking it over this kind of does what we would want to do, however we
would need to find a way to have this work without the cgroup requirement.
Essentially we would have the guest running this and then proactively
keeping its own resources in check.

- Alex



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting
  2020-01-23 22:29       ` Alexander Duyck
@ 2020-01-23 23:24         ` Dave Hansen
  0 siblings, 0 replies; 39+ messages in thread
From: Dave Hansen @ 2020-01-23 23:24 UTC (permalink / raw)
  To: Alexander Duyck, Johannes Weiner
  Cc: Alexander Graf, Alexander Duyck, kvm, mst, linux-kernel, willy,
	mhocko, linux-mm, akpm, mgorman, vbabka, yang.zhang.wz, nitesh,
	konrad.wilk, david, pagupta, riel, lcapitulino, wei.w.wang,
	aarcange, pbonzini, dan.j.williams, osalvador, Paterson-Jones,
	Roland, hare

On 1/23/20 2:29 PM, Alexander Duyck wrote:
> Looking it over this kind of does what we would want to do, however we
> would need to find a way to have this work without the cgroup requirement.
> Essentially we would have the guest running this and then proactively
> keeping its own resources in check.

It's also worth noting that for Clear Linux, the guests are doing
container-like things (https://katacontainers.io/) but inside virtual
machines.  The VM content in this case is known and relatively trusted,
so generally isn't a stretch to assume that it can run a daemon and will
mostly play nice.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting
  2020-01-23 14:52     ` Alexander Graf
@ 2020-01-24 13:25       ` David Hildenbrand
  2020-01-24 16:20         ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2020-01-24 13:25 UTC (permalink / raw)
  To: Alexander Graf, Alexander Duyck, kvm, mst, linux-kernel, willy,
	mhocko, linux-mm, akpm, mgorman, vbabka
  Cc: yang.zhang.wz, nitesh, konrad.wilk, pagupta, riel, lcapitulino,
	dave.hansen, wei.w.wang, aarcange, pbonzini, dan.j.williams,
	alexander.h.duyck, osalvador, Paterson-Jones, Roland, hannes,
	hare

On 23.01.20 15:52, Alexander Graf wrote:
> 
> 
> On 23.01.20 15:05, David Hildenbrand wrote:
>> On 23.01.20 11:20, Alexander Graf wrote:
>>> Hi Alex,
>>>
>>> On 22.01.20 18:43, Alexander Duyck wrote:
>>>> This series provides an asynchronous means of reporting free guest pages
>>>> to a hypervisor so that the memory associated with those pages can be
>>>> dropped and reused by other processes and/or guests on the host. Using
>>>> this it is possible to avoid unnecessary I/O to disk and greatly improve
>>>> performance in the case of memory overcommit on the host.
>>>>
>>>> When enabled we will be performing a scan of free memory every 2 seconds
>>>> while pages of sufficiently high order are being freed. In each pass at
>>>> least one sixteenth of each free list will be reported. By doing this we
>>>> avoid racing against other threads that may be causing a high amount of
>>>> memory churn.
>>>>
>>>> The lowest page order currently scanned when reporting pages is
>>>> pageblock_order so that this feature will not interfere with the use of
>>>> Transparent Huge Pages in the case of virtualization.
>>>>
>>>> Currently this is only in use by virtio-balloon however there is the hope
>>>> that at some point in the future other hypervisors might be able to make
>>>> use of it. In the virtio-balloon/QEMU implementation the hypervisor is
>>>> currently using MADV_DONTNEED to indicate to the host kernel that the page
>>>> is currently free. It will be zeroed and faulted back into the guest the
>>>> next time the page is accessed.
>>>>
>>>> To track if a page is reported or not the Uptodate flag was repurposed and
>>>> used as a Reported flag for Buddy pages. We walk though the free list
>>>> isolating pages and adding them to the scatterlist until we either
>>>> encounter the end of the list, processed as many pages as were listed in
>>>> nr_free prior to us starting, or have filled the scatterlist with pages to
>>>> be reported. If we fill the scatterlist before we reach the end of the
>>>> list we rotate the list so that the first unreported page we encounter is
>>>> moved to the head of the list as that is where we will resume after we
>>>> have freed the reported pages back into the tail of the list.
>>>>
>>>> Below are the results from various benchmarks. I primarily focused on two
>>>> tests. The first is the will-it-scale/page_fault2 test, and the other is
>>>> a modified version of will-it-scale/page_fault1 that was enabled to use
>>>> THP. I did this as it allows for better visibility into different parts
>>>> of the memory subsystem. The guest is running with 32G for RAM on one
>>>> node of a E5-2630 v3. The host has had some features such as CPU turbo
>>>> disabled in the BIOS.
>>>>
>>>> Test                   page_fault1 (THP)    page_fault2
>>>> Name            tasks  Process Iter  STDEV  Process Iter  STDEV
>>>> Baseline            1    1012402.50  0.14%     361855.25  0.81%
>>>>                      16    8827457.25  0.09%    3282347.00  0.34%
>>>>
>>>> Patches Applied     1    1007897.00  0.23%     361887.00  0.26%
>>>>                      16    8784741.75  0.39%    3240669.25  0.48%
>>>>
>>>> Patches Enabled     1    1010227.50  0.39%     359749.25  0.56%
>>>>                      16    8756219.00  0.24%    3226608.75  0.97%
>>>>
>>>> Patches Enabled     1    1050982.00  4.26%     357966.25  0.14%
>>>>    page shuffle      16    8672601.25  0.49%    3223177.75  0.40%
>>>>
>>>> Patches enabled     1    1003238.00  0.22%     360211.00  0.22%
>>>>    shuffle w/ RFC    16    8767010.50  0.32%    3199874.00  0.71%
>>>>
>>>> The results above are for a baseline with a linux-next-20191219 kernel,
>>>> that kernel with this patch set applied but page reporting disabled in
>>>> virtio-balloon, the patches applied and page reporting fully enabled, the
>>>> patches enabled with page shuffling enabled, and the patches applied with
>>>> page shuffling enabled and an RFC patch that makes used of MADV_FREE in
>>>> QEMU. These results include the deviation seen between the average value
>>>> reported here versus the high and/or low value. I observed that during the
>>>> test memory usage for the first three tests never dropped whereas with the
>>>> patches fully enabled the VM would drop to using only a few GB of the
>>>> host's memory when switching from memhog to page fault tests.
>>>>
>>>> Any of the overhead visible with this patch set enabled seems due to page
>>>> faults caused by accessing the reported pages and the host zeroing the page
>>>> before giving it back to the guest. This overhead is much more visible when
>>>> using THP than with standard 4K pages. In addition page shuffling seemed to
>>>> increase the amount of faults generated due to an increase in memory churn.
>>>> The overhead is reduced when using MADV_FREE as we can avoid the extra
>>>> zeroing of the pages when they are reintroduced to the host, as can be seen
>>>> when the RFC is applied with shuffling enabled.
>>>>
>>>> The overall guest size is kept fairly small to only a few GB while the test
>>>> is running. If the host memory were oversubscribed this patch set should
>>>> result in a performance improvement as swapping memory in the host can be
>>>> avoided.
>>>
>>>
>>> I really like the approach overall. Voluntarily propagating free memory
>>> from a guest to the host has been a sore point ever since KVM was
>>> around. This solution looks like a very elegant way to do so.
>>>
>>> The big piece I'm missing is the page cache. Linux will by default try
>>> to keep the free list as small as it can in favor of page cache, so most
>>> of the benefit of this patch set will be void in real world scenarios.
>>
>> One approach is to move (parts of) the page cache from the guest to the
>> hypervisor - e.g., using emulated NVDIMM or virtio-pmem.
> 
> Whether you can do that depends heavily on your virtualization 
> environment. On a host with single tenant VMs, that's definitely 
> feasible. In a Kubernetes environment, it might also be feasible.

I would be interesting in which environments this is an actual problem
that can't be solved in the hypervisor (e.g., see below).

> 
> But when you have VMs that assume that the host is interfering with them 
> as little as possible, it becomes harder:

... then you don't want free page reporting or even ballooning I suppose :)

> 
> How do you ensure fairness across different VMs' page cache that is 
> munged into a single big host one?

Interesting question. I would assume that this problem (e.g.,
partitioning the page cache/priorities/etc..) either

a) Already has been solved in the kernel for different processes/process
groups/containers. So for VMs as well.
b) Still is an open problem to solve for all these units.

But: Not a cgroup/pagecache expert

> 
> Do you even have host page cache or are you using SR-IOV / mdev for 
> storage for performance reasons?

E.g., with virtio-mem, a file-backed file will be mmaped into your VM
just like a NVDIMM. So there will be host page cache. But I am -
unfortunately - not an expert on that matter as well (and not sure if it
answers your question) :)

> 
> 
> The puzzle is still incomplete, even with NVDIMM exposure to the guest 
> as an option unfortunately :).

I think it is worth to note that free page reporting won't apply to all
environments either way. There is a steady overhead to do the reporting
and reporting only happens on >= 2MB chunks. There are setups where
virtio-balloon might still be desirable.

(e.g., fragmented guest, no runtime overhead, ...)

All this fiddeling with the guest page cache feels wrong to me ... but
there seem to be interesting approaches being discussed.

So I do agree that the puzzle is still incomplete for some use cases /
environments.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting
  2020-01-24 13:25       ` David Hildenbrand
@ 2020-01-24 16:20         ` David Hildenbrand
  0 siblings, 0 replies; 39+ messages in thread
From: David Hildenbrand @ 2020-01-24 16:20 UTC (permalink / raw)
  To: Alexander Graf, Alexander Duyck, kvm, mst, linux-kernel, willy,
	mhocko, linux-mm, akpm, mgorman, vbabka
  Cc: yang.zhang.wz, nitesh, konrad.wilk, pagupta, riel, lcapitulino,
	dave.hansen, wei.w.wang, aarcange, pbonzini, dan.j.williams,
	alexander.h.duyck, osalvador, Paterson-Jones, Roland, hannes,
	hare

On 24.01.20 14:25, David Hildenbrand wrote:
> On 23.01.20 15:52, Alexander Graf wrote:
>>
>>
>> On 23.01.20 15:05, David Hildenbrand wrote:
>>> On 23.01.20 11:20, Alexander Graf wrote:
>>>> Hi Alex,
>>>>
>>>> On 22.01.20 18:43, Alexander Duyck wrote:
>>>>> This series provides an asynchronous means of reporting free guest pages
>>>>> to a hypervisor so that the memory associated with those pages can be
>>>>> dropped and reused by other processes and/or guests on the host. Using
>>>>> this it is possible to avoid unnecessary I/O to disk and greatly improve
>>>>> performance in the case of memory overcommit on the host.
>>>>>
>>>>> When enabled we will be performing a scan of free memory every 2 seconds
>>>>> while pages of sufficiently high order are being freed. In each pass at
>>>>> least one sixteenth of each free list will be reported. By doing this we
>>>>> avoid racing against other threads that may be causing a high amount of
>>>>> memory churn.
>>>>>
>>>>> The lowest page order currently scanned when reporting pages is
>>>>> pageblock_order so that this feature will not interfere with the use of
>>>>> Transparent Huge Pages in the case of virtualization.
>>>>>
>>>>> Currently this is only in use by virtio-balloon however there is the hope
>>>>> that at some point in the future other hypervisors might be able to make
>>>>> use of it. In the virtio-balloon/QEMU implementation the hypervisor is
>>>>> currently using MADV_DONTNEED to indicate to the host kernel that the page
>>>>> is currently free. It will be zeroed and faulted back into the guest the
>>>>> next time the page is accessed.
>>>>>
>>>>> To track if a page is reported or not the Uptodate flag was repurposed and
>>>>> used as a Reported flag for Buddy pages. We walk though the free list
>>>>> isolating pages and adding them to the scatterlist until we either
>>>>> encounter the end of the list, processed as many pages as were listed in
>>>>> nr_free prior to us starting, or have filled the scatterlist with pages to
>>>>> be reported. If we fill the scatterlist before we reach the end of the
>>>>> list we rotate the list so that the first unreported page we encounter is
>>>>> moved to the head of the list as that is where we will resume after we
>>>>> have freed the reported pages back into the tail of the list.
>>>>>
>>>>> Below are the results from various benchmarks. I primarily focused on two
>>>>> tests. The first is the will-it-scale/page_fault2 test, and the other is
>>>>> a modified version of will-it-scale/page_fault1 that was enabled to use
>>>>> THP. I did this as it allows for better visibility into different parts
>>>>> of the memory subsystem. The guest is running with 32G for RAM on one
>>>>> node of a E5-2630 v3. The host has had some features such as CPU turbo
>>>>> disabled in the BIOS.
>>>>>
>>>>> Test                   page_fault1 (THP)    page_fault2
>>>>> Name            tasks  Process Iter  STDEV  Process Iter  STDEV
>>>>> Baseline            1    1012402.50  0.14%     361855.25  0.81%
>>>>>                      16    8827457.25  0.09%    3282347.00  0.34%
>>>>>
>>>>> Patches Applied     1    1007897.00  0.23%     361887.00  0.26%
>>>>>                      16    8784741.75  0.39%    3240669.25  0.48%
>>>>>
>>>>> Patches Enabled     1    1010227.50  0.39%     359749.25  0.56%
>>>>>                      16    8756219.00  0.24%    3226608.75  0.97%
>>>>>
>>>>> Patches Enabled     1    1050982.00  4.26%     357966.25  0.14%
>>>>>    page shuffle      16    8672601.25  0.49%    3223177.75  0.40%
>>>>>
>>>>> Patches enabled     1    1003238.00  0.22%     360211.00  0.22%
>>>>>    shuffle w/ RFC    16    8767010.50  0.32%    3199874.00  0.71%
>>>>>
>>>>> The results above are for a baseline with a linux-next-20191219 kernel,
>>>>> that kernel with this patch set applied but page reporting disabled in
>>>>> virtio-balloon, the patches applied and page reporting fully enabled, the
>>>>> patches enabled with page shuffling enabled, and the patches applied with
>>>>> page shuffling enabled and an RFC patch that makes used of MADV_FREE in
>>>>> QEMU. These results include the deviation seen between the average value
>>>>> reported here versus the high and/or low value. I observed that during the
>>>>> test memory usage for the first three tests never dropped whereas with the
>>>>> patches fully enabled the VM would drop to using only a few GB of the
>>>>> host's memory when switching from memhog to page fault tests.
>>>>>
>>>>> Any of the overhead visible with this patch set enabled seems due to page
>>>>> faults caused by accessing the reported pages and the host zeroing the page
>>>>> before giving it back to the guest. This overhead is much more visible when
>>>>> using THP than with standard 4K pages. In addition page shuffling seemed to
>>>>> increase the amount of faults generated due to an increase in memory churn.
>>>>> The overhead is reduced when using MADV_FREE as we can avoid the extra
>>>>> zeroing of the pages when they are reintroduced to the host, as can be seen
>>>>> when the RFC is applied with shuffling enabled.
>>>>>
>>>>> The overall guest size is kept fairly small to only a few GB while the test
>>>>> is running. If the host memory were oversubscribed this patch set should
>>>>> result in a performance improvement as swapping memory in the host can be
>>>>> avoided.
>>>>
>>>>
>>>> I really like the approach overall. Voluntarily propagating free memory
>>>> from a guest to the host has been a sore point ever since KVM was
>>>> around. This solution looks like a very elegant way to do so.
>>>>
>>>> The big piece I'm missing is the page cache. Linux will by default try
>>>> to keep the free list as small as it can in favor of page cache, so most
>>>> of the benefit of this patch set will be void in real world scenarios.
>>>
>>> One approach is to move (parts of) the page cache from the guest to the
>>> hypervisor - e.g., using emulated NVDIMM or virtio-pmem.
>>
>> Whether you can do that depends heavily on your virtualization 
>> environment. On a host with single tenant VMs, that's definitely 
>> feasible. In a Kubernetes environment, it might also be feasible.
> 
> I would be interesting in which environments this is an actual problem
> that can't be solved in the hypervisor (e.g., see below).

Okay, as Alex told me offline, (somewhat obvious) environments are where
the hypervisor page cache is not involved (e.g., vfio etc.)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting
       [not found] ` <20200124132352.12824-1-hdanton@sina.com>
@ 2020-01-24 16:40   ` Alexander Graf
  0 siblings, 0 replies; 39+ messages in thread
From: Alexander Graf @ 2020-01-24 16:40 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Alexander Duyck, kvm, mst, linux-kernel, linux-mm, akpm, mgorman,
	Minchan Kim, vbabka



On 24.01.20 14:23, Hillf Danton wrote:
> 
> On Thu, 23 Jan 2020 11:20:07 +0100 Alexander Graf wrote:
>>
>> The big problem I see is that what I really want from a user's point of
>> view is a tuneable that says "Automatically free clean page cache pages
>> that were not accessed in the last X minutes".
> 
> A diff is made on top of 1a4e58cce84e ("mm: introduce MADV_PAGEOUT") without
> test in any form, assuming it goes in line with the tunable above but without
> "X minutes" taken into account.
> 
> [BTW, please take a look at
> Content-Type: text/plain; charset="utf-8"; format="flowed"
> Content-Transfer-Encoding: base64
Thanks, looks like Exchange doesn't pass 8bit data on, I've changed the 
default to ascii now, please just notify me in private if you see it 
broken again.

> 
> and ensure pure text message.]
> 
> 
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -69,6 +69,7 @@
>   
>   #define MADV_COLD	20		/* deactivate these pages */
>   #define MADV_PAGEOUT	21		/* reclaim these pages */
> +#define MADV_CCPC	22		/* reclaim cold & clean page cache pages */

This patch adds a new madvise flag. I have a hard time seeing how that 
would help with the "full system expiry" of pages?

The basic point that I tried to make above was that I would ideally like 
to have a coldness cutoff date at which you can be pretty confident that 
page cache data is no longer needed.

To work properly, this needs to be transparent to any normal process on 
the system :).


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting
  2020-01-22 17:43 [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting Alexander Duyck
                   ` (10 preceding siblings ...)
       [not found] ` <20200124132352.12824-1-hdanton@sina.com>
@ 2020-02-03 22:05 ` Alexander Duyck
  2020-02-10 19:18   ` Should I repost? (was: Re: [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting) Alexander Duyck
  11 siblings, 1 reply; 39+ messages in thread
From: Alexander Duyck @ 2020-02-03 22:05 UTC (permalink / raw)
  To: akpm, mgorman, david
  Cc: yang.zhang.wz, nitesh, konrad.wilk, pagupta, riel, lcapitulino,
	dave.hansen, wei.w.wang, aarcange, pbonzini, dan.j.williams,
	osalvador, vbabka, AlexanderDuyck, kvm, mst, linux-kernel, willy,
	mhocko, linux-mm

On Wed, 2020-01-22 at 09:43 -0800, Alexander Duyck wrote:
> This series provides an asynchronous means of reporting free guest pages
> to a hypervisor so that the memory associated with those pages can be
> dropped and reused by other processes and/or guests on the host. Using
> this it is possible to avoid unnecessary I/O to disk and greatly improve
> performance in the case of memory overcommit on the host.

<snip>

> A brief history on the background of free page reporting can be found at:
> https://lore.kernel.org/lkml/29f43d5796feed0dec8e8bb98b187d9dac03b900.camel@linux.intel.com/
> 
> Changes from v14:
> https://lore.kernel.org/lkml/20191119214454.24996.66289.stgit@localhost.localdomain/
> Renamed "unused page reporting" to "free page reporting"
>   Updated code, kconfig, and patch descriptions
> Split out patch for __free_isolated_page
>   Renamed function to __putback_isolated_page
> Rewrote core reporting functionality
>   Added logic to reschedule worker in 2 seconds instead of run to completion
>   Removed reported_pages statistics
>   Removed REPORTING_REQUESTED bit used in zone flags
>   Replaced page_reporting_dev_info refcount with state variable
>   Removed scatterlist from page_reporting_dev_info
>   Removed capacity from page reporting device
>   Added dynamic scatterlist allocation/free at start/end of reporting process
>   Updated __free_one_page so that reported pages are not always added to tail
>   Added logic to handle error from report function
> Updated virtio-balloon patch that adds support for page reporting
>   Updated patch description to try and highlight differences in approaches
>   Updated logic to reflect that we cannot limit the scatterlist from device
>   Added logic to return error from report function
> Moved documentation patch to end of patch set
> 
> Changes from v15:
> https://lore.kernel.org/lkml/20191205161928.19548.41654.stgit@localhost.localdomain/
> Rebased on linux-next-20191219
> Split out patches for budget and moving head to last page processed
> Updated budget code to reduce how much memory is reported per pass
> Added logic to also rotate the list if we exit due a page isolation failure
> Added migratetype as argument in __putback_isolated_page
> 
> Changes from v16:
> https://lore.kernel.org/lkml/20200103210509.29237.18426.stgit@localhost.localdomain/
> Rebased on linux-next-20200122
>   Updated patch 2 to to account for removal of pr_info in __isolate_free_page
> Updated patch title for patches 7, 8, and 9 to use prefix mm/page_reporting
> No code changes other than conflict resolution for patch 2

So I thought I would put out a gentle nudge since it has been about 4
weeks since v16 was submitted, a little over a week and a half for v16.1,
and I have yet to get any feedback on the code contained in the patchset.
Codewise nothing has changed from the v16 patchset other than rebasing it
off of the linux-next tree to resolve some merge conflicts that I saw
recently, and discussion around v16.1 was mostly about next steps and how
to deal with the page cache instead of discussing the code itself.

The full patchset can be found at:
https://lore.kernel.org/lkml/20200122173040.6142.39116.stgit@localhost.localdomain/

I believe I still need review feedback for patches 3, 4, 7, 8, and 9.

Thanks.

- Alex


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Should I repost? (was: Re: [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting)
  2020-02-03 22:05 ` Alexander Duyck
@ 2020-02-10 19:18   ` Alexander Duyck
  2020-02-11 10:40     ` Mel Gorman
  0 siblings, 1 reply; 39+ messages in thread
From: Alexander Duyck @ 2020-02-10 19:18 UTC (permalink / raw)
  To: akpm, mgorman, david
  Cc: yang.zhang.wz, nitesh, konrad.wilk, pagupta, riel, lcapitulino,
	dave.hansen, wei.w.wang, aarcange, pbonzini, dan.j.williams,
	osalvador, vbabka, AlexanderDuyck, kvm, mst, linux-kernel, willy,
	mhocko, linux-mm

On Mon, 2020-02-03 at 14:05 -0800, Alexander Duyck wrote:
> On Wed, 2020-01-22 at 09:43 -0800, Alexander Duyck wrote:
> > This series provides an asynchronous means of reporting free guest pages
> > to a hypervisor so that the memory associated with those pages can be
> > dropped and reused by other processes and/or guests on the host. Using
> > this it is possible to avoid unnecessary I/O to disk and greatly improve
> > performance in the case of memory overcommit on the host.
> 
> <snip>
> 
> > A brief history on the background of free page reporting can be found at:
> > https://lore.kernel.org/lkml/29f43d5796feed0dec8e8bb98b187d9dac03b900.camel@linux.intel.com/
> > 
> > Changes from v14:
> > https://lore.kernel.org/lkml/20191119214454.24996.66289.stgit@localhost.localdomain/
> > Renamed "unused page reporting" to "free page reporting"
> >   Updated code, kconfig, and patch descriptions
> > Split out patch for __free_isolated_page
> >   Renamed function to __putback_isolated_page
> > Rewrote core reporting functionality
> >   Added logic to reschedule worker in 2 seconds instead of run to completion
> >   Removed reported_pages statistics
> >   Removed REPORTING_REQUESTED bit used in zone flags
> >   Replaced page_reporting_dev_info refcount with state variable
> >   Removed scatterlist from page_reporting_dev_info
> >   Removed capacity from page reporting device
> >   Added dynamic scatterlist allocation/free at start/end of reporting process
> >   Updated __free_one_page so that reported pages are not always added to tail
> >   Added logic to handle error from report function
> > Updated virtio-balloon patch that adds support for page reporting
> >   Updated patch description to try and highlight differences in approaches
> >   Updated logic to reflect that we cannot limit the scatterlist from device
> >   Added logic to return error from report function
> > Moved documentation patch to end of patch set
> > 
> > Changes from v15:
> > https://lore.kernel.org/lkml/20191205161928.19548.41654.stgit@localhost.localdomain/
> > Rebased on linux-next-20191219
> > Split out patches for budget and moving head to last page processed
> > Updated budget code to reduce how much memory is reported per pass
> > Added logic to also rotate the list if we exit due a page isolation failure
> > Added migratetype as argument in __putback_isolated_page
> > 
> > Changes from v16:
> > https://lore.kernel.org/lkml/20200103210509.29237.18426.stgit@localhost.localdomain/
> > Rebased on linux-next-20200122
> >   Updated patch 2 to to account for removal of pr_info in __isolate_free_page
> > Updated patch title for patches 7, 8, and 9 to use prefix mm/page_reporting
> > No code changes other than conflict resolution for patch 2
> 
> So I thought I would put out a gentle nudge since it has been about 4
> weeks since v16 was submitted, a little over a week and a half for v16.1,
> and I have yet to get any feedback on the code contained in the patchset.
> Codewise nothing has changed from the v16 patchset other than rebasing it
> off of the linux-next tree to resolve some merge conflicts that I saw
> recently, and discussion around v16.1 was mostly about next steps and how
> to deal with the page cache instead of discussing the code itself.
> 
> The full patchset can be found at:
> https://lore.kernel.org/lkml/20200122173040.6142.39116.stgit@localhost.localdomain/
> 
> I believe I still need review feedback for patches 3, 4, 7, 8, and 9.
> 
> Thanks.
> 
> - Alex

So I had posted this patch set a few days before Linus's merge window
opened. When I posted it the discussion was about what the follow-up on
this patch set will be in terms of putting pressure on the page cache to
force it to shrink. However I didn't get any review comments on the code
itself.

My last understanding on this patch set is that I am waiting on patch
feedback from Mel Gorman as he had the remaining requests that led to most
of the changes in v15 and v16. I believe I have addressed them, but I
don't believe he has had a chance to review them.

I am wondering now if it is still possible to either get it reviewed
and/or applied without reposting, or do I need to repost it since it has
been several weeks since I submitted it? The patch set still applies to
the linux-next tree without any issues.

Thanks.

- Alex





^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Should I repost? (was: Re: [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting)
  2020-02-10 19:18   ` Should I repost? (was: Re: [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting) Alexander Duyck
@ 2020-02-11 10:40     ` Mel Gorman
  2020-02-11 22:57       ` Alexander Duyck
  0 siblings, 1 reply; 39+ messages in thread
From: Mel Gorman @ 2020-02-11 10:40 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: akpm, david, yang.zhang.wz, nitesh, konrad.wilk, pagupta, riel,
	lcapitulino, dave.hansen, wei.w.wang, aarcange, pbonzini,
	dan.j.williams, osalvador, vbabka, AlexanderDuyck, kvm, mst,
	linux-kernel, willy, mhocko, linux-mm

On Mon, Feb 10, 2020 at 11:18:59AM -0800, Alexander Duyck wrote:
> > So I thought I would put out a gentle nudge since it has been about 4
> > weeks since v16 was submitted, a little over a week and a half for v16.1,
> > and I have yet to get any feedback on the code contained in the patchset.
> > Codewise nothing has changed from the v16 patchset other than rebasing it
> > off of the linux-next tree to resolve some merge conflicts that I saw
> > recently, and discussion around v16.1 was mostly about next steps and how
> > to deal with the page cache instead of discussing the code itself.
> > 
> > The full patchset can be found at:
> > https://lore.kernel.org/lkml/20200122173040.6142.39116.stgit@localhost.localdomain/
> > 
> > I believe I still need review feedback for patches 3, 4, 7, 8, and 9.
> > 
> > Thanks.
> > 
> > - Alex
> 
> So I had posted this patch set a few days before Linus's merge window
> opened. When I posted it the discussion was about what the follow-up on
> this patch set will be in terms of putting pressure on the page cache to
> force it to shrink. However I didn't get any review comments on the code
> itself.
> 
> My last understanding on this patch set is that I am waiting on patch
> feedback from Mel Gorman as he had the remaining requests that led to most
> of the changes in v15 and v16. I believe I have addressed them, but I
> don't believe he has had a chance to review them.
> 
> I am wondering now if it is still possible to either get it reviewed
> and/or applied without reposting, or do I need to repost it since it has
> been several weeks since I submitted it? The patch set still applies to
> the linux-next tree without any issues.
> 

Please repost to take into account that this is confirmed to be working
as expected after the merge window and has not conflicted with anything
else that got merged in the meantime. This fell off my radar because of the
timing when it was posted and the volume of mail I was receiving. I simply
noted a large amount of traffic in response to the series and assumed
others had issues that would get resolved without looking closely. Now
I see that it was all comments on future work instead of the series itself.

Sorry.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 6/9] virtio-balloon: Add support for providing free page reports to host
  2020-01-22 17:43 ` [PATCH v16.1 6/9] virtio-balloon: Add support for providing free page reports to host Alexander Duyck
@ 2020-02-11 11:03   ` David Hildenbrand
  2020-02-11 11:47     ` Michael S. Tsirkin
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2020-02-11 11:03 UTC (permalink / raw)
  To: Alexander Duyck, kvm, mst, linux-kernel, willy, mhocko, linux-mm,
	akpm, mgorman, vbabka
  Cc: yang.zhang.wz, nitesh, konrad.wilk, pagupta, riel, lcapitulino,
	dave.hansen, wei.w.wang, aarcange, pbonzini, dan.j.williams,
	alexander.h.duyck, osalvador

On 22.01.20 18:43, Alexander Duyck wrote:
> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> 
> Add support for the page reporting feature provided by virtio-balloon.
> Reporting differs from the regular balloon functionality in that is is
> much less durable than a standard memory balloon. Instead of creating a
> list of pages that cannot be accessed the pages are only inaccessible
> while they are being indicated to the virtio interface. Once the
> interface has acknowledged them they are placed back into their respective
> free lists and are once again accessible by the guest system.
> 
> Unlike a standard balloon we don't inflate and deflate the pages. Instead
> we perform the reporting, and once the reporting is completed it is
> assumed that the page has been dropped from the guest and will be faulted
> back in the next time the page is accessed.
> 
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> ---
>  drivers/virtio/Kconfig              |    1 +
>  drivers/virtio/virtio_balloon.c     |   64 +++++++++++++++++++++++++++++++++++
>  include/uapi/linux/virtio_balloon.h |    1 +
>  3 files changed, 66 insertions(+)
> 
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 078615cf2afc..4b2dd8259ff5 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -58,6 +58,7 @@ config VIRTIO_BALLOON
>  	tristate "Virtio balloon driver"
>  	depends on VIRTIO
>  	select MEMORY_BALLOON
> +	select PAGE_REPORTING
>  	---help---
>  	 This driver supports increasing and decreasing the amount
>  	 of memory within a KVM guest.
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 40bb7693e3de..a07b9e18a292 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -19,6 +19,7 @@
>  #include <linux/mount.h>
>  #include <linux/magic.h>
>  #include <linux/pseudo_fs.h>
> +#include <linux/page_reporting.h>
>  
>  /*
>   * Balloon device works in 4K page units.  So each page is pointed to by
> @@ -47,6 +48,7 @@ enum virtio_balloon_vq {
>  	VIRTIO_BALLOON_VQ_DEFLATE,
>  	VIRTIO_BALLOON_VQ_STATS,
>  	VIRTIO_BALLOON_VQ_FREE_PAGE,
> +	VIRTIO_BALLOON_VQ_REPORTING,
>  	VIRTIO_BALLOON_VQ_MAX
>  };
>  
> @@ -114,6 +116,10 @@ struct virtio_balloon {
>  
>  	/* To register a shrinker to shrink memory upon memory pressure */
>  	struct shrinker shrinker;
> +
> +	/* Free page reporting device */
> +	struct virtqueue *reporting_vq;
> +	struct page_reporting_dev_info pr_dev_info;
>  };
>  
>  static struct virtio_device_id id_table[] = {
> @@ -153,6 +159,33 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>  
>  }
>  
> +int virtballoon_free_page_report(struct page_reporting_dev_info *pr_dev_info,
> +				   struct scatterlist *sg, unsigned int nents)
> +{
> +	struct virtio_balloon *vb =
> +		container_of(pr_dev_info, struct virtio_balloon, pr_dev_info);
> +	struct virtqueue *vq = vb->reporting_vq;
> +	unsigned int unused, err;
> +
> +	/* We should always be able to add these buffers to an empty queue. */
> +	err = virtqueue_add_inbuf(vq, sg, nents, vb, GFP_NOWAIT | __GFP_NOWARN);
> +
> +	/*
> +	 * In the extremely unlikely case that something has occurred and we
> +	 * are able to trigger an error we will simply display a warning
> +	 * and exit without actually processing the pages.
> +	 */
> +	if (WARN_ON_ONCE(err))
> +		return err;
> +
> +	virtqueue_kick(vq);
> +
> +	/* When host has read buffer, this completes via balloon_ack */
> +	wait_event(vb->acked, virtqueue_get_buf(vq, &unused));
> +
> +	return 0;
> +}


Did you see the discussion regarding unifying handling of
inflate/deflate/free_page_hinting_free_page_reporting, requested by
Michael? I think free page reporting is special and shall be left alone.

VIRTIO_BALLOON_F_REPORTING is nothing but a more advanced inflate, right
(sg, inflate based on size - not "virtio pages")? And you rely on
deflates not being required before reusing an inflated page.

I suggest the following:

/* New interface (+ 2 virtqueues) to inflate/deflate using a SG */
VIRTIO_BALLOON_F_SG
/*
 * No need to deflate when reusing pages (once the inflate request was
 * processed). Applies to all inflate queues.
 */
VIRTIO_BALLOON_F_OPTIONAL_DEFLATE

And two new virtqueues

VIRTIO_BALLOON_VQ_INFLATE_SG
VIRTIO_BALLOON_VQ_DEFLATE_SG


Your feature would depend on VIRTIO_BALLOON_F_SG &&
VIRTIO_BALLOON_F_OPTIONAL_DEFLATE. VIRTIO_BALLOON_F_OPTIONAL_DEFLATE
could be reused to avoid deflating on certain events (e.g., from
OOM/shrinker).

Thoughts?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 6/9] virtio-balloon: Add support for providing free page reports to host
  2020-02-11 11:03   ` David Hildenbrand
@ 2020-02-11 11:47     ` Michael S. Tsirkin
  2020-02-11 12:19       ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2020-02-11 11:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexander Duyck, kvm, linux-kernel, willy, mhocko, linux-mm,
	akpm, mgorman, vbabka, yang.zhang.wz, nitesh, konrad.wilk,
	pagupta, riel, lcapitulino, dave.hansen, wei.w.wang, aarcange,
	pbonzini, dan.j.williams, alexander.h.duyck, osalvador

On Tue, Feb 11, 2020 at 12:03:57PM +0100, David Hildenbrand wrote:
> On 22.01.20 18:43, Alexander Duyck wrote:
> > From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > 
> > Add support for the page reporting feature provided by virtio-balloon.
> > Reporting differs from the regular balloon functionality in that is is
> > much less durable than a standard memory balloon. Instead of creating a
> > list of pages that cannot be accessed the pages are only inaccessible
> > while they are being indicated to the virtio interface. Once the
> > interface has acknowledged them they are placed back into their respective
> > free lists and are once again accessible by the guest system.
> > 
> > Unlike a standard balloon we don't inflate and deflate the pages. Instead
> > we perform the reporting, and once the reporting is completed it is
> > assumed that the page has been dropped from the guest and will be faulted
> > back in the next time the page is accessed.
> > 
> > Acked-by: Michael S. Tsirkin <mst@redhat.com>
> > Reviewed-by: David Hildenbrand <david@redhat.com>
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > ---
> >  drivers/virtio/Kconfig              |    1 +
> >  drivers/virtio/virtio_balloon.c     |   64 +++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/virtio_balloon.h |    1 +
> >  3 files changed, 66 insertions(+)
> > 
> > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > index 078615cf2afc..4b2dd8259ff5 100644
> > --- a/drivers/virtio/Kconfig
> > +++ b/drivers/virtio/Kconfig
> > @@ -58,6 +58,7 @@ config VIRTIO_BALLOON
> >  	tristate "Virtio balloon driver"
> >  	depends on VIRTIO
> >  	select MEMORY_BALLOON
> > +	select PAGE_REPORTING
> >  	---help---
> >  	 This driver supports increasing and decreasing the amount
> >  	 of memory within a KVM guest.
> > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> > index 40bb7693e3de..a07b9e18a292 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -19,6 +19,7 @@
> >  #include <linux/mount.h>
> >  #include <linux/magic.h>
> >  #include <linux/pseudo_fs.h>
> > +#include <linux/page_reporting.h>
> >  
> >  /*
> >   * Balloon device works in 4K page units.  So each page is pointed to by
> > @@ -47,6 +48,7 @@ enum virtio_balloon_vq {
> >  	VIRTIO_BALLOON_VQ_DEFLATE,
> >  	VIRTIO_BALLOON_VQ_STATS,
> >  	VIRTIO_BALLOON_VQ_FREE_PAGE,
> > +	VIRTIO_BALLOON_VQ_REPORTING,
> >  	VIRTIO_BALLOON_VQ_MAX
> >  };
> >  
> > @@ -114,6 +116,10 @@ struct virtio_balloon {
> >  
> >  	/* To register a shrinker to shrink memory upon memory pressure */
> >  	struct shrinker shrinker;
> > +
> > +	/* Free page reporting device */
> > +	struct virtqueue *reporting_vq;
> > +	struct page_reporting_dev_info pr_dev_info;
> >  };
> >  
> >  static struct virtio_device_id id_table[] = {
> > @@ -153,6 +159,33 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> >  
> >  }
> >  
> > +int virtballoon_free_page_report(struct page_reporting_dev_info *pr_dev_info,
> > +				   struct scatterlist *sg, unsigned int nents)
> > +{
> > +	struct virtio_balloon *vb =
> > +		container_of(pr_dev_info, struct virtio_balloon, pr_dev_info);
> > +	struct virtqueue *vq = vb->reporting_vq;
> > +	unsigned int unused, err;
> > +
> > +	/* We should always be able to add these buffers to an empty queue. */
> > +	err = virtqueue_add_inbuf(vq, sg, nents, vb, GFP_NOWAIT | __GFP_NOWARN);
> > +
> > +	/*
> > +	 * In the extremely unlikely case that something has occurred and we
> > +	 * are able to trigger an error we will simply display a warning
> > +	 * and exit without actually processing the pages.
> > +	 */
> > +	if (WARN_ON_ONCE(err))
> > +		return err;
> > +
> > +	virtqueue_kick(vq);
> > +
> > +	/* When host has read buffer, this completes via balloon_ack */
> > +	wait_event(vb->acked, virtqueue_get_buf(vq, &unused));
> > +
> > +	return 0;
> > +}
> 
> 
> Did you see the discussion regarding unifying handling of
> inflate/deflate/free_page_hinting_free_page_reporting, requested by
> Michael? I think free page reporting is special and shall be left alone.

Not sure what do you mean by "left alone here". Could you clarify?

> VIRTIO_BALLOON_F_REPORTING is nothing but a more advanced inflate, right
> (sg, inflate based on size - not "virtio pages")?


Not exactly - it's also initiated by guest as opposed to host, and
not guided by the ballon size request set by the host.
And uses a dedicated queue to avoid blocking other functionality ...

I really think this is more like an inflate immediately followed by deflate.



> And you rely on
> deflates not being required before reusing an inflated page.
> 
> I suggest the following:
> 
> /* New interface (+ 2 virtqueues) to inflate/deflate using a SG */
> VIRTIO_BALLOON_F_SG
> /*
>  * No need to deflate when reusing pages (once the inflate request was
>  * processed). Applies to all inflate queues.
>  */
> VIRTIO_BALLOON_F_OPTIONAL_DEFLATE
> 
> And two new virtqueues
> 
> VIRTIO_BALLOON_VQ_INFLATE_SG
> VIRTIO_BALLOON_VQ_DEFLATE_SG
> 
> 
> Your feature would depend on VIRTIO_BALLOON_F_SG &&
> VIRTIO_BALLOON_F_OPTIONAL_DEFLATE. VIRTIO_BALLOON_F_OPTIONAL_DEFLATE
> could be reused to avoid deflating on certain events (e.g., from
> OOM/shrinker).
> 
> Thoughts?

I'd rather wait until we have a usecase and preferably a POC
showing it helps before we add optional deflate ...
For now I personally am fine with just making this go ahead as is,
and imply SG and OPTIONAL_DEFLATE just for this VQ.

Do you feel strongly we need to bring this up to a TC vote?
It means spec patch needs to be written, but it
does not have to be a big patch ...


> -- 
> Thanks,
> 
> David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 6/9] virtio-balloon: Add support for providing free page reports to host
  2020-02-11 11:47     ` Michael S. Tsirkin
@ 2020-02-11 12:19       ` David Hildenbrand
  2020-02-11 14:07         ` Michael S. Tsirkin
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2020-02-11 12:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, kvm, linux-kernel, willy, mhocko, linux-mm,
	akpm, mgorman, vbabka, yang.zhang.wz, nitesh, konrad.wilk,
	pagupta, riel, lcapitulino, dave.hansen, wei.w.wang, aarcange,
	pbonzini, dan.j.williams, alexander.h.duyck, osalvador

>>
>> Did you see the discussion regarding unifying handling of
>> inflate/deflate/free_page_hinting_free_page_reporting, requested by
>> Michael? I think free page reporting is special and shall be left alone.
> 
> Not sure what do you mean by "left alone here". Could you clarify?

Don't try to unify handling like I proposed below, because it's
semantics are special.

> 
>> VIRTIO_BALLOON_F_REPORTING is nothing but a more advanced inflate, right
>> (sg, inflate based on size - not "virtio pages")?
> 
> 
> Not exactly - it's also initiated by guest as opposed to host, and
> not guided by the ballon size request set by the host.

True, but AFAIKS you could use existing INFLATE/DEFLATE in a similar
way. There is no way for the hypervisor to nack a request. The balloon
size is not glued to inflate/deflate requests. The guests manually
updates it.

> And uses a dedicated queue to avoid blocking other functionality ...

True, but the other queues also don't allow for an easy extension
AFAIKS, so that's another reason.

> 
> I really think this is more like an inflate immediately followed by deflate.

Depends on how you look at it. As inflate/deflate is not glued to the
balloon size (the guest updates the size manually), it's not obvious.

E.g., in QEMU, a deflate is just a performance improvement
("MADV_WILLNEED") - in that regard, it's more like an optional deflation.

[...]

> 
> I'd rather wait until we have a usecase and preferably a POC
> showing it helps before we add optional deflate ...
> For now I personally am fine with just making this go ahead as is,
> and imply SG and OPTIONAL_DEFLATE just for this VQ.

Also fine with me, you asked about if we can abstract any of this if I
am not wrong :) So this was my take.

> 
> Do you feel strongly we need to bring this up to a TC vote?

Not really. People have been asking about how to inflate/deflate huge
pages a long time ago (comes with different challenges - e.g., balloon
compaction). looked like this interface could have been reused for this
as well.

But yeah, I am not a fan of virtio-balloon and the whole inflate/deflate
thingy. So at least I don't see a need to extend the inflate/deflate
capability.

Free page reporting is a different story (and the semantics require no
inflate/deflate/balloon size) - it could have been moved to
virtio-whatever without any issues. So I am fine with this.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 6/9] virtio-balloon: Add support for providing free page reports to host
  2020-02-11 12:19       ` David Hildenbrand
@ 2020-02-11 14:07         ` Michael S. Tsirkin
  2020-02-11 14:31           ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2020-02-11 14:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexander Duyck, kvm, linux-kernel, willy, mhocko, linux-mm,
	akpm, mgorman, vbabka, yang.zhang.wz, nitesh, konrad.wilk,
	pagupta, riel, lcapitulino, dave.hansen, wei.w.wang, aarcange,
	pbonzini, dan.j.williams, alexander.h.duyck, osalvador

On Tue, Feb 11, 2020 at 01:19:31PM +0100, David Hildenbrand wrote:
> >>
> >> Did you see the discussion regarding unifying handling of
> >> inflate/deflate/free_page_hinting_free_page_reporting, requested by
> >> Michael? I think free page reporting is special and shall be left alone.
> > 
> > Not sure what do you mean by "left alone here". Could you clarify?
> 
> Don't try to unify handling like I proposed below, because it's
> semantics are special.
> 
> > 
> >> VIRTIO_BALLOON_F_REPORTING is nothing but a more advanced inflate, right
> >> (sg, inflate based on size - not "virtio pages")?
> > 
> > 
> > Not exactly - it's also initiated by guest as opposed to host, and
> > not guided by the ballon size request set by the host.
> 
> True, but AFAIKS you could use existing INFLATE/DEFLATE in a similar
> way. There is no way for the hypervisor to nack a request. The balloon
> size is not glued to inflate/deflate requests. The guests manually
> updates it.

Hmm how isn't it? num_pages is the only way to inflate/deflate.

Spec also says:
The device is driven either by the receipt of a configuration change notification, or by changing guest memory
needs, such as performing memory compaction or responding to out of memory conditions.

so ignoring compaction/oom (later is under-specified, not a good example
to follow) yes inflate/deflate are tied to host specified configuration.


> > And uses a dedicated queue to avoid blocking other functionality ...
> 
> True, but the other queues also don't allow for an easy extension
> AFAIKS, so that's another reason.
> 
> > 
> > I really think this is more like an inflate immediately followed by deflate.
> 
> Depends on how you look at it. As inflate/deflate is not glued to the
> balloon size (the guest updates the size manually), it's not obvious.
> 
> E.g., in QEMU, a deflate is just a performance improvement
> ("MADV_WILLNEED") - in that regard, it's more like an optional deflation.
> 
> [...]
> 
> > 
> > I'd rather wait until we have a usecase and preferably a POC
> > showing it helps before we add optional deflate ...
> > For now I personally am fine with just making this go ahead as is,
> > and imply SG and OPTIONAL_DEFLATE just for this VQ.
> 
> Also fine with me, you asked about if we can abstract any of this if I
> am not wrong :) So this was my take.
> 
> > 
> > Do you feel strongly we need to bring this up to a TC vote?
> 
> Not really. People have been asking about how to inflate/deflate huge
> pages a long time ago (comes with different challenges - e.g., balloon
> compaction). looked like this interface could have been reused for this
> as well.
> 
> But yeah, I am not a fan of virtio-balloon and the whole inflate/deflate
> thingy. So at least I don't see a need to extend the inflate/deflate
> capability.
> 
> Free page reporting is a different story (and the semantics require no
> inflate/deflate/balloon size) - it could have been moved to
> virtio-whatever without any issues. So I am fine with this.
> 
> -- 
> Thanks,
> 
> David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 6/9] virtio-balloon: Add support for providing free page reports to host
  2020-02-11 14:07         ` Michael S. Tsirkin
@ 2020-02-11 14:31           ` David Hildenbrand
  2020-02-11 14:48             ` Michael S. Tsirkin
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2020-02-11 14:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, kvm, linux-kernel, willy, mhocko, linux-mm,
	akpm, mgorman, vbabka, yang.zhang.wz, nitesh, konrad.wilk,
	pagupta, riel, lcapitulino, dave.hansen, wei.w.wang, aarcange,
	pbonzini, dan.j.williams, alexander.h.duyck, osalvador

On 11.02.20 15:07, Michael S. Tsirkin wrote:
> On Tue, Feb 11, 2020 at 01:19:31PM +0100, David Hildenbrand wrote:
>>>>
>>>> Did you see the discussion regarding unifying handling of
>>>> inflate/deflate/free_page_hinting_free_page_reporting, requested by
>>>> Michael? I think free page reporting is special and shall be left alone.
>>>
>>> Not sure what do you mean by "left alone here". Could you clarify?
>>
>> Don't try to unify handling like I proposed below, because it's
>> semantics are special.
>>
>>>
>>>> VIRTIO_BALLOON_F_REPORTING is nothing but a more advanced inflate, right
>>>> (sg, inflate based on size - not "virtio pages")?
>>>
>>>
>>> Not exactly - it's also initiated by guest as opposed to host, and
>>> not guided by the ballon size request set by the host.
>>
>> True, but AFAIKS you could use existing INFLATE/DEFLATE in a similar
>> way. There is no way for the hypervisor to nack a request. The balloon
>> size is not glued to inflate/deflate requests. The guests manually
>> updates it.
> 
> Hmm how isn't it? num_pages is the only way to inflate/deflate.

Usually, guests are nice and respond to num_pages changes in an
appropriate way, except:
- Triggering deflate: Unload the driver. Suspend/hibernate. OOM.
  (+ Reboot, although that's special)
- Triggering inflate + deflate: Simple balloon compaction / page
  migration.

But that's not what I meant.

"actual" is updated by the guest, not by the host. So the "actual
balloon size" is set by the guest. It's not glued to inflation/deflation
requests. "num_pages" is the host request.

AFAIKs, the guest could inflate/deflate (esp. temporarily) and
communicate via "actual" the actual balloon size as he sees it.

> Spec also says:
> The device is driven either by the receipt of a configuration change notification, or by changing guest memory
> needs, such as performing memory compaction or responding to out of memory conditions.
> 
> so ignoring compaction/oom (later is under-specified, not a good example
> to follow) yes inflate/deflate are tied to host specified configuration
Yes, "num_pages" is the host request. But I'd say the statement (esp.
"the device is driven by") in the spec is rather weak. It does not
explicitly state when inflation/deflation is allowed IMHO.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 6/9] virtio-balloon: Add support for providing free page reports to host
  2020-02-11 14:31           ` David Hildenbrand
@ 2020-02-11 14:48             ` Michael S. Tsirkin
  2020-02-11 15:13               ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2020-02-11 14:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexander Duyck, kvm, linux-kernel, willy, mhocko, linux-mm,
	akpm, mgorman, vbabka, yang.zhang.wz, nitesh, konrad.wilk,
	pagupta, riel, lcapitulino, dave.hansen, wei.w.wang, aarcange,
	pbonzini, dan.j.williams, alexander.h.duyck, osalvador

On Tue, Feb 11, 2020 at 03:31:18PM +0100, David Hildenbrand wrote:
> On 11.02.20 15:07, Michael S. Tsirkin wrote:
> > On Tue, Feb 11, 2020 at 01:19:31PM +0100, David Hildenbrand wrote:
> >>>>
> >>>> Did you see the discussion regarding unifying handling of
> >>>> inflate/deflate/free_page_hinting_free_page_reporting, requested by
> >>>> Michael? I think free page reporting is special and shall be left alone.
> >>>
> >>> Not sure what do you mean by "left alone here". Could you clarify?
> >>
> >> Don't try to unify handling like I proposed below, because it's
> >> semantics are special.
> >>
> >>>
> >>>> VIRTIO_BALLOON_F_REPORTING is nothing but a more advanced inflate, right
> >>>> (sg, inflate based on size - not "virtio pages")?
> >>>
> >>>
> >>> Not exactly - it's also initiated by guest as opposed to host, and
> >>> not guided by the ballon size request set by the host.
> >>
> >> True, but AFAIKS you could use existing INFLATE/DEFLATE in a similar
> >> way. There is no way for the hypervisor to nack a request. The balloon
> >> size is not glued to inflate/deflate requests. The guests manually
> >> updates it.
> > 
> > Hmm how isn't it? num_pages is the only way to inflate/deflate.
> 
> Usually, guests are nice and respond to num_pages changes in an
> appropriate way, except:
> - Triggering deflate: Unload the driver. Suspend/hibernate. OOM.
>   (+ Reboot, although that's special)
> - Triggering inflate + deflate: Simple balloon compaction / page
>   migration.


These are all real situations but balloon always has been best effort.


> But that's not what I meant.
> 
> "actual" is updated by the guest, not by the host. So the "actual
> balloon size" is set by the guest. It's not glued to inflation/deflation
> requests. "num_pages" is the host request.

Well the expectation is that as long as guest has ample
available memory, when num_pages changes then
guest starts sending inflate/deflate requests,
until actual matches num_pages.

If it does not match, and we wait and it still doesn't,
then something unusual happened. People do depend on that
behaviour.

> AFAIKs, the guest could inflate/deflate (esp. temporarily) and
> communicate via "actual" the actual balloon size as he sees it.

OK so you want hinted but unused pages counted, and reported
in "actual"? That's a vmexit before each page use ...



> > Spec also says:
> > The device is driven either by the receipt of a configuration change notification, or by changing guest memory
> > needs, such as performing memory compaction or responding to out of memory conditions.
> > 
> > so ignoring compaction/oom (later is under-specified, not a good example
> > to follow) yes inflate/deflate are tied to host specified configuration
> Yes, "num_pages" is the host request. But I'd say the statement (esp.
> "the device is driven by") in the spec is rather weak. It does not
> explicitly state when inflation/deflation is allowed IMHO.

Right since it's all best effort anyway.


> -- 
> Thanks,
> 
> David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 6/9] virtio-balloon: Add support for providing free page reports to host
  2020-02-11 14:48             ` Michael S. Tsirkin
@ 2020-02-11 15:13               ` David Hildenbrand
  2020-02-11 16:33                 ` Alexander Duyck
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2020-02-11 15:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, kvm, linux-kernel, willy, mhocko, linux-mm,
	akpm, mgorman, vbabka, yang.zhang.wz, nitesh, konrad.wilk,
	pagupta, riel, lcapitulino, dave.hansen, wei.w.wang, aarcange,
	pbonzini, dan.j.williams, alexander.h.duyck, osalvador

 >> AFAIKs, the guest could inflate/deflate (esp. temporarily) and
>> communicate via "actual" the actual balloon size as he sees it.
> 
> OK so you want hinted but unused pages counted, and reported
> in "actual"? That's a vmexit before each page use ...

No, not at all. I rather meant, that it is unclear how
inflation/deflation requests and "actual" *could* interact. Especially
if we would consider free page reporting as some way of inflation
(+immediate deflation) triggered by the guest. IMHO, we would not touch
"actual" in that case.

But as I said, I am totally fine with keeping it as is in this patch.
IOW not glue free page reporting to inflation/deflation but let it act
like something different with its own semantics (and document these
properly).

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 6/9] virtio-balloon: Add support for providing free page reports to host
  2020-02-11 15:13               ` David Hildenbrand
@ 2020-02-11 16:33                 ` Alexander Duyck
  2020-02-11 17:04                   ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: Alexander Duyck @ 2020-02-11 16:33 UTC (permalink / raw)
  To: David Hildenbrand, Michael S. Tsirkin
  Cc: Alexander Duyck, kvm, linux-kernel, willy, mhocko, linux-mm,
	akpm, mgorman, vbabka, yang.zhang.wz, nitesh, konrad.wilk,
	pagupta, riel, lcapitulino, dave.hansen, wei.w.wang, aarcange,
	pbonzini, dan.j.williams, osalvador

On Tue, 2020-02-11 at 16:13 +0100, David Hildenbrand wrote:
>  >> AFAIKs, the guest could inflate/deflate (esp. temporarily) and
> > > communicate via "actual" the actual balloon size as he sees it.
> > 
> > OK so you want hinted but unused pages counted, and reported
> > in "actual"? That's a vmexit before each page use ...
> 
> No, not at all. I rather meant, that it is unclear how
> inflation/deflation requests and "actual" *could* interact. Especially
> if we would consider free page reporting as some way of inflation
> (+immediate deflation) triggered by the guest. IMHO, we would not touch
> "actual" in that case.
> 
> But as I said, I am totally fine with keeping it as is in this patch.
> IOW not glue free page reporting to inflation/deflation but let it act
> like something different with its own semantics (and document these
> properly).
> 

Okay, so before I post v17 am I leaving the virtio-balloon changes as they
were then?

For what it is worth I agree with Michael that there is more to this than
just a scatter-gather queue. For now I am trying to keep the overall
impact on QEMU on the smaller side, and if we do end up supporting the
MADV_FREE instead of MADV_DONTNEED that would also have an impact on
things as it would be yet another difference between ballooning and
hinting.

Thanks.

- Alex


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16.1 6/9] virtio-balloon: Add support for providing free page reports to host
  2020-02-11 16:33                 ` Alexander Duyck
@ 2020-02-11 17:04                   ` David Hildenbrand
  0 siblings, 0 replies; 39+ messages in thread
From: David Hildenbrand @ 2020-02-11 17:04 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Hildenbrand, Michael S. Tsirkin, Alexander Duyck, kvm,
	linux-kernel, willy, mhocko, linux-mm, akpm, mgorman, vbabka,
	yang.zhang.wz, nitesh, konrad.wilk, pagupta, riel, lcapitulino,
	dave.hansen, wei.w.wang, aarcange, pbonzini, dan.j.williams,
	osalvador



> Am 11.02.2020 um 17:33 schrieb Alexander Duyck <alexander.h.duyck@linux.intel.com>:
> 
> On Tue, 2020-02-11 at 16:13 +0100, David Hildenbrand wrote:
>>>> AFAIKs, the guest could inflate/deflate (esp. temporarily) and
>>>> communicate via "actual" the actual balloon size as he sees it.
>>> 
>>> OK so you want hinted but unused pages counted, and reported
>>> in "actual"? That's a vmexit before each page use ...
>> 
>> No, not at all. I rather meant, that it is unclear how
>> inflation/deflation requests and "actual" *could* interact. Especially
>> if we would consider free page reporting as some way of inflation
>> (+immediate deflation) triggered by the guest. IMHO, we would not touch
>> "actual" in that case.
>> 
>> But as I said, I am totally fine with keeping it as is in this patch.
>> IOW not glue free page reporting to inflation/deflation but let it act
>> like something different with its own semantics (and document these
>> properly).
>> 
> 
> Okay, so before I post v17 am I leaving the virtio-balloon changes as they
> were then?

I‘d say yes :)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Should I repost? (was: Re: [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting)
  2020-02-11 10:40     ` Mel Gorman
@ 2020-02-11 22:57       ` Alexander Duyck
  0 siblings, 0 replies; 39+ messages in thread
From: Alexander Duyck @ 2020-02-11 22:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: akpm, david, yang.zhang.wz, nitesh, konrad.wilk, pagupta, riel,
	lcapitulino, dave.hansen, wei.w.wang, aarcange, pbonzini,
	dan.j.williams, osalvador, vbabka, AlexanderDuyck, kvm, mst,
	linux-kernel, willy, mhocko, linux-mm

On Tue, 2020-02-11 at 10:40 +0000, Mel Gorman wrote:
> On Mon, Feb 10, 2020 at 11:18:59AM -0800, Alexander Duyck wrote:
> > > So I thought I would put out a gentle nudge since it has been about 4
> > > weeks since v16 was submitted, a little over a week and a half for v16.1,
> > > and I have yet to get any feedback on the code contained in the patchset.
> > > Codewise nothing has changed from the v16 patchset other than rebasing it
> > > off of the linux-next tree to resolve some merge conflicts that I saw
> > > recently, and discussion around v16.1 was mostly about next steps and how
> > > to deal with the page cache instead of discussing the code itself.
> > > 
> > > The full patchset can be found at:
> > > https://lore.kernel.org/lkml/20200122173040.6142.39116.stgit@localhost.localdomain/
> > > 
> > > I believe I still need review feedback for patches 3, 4, 7, 8, and 9.
> > > 
> > > Thanks.
> > > 
> > > - Alex
> > 
> > So I had posted this patch set a few days before Linus's merge window
> > opened. When I posted it the discussion was about what the follow-up on
> > this patch set will be in terms of putting pressure on the page cache to
> > force it to shrink. However I didn't get any review comments on the code
> > itself.
> > 
> > My last understanding on this patch set is that I am waiting on patch
> > feedback from Mel Gorman as he had the remaining requests that led to most
> > of the changes in v15 and v16. I believe I have addressed them, but I
> > don't believe he has had a chance to review them.
> > 
> > I am wondering now if it is still possible to either get it reviewed
> > and/or applied without reposting, or do I need to repost it since it has
> > been several weeks since I submitted it? The patch set still applies to
> > the linux-next tree without any issues.
> > 
> 
> Please repost to take into account that this is confirmed to be working
> as expected after the merge window and has not conflicted with anything
> else that got merged in the meantime. This fell off my radar because of the
> timing when it was posted and the volume of mail I was receiving. I simply
> noted a large amount of traffic in response to the series and assumed
> others had issues that would get resolved without looking closely. Now
> I see that it was all comments on future work instead of the series itself.
> 
> Sorry.
> 

No problem.

I have reposted as v17. I made a slight tweak to the cover page, rebased
on today's linux-next and QEMU, rebuilt and reran some of the tests to
verify the functionality and performance are still running about the same.

The full patch set with QEMU patches can be found here:
https://lore.kernel.org/lkml/20200211224416.29318.44077.stgit@localhost.localdomain/

Thanks.

- Alex


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2020-02-11 22:57 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-22 17:43 [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting Alexander Duyck
2020-01-22 17:43 ` [PATCH v16.1 1/9] mm: Adjust shuffle code to allow for future coalescing Alexander Duyck
2020-01-22 17:43 ` [PATCH v16.1 2/9] mm: Use zone and order instead of free area in free_list manipulators Alexander Duyck
2020-01-22 17:43 ` [PATCH v16.1 3/9] mm: Add function __putback_isolated_page Alexander Duyck
2020-01-22 17:43 ` [PATCH v16.1 4/9] mm: Introduce Reported pages Alexander Duyck
2020-01-22 17:43 ` [PATCH v16.1 5/9] virtio-balloon: Pull page poisoning config out of free page hinting Alexander Duyck
2020-01-22 17:43 ` [PATCH v16.1 6/9] virtio-balloon: Add support for providing free page reports to host Alexander Duyck
2020-02-11 11:03   ` David Hildenbrand
2020-02-11 11:47     ` Michael S. Tsirkin
2020-02-11 12:19       ` David Hildenbrand
2020-02-11 14:07         ` Michael S. Tsirkin
2020-02-11 14:31           ` David Hildenbrand
2020-02-11 14:48             ` Michael S. Tsirkin
2020-02-11 15:13               ` David Hildenbrand
2020-02-11 16:33                 ` Alexander Duyck
2020-02-11 17:04                   ` David Hildenbrand
2020-01-22 17:43 ` [PATCH v16.1 7/9] mm/page_reporting: Rotate reported pages to the tail of the list Alexander Duyck
2020-01-22 17:43 ` [PATCH v16.1 8/9] mm/page_reporting: Add budget limit on how many pages can be reported per pass Alexander Duyck
2020-01-22 17:44 ` [PATCH v16.1 9/9] mm/page_reporting: Add free page reporting documentation Alexander Duyck
2020-01-23 10:20 ` [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting Alexander Graf
2020-01-23 14:05   ` David Hildenbrand
2020-01-23 14:52     ` Alexander Graf
2020-01-24 13:25       ` David Hildenbrand
2020-01-24 16:20         ` David Hildenbrand
2020-01-23 16:26   ` Alexander Duyck
2020-01-23 16:54     ` Alexander Graf
2020-01-23 18:33       ` Alexander Duyck
2020-01-23 18:47         ` Graf (AWS), Alexander
2020-01-23 22:05           ` Alexander Duyck
2020-01-23 17:20     ` Dave Hansen
2020-01-23 19:23       ` Konrad Rzeszutek Wilk
2020-01-23 19:17     ` Johannes Weiner
2020-01-23 22:29       ` Alexander Duyck
2020-01-23 23:24         ` Dave Hansen
     [not found] ` <20200124132352.12824-1-hdanton@sina.com>
2020-01-24 16:40   ` Alexander Graf
2020-02-03 22:05 ` Alexander Duyck
2020-02-10 19:18   ` Should I repost? (was: Re: [PATCH v16.1 0/9] mm / virtio: Provide support for free page reporting) Alexander Duyck
2020-02-11 10:40     ` Mel Gorman
2020-02-11 22:57       ` Alexander Duyck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).