kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment
@ 2019-05-30 21:53 Alexander Duyck
  2019-05-30 21:53 ` [RFC PATCH 01/11] mm: Move MAX_ORDER definition closer to pageblock_order Alexander Duyck
                   ` (14 more replies)
  0 siblings, 15 replies; 18+ messages in thread
From: Alexander Duyck @ 2019-05-30 21:53 UTC (permalink / raw)
  To: nitesh, kvm, david, mst, dave.hansen, linux-kernel, linux-mm
  Cc: yang.zhang.wz, pagupta, riel, konrad.wilk, lcapitulino,
	wei.w.wang, aarcange, pbonzini, dan.j.williams,
	alexander.h.duyck

This series provides an asynchronous means of hinting to a hypervisor
that a guest page is no longer in use and can have the data associated
with it dropped. To do this I have implemented functionality that allows
for what I am referring to as "waste page treatment".

I have based many of the terms and functionality off of waste water
treatment, the idea for the similarity occured to me after I had reached
the point of referring to the hints as "bubbles", as the hints used the
same approach as the balloon functionality but would disappear if they
were touched, as a result I started to think of the virtio device as an
aerator. The general idea with all of this is that the guest should be
treating the unused pages so that when they end up heading "downstream"
to either another guest, or back at the host they will not need to be
written to swap.

So for a bit of background for the treatment process, it is based on a
sequencing batch reactor (SBR)[1]. The treatment process itself has five
stages. The first stage is the fill, with this we take the raw pages and
add them to the reactor. The second stage is react, in this stage we hand
the pages off to the Virtio Balloon driver to have hints attached to them
and for those hints to be sent to the hypervisor. The third stage is
settle, in this stage we are waiting for the hypervisor to process the
pages, and we should receive an interrupt when it is completed. The fourth
stage is to decant, or drain the reactor of pages. Finally we have the
idle stage which we will go into if the reference count for the reactor
gets down to 0 after a drain, or if a fill operation fails to obtain any
pages and the reference count has hit 0. Otherwise we return to the first
state and start the cycle over again.

This patch set is still far more intrusive then I would really like for
what it has to do. Currently I am splitting the nr_free_pages into two
values and having to add a pointer and an index to track where we area in
the treatment process for a given free_area. I'm also not sure I have
covered all possible corner cases where pages can get into the free_area
or move from one migratetype to another.

Also I am still leaving a number of things hard-coded such as limiting the
lowest order processed to PAGEBLOCK_ORDER, and have left it up to the
guest to determine what size of reactor it wants to allocate to process
the hints.

Another consideration I am still debating is if I really want to process
the aerator_cycle() function in interrupt context or if I should have it
running in a thread somewhere else.

[1]: https://en.wikipedia.org/wiki/Sequencing_batch_reactor

---

Alexander Duyck (11):
      mm: Move MAX_ORDER definition closer to pageblock_order
      mm: Adjust shuffle code to allow for future coalescing
      mm: Add support for Treated Buddy pages
      mm: Split nr_free into nr_free_raw and nr_free_treated
      mm: Propogate Treated bit when splitting
      mm: Add membrane to free area to use as divider between treated and raw pages
      mm: Add support for acquiring first free "raw" or "untreated" page in zone
      mm: Add support for creating memory aeration
      mm: Count isolated pages as "treated"
      virtio-balloon: Add support for aerating memory via bubble hinting
      mm: Add free page notification hook


 arch/x86/include/asm/page.h         |   11 +
 drivers/virtio/Kconfig              |    1 
 drivers/virtio/virtio_balloon.c     |   89 ++++++++++
 include/linux/gfp.h                 |   10 +
 include/linux/memory_aeration.h     |   54 ++++++
 include/linux/mmzone.h              |  100 +++++++++--
 include/linux/page-flags.h          |   32 +++
 include/linux/pageblock-flags.h     |    8 +
 include/uapi/linux/virtio_balloon.h |    1 
 mm/Kconfig                          |    5 +
 mm/Makefile                         |    1 
 mm/aeration.c                       |  324 +++++++++++++++++++++++++++++++++++
 mm/compaction.c                     |    4 
 mm/page_alloc.c                     |  220 ++++++++++++++++++++----
 mm/shuffle.c                        |   24 ---
 mm/shuffle.h                        |   35 ++++
 mm/vmstat.c                         |    5 -
 17 files changed, 838 insertions(+), 86 deletions(-)
 create mode 100644 include/linux/memory_aeration.h
 create mode 100644 mm/aeration.c

--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH 01/11] mm: Move MAX_ORDER definition closer to pageblock_order
  2019-05-30 21:53 [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment Alexander Duyck
@ 2019-05-30 21:53 ` Alexander Duyck
  2019-05-30 21:53 ` [RFC PATCH 02/11] mm: Adjust shuffle code to allow for future coalescing Alexander Duyck
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 18+ messages in thread
From: Alexander Duyck @ 2019-05-30 21:53 UTC (permalink / raw)
  To: nitesh, kvm, david, mst, dave.hansen, linux-kernel, linux-mm
  Cc: yang.zhang.wz, pagupta, riel, konrad.wilk, lcapitulino,
	wei.w.wang, aarcange, pbonzini, dan.j.williams,
	alexander.h.duyck

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

The definition of MAX_ORDER being contained in mmzone.h is problematic when
wanting to just get access to things like pageblock_order since
pageblock_order is defined on some architectures as being based on
MAX_ORDER and it isn't included in pageblock-flags.h.

Move the definition of MAX_ORDER into pageblock-flags.h so that it is
defined in the same header as pageblock_order. By doing this we don't need
to also include mmzone.h. The definition of MAX_ORDER will still be
accessible to any file that includes mmzone.h as it includes
pageblock-flags.h.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/mmzone.h          |    8 --------
 include/linux/pageblock-flags.h |    8 ++++++++
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 70394cabaf4e..a6bdff538437 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -22,14 +22,6 @@
 #include <linux/page-flags.h>
 #include <asm/page.h>
 
-/* Free memory management - zoned buddy allocator.  */
-#ifndef CONFIG_FORCE_MAX_ZONEORDER
-#define MAX_ORDER 11
-#else
-#define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER
-#endif
-#define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))
-
 /*
  * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
  * costly to service.  That is between allocation orders which should
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 06a66327333d..e9e8006ccae1 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -40,6 +40,14 @@ enum pageblock_bits {
 	NR_PAGEBLOCK_BITS
 };
 
+/* Free memory management - zoned buddy allocator.  */
+#ifndef CONFIG_FORCE_MAX_ZONEORDER
+#define MAX_ORDER 11
+#else
+#define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER
+#endif
+#define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))
+
 #ifdef CONFIG_HUGETLB_PAGE
 
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 02/11] mm: Adjust shuffle code to allow for future coalescing
  2019-05-30 21:53 [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment Alexander Duyck
  2019-05-30 21:53 ` [RFC PATCH 01/11] mm: Move MAX_ORDER definition closer to pageblock_order Alexander Duyck
@ 2019-05-30 21:53 ` Alexander Duyck
  2019-05-30 21:53 ` [RFC PATCH 03/11] mm: Add support for Treated Buddy pages Alexander Duyck
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 18+ messages in thread
From: Alexander Duyck @ 2019-05-30 21:53 UTC (permalink / raw)
  To: nitesh, kvm, david, mst, dave.hansen, linux-kernel, linux-mm
  Cc: yang.zhang.wz, pagupta, riel, konrad.wilk, lcapitulino,
	wei.w.wang, aarcange, pbonzini, dan.j.williams,
	alexander.h.duyck

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

This patch is meant to move the head/tail adding logic out of the shuffle
code and into the __free_one_page function since ultimately that is where
it is really needed anyway. By doing this we should be able to reduce the
overhead and can consolidate all of the list addition bits in one spot.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/mmzone.h |   12 --------
 mm/page_alloc.c        |   70 +++++++++++++++++++++++++++---------------------
 mm/shuffle.c           |   24 ----------------
 mm/shuffle.h           |   35 ++++++++++++++++++++++++
 4 files changed, 74 insertions(+), 67 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a6bdff538437..297edb45071a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -108,18 +108,6 @@ static inline void add_to_free_area_tail(struct page *page, struct free_area *ar
 	area->nr_free++;
 }
 
-#ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR
-/* Used to preserve page allocation order entropy */
-void add_to_free_area_random(struct page *page, struct free_area *area,
-		int migratetype);
-#else
-static inline void add_to_free_area_random(struct page *page,
-		struct free_area *area, int migratetype)
-{
-	add_to_free_area(page, area, migratetype);
-}
-#endif
-
 /* Used for pages which are on another list */
 static inline void move_to_free_area(struct page *page, struct free_area *area,
 			     int migratetype)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c061f66c2d0c..2fa5bbb372bb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -851,6 +851,36 @@ static inline struct capture_control *task_capc(struct zone *zone)
 #endif /* CONFIG_COMPACTION */
 
 /*
+ * If this is not the largest possible page, check if the buddy
+ * of the next-highest order is free. If it is, it's possible
+ * that pages are being freed that will coalesce soon. In case,
+ * that is happening, add the free page to the tail of the list
+ * so it's less likely to be used soon and more likely to be merged
+ * as a higher order page
+ */
+static inline bool
+buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
+		   struct page *page, unsigned int order)
+{
+	struct page *higher_page, *higher_buddy;
+	unsigned long combined_pfn;
+
+	if (is_shuffle_order(order) || order >= (MAX_ORDER - 2))
+		return false;
+
+	if (!pfn_valid_within(buddy_pfn))
+		return false;
+
+	combined_pfn = buddy_pfn & pfn;
+	higher_page = page + (combined_pfn - pfn);
+	buddy_pfn = __find_buddy_pfn(combined_pfn, order + 1);
+	higher_buddy = higher_page + (buddy_pfn - combined_pfn);
+
+	return pfn_valid_within(buddy_pfn) &&
+	       page_is_buddy(higher_page, higher_buddy, order + 1);
+}
+
+/*
  * Freeing function for a buddy system allocator.
  *
  * The concept of a buddy system is to maintain direct-mapped table
@@ -879,11 +909,12 @@ static inline void __free_one_page(struct page *page,
 		struct zone *zone, unsigned int order,
 		int migratetype)
 {
-	unsigned long combined_pfn;
+	struct capture_control *capc = task_capc(zone);
 	unsigned long uninitialized_var(buddy_pfn);
-	struct page *buddy;
+	unsigned long combined_pfn;
+	struct free_area *area;
 	unsigned int max_order;
-	struct capture_control *capc = task_capc(zone);
+	struct page *buddy;
 
 	max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
 
@@ -952,35 +983,12 @@ static inline void __free_one_page(struct page *page,
 done_merging:
 	set_page_order(page, order);
 
-	/*
-	 * If this is not the largest possible page, check if the buddy
-	 * of the next-highest order is free. If it is, it's possible
-	 * that pages are being freed that will coalesce soon. In case,
-	 * that is happening, add the free page to the tail of the list
-	 * so it's less likely to be used soon and more likely to be merged
-	 * as a higher order page
-	 */
-	if ((order < MAX_ORDER-2) && pfn_valid_within(buddy_pfn)
-			&& !is_shuffle_order(order)) {
-		struct page *higher_page, *higher_buddy;
-		combined_pfn = buddy_pfn & pfn;
-		higher_page = page + (combined_pfn - pfn);
-		buddy_pfn = __find_buddy_pfn(combined_pfn, order + 1);
-		higher_buddy = higher_page + (buddy_pfn - combined_pfn);
-		if (pfn_valid_within(buddy_pfn) &&
-		    page_is_buddy(higher_page, higher_buddy, order + 1)) {
-			add_to_free_area_tail(page, &zone->free_area[order],
-					      migratetype);
-			return;
-		}
-	}
-
-	if (is_shuffle_order(order))
-		add_to_free_area_random(page, &zone->free_area[order],
-				migratetype);
+	area = &zone->free_area[order];
+	if (buddy_merge_likely(pfn, buddy_pfn, page, order) ||
+	    is_shuffle_tail_page(order))
+		add_to_free_area_tail(page, area, migratetype);
 	else
-		add_to_free_area(page, &zone->free_area[order], migratetype);
-
+		add_to_free_area(page, area, migratetype);
 }
 
 /*
diff --git a/mm/shuffle.c b/mm/shuffle.c
index 3ce12481b1dc..55d592e62526 100644
--- a/mm/shuffle.c
+++ b/mm/shuffle.c
@@ -4,7 +4,6 @@
 #include <linux/mm.h>
 #include <linux/init.h>
 #include <linux/mmzone.h>
-#include <linux/random.h>
 #include <linux/moduleparam.h>
 #include "internal.h"
 #include "shuffle.h"
@@ -182,26 +181,3 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
 	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
 		shuffle_zone(z);
 }
-
-void add_to_free_area_random(struct page *page, struct free_area *area,
-		int migratetype)
-{
-	static u64 rand;
-	static u8 rand_bits;
-
-	/*
-	 * The lack of locking is deliberate. If 2 threads race to
-	 * update the rand state it just adds to the entropy.
-	 */
-	if (rand_bits == 0) {
-		rand_bits = 64;
-		rand = get_random_u64();
-	}
-
-	if (rand & 1)
-		add_to_free_area(page, area, migratetype);
-	else
-		add_to_free_area_tail(page, area, migratetype);
-	rand_bits--;
-	rand >>= 1;
-}
diff --git a/mm/shuffle.h b/mm/shuffle.h
index 777a257a0d2f..3f4edb60a453 100644
--- a/mm/shuffle.h
+++ b/mm/shuffle.h
@@ -3,6 +3,7 @@
 #ifndef _MM_SHUFFLE_H
 #define _MM_SHUFFLE_H
 #include <linux/jump_label.h>
+#include <linux/random.h>
 
 /*
  * SHUFFLE_ENABLE is called from the command line enabling path, or by
@@ -43,6 +44,35 @@ static inline bool is_shuffle_order(int order)
 		return false;
 	return order >= SHUFFLE_ORDER;
 }
+
+static inline bool is_shuffle_tail_page(int order)
+{
+	static u64 rand;
+	static u8 rand_bits;
+	u64 rand_old;
+
+	if (!is_shuffle_order(order))
+		return false;
+
+	/*
+	 * The lack of locking is deliberate. If 2 threads race to
+	 * update the rand state it just adds to the entropy.
+	 */
+	if (rand_bits-- == 0) {
+		rand_bits = 64;
+		rand = get_random_u64();
+	}
+
+	/*
+	 * Test highest order bit while shifting our random value. This
+	 * should result in us testing for the carry flag following the
+	 * shift.
+	 */
+	rand_old = rand;
+	rand <<= 1;
+
+	return rand < rand_old;
+}
 #else
 static inline void shuffle_free_memory(pg_data_t *pgdat)
 {
@@ -60,5 +90,10 @@ static inline bool is_shuffle_order(int order)
 {
 	return false;
 }
+
+static inline bool is_shuffle_tail_page(int order)
+{
+	return false;
+}
 #endif
 #endif /* _MM_SHUFFLE_H */


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 03/11] mm: Add support for Treated Buddy pages
  2019-05-30 21:53 [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment Alexander Duyck
  2019-05-30 21:53 ` [RFC PATCH 01/11] mm: Move MAX_ORDER definition closer to pageblock_order Alexander Duyck
  2019-05-30 21:53 ` [RFC PATCH 02/11] mm: Adjust shuffle code to allow for future coalescing Alexander Duyck
@ 2019-05-30 21:53 ` Alexander Duyck
  2019-05-30 21:54 ` [RFC PATCH 04/11] mm: Split nr_free into nr_free_raw and nr_free_treated Alexander Duyck
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 18+ messages in thread
From: Alexander Duyck @ 2019-05-30 21:53 UTC (permalink / raw)
  To: nitesh, kvm, david, mst, dave.hansen, linux-kernel, linux-mm
  Cc: yang.zhang.wz, pagupta, riel, konrad.wilk, lcapitulino,
	wei.w.wang, aarcange, pbonzini, dan.j.williams,
	alexander.h.duyck

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

This patch is adding support for flagging pages as "Treated" within the
buddy allocator.

If memory aeration is not enabled then the value will always be treated as
false and the set/clear operations will have no effect.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/mmzone.h     |    1 +
 include/linux/page-flags.h |   32 ++++++++++++++++++++++++++++++++
 mm/page_alloc.c            |    5 +++++
 3 files changed, 38 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 297edb45071a..0263d5bf0b84 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -127,6 +127,7 @@ static inline void del_page_from_free_area(struct page *page,
 {
 	list_del(&page->lru);
 	__ClearPageBuddy(page);
+	__ResetPageTreated(page);
 	set_page_private(page, 0);
 	area->nr_free--;
 }
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 9f8712a4b1a5..1f8ccb98dd69 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -722,12 +722,32 @@ static inline int page_has_type(struct page *page)
 	VM_BUG_ON_PAGE(!PageType(page, 0), page);			\
 	page->page_type &= ~PG_##lname;					\
 }									\
+static __always_inline void __ResetPage##uname(struct page *page)	\
+{									\
+	VM_BUG_ON_PAGE(!PageType(page, 0), page);			\
+	page->page_type |= PG_##lname;					\
+}									\
 static __always_inline void __ClearPage##uname(struct page *page)	\
 {									\
 	VM_BUG_ON_PAGE(!Page##uname(page), page);			\
 	page->page_type |= PG_##lname;					\
 }
 
+#define PAGE_TYPE_OPS_DISABLED(uname)					\
+static __always_inline int Page##uname(struct page *page)		\
+{									\
+	return false;							\
+}									\
+static __always_inline void __SetPage##uname(struct page *page)		\
+{									\
+}									\
+static __always_inline void __ResetPage##uname(struct page *page)	\
+{									\
+}									\
+static __always_inline void __ClearPage##uname(struct page *page)	\
+{									\
+}
+
 /*
  * PageBuddy() indicates that the page is free and in the buddy system
  * (see mm/page_alloc.c).
@@ -744,6 +764,18 @@ static inline int page_has_type(struct page *page)
 PAGE_TYPE_OPS(Offline, offline)
 
 /*
+ * PageTreated() is an alias for Offline, however it is not meant to be an
+ * exclusive value. It should be combined with PageBuddy() when seen as it
+ * is meant to indicate that the page has been scrubbed while waiting in
+ * the buddy system.
+ */
+#ifdef CONFIG_AERATION
+PAGE_TYPE_OPS(Treated, offline)
+#else
+PAGE_TYPE_OPS_DISABLED(Treated)
+#endif
+
+/*
  * If kmemcg is enabled, the buddy allocator will set PageKmemcg() on
  * pages allocated with __GFP_ACCOUNT. It gets cleared on page free.
  */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2fa5bbb372bb..2894990862bd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -942,6 +942,11 @@ static inline void __free_one_page(struct page *page,
 			goto done_merging;
 		if (!page_is_buddy(page, buddy, order))
 			goto done_merging;
+
+		/* If buddy is not treated, then do not mark page treated */
+		if (!PageTreated(buddy))
+			__ResetPageTreated(page);
+
 		/*
 		 * Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
 		 * merge with it and move up one order.


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 04/11] mm: Split nr_free into nr_free_raw and nr_free_treated
  2019-05-30 21:53 [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment Alexander Duyck
                   ` (2 preceding siblings ...)
  2019-05-30 21:53 ` [RFC PATCH 03/11] mm: Add support for Treated Buddy pages Alexander Duyck
@ 2019-05-30 21:54 ` Alexander Duyck
  2019-05-30 21:54 ` [RFC PATCH 05/11] mm: Propogate Treated bit when splitting Alexander Duyck
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 18+ messages in thread
From: Alexander Duyck @ 2019-05-30 21:54 UTC (permalink / raw)
  To: nitesh, kvm, david, mst, dave.hansen, linux-kernel, linux-mm
  Cc: yang.zhang.wz, pagupta, riel, konrad.wilk, lcapitulino,
	wei.w.wang, aarcange, pbonzini, dan.j.williams,
	alexander.h.duyck

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Split the nr_free value into two values that track where the pages were
inserted into the list. The idea is that we can use this later to track
which pages were treated and added to the free list versus the raw pages
which were just added to the head of the list.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/mmzone.h |   36 ++++++++++++++++++++++++++++++++----
 mm/compaction.c        |    4 ++--
 mm/page_alloc.c        |   14 +++++++++-----
 mm/vmstat.c            |    5 +++--
 4 files changed, 46 insertions(+), 13 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0263d5bf0b84..988c3094b686 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -89,7 +89,8 @@ static inline bool is_migrate_movable(int mt)
 
 struct free_area {
 	struct list_head	free_list[MIGRATE_TYPES];
-	unsigned long		nr_free;
+	unsigned long		nr_free_raw;
+	unsigned long		nr_free_treated;
 };
 
 /* Used for pages not on another list */
@@ -97,7 +98,7 @@ static inline void add_to_free_area(struct page *page, struct free_area *area,
 			     int migratetype)
 {
 	list_add(&page->lru, &area->free_list[migratetype]);
-	area->nr_free++;
+	area->nr_free_raw++;
 }
 
 /* Used for pages not on another list */
@@ -105,13 +106,31 @@ static inline void add_to_free_area_tail(struct page *page, struct free_area *ar
 				  int migratetype)
 {
 	list_add_tail(&page->lru, &area->free_list[migratetype]);
-	area->nr_free++;
+	area->nr_free_raw++;
 }
 
 /* Used for pages which are on another list */
 static inline void move_to_free_area(struct page *page, struct free_area *area,
 			     int migratetype)
 {
+	/*
+	 * Since we are moving the page out of one migrate type and into
+	 * another the page will be added to the head of the new list.
+	 *
+	 * To avoid creating an island of raw pages floating between two
+	 * sections of treated pages we should reset the page type and
+	 * just re-treat the page when we process the destination.
+	 *
+	 * No need to trigger a notification for this since the page itself
+	 * is actually treated and we are just doing this for logistical
+	 * reasons.
+	 */
+	if (PageTreated(page)) {
+		__ResetPageTreated(page);
+		area->nr_free_treated--;
+		area->nr_free_raw++;
+	}
+
 	list_move(&page->lru, &area->free_list[migratetype]);
 }
 
@@ -125,11 +144,15 @@ static inline struct page *get_page_from_free_area(struct free_area *area,
 static inline void del_page_from_free_area(struct page *page,
 		struct free_area *area)
 {
+	if (PageTreated(page))
+		area->nr_free_treated--;
+	else
+		area->nr_free_raw--;
+
 	list_del(&page->lru);
 	__ClearPageBuddy(page);
 	__ResetPageTreated(page);
 	set_page_private(page, 0);
-	area->nr_free--;
 }
 
 static inline bool free_area_empty(struct free_area *area, int migratetype)
@@ -137,6 +160,11 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
 	return list_empty(&area->free_list[migratetype]);
 }
 
+static inline unsigned long nr_pages_in_free_area(struct free_area *area)
+{
+	return area->nr_free_raw + area->nr_free_treated;
+}
+
 struct pglist_data;
 
 /*
diff --git a/mm/compaction.c b/mm/compaction.c
index 9febc8cc84e7..f5a27d5dccdf 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1318,7 +1318,7 @@ static int next_search_order(struct compact_control *cc, int order)
 		unsigned long flags;
 		unsigned int order_scanned = 0;
 
-		if (!area->nr_free)
+		if (!nr_pages_in_free_area(area))
 			continue;
 
 		spin_lock_irqsave(&cc->zone->lock, flags);
@@ -1674,7 +1674,7 @@ static unsigned long fast_find_migrateblock(struct compact_control *cc)
 		unsigned long flags;
 		struct page *freepage;
 
-		if (!area->nr_free)
+		if (!nr_pages_in_free_area(area))
 			continue;
 
 		spin_lock_irqsave(&cc->zone->lock, flags);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2894990862bd..10eaea762627 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2418,7 +2418,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
 	int i;
 	int fallback_mt;
 
-	if (area->nr_free == 0)
+	if (!nr_pages_in_free_area(area))
 		return -1;
 
 	*can_steal = false;
@@ -3393,7 +3393,7 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 		struct free_area *area = &z->free_area[o];
 		int mt;
 
-		if (!area->nr_free)
+		if (!nr_pages_in_free_area(area))
 			continue;
 
 		for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
@@ -5325,7 +5325,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			struct free_area *area = &zone->free_area[order];
 			int type;
 
-			nr[order] = area->nr_free;
+			nr[order] = nr_pages_in_free_area(area);
 			total += nr[order] << order;
 
 			types[order] = 0;
@@ -5944,9 +5944,13 @@ void __ref memmap_init_zone_device(struct zone *zone,
 static void __meminit zone_init_free_lists(struct zone *zone)
 {
 	unsigned int order, t;
-	for_each_migratetype_order(order, t) {
+
+	for_each_migratetype_order(order, t)
 		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
-		zone->free_area[order].nr_free = 0;
+
+	for (order = MAX_ORDER; order--; ) {
+		zone->free_area[order].nr_free_raw = 0;
+		zone->free_area[order].nr_free_treated = 0;
 	}
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index fd7e16ca6996..aa822fda4250 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1031,7 +1031,7 @@ static void fill_contig_page_info(struct zone *zone,
 		unsigned long blocks;
 
 		/* Count number of free blocks */
-		blocks = zone->free_area[order].nr_free;
+		blocks = nr_pages_in_free_area(&zone->free_area[order]);
 		info->free_blocks_total += blocks;
 
 		/* Count free base pages */
@@ -1353,7 +1353,8 @@ static void frag_show_print(struct seq_file *m, pg_data_t *pgdat,
 
 	seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
 	for (order = 0; order < MAX_ORDER; ++order)
-		seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+		seq_printf(m, "%6lu ",
+			   nr_pages_in_free_area(&zone->free_area[order]));
 	seq_putc(m, '\n');
 }
 


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 05/11] mm: Propogate Treated bit when splitting
  2019-05-30 21:53 [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment Alexander Duyck
                   ` (3 preceding siblings ...)
  2019-05-30 21:54 ` [RFC PATCH 04/11] mm: Split nr_free into nr_free_raw and nr_free_treated Alexander Duyck
@ 2019-05-30 21:54 ` Alexander Duyck
  2019-05-30 21:54 ` [RFC PATCH 06/11] mm: Add membrane to free area to use as divider between treated and raw pages Alexander Duyck
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 18+ messages in thread
From: Alexander Duyck @ 2019-05-30 21:54 UTC (permalink / raw)
  To: nitesh, kvm, david, mst, dave.hansen, linux-kernel, linux-mm
  Cc: yang.zhang.wz, pagupta, riel, konrad.wilk, lcapitulino,
	wei.w.wang, aarcange, pbonzini, dan.j.williams,
	alexander.h.duyck

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

When we are going to call "expand" to split a page into subpages we should
mark those subpages as being "Treated" if the parent page was a "Treated"
page. By doing this we can avoid potentially providing hints on a page that
was already hinted at a larger page size as being unused.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/mmzone.h |    8 ++++++--
 mm/page_alloc.c        |   18 +++++++++++++++---
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 988c3094b686..a55fe6d2f63c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -97,16 +97,20 @@ struct free_area {
 static inline void add_to_free_area(struct page *page, struct free_area *area,
 			     int migratetype)
 {
+	if (PageTreated(page))
+		area->nr_free_treated++;
+	else
+		area->nr_free_raw++;
+
 	list_add(&page->lru, &area->free_list[migratetype]);
-	area->nr_free_raw++;
 }
 
 /* Used for pages not on another list */
 static inline void add_to_free_area_tail(struct page *page, struct free_area *area,
 				  int migratetype)
 {
-	list_add_tail(&page->lru, &area->free_list[migratetype]);
 	area->nr_free_raw++;
+	list_add_tail(&page->lru, &area->free_list[migratetype]);
 }
 
 /* Used for pages which are on another list */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 10eaea762627..f6c067c6c784 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1965,7 +1965,7 @@ void __init init_cma_reserved_pageblock(struct page *page)
  */
 static inline void expand(struct zone *zone, struct page *page,
 	int low, int high, struct free_area *area,
-	int migratetype)
+	int migratetype, bool treated)
 {
 	unsigned long size = 1 << high;
 
@@ -1984,8 +1984,17 @@ static inline void expand(struct zone *zone, struct page *page,
 		if (set_page_guard(zone, &page[size], high, migratetype))
 			continue;
 
-		add_to_free_area(&page[size], area, migratetype);
 		set_page_order(&page[size], high);
+		if (treated)
+			__SetPageTreated(&page[size]);
+
+		/*
+		 * The list we are placing this page in should be empty
+		 * so it should be safe to place it here without worrying
+		 * about creating a block of raw pages floating in between
+		 * two blocks of treated pages.
+		 */
+		add_to_free_area(&page[size], area, migratetype);
 	}
 }
 
@@ -2122,6 +2131,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	unsigned int current_order;
 	struct free_area *area;
 	struct page *page;
+	bool treated;
 
 	/* Find a page of the appropriate size in the preferred list */
 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
@@ -2129,8 +2139,10 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 		page = get_page_from_free_area(area, migratetype);
 		if (!page)
 			continue;
+		treated = PageTreated(page);
 		del_page_from_free_area(page, area);
-		expand(zone, page, order, current_order, area, migratetype);
+		expand(zone, page, order, current_order, area, migratetype,
+		       treated);
 		set_pcppage_migratetype(page, migratetype);
 		return page;
 	}


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 06/11] mm: Add membrane to free area to use as divider between treated and raw pages
  2019-05-30 21:53 [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment Alexander Duyck
                   ` (4 preceding siblings ...)
  2019-05-30 21:54 ` [RFC PATCH 05/11] mm: Propogate Treated bit when splitting Alexander Duyck
@ 2019-05-30 21:54 ` Alexander Duyck
  2019-05-30 21:54 ` [RFC PATCH 07/11] mm: Add support for acquiring first free "raw" or "untreated" page in zone Alexander Duyck
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 18+ messages in thread
From: Alexander Duyck @ 2019-05-30 21:54 UTC (permalink / raw)
  To: nitesh, kvm, david, mst, dave.hansen, linux-kernel, linux-mm
  Cc: yang.zhang.wz, pagupta, riel, konrad.wilk, lcapitulino,
	wei.w.wang, aarcange, pbonzini, dan.j.williams,
	alexander.h.duyck

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Add a pointer we shall call "membrane" which represents the upper boundary
between the "raw" and "treated" pages. The general idea is that in order
for a page to cross from one side of the membrane to the other it will need
to go through the aeration treatment.

By doing this we should be able to make certain that we keep the treated
pages as one contiguous block within each free list. While treating the
pages there may be two, but the two should merge into one before we
complete the migratetype and allow it to fall back into the "settling"
state.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/mmzone.h |   38 ++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c        |   14 ++++++++++++--
 2 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a55fe6d2f63c..be996e8ca6b5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -87,10 +87,28 @@ static inline bool is_migrate_movable(int mt)
 	get_pfnblock_flags_mask(page, page_to_pfn(page),		\
 			PB_migrate_end, MIGRATETYPE_MASK)
 
+/*
+ * The treatment state indicates the current state of the region pointed to
+ * by the treatment_mt and the membrane pointer. The general idea is that
+ * when we are in the "SETTLING" state the treatment area is contiguous and
+ * it is safe to move on to treating another migratetype. If we are in the
+ * "AERATING" state then the region is being actively processed and we
+ * would cause issues such as potentially isolating a section of raw pages
+ * between two sections of treated pages if we were to move onto another
+ * migratetype.
+ */
+enum treatment_state {
+	TREATMENT_SETTLING,
+	TREATMENT_AERATING,
+};
+
 struct free_area {
 	struct list_head	free_list[MIGRATE_TYPES];
 	unsigned long		nr_free_raw;
 	unsigned long		nr_free_treated;
+	struct list_head	*membrane;
+	u8			treatment_mt;
+	u8			treatment_state;
 };
 
 /* Used for pages not on another list */
@@ -113,6 +131,19 @@ static inline void add_to_free_area_tail(struct page *page, struct free_area *ar
 	list_add_tail(&page->lru, &area->free_list[migratetype]);
 }
 
+static inline void
+add_to_free_area_treated(struct page *page, struct free_area *area,
+			 int migratetype)
+{
+	area->nr_free_treated++;
+
+	BUG_ON(area->treatment_mt != migratetype);
+
+	/* Insert page above membrane, then move membrane to the page */
+	list_add_tail(&page->lru, area->membrane);
+	area->membrane = &page->lru;
+}
+
 /* Used for pages which are on another list */
 static inline void move_to_free_area(struct page *page, struct free_area *area,
 			     int migratetype)
@@ -135,6 +166,10 @@ static inline void move_to_free_area(struct page *page, struct free_area *area,
 		area->nr_free_raw++;
 	}
 
+	/* push membrane back if we removed the upper boundary */
+	if (area->membrane == &page->lru)
+		area->membrane = page->lru.next;
+
 	list_move(&page->lru, &area->free_list[migratetype]);
 }
 
@@ -153,6 +188,9 @@ static inline void del_page_from_free_area(struct page *page,
 	else
 		area->nr_free_raw--;
 
+	if (area->membrane == &page->lru)
+		area->membrane = page->lru.next;
+
 	list_del(&page->lru);
 	__ClearPageBuddy(page);
 	__ResetPageTreated(page);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f6c067c6c784..f4a629b6af96 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -989,6 +989,11 @@ static inline void __free_one_page(struct page *page,
 	set_page_order(page, order);
 
 	area = &zone->free_area[order];
+	if (PageTreated(page)) {
+		add_to_free_area_treated(page, area, migratetype);
+		return;
+	}
+
 	if (buddy_merge_likely(pfn, buddy_pfn, page, order) ||
 	    is_shuffle_tail_page(order))
 		add_to_free_area_tail(page, area, migratetype);
@@ -5961,8 +5966,13 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
 
 	for (order = MAX_ORDER; order--; ) {
-		zone->free_area[order].nr_free_raw = 0;
-		zone->free_area[order].nr_free_treated = 0;
+		struct free_area *area = &zone->free_area[order];
+
+		area->nr_free_raw = 0;
+		area->nr_free_treated = 0;
+		area->treatment_mt = 0;
+		area->treatment_state = TREATMENT_SETTLING;
+		area->membrane = &area->free_list[0];
 	}
 }
 


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 07/11] mm: Add support for acquiring first free "raw" or "untreated" page in zone
  2019-05-30 21:53 [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment Alexander Duyck
                   ` (5 preceding siblings ...)
  2019-05-30 21:54 ` [RFC PATCH 06/11] mm: Add membrane to free area to use as divider between treated and raw pages Alexander Duyck
@ 2019-05-30 21:54 ` Alexander Duyck
  2019-05-30 21:54 ` [RFC PATCH 08/11] mm: Add support for creating memory aeration Alexander Duyck
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 18+ messages in thread
From: Alexander Duyck @ 2019-05-30 21:54 UTC (permalink / raw)
  To: nitesh, kvm, david, mst, dave.hansen, linux-kernel, linux-mm
  Cc: yang.zhang.wz, pagupta, riel, konrad.wilk, lcapitulino,
	wei.w.wang, aarcange, pbonzini, dan.j.williams,
	alexander.h.duyck

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

In order to be able to "treat" memory in an asynchonous fashion we need a
way to acquire a block of memory that isn't already treated, and then flush
that back in a way that we will not pick it back up again.

To achieve that this patch adds a pair of functions. One to fill a list
with pages to be treated, and another that will flush out the list back to
the buddy allocator.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/gfp.h |    6 +++
 mm/page_alloc.c     |  107 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 113 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index fb07b503dc45..407a089d861f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -559,6 +559,12 @@ extern void *page_frag_alloc(struct page_frag_cache *nc,
 void drain_all_pages(struct zone *zone);
 void drain_local_pages(struct zone *zone);
 
+#ifdef CONFIG_AERATION
+struct page *get_raw_pages(struct zone *zone, unsigned int order,
+			   int migratetype);
+void free_treated_page(struct page *page);
+#endif
+
 void page_alloc_init_late(void);
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f4a629b6af96..e79c65413dc9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2155,6 +2155,113 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	return NULL;
 }
 
+#ifdef CONFIG_AERATION
+static struct page *get_raw_page_from_free_area(struct free_area *area,
+						int migratetype)
+{
+	struct list_head *head = &area->free_list[migratetype];
+	struct page *page;
+
+	/* If we have not worked in this free_list before reset membrane */
+	if (area->treatment_mt != migratetype) {
+		area->treatment_mt = migratetype;
+		area->membrane = head;
+	}
+
+	/* Try to pulling in any untreated pages above the the membrane */
+	page = list_last_entry(area->membrane, struct page, lru);
+	list_for_each_entry_from_reverse(page, head, lru) {
+		/*
+		 * If the page in front of the membrane is treated then try
+		 * skimming the top to see if we have any untreated pages
+		 * up there.
+		 */
+		if (PageTreated(page)) {
+			page = list_first_entry(head, struct page, lru);
+			if (PageTreated(page))
+				break;
+		}
+
+		/* update state of treatment */
+		area->treatment_state = TREATMENT_AERATING;
+
+		return page;
+	}
+
+	/*
+	 * At this point there are no longer any untreated pages between
+	 * the membrane and the first entry of the list. So we can safely
+	 * set the membrane to the top of the treated region and will mark
+	 * the current migratetype as complete for now.
+	 */
+	area->membrane = &page->lru;
+	area->treatment_state = TREATMENT_SETTLING;
+
+	return NULL;
+}
+
+/**
+ * get_raw_pages - Provide a "raw" page for treatment by the aerator
+ * @zone: Zone to draw pages from
+ * @order: Order to draw pages from
+ * @migratetype: Migratetype to draw pages from
+ *
+ * This function will obtain a page that does not have the Treated value
+ * set in the page type field. It will attempt to fetch a "raw" page from
+ * just above the "membrane" and if that is not available it will attempt
+ * to pull a "raw" page from the head of the free list.
+ *
+ * The page will have the migrate type and order stored in the page
+ * metadata.
+ *
+ * Return: page pointer if raw page found, otherwise NULL
+ */
+struct page *get_raw_pages(struct zone *zone, unsigned int order,
+			   int migratetype)
+{
+	struct free_area *area = &(zone->free_area[order]);
+	struct page *page;
+
+	/* Find a page of the appropriate size in the preferred list */
+	page = get_raw_page_from_free_area(area, migratetype);
+	if (page) {
+		del_page_from_free_area(page, area);
+
+		/* record migratetype and order within page */
+		set_pcppage_migratetype(page, migratetype);
+		set_page_private(page, order);
+		__mod_zone_freepage_state(zone, -(1 << order), migratetype);
+	}
+
+	return page;
+}
+EXPORT_SYMBOL_GPL(get_raw_pages);
+
+/**
+ * free_treated_page - Return a now-treated "raw" page back where we got it
+ * @page: Previously "raw" page that can now be returned after treatment
+ *
+ * This function will pull the zone, migratetype, and order information out
+ * of the page and attempt to return it where it found it. We default to
+ * using free_one_page to return the page as it is possible that the
+ * pageblock might have been switched to an isolate migratetype during
+ * treatment.
+ */
+void free_treated_page(struct page *page)
+{
+	unsigned int order, mt;
+	struct zone *zone;
+
+	zone = page_zone(page);
+	mt = get_pcppage_migratetype(page);
+	order = page_private(page);
+
+	set_page_private(page, 0);
+
+	free_one_page(zone, page, page_to_pfn(page), order, mt);
+}
+EXPORT_SYMBOL_GPL(free_treated_page);
+#endif /* CONFIG_AERATION */
 
 /*
  * This array describes the order lists are fallen back to when


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 08/11] mm: Add support for creating memory aeration
  2019-05-30 21:53 [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment Alexander Duyck
                   ` (6 preceding siblings ...)
  2019-05-30 21:54 ` [RFC PATCH 07/11] mm: Add support for acquiring first free "raw" or "untreated" page in zone Alexander Duyck
@ 2019-05-30 21:54 ` Alexander Duyck
  2019-05-30 21:54 ` [RFC PATCH 09/11] mm: Count isolated pages as "treated" Alexander Duyck
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 18+ messages in thread
From: Alexander Duyck @ 2019-05-30 21:54 UTC (permalink / raw)
  To: nitesh, kvm, david, mst, dave.hansen, linux-kernel, linux-mm
  Cc: yang.zhang.wz, pagupta, riel, konrad.wilk, lcapitulino,
	wei.w.wang, aarcange, pbonzini, dan.j.williams,
	alexander.h.duyck

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Add support for "aerating" memory in a guest by pushing individual pages
out. This patch is meant to add generic support for this by adding a common
framework that can be used later by drivers such as virtio-balloon.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/memory_aeration.h |   54 +++++++
 mm/Kconfig                      |    5 +
 mm/Makefile                     |    1 
 mm/aeration.c                   |  320 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 380 insertions(+)
 create mode 100644 include/linux/memory_aeration.h
 create mode 100644 mm/aeration.c

diff --git a/include/linux/memory_aeration.h b/include/linux/memory_aeration.h
new file mode 100644
index 000000000000..5ba0e634f240
--- /dev/null
+++ b/include/linux/memory_aeration.h
@@ -0,0 +1,54 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MEMORY_AERATION_H
+#define _LINUX_MEMORY_AERATION_H
+
+#include <linux/pageblock-flags.h>
+#include <linux/jump_label.h>
+#include <asm/pgtable_types.h>
+
+struct zone;
+
+#define AERATOR_MIN_ORDER	pageblock_order
+
+struct aerator_dev_info {
+	unsigned long capacity;
+	struct list_head batch_reactor;
+	atomic_t refcnt;
+	void (*react)(struct aerator_dev_info *a_dev_info);
+};
+
+extern struct static_key aerator_notify_enabled;
+
+void aerator_cycle(void);
+void __aerator_notify(struct zone *zone, int order);
+
+/**
+ * aerator_notify_free - Free page notification that will start page processing
+ * @page: Last page processed
+ * @zone: Pointer to current zone of last page processed
+ * @order: Order of last page added to zone
+ *
+ * This function is meant to act as a screener for __aerator_notify which
+ * will determine if a give zone has crossed over the high-water mark that
+ * will justify us beginning page treatment. If we have crossed that
+ * threshold then it will start the process of pulling some pages and
+ * placing them in the batch_reactor list for treatment.
+ */
+static inline void
+aerator_notify_free(struct page *page, struct zone *zone, int order)
+{
+	if (!static_key_false(&aerator_notify_enabled))
+		return;
+
+	if (order < AERATOR_MIN_ORDER)
+		return;
+
+	__aerator_notify(zone, order);
+}
+
+void aerator_shutdown(void);
+int aerator_startup(struct aerator_dev_info *sdev);
+
+#define AERATOR_ZONE_BITS	(BITS_TO_LONGS(MAX_NR_ZONES) * BITS_PER_LONG)
+#define AERATOR_HWM_BITS	(AERATOR_ZONE_BITS * MAX_NUMNODES)
+#endif /*_LINUX_MEMORY_AERATION_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index f0c76ba47695..34680214cefa 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -236,6 +236,11 @@ config COMPACTION
           linux-mm@kvack.org.
 
 #
+# support for memory aeration
+config AERATION
+	bool
+
+#
 # support for page migration
 #
 config MIGRATION
diff --git a/mm/Makefile b/mm/Makefile
index ac5e5ba78874..26c2fcd2b89d 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -104,3 +104,4 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
 obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_HMM) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_AERATION) += aeration.o
diff --git a/mm/aeration.c b/mm/aeration.c
new file mode 100644
index 000000000000..aaf8af8d822f
--- /dev/null
+++ b/mm/aeration.c
@@ -0,0 +1,320 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/memory_aeration.h>
+#include <linux/mmzone.h>
+#include <linux/gfp.h>
+#include <linux/export.h>
+#include <linux/delay.h>
+#include <linux/slab.h>
+
+static unsigned long *aerator_hwm;
+static struct aerator_dev_info *a_dev_info;
+struct static_key aerator_notify_enabled;
+
+void aerator_shutdown(void)
+{
+	static_key_slow_dec(&aerator_notify_enabled);
+
+	while (atomic_read(&a_dev_info->refcnt))
+		msleep(20);
+
+	kfree(aerator_hwm);
+	aerator_hwm = NULL;
+
+	a_dev_info = NULL;
+}
+EXPORT_SYMBOL_GPL(aerator_shutdown);
+
+int aerator_startup(struct aerator_dev_info *sdev)
+{
+	size_t size = BITS_TO_LONGS(AERATOR_HWM_BITS) * sizeof(unsigned long);
+	unsigned long *hwm;
+
+	if (a_dev_info || aerator_hwm)
+		return -EBUSY;
+
+	a_dev_info = sdev;
+
+	atomic_set(&sdev->refcnt, 0);
+
+	hwm = kzalloc(size, GFP_KERNEL);
+	if (!hwm) {
+		aerator_shutdown();
+		return -ENOMEM;
+	}
+
+	aerator_hwm = hwm;
+
+	static_key_slow_inc(&aerator_notify_enabled);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(aerator_startup);
+
+static inline unsigned long *get_aerator_hwm(int nid)
+{
+	if (!aerator_hwm)
+		return NULL;
+
+	return aerator_hwm + (BITS_TO_LONGS(MAX_NR_ZONES) * nid);
+}
+
+static int __aerator_fill(struct zone *zone, unsigned int size)
+{
+	struct list_head *batch = &a_dev_info->batch_reactor;
+	unsigned long nr_raw = 0;
+	unsigned int len = 0;
+	unsigned int order;
+
+	for (order = MAX_ORDER; order-- != AERATOR_MIN_ORDER;) {
+		struct free_area *area = &(zone->free_area[order]);
+		int mt = area->treatment_mt;
+
+		/*
+		 * If there are no untreated pages to pull
+		 * then we might as well skip the area.
+		 */
+		while (area->nr_free_raw) {
+			unsigned int count = 0;
+			struct page *page;
+
+			/*
+			 * If we completed aeration we can let the current
+			 * free list work on settling so that a batch of
+			 * new raw pages can build. In the meantime move on
+			 * to the next migratetype.
+			 */
+			if (++mt >= MIGRATE_TYPES)
+				mt = 0;
+
+			/*
+			 * Pull pages from free list until we have drained
+			 * it or we have filled the batch reactor.
+			 */
+			while ((page = get_raw_pages(zone, order, mt))) {
+				list_add(&page->lru, batch);
+
+				if (++count == (size - len))
+					return size;
+			}
+
+			/*
+			 * If we pulled any pages from this migratetype then
+			 * we must move on to a new free area as we cannot
+			 * move the membrane until after we have decanted the
+			 * pages currently being aerated.
+			 */
+			if (count) {
+				len += count;
+				break;
+			}
+		}
+
+		/*
+		 * Keep a running total of the raw packets we have left
+		 * behind. We will use this to determine if we should
+		 * clear the HWM flag.
+		 */
+		nr_raw += area->nr_free_raw;
+	}
+
+	/*
+	 * If there are no longer enough free pages to fully populate
+	 * the aerator, then we can just shut it down for this zone.
+	 */
+	if (nr_raw < a_dev_info->capacity) {
+		unsigned long *hwm = get_aerator_hwm(zone_to_nid(zone));
+
+		clear_bit(zone_idx(zone), hwm);
+		atomic_dec(&a_dev_info->refcnt);
+	}
+
+	return len;
+}
+
+static unsigned int aerator_fill(int nid, int zid, int budget)
+{
+	pg_data_t *pgdat = NODE_DATA(nid);
+	struct zone *zone = &pgdat->node_zones[zid];
+	unsigned long flags;
+	int len;
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	/* fill aerator with "raw" pages */
+	len = __aerator_fill(zone, budget);
+
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	return len;
+}
+
+static void aerator_fill_and_react(void)
+{
+	int budget = a_dev_info->capacity;
+	int nr;
+
+	/*
+	 * We should never be calling this function while there are already
+	 * pages in the reactor being aerated. If we are called under such
+	 * a circumstance report an error.
+	 */
+	BUG_ON(!list_empty(&a_dev_info->batch_reactor));
+retry:
+	/*
+	 * We want to hold one additional reference against the number of
+	 * active hints as we may clear the hint that originally brought us
+	 * here. We will clear it after we have either vaporized the content
+	 * of the pages, or if we discover all pages were stolen out from
+	 * under us.
+	 */
+	atomic_inc(&a_dev_info->refcnt);
+
+	for_each_set_bit(nr, aerator_hwm, AERATOR_HWM_BITS) {
+		int node_id = nr / AERATOR_ZONE_BITS;
+		int zone_id = nr % AERATOR_ZONE_BITS;
+
+		budget -= aerator_fill(node_id, zone_id, budget);
+		if (!budget)
+			goto start_aerating;
+	}
+
+	if (unlikely(list_empty(&a_dev_info->batch_reactor))) {
+		/*
+		 * If we never generated any pages, and we were holding the
+		 * only remaining reference to active hints then we can
+		 * just let this go for now and go idle.
+		 */
+		if (atomic_dec_and_test(&a_dev_info->refcnt))
+			return;
+
+		/*
+		 * There must be a bit populated somewhere, try going
+		 * back through and finding it.
+		 */
+		goto retry;
+	}
+
+start_aerating:
+	a_dev_info->react(a_dev_info);
+}
+
+void aerator_decant(void)
+{
+	struct list_head *list = &a_dev_info->batch_reactor;
+	struct page *page;
+
+	/*
+	 * This function should never be called on an empty list. If so it
+	 * points to a bug as we should never be running the aerator when
+	 * the list is empty.
+	 */
+	WARN_ON(list_empty(&a_dev_info->batch_reactor));
+
+	while ((page = list_first_entry_or_null(list, struct page, lru))) {
+		list_del(&page->lru);
+
+		__SetPageTreated(page);
+
+		free_treated_page(page);
+	}
+}
+
+/**
+ * aerator_cycle - drain, fill, and start aerating another batch of pages
+ *
+ * This function is at the heart of the aerator. It should be called after
+ * the previous batch of pages has finished being processed by the aerator.
+ * It will drain the aerator, refill it, and start the next set of pages
+ * being processed.
+ */
+void aerator_cycle(void)
+{
+	aerator_decant();
+
+	/*
+	 * Now that the pages have been flushed we can drop our reference to
+	 * the active hints list. If there are no further hints that need to
+	 * be processed we can simply go idle.
+	 */
+	if (atomic_dec_and_test(&a_dev_info->refcnt))
+		return;
+
+	aerator_fill_and_react();
+}
+EXPORT_SYMBOL_GPL(aerator_cycle);
+
+static void __aerator_fill_and_react(struct zone *zone)
+{
+	/*
+	 * We should never be calling this function while there are already
+	 * pages in the list being aerated. If we are called under such a
+	 * circumstance report an error.
+	 */
+	BUG_ON(!list_empty(&a_dev_info->batch_reactor));
+
+	/*
+	 * We want to hold one additional reference against the number of
+	 * active hints as we may clear the hint that originally brought us
+	 * here. We will clear it after we have either vaporized the content
+	 * of the pages, or if we discover all pages were stolen out from
+	 * under us.
+	 */
+	atomic_inc(&a_dev_info->refcnt);
+
+	__aerator_fill(zone, a_dev_info->capacity);
+
+	if (unlikely(list_empty(&a_dev_info->batch_reactor))) {
+		/*
+		 * If we never generated any pages, and we were holding the
+		 * only remaining reference to active hints then we can just
+		 * let this go for now and go idle.
+		 */
+		if (atomic_dec_and_test(&a_dev_info->refcnt))
+			return;
+
+		/*
+		 * Another zone must have populated some raw pages that
+		 * need to be processed. Release the zone lock and process
+		 * that zone instead.
+		 */
+		spin_unlock(&zone->lock);
+		aerator_fill_and_react();
+	} else {
+		/* Release the zone lock and begin the page aerator */
+		spin_unlock(&zone->lock);
+		a_dev_info->react(a_dev_info);
+	}
+
+	/* Reaquire lock so we can resume processing this zone */
+	spin_lock(&zone->lock);
+}
+
+void __aerator_notify(struct zone *zone, int order)
+{
+	int node_id = zone_to_nid(zone);
+	int zone_id = zone_idx(zone);
+	unsigned long *hwm;
+
+	if (zone->free_area[order].nr_free_raw < (2 * a_dev_info->capacity))
+		return;
+
+	hwm = get_aerator_hwm(node_id);
+
+	/*
+	 * We an use separate test and set operations here as there
+	 * is nothing else that can set or clear this bit while we are
+	 * holding the zone lock. The advantage to doing it this way is
+	 * that we don't have to dirty the cacheline unless we are
+	 * changing the value.
+	 */
+	if (test_bit(zone_id, hwm))
+		return;
+	set_bit(zone_id, hwm);
+
+	if (atomic_fetch_inc(&a_dev_info->refcnt))
+		return;
+
+	__aerator_fill_and_react(zone);
+}
+EXPORT_SYMBOL_GPL(__aerator_notify);
+


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 09/11] mm: Count isolated pages as "treated"
  2019-05-30 21:53 [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment Alexander Duyck
                   ` (7 preceding siblings ...)
  2019-05-30 21:54 ` [RFC PATCH 08/11] mm: Add support for creating memory aeration Alexander Duyck
@ 2019-05-30 21:54 ` Alexander Duyck
  2019-05-30 21:54 ` [RFC PATCH 10/11] virtio-balloon: Add support for aerating memory via bubble hinting Alexander Duyck
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 18+ messages in thread
From: Alexander Duyck @ 2019-05-30 21:54 UTC (permalink / raw)
  To: nitesh, kvm, david, mst, dave.hansen, linux-kernel, linux-mm
  Cc: yang.zhang.wz, pagupta, riel, konrad.wilk, lcapitulino,
	wei.w.wang, aarcange, pbonzini, dan.j.williams,
	alexander.h.duyck

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Treat isolated pages as though they have already been treated. We do this
so that we can avoid trying to treat pages that have been marked for
isolation. The issue is that we don't want to run into issues where we are
treating a page, and when we put it back we find it has been moved into the
isolated migratetype, nor would we want to pull pages out of the isolated
migratetype and then find that they are now being located in a different
migratetype.

To avoid those issues we can specifically mark all isolated pages as being
"treated" and avoid special case handling for them since they will never be
merged anyway, so we can just add them to the head of the free_list.

In addition we will skip over the isolate migratetype when getting raw
pages.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/mmzone.h |    7 +++++++
 mm/aeration.c          |    8 ++++++--
 mm/page_alloc.c        |    2 +-
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index be996e8ca6b5..f749ccfcc62a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -137,6 +137,13 @@ static inline void add_to_free_area_tail(struct page *page, struct free_area *ar
 {
 	area->nr_free_treated++;
 
+#ifdef CONFIG_MEMORY_ISOLATION
+	/* Bypass membrane for isolated pages, all are considered "treated" */
+	if (migratetype == MIGRATE_ISOLATE) {
+		list_add(&page->lru, &area->free_list[migratetype]);
+		return;
+	}
+#endif
 	BUG_ON(area->treatment_mt != migratetype);
 
 	/* Insert page above membrane, then move membrane to the page */
diff --git a/mm/aeration.c b/mm/aeration.c
index aaf8af8d822f..f921295ed3ae 100644
--- a/mm/aeration.c
+++ b/mm/aeration.c
@@ -1,6 +1,8 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/memory_aeration.h>
+#include <linux/mm.h>
 #include <linux/mmzone.h>
+#include <linux/page-isolation.h>
 #include <linux/gfp.h>
 #include <linux/export.h>
 #include <linux/delay.h>
@@ -83,8 +85,10 @@ static int __aerator_fill(struct zone *zone, unsigned int size)
 			 * new raw pages can build. In the meantime move on
 			 * to the next migratetype.
 			 */
-			if (++mt >= MIGRATE_TYPES)
-				mt = 0;
+			do {
+				if (++mt >= MIGRATE_TYPES)
+					mt = 0;
+			} while (is_migrate_isolate(mt));
 
 			/*
 			 * Pull pages from free list until we have drained
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e79c65413dc9..e3800221414b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -989,7 +989,7 @@ static inline void __free_one_page(struct page *page,
 	set_page_order(page, order);
 
 	area = &zone->free_area[order];
-	if (PageTreated(page)) {
+	if (is_migrate_isolate(migratetype) || PageTreated(page)) {
 		add_to_free_area_treated(page, area, migratetype);
 		return;
 	}


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 10/11] virtio-balloon: Add support for aerating memory via bubble hinting
  2019-05-30 21:53 [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment Alexander Duyck
                   ` (8 preceding siblings ...)
  2019-05-30 21:54 ` [RFC PATCH 09/11] mm: Count isolated pages as "treated" Alexander Duyck
@ 2019-05-30 21:54 ` Alexander Duyck
  2019-05-30 21:54 ` [RFC PATCH 11/11] mm: Add free page notification hook Alexander Duyck
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 18+ messages in thread
From: Alexander Duyck @ 2019-05-30 21:54 UTC (permalink / raw)
  To: nitesh, kvm, david, mst, dave.hansen, linux-kernel, linux-mm
  Cc: yang.zhang.wz, pagupta, riel, konrad.wilk, lcapitulino,
	wei.w.wang, aarcange, pbonzini, dan.j.williams,
	alexander.h.duyck

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Add support for aerating memory using the bubble hinting feature provided
by virtio-balloon. Bubble hinting differs from the regular balloon
functionality in that is is much less durable than a standard memory
balloon. Instead of creating a list of pages that cannot be accessed the
pages are only inaccessible while they are being indicated to the virtio
interface. Once the interface has acknowledged them they are placed back
into their respective free lists and are once again accessible by the guest
system.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 drivers/virtio/Kconfig              |    1 
 drivers/virtio/virtio_balloon.c     |   89 +++++++++++++++++++++++++++++++++++
 include/uapi/linux/virtio_balloon.h |    1 
 3 files changed, 90 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 023fc3bc01c6..9cdaccf92c3a 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -47,6 +47,7 @@ config VIRTIO_BALLOON
 	tristate "Virtio balloon driver"
 	depends on VIRTIO
 	select MEMORY_BALLOON
+	select AERATION
 	---help---
 	 This driver supports increasing and decreasing the amount
 	 of memory within a KVM guest.
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 44339fc87cc7..e1399991bc1f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -18,6 +18,7 @@
 #include <linux/mm.h>
 #include <linux/mount.h>
 #include <linux/magic.h>
+#include <linux/memory_aeration.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -45,6 +46,7 @@ enum virtio_balloon_vq {
 	VIRTIO_BALLOON_VQ_DEFLATE,
 	VIRTIO_BALLOON_VQ_STATS,
 	VIRTIO_BALLOON_VQ_FREE_PAGE,
+	VIRTIO_BALLOON_VQ_HINTING,
 	VIRTIO_BALLOON_VQ_MAX
 };
 
@@ -52,9 +54,16 @@ enum virtio_balloon_config_read {
 	VIRTIO_BALLOON_CONFIG_READ_CMD_ID = 0,
 };
 
+#define VIRTIO_BUBBLE_ARRAY_HINTS_MAX	32
+struct virtio_bubble_page_hint {
+	__virtio32 pfn;
+	__virtio32 size;
+};
+
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq,
+								*hinting_vq;
 
 	/* Balloon's own wq for cpu-intensive work items */
 	struct workqueue_struct *balloon_wq;
@@ -107,6 +116,11 @@ struct virtio_balloon {
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
 
+	/* The array of PFNs we are hinting on */
+	unsigned int num_hints;
+	struct virtio_bubble_page_hint hints[VIRTIO_BUBBLE_ARRAY_HINTS_MAX];
+	struct aerator_dev_info a_dev_info;
+
 	/* Memory statistics */
 	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
 
@@ -151,6 +165,54 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 
 }
 
+void virtballoon_aerator_react(struct aerator_dev_info *a_dev_info)
+{
+	struct virtio_balloon *vb = container_of(a_dev_info,
+						struct virtio_balloon,
+						a_dev_info);
+	struct virtqueue *vq = vb->hinting_vq;
+	struct scatterlist sg;
+	unsigned int unused;
+	struct page *page;
+
+	vb->num_hints = 0;
+
+	list_for_each_entry(page, &a_dev_info->batch_reactor, lru) {
+		struct virtio_bubble_page_hint *hint;
+		unsigned int size;
+
+		hint = &vb->hints[vb->num_hints++];
+		hint->pfn = cpu_to_virtio32(vb->vdev,
+					    page_to_balloon_pfn(page));
+		size = VIRTIO_BALLOON_PAGES_PER_PAGE << page_private(page);
+		hint->size = cpu_to_virtio32(vb->vdev, size);
+	}
+
+	/* We shouldn't have been called if there is nothing to process */
+	if (WARN_ON(vb->num_hints == 0))
+		return;
+
+	/* Detach all the used buffers from the vq */
+	while (virtqueue_get_buf(vq, &unused))
+		;
+
+	sg_init_one(&sg, vb->hints,
+		    sizeof(vb->hints[0]) * vb->num_hints);
+
+	/*
+	 * We should always be able to add one buffer to an
+	 * empty queue.
+	 */
+	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
+	virtqueue_kick(vq);
+}
+
+static void aerator_settled(struct virtqueue *vq)
+{
+	/* Drain the current aerator contents, refill, and start next cycle */
+	aerator_cycle();
+}
+
 static void set_page_pfns(struct virtio_balloon *vb,
 			  __virtio32 pfns[], struct page *page)
 {
@@ -475,6 +537,7 @@ static int init_vqs(struct virtio_balloon *vb)
 	names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
 	names[VIRTIO_BALLOON_VQ_STATS] = NULL;
 	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
+	names[VIRTIO_BALLOON_VQ_HINTING] = NULL;
 
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
 		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
@@ -486,11 +549,19 @@ static int init_vqs(struct virtio_balloon *vb)
 		callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
 	}
 
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
+		names[VIRTIO_BALLOON_VQ_HINTING] = "hinting_vq";
+		callbacks[VIRTIO_BALLOON_VQ_HINTING] = aerator_settled;
+	}
+
 	err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
 					 vqs, callbacks, names, NULL, NULL);
 	if (err)
 		return err;
 
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
+		vb->hinting_vq = vqs[VIRTIO_BALLOON_VQ_HINTING];
+
 	vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
 	vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
@@ -929,12 +1000,25 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		if (err)
 			goto out_del_balloon_wq;
 	}
+
+	vb->a_dev_info.react = virtballoon_aerator_react;
+	vb->a_dev_info.capacity = VIRTIO_BUBBLE_ARRAY_HINTS_MAX;
+	INIT_LIST_HEAD(&vb->a_dev_info.batch_reactor);
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
+		err = aerator_startup(&vb->a_dev_info);
+		if (err)
+			goto out_unregister_shrinker;
+	}
+
 	virtio_device_ready(vdev);
 
 	if (towards_target(vb))
 		virtballoon_changed(vdev);
 	return 0;
 
+out_unregister_shrinker:
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
+		virtio_balloon_unregister_shrinker(vb);
 out_del_balloon_wq:
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
 		destroy_workqueue(vb->balloon_wq);
@@ -963,6 +1047,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb = vdev->priv;
 
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
+		aerator_shutdown();
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
 		virtio_balloon_unregister_shrinker(vb);
 	spin_lock_irq(&vb->stop_update_lock);
@@ -1032,6 +1118,7 @@ static int virtballoon_validate(struct virtio_device *vdev)
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_FREE_PAGE_HINT,
 	VIRTIO_BALLOON_F_PAGE_POISON,
+	VIRTIO_BALLOON_F_HINTING,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index a1966cd7b677..2b0f62814e22 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -36,6 +36,7 @@
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
 #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
+#define VIRTIO_BALLOON_F_HINTING	5 /* Page hinting virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 11/11] mm: Add free page notification hook
  2019-05-30 21:53 [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment Alexander Duyck
                   ` (9 preceding siblings ...)
  2019-05-30 21:54 ` [RFC PATCH 10/11] virtio-balloon: Add support for aerating memory via bubble hinting Alexander Duyck
@ 2019-05-30 21:54 ` Alexander Duyck
  2019-05-30 21:57 ` [RFC QEMU PATCH] QEMU: Provide a interface for hinting based off of the balloon infrastructure Alexander Duyck
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 18+ messages in thread
From: Alexander Duyck @ 2019-05-30 21:54 UTC (permalink / raw)
  To: nitesh, kvm, david, mst, dave.hansen, linux-kernel, linux-mm
  Cc: yang.zhang.wz, pagupta, riel, konrad.wilk, lcapitulino,
	wei.w.wang, aarcange, pbonzini, dan.j.williams,
	alexander.h.duyck

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Add a hook so that we are notified when a new page is available. We will
use this hook to notify the virtio aeration system when we have achieved
enough free higher-order pages to justify the process of pulling some pages
and hinting on them.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 arch/x86/include/asm/page.h |   11 +++++++++++
 include/linux/gfp.h         |    4 ++++
 mm/page_alloc.c             |    2 ++
 3 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 7555b48803a8..dfd546230120 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -18,6 +18,17 @@
 
 struct page;
 
+#ifdef CONFIG_AERATION
+#include <linux/memory_aeration.h>
+
+#define HAVE_ARCH_FREE_PAGE_NOTIFY
+static inline void
+arch_free_page_notify(struct page *page, struct zone *zone, int order)
+{
+	aerator_notify_free(page, zone, order);
+}
+
+#endif
 #include <linux/range.h>
 extern struct range pfn_mapped[];
 extern int nr_pfn_mapped;
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 407a089d861f..d975e7eabbf8 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -459,6 +459,10 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
 #ifndef HAVE_ARCH_FREE_PAGE
 static inline void arch_free_page(struct page *page, int order) { }
 #endif
+#ifndef HAVE_ARCH_FREE_PAGE_NOTIFY
+static inline void
+arch_free_page_notify(struct page *page, struct zone *zone, int order) { }
+#endif
 #ifndef HAVE_ARCH_ALLOC_PAGE
 static inline void arch_alloc_page(struct page *page, int order) { }
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e3800221414b..104763034ce3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -999,6 +999,8 @@ static inline void __free_one_page(struct page *page,
 		add_to_free_area_tail(page, area, migratetype);
 	else
 		add_to_free_area(page, area, migratetype);
+
+	arch_free_page_notify(page, zone, order);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC QEMU PATCH] QEMU: Provide a interface for hinting based off of the balloon infrastructure
  2019-05-30 21:53 [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment Alexander Duyck
                   ` (10 preceding siblings ...)
  2019-05-30 21:54 ` [RFC PATCH 11/11] mm: Add free page notification hook Alexander Duyck
@ 2019-05-30 21:57 ` Alexander Duyck
  2019-05-30 22:52 ` [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment Michael S. Tsirkin
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 18+ messages in thread
From: Alexander Duyck @ 2019-05-30 21:57 UTC (permalink / raw)
  To: nitesh, kvm, david, mst, dave.hansen, linux-kernel, linux-mm
  Cc: yang.zhang.wz, pagupta, riel, konrad.wilk, lcapitulino,
	wei.w.wang, aarcange, pbonzini, dan.j.williams,
	alexander.h.duyck

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

So this is meant to be a simplification of the existing balloon interface
to use for providing hints to what memory needs to be freed. I am assuming
this is safe to do as the deflate logic does not actually appear to do very
much other than tracking what subpages have been released and which ones
haven't.

I suspect this is still a bit crude and will need some more work.
Suggestions welcome.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 hw/virtio/trace-events                          |    1 
 hw/virtio/virtio-balloon.c                      |   85 +++++++++++++++++++++++
 include/hw/virtio/virtio-balloon.h              |    2 -
 include/standard-headers/linux/virtio_balloon.h |    1 
 4 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index e28ba48da621..b56daf460769 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -46,6 +46,7 @@ virtio_balloon_handle_output(const char *name, uint64_t gpa) "section name: %s g
 virtio_balloon_get_config(uint32_t num_pages, uint32_t actual) "num_pages: %d actual: %d"
 virtio_balloon_set_config(uint32_t actual, uint32_t oldactual) "actual: %d oldactual: %d"
 virtio_balloon_to_target(uint64_t target, uint32_t num_pages) "balloon target: 0x%"PRIx64" num_pages: %d"
+virtio_bubble_handle_output(const char *name, uint64_t gpa, uint64_t size) "section name: %s gpa: 0x%" PRIx64 " size: %" PRIx64
 
 # virtio-mmio.c
 virtio_mmio_read(uint64_t offset) "virtio_mmio_read offset 0x%" PRIx64
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 2112874055fb..eb819ec8f436 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -34,6 +34,13 @@
 
 #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
 
+struct guest_pages {
+	unsigned long pfn;
+	unsigned int order;
+};
+
+void page_hinting_request(uint64_t addr, uint32_t len);
+
 struct PartiallyBalloonedPage {
     RAMBlock *rb;
     ram_addr_t base;
@@ -328,6 +335,80 @@ static void balloon_stats_set_poll_interval(Object *obj, Visitor *v,
     balloon_stats_change_timer(s, 0);
 }
 
+static void bubble_inflate_page(VirtIOBalloon *balloon,
+                                MemoryRegion *mr, hwaddr offset, size_t size)
+{
+    void *addr = memory_region_get_ram_ptr(mr) + offset;
+    ram_addr_t ram_offset;
+    size_t rb_page_size;
+    RAMBlock *rb;
+
+    rb = qemu_ram_block_from_host(addr, false, &ram_offset);
+    rb_page_size = qemu_ram_pagesize(rb);
+
+    /* For now we will simply ignore unaligned memory regions */
+    if ((ram_offset | size) & (rb_page_size - 1))
+        return;
+
+    ram_block_discard_range(rb, ram_offset, size);
+}
+
+static void virtio_bubble_handle_output(VirtIODevice *vdev, VirtQueue *vq)
+{
+    VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
+    VirtQueueElement *elem;
+    MemoryRegionSection section;
+
+    for (;;) {
+        size_t offset = 0;
+	struct {
+            uint32_t pfn;
+            uint32_t size;
+	} hint;
+
+        elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+        if (!elem) {
+            return;
+        }
+
+        while (iov_to_buf(elem->out_sg, elem->out_num, offset, &hint, 8) == 8) {
+            size_t size = virtio_ldl_p(vdev, &hint.size);
+            hwaddr pa = virtio_ldl_p(vdev, &hint.pfn);
+
+            offset += 8;
+
+            if (qemu_balloon_is_inhibited())
+                continue;
+
+            pa <<= VIRTIO_BALLOON_PFN_SHIFT;
+            size <<= VIRTIO_BALLOON_PFN_SHIFT;
+
+            section = memory_region_find(get_system_memory(), pa, size);
+            if (!section.mr) {
+                trace_virtio_balloon_bad_addr(pa);
+                continue;
+            }
+
+            if (!memory_region_is_ram(section.mr) ||
+                memory_region_is_rom(section.mr) ||
+                memory_region_is_romd(section.mr)) {
+                trace_virtio_balloon_bad_addr(pa);
+            } else {
+                trace_virtio_bubble_handle_output(memory_region_name(section.mr),
+                                                  pa, size);
+                bubble_inflate_page(s, section.mr,
+                                    section.offset_within_region, size);
+            }
+
+            memory_region_unref(section.mr);
+        }
+
+        virtqueue_push(vq, elem, offset);
+        virtio_notify(vdev, vq);
+        g_free(elem);
+    }
+}
+
 static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
@@ -694,6 +775,7 @@ static uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f,
     VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
     f |= dev->host_features;
     virtio_add_feature(&f, VIRTIO_BALLOON_F_STATS_VQ);
+    virtio_add_feature(&f, VIRTIO_BALLOON_F_HINTING);
 
     return f;
 }
@@ -780,6 +862,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
     s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
     s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
     s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
+    s->hvq = virtio_add_queue(vdev, 128, virtio_bubble_handle_output);
 
     if (virtio_has_feature(s->host_features,
                            VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
@@ -875,6 +958,8 @@ static void virtio_balloon_instance_init(Object *obj)
 
     object_property_add(obj, "guest-stats", "guest statistics",
                         balloon_stats_get_all, NULL, NULL, s, NULL);
+    object_property_add(obj, "guest-page-hinting", "guest page hinting",
+                        NULL, NULL, NULL, s, NULL);
 
     object_property_add(obj, "guest-stats-polling-interval", "int",
                         balloon_stats_get_poll_interval,
diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
index 1afafb12f6bc..dd6d4d0e45fd 100644
--- a/include/hw/virtio/virtio-balloon.h
+++ b/include/hw/virtio/virtio-balloon.h
@@ -44,7 +44,7 @@ enum virtio_balloon_free_page_report_status {
 
 typedef struct VirtIOBalloon {
     VirtIODevice parent_obj;
-    VirtQueue *ivq, *dvq, *svq, *free_page_vq;
+    VirtQueue *ivq, *dvq, *svq, *hvq, *free_page_vq;
     uint32_t free_page_report_status;
     uint32_t num_pages;
     uint32_t actual;
diff --git a/include/standard-headers/linux/virtio_balloon.h b/include/standard-headers/linux/virtio_balloon.h
index 9375ca2a70de..f9e3e8256261 100644
--- a/include/standard-headers/linux/virtio_balloon.h
+++ b/include/standard-headers/linux/virtio_balloon.h
@@ -36,6 +36,7 @@
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
 #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
+#define VIRTIO_BALLOON_F_HINTING	5 /* Page hinting virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment
  2019-05-30 21:53 [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment Alexander Duyck
                   ` (11 preceding siblings ...)
  2019-05-30 21:57 ` [RFC QEMU PATCH] QEMU: Provide a interface for hinting based off of the balloon infrastructure Alexander Duyck
@ 2019-05-30 22:52 ` Michael S. Tsirkin
  2019-05-31 11:16 ` Nitesh Narayan Lal
  2019-06-03  9:31 ` David Hildenbrand
  14 siblings, 0 replies; 18+ messages in thread
From: Michael S. Tsirkin @ 2019-05-30 22:52 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: nitesh, kvm, david, dave.hansen, linux-kernel, linux-mm,
	yang.zhang.wz, pagupta, riel, konrad.wilk, lcapitulino,
	wei.w.wang, aarcange, pbonzini, dan.j.williams,
	alexander.h.duyck

On Thu, May 30, 2019 at 02:53:34PM -0700, Alexander Duyck wrote:
> This series provides an asynchronous means of hinting to a hypervisor
> that a guest page is no longer in use and can have the data associated
> with it dropped. To do this I have implemented functionality that allows
> for what I am referring to as "waste page treatment".
> 
> I have based many of the terms and functionality off of waste water
> treatment, the idea for the similarity occured to me after I had reached
> the point of referring to the hints as "bubbles", as the hints used the
> same approach as the balloon functionality but would disappear if they
> were touched, as a result I started to think of the virtio device as an
> aerator. The general idea with all of this is that the guest should be
> treating the unused pages so that when they end up heading "downstream"
> to either another guest, or back at the host they will not need to be
> written to swap.

A lovely analogy.

> So for a bit of background for the treatment process, it is based on a
> sequencing batch reactor (SBR)[1]. The treatment process itself has five
> stages. The first stage is the fill, with this we take the raw pages and
> add them to the reactor. The second stage is react, in this stage we hand
> the pages off to the Virtio Balloon driver to have hints attached to them
> and for those hints to be sent to the hypervisor. The third stage is
> settle, in this stage we are waiting for the hypervisor to process the
> pages, and we should receive an interrupt when it is completed. The fourth
> stage is to decant, or drain the reactor of pages. Finally we have the
> idle stage which we will go into if the reference count for the reactor
> gets down to 0 after a drain, or if a fill operation fails to obtain any
> pages and the reference count has hit 0. Otherwise we return to the first
> state and start the cycle over again.

will review the patchset closely shortly.

> This patch set is still far more intrusive then I would really like for
> what it has to do. Currently I am splitting the nr_free_pages into two
> values and having to add a pointer and an index to track where we area in
> the treatment process for a given free_area. I'm also not sure I have
> covered all possible corner cases where pages can get into the free_area
> or move from one migratetype to another.
> 
> Also I am still leaving a number of things hard-coded such as limiting the
> lowest order processed to PAGEBLOCK_ORDER, and have left it up to the
> guest to determine what size of reactor it wants to allocate to process
> the hints.
> 
> Another consideration I am still debating is if I really want to process
> the aerator_cycle() function in interrupt context or if I should have it
> running in a thread somewhere else.
> 
> [1]: https://en.wikipedia.org/wiki/Sequencing_batch_reactor
> 
> ---
> 
> Alexander Duyck (11):
>       mm: Move MAX_ORDER definition closer to pageblock_order
>       mm: Adjust shuffle code to allow for future coalescing
>       mm: Add support for Treated Buddy pages
>       mm: Split nr_free into nr_free_raw and nr_free_treated
>       mm: Propogate Treated bit when splitting
>       mm: Add membrane to free area to use as divider between treated and raw pages
>       mm: Add support for acquiring first free "raw" or "untreated" page in zone
>       mm: Add support for creating memory aeration
>       mm: Count isolated pages as "treated"
>       virtio-balloon: Add support for aerating memory via bubble hinting
>       mm: Add free page notification hook
> 
> 
>  arch/x86/include/asm/page.h         |   11 +
>  drivers/virtio/Kconfig              |    1 
>  drivers/virtio/virtio_balloon.c     |   89 ++++++++++
>  include/linux/gfp.h                 |   10 +
>  include/linux/memory_aeration.h     |   54 ++++++
>  include/linux/mmzone.h              |  100 +++++++++--
>  include/linux/page-flags.h          |   32 +++
>  include/linux/pageblock-flags.h     |    8 +
>  include/uapi/linux/virtio_balloon.h |    1 
>  mm/Kconfig                          |    5 +
>  mm/Makefile                         |    1 
>  mm/aeration.c                       |  324 +++++++++++++++++++++++++++++++++++
>  mm/compaction.c                     |    4 
>  mm/page_alloc.c                     |  220 ++++++++++++++++++++----
>  mm/shuffle.c                        |   24 ---
>  mm/shuffle.h                        |   35 ++++
>  mm/vmstat.c                         |    5 -
>  17 files changed, 838 insertions(+), 86 deletions(-)
>  create mode 100644 include/linux/memory_aeration.h
>  create mode 100644 mm/aeration.c
> 
> --

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment
  2019-05-30 21:53 [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment Alexander Duyck
                   ` (12 preceding siblings ...)
  2019-05-30 22:52 ` [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment Michael S. Tsirkin
@ 2019-05-31 11:16 ` Nitesh Narayan Lal
  2019-05-31 15:51   ` Alexander Duyck
  2019-06-03  9:31 ` David Hildenbrand
  14 siblings, 1 reply; 18+ messages in thread
From: Nitesh Narayan Lal @ 2019-05-31 11:16 UTC (permalink / raw)
  To: Alexander Duyck, kvm, david, mst, dave.hansen, linux-kernel, linux-mm
  Cc: yang.zhang.wz, pagupta, riel, konrad.wilk, lcapitulino,
	wei.w.wang, aarcange, pbonzini, dan.j.williams,
	alexander.h.duyck


On 5/30/19 5:53 PM, Alexander Duyck wrote:
> This series provides an asynchronous means of hinting to a hypervisor
> that a guest page is no longer in use and can have the data associated
> with it dropped. To do this I have implemented functionality that allows
> for what I am referring to as "waste page treatment".
>
> I have based many of the terms and functionality off of waste water
> treatment, the idea for the similarity occured to me after I had reached
> the point of referring to the hints as "bubbles", as the hints used the
> same approach as the balloon functionality but would disappear if they
> were touched, as a result I started to think of the virtio device as an
> aerator. The general idea with all of this is that the guest should be
> treating the unused pages so that when they end up heading "downstream"
> to either another guest, or back at the host they will not need to be
> written to swap.
>
> So for a bit of background for the treatment process, it is based on a
> sequencing batch reactor (SBR)[1]. The treatment process itself has five
> stages. The first stage is the fill, with this we take the raw pages and
> add them to the reactor. The second stage is react, in this stage we hand
> the pages off to the Virtio Balloon driver to have hints attached to them
> and for those hints to be sent to the hypervisor. The third stage is
> settle, in this stage we are waiting for the hypervisor to process the
> pages, and we should receive an interrupt when it is completed. The fourth
> stage is to decant, or drain the reactor of pages. Finally we have the
> idle stage which we will go into if the reference count for the reactor
> gets down to 0 after a drain, or if a fill operation fails to obtain any
> pages and the reference count has hit 0. Otherwise we return to the first
> state and start the cycle over again.
>
> This patch set is still far more intrusive then I would really like for
> what it has to do. Currently I am splitting the nr_free_pages into two
> values and having to add a pointer and an index to track where we area in
> the treatment process for a given free_area. I'm also not sure I have
> covered all possible corner cases where pages can get into the free_area
> or move from one migratetype to another.
>
> Also I am still leaving a number of things hard-coded such as limiting the
> lowest order processed to PAGEBLOCK_ORDER, and have left it up to the
> guest to determine what size of reactor it wants to allocate to process
> the hints.
>
> Another consideration I am still debating is if I really want to process
> the aerator_cycle() function in interrupt context or if I should have it
> running in a thread somewhere else.

Can you please share some performance numbers?

I will be sharing a less mm-intrusive bitmap-based approach hopefully by
next week.
Let's compare the two approaches then, in the meanwhile I will start
reviewing your patch-set.

>
> [1]: https://en.wikipedia.org/wiki/Sequencing_batch_reactor
>
> ---
>
> Alexander Duyck (11):
>       mm: Move MAX_ORDER definition closer to pageblock_order
>       mm: Adjust shuffle code to allow for future coalescing
>       mm: Add support for Treated Buddy pages
>       mm: Split nr_free into nr_free_raw and nr_free_treated
>       mm: Propogate Treated bit when splitting
>       mm: Add membrane to free area to use as divider between treated and raw pages
>       mm: Add support for acquiring first free "raw" or "untreated" page in zone
>       mm: Add support for creating memory aeration
>       mm: Count isolated pages as "treated"
>       virtio-balloon: Add support for aerating memory via bubble hinting
>       mm: Add free page notification hook
>
>
>  arch/x86/include/asm/page.h         |   11 +
>  drivers/virtio/Kconfig              |    1 
>  drivers/virtio/virtio_balloon.c     |   89 ++++++++++
>  include/linux/gfp.h                 |   10 +
>  include/linux/memory_aeration.h     |   54 ++++++
>  include/linux/mmzone.h              |  100 +++++++++--
>  include/linux/page-flags.h          |   32 +++
>  include/linux/pageblock-flags.h     |    8 +
>  include/uapi/linux/virtio_balloon.h |    1 
>  mm/Kconfig                          |    5 +
>  mm/Makefile                         |    1 
>  mm/aeration.c                       |  324 +++++++++++++++++++++++++++++++++++
>  mm/compaction.c                     |    4 
>  mm/page_alloc.c                     |  220 ++++++++++++++++++++----
>  mm/shuffle.c                        |   24 ---
>  mm/shuffle.h                        |   35 ++++
>  mm/vmstat.c                         |    5 -
>  17 files changed, 838 insertions(+), 86 deletions(-)
>  create mode 100644 include/linux/memory_aeration.h
>  create mode 100644 mm/aeration.c
>
> --
-- 
Regards
Nitesh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment
  2019-05-31 11:16 ` Nitesh Narayan Lal
@ 2019-05-31 15:51   ` Alexander Duyck
  0 siblings, 0 replies; 18+ messages in thread
From: Alexander Duyck @ 2019-05-31 15:51 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm list, David Hildenbrand, Michael S. Tsirkin, Dave Hansen,
	LKML, linux-mm, Yang Zhang, pagupta, Rik van Riel,
	Konrad Rzeszutek Wilk, lcapitulino, wei.w.wang, Andrea Arcangeli,
	Paolo Bonzini, dan.j.williams, Alexander Duyck

On Fri, May 31, 2019 at 4:16 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 5/30/19 5:53 PM, Alexander Duyck wrote:
> > This series provides an asynchronous means of hinting to a hypervisor
> > that a guest page is no longer in use and can have the data associated
> > with it dropped. To do this I have implemented functionality that allows
> > for what I am referring to as "waste page treatment".
> >
> > I have based many of the terms and functionality off of waste water
> > treatment, the idea for the similarity occured to me after I had reached
> > the point of referring to the hints as "bubbles", as the hints used the
> > same approach as the balloon functionality but would disappear if they
> > were touched, as a result I started to think of the virtio device as an
> > aerator. The general idea with all of this is that the guest should be
> > treating the unused pages so that when they end up heading "downstream"
> > to either another guest, or back at the host they will not need to be
> > written to swap.
> >
> > So for a bit of background for the treatment process, it is based on a
> > sequencing batch reactor (SBR)[1]. The treatment process itself has five
> > stages. The first stage is the fill, with this we take the raw pages and
> > add them to the reactor. The second stage is react, in this stage we hand
> > the pages off to the Virtio Balloon driver to have hints attached to them
> > and for those hints to be sent to the hypervisor. The third stage is
> > settle, in this stage we are waiting for the hypervisor to process the
> > pages, and we should receive an interrupt when it is completed. The fourth
> > stage is to decant, or drain the reactor of pages. Finally we have the
> > idle stage which we will go into if the reference count for the reactor
> > gets down to 0 after a drain, or if a fill operation fails to obtain any
> > pages and the reference count has hit 0. Otherwise we return to the first
> > state and start the cycle over again.
> >
> > This patch set is still far more intrusive then I would really like for
> > what it has to do. Currently I am splitting the nr_free_pages into two
> > values and having to add a pointer and an index to track where we area in
> > the treatment process for a given free_area. I'm also not sure I have
> > covered all possible corner cases where pages can get into the free_area
> > or move from one migratetype to another.
> >
> > Also I am still leaving a number of things hard-coded such as limiting the
> > lowest order processed to PAGEBLOCK_ORDER, and have left it up to the
> > guest to determine what size of reactor it wants to allocate to process
> > the hints.
> >
> > Another consideration I am still debating is if I really want to process
> > the aerator_cycle() function in interrupt context or if I should have it
> > running in a thread somewhere else.
>
> Can you please share some performance numbers?
>
> I will be sharing a less mm-intrusive bitmap-based approach hopefully by
> next week.
> Let's compare the two approaches then, in the meanwhile I will start
> reviewing your patch-set.

The performance can vary quite a bit depending on the configuration.
So for example with the memory shuffling enabled I saw an overall
improvement in transactions in the page_fault1 test I was running,
however I suspect that is just due to the fact that I inlined the bit
that was doing the shuffling at the 2nd patch in.

I'm still working on gathering data so you can consider the data
provided below as preliminary, and I want to emphasize your mileage
may vary as it seems like the config used can make a big difference.

So the results below are for a will-it-scale test of a VM running with
16 VCPUs and 32G of memory. The clean version is without patches
applied, and "aerate" is with patches applied. I disabled the memory
shuffling in the config for the kernels since it seemed like an unfair
comparison with it enabled. Before the test I ran "memhog 32g" to pull
in all available memory on the "clean" test and to pull in and flush
all the memory on the "aerate" tests. One thing that isn't really
making sense to me yet is why the results for the aerate version
appear to be better then the clean version when we start getting into
higher thread counts. One thing I notice is that clear_page_erms jumps
to the top of a perf trace on the host at about the inflection point
where the "clean" guest starts to under-perform versus the "aerate"
guest. So it is possible that there may be some benefit to having the
host clear the pages before the guest processes them.

5.2.0-rc2-next-20190529-clean #53
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,574916,93.73,574313,93.70,574916
2,1006964,87.47,918228,87.52,1149832
3,1373857,81.23,1170468,82.35,1724748
4,1781250,74.98,1526831,76.77,2299664
5,1973790,68.74,1764815,69.86,2874580
6,2235127,62.53,1912371,65.42,3449496
7,2499192,56.28,1936901,61.63,4024412
8,2581220,50.05,2114032,56.54,4599328
9,2804630,43.81,2202166,52.37,5174244
10,2746340,37.58,2194305,48.31,5749160
11,2694687,31.33,2189028,41.74,6324076
12,2772102,25.16,2176312,40.85,6898992
13,2854235,18.94,2146288,37.61,7473908
14,2720456,12.73,2063334,32.67,8048824
15,2753005,6.51,2103228,26.65,8623740
16,2595824,0.36,2142308,25.96,9198656
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,568948,93.73,570092,93.72,570092
2,1006781,87.47,911829,87.57,1140184
3,1360418,81.23,1189920,82.22,1710276
4,1749889,74.99,1476555,77.22,2280368
5,1927251,68.76,1681522,70.49,2850460
6,2221112,62.51,1845148,65.74,3420552
7,2497960,56.29,1983406,61.44,3990644
8,2586250,50.01,2062633,56.99,4560736
9,2570559,43.82,1989225,53.14,5130828
10,2692389,37.57,2159570,48.07,5700920
11,2621505,31.33,2214469,43.73,6271012
12,2772863,25.15,2164639,40.35,6841104
13,2839319,18.94,2184126,36.90,7411196
14,2712433,12.77,2048788,31.16,7981288
15,2779543,6.54,2105144,27.29,8551380
16,2605799,0.34,2101187,23.20,9121472

5.2.0-rc2-next-20190529-aerate+ #55
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,538715,93.73,538909,93.73,538909
2,985096,87.46,899393,87.54,1077818
3,1421187,81.25,1271836,81.88,1616727
4,1745358,75.00,1435337,77.61,2155636
5,2031097,68.76,1766946,70.37,2694545
6,2234646,62.51,1794401,66.94,3233454
7,2455541,56.27,2101020,59.42,3772363
8,2576793,50.09,1810192,59.45,4311272
9,2772082,43.82,2315719,50.58,4850181
10,2794868,37.62,1996644,50.84,5389090
11,2931943,31.36,2147434,42.58,5927999
12,2837655,25.12,2032434,42.79,6466908
13,2881797,18.95,2163387,36.80,7005817
14,2802190,12.73,2049732,30.00,7544726
15,2684374,6.53,2039098,26.48,8083635
16,2695848,0.41,2044131,22.08,8622544
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,533361,93.72,532973,93.73,533361
2,980085,87.46,904796,87.50,1066722
3,1387100,81.21,1271080,81.41,1600083
4,1720030,74.99,1539417,75.99,2133444
5,1942111,68.74,1530612,73.21,2666805
6,2226552,62.51,1777038,66.97,3200166
7,2469715,56.27,2084451,59.72,3733527
8,2567333,50.04,1820491,59.38,4266888
9,2744551,43.82,2259861,51.73,4800249
10,2768107,37.60,2240844,48.10,5333610
11,2879636,31.37,2134152,46.56,5866971
12,2826859,25.18,1960830,44.07,6400332
13,2905216,19.05,1887735,36.66,6933693
14,2841688,12.79,2047092,30.18,7467054
15,2832234,6.57,2066059,25.31,8000415
16,2758579,0.38,1961050,22.16,8533776

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment
  2019-05-30 21:53 [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment Alexander Duyck
                   ` (13 preceding siblings ...)
  2019-05-31 11:16 ` Nitesh Narayan Lal
@ 2019-06-03  9:31 ` David Hildenbrand
  2019-06-03 15:33   ` Alexander Duyck
  14 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2019-06-03  9:31 UTC (permalink / raw)
  To: Alexander Duyck, nitesh, kvm, mst, dave.hansen, linux-kernel, linux-mm
  Cc: yang.zhang.wz, pagupta, riel, konrad.wilk, lcapitulino,
	wei.w.wang, aarcange, pbonzini, dan.j.williams,
	alexander.h.duyck

On 30.05.19 23:53, Alexander Duyck wrote:
> This series provides an asynchronous means of hinting to a hypervisor
> that a guest page is no longer in use and can have the data associated
> with it dropped. To do this I have implemented functionality that allows
> for what I am referring to as "waste page treatment".
> 
> I have based many of the terms and functionality off of waste water
> treatment, the idea for the similarity occured to me after I had reached
> the point of referring to the hints as "bubbles", as the hints used the
> same approach as the balloon functionality but would disappear if they
> were touched, as a result I started to think of the virtio device as an
> aerator. The general idea with all of this is that the guest should be
> treating the unused pages so that when they end up heading "downstream"
> to either another guest, or back at the host they will not need to be
> written to swap.
> 
> So for a bit of background for the treatment process, it is based on a
> sequencing batch reactor (SBR)[1]. The treatment process itself has five
> stages. The first stage is the fill, with this we take the raw pages and
> add them to the reactor. The second stage is react, in this stage we hand
> the pages off to the Virtio Balloon driver to have hints attached to them
> and for those hints to be sent to the hypervisor. The third stage is
> settle, in this stage we are waiting for the hypervisor to process the
> pages, and we should receive an interrupt when it is completed. The fourth
> stage is to decant, or drain the reactor of pages. Finally we have the
> idle stage which we will go into if the reference count for the reactor
> gets down to 0 after a drain, or if a fill operation fails to obtain any
> pages and the reference count has hit 0. Otherwise we return to the first
> state and start the cycle over again.

While I like this analogy, I don't like the terminology mixed into
linux-mm core.

mm/aeration.c? Bubble? Treated? whut?

Can you come up with a terminology once can understand without a PHD in
biology? (if that is even the right field of study, I have no idea)


ALSO: isn't the analogy partially wrong? Nobody would be using "waste
water" just because they are low on "clean water". At least not in my
city (I hope so ;) ). But maybe I am not getting the whole concept
because we are dealing with pages we want to hint to the hypervisor and
not with actual "waste".

> 
> This patch set is still far more intrusive then I would really like for
> what it has to do. Currently I am splitting the nr_free_pages into two
> values and having to add a pointer and an index to track where we area in
> the treatment process for a given free_area. I'm also not sure I have
> covered all possible corner cases where pages can get into the free_area
> or move from one migratetype to another.

Yes, it is quite intrusive. Maybe we can minimize the impact/error
proneness.

> 
> Also I am still leaving a number of things hard-coded such as limiting the
> lowest order processed to PAGEBLOCK_ORDER, and have left it up to the
> guest to determine what size of reactor it wants to allocate to process
> the hints.
> 
> Another consideration I am still debating is if I really want to process
> the aerator_cycle() function in interrupt context or if I should have it
> running in a thread somewhere else.

Did you get to test/benchmark the difference?

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment
  2019-06-03  9:31 ` David Hildenbrand
@ 2019-06-03 15:33   ` Alexander Duyck
  0 siblings, 0 replies; 18+ messages in thread
From: Alexander Duyck @ 2019-06-03 15:33 UTC (permalink / raw)
  To: David Hildenbrand, Alexander Duyck, nitesh, kvm, mst,
	dave.hansen, linux-kernel, linux-mm
  Cc: yang.zhang.wz, pagupta, riel, konrad.wilk, lcapitulino,
	wei.w.wang, aarcange, pbonzini, dan.j.williams

On Mon, 2019-06-03 at 11:31 +0200, David Hildenbrand wrote:
> On 30.05.19 23:53, Alexander Duyck wrote:
> > This series provides an asynchronous means of hinting to a hypervisor
> > that a guest page is no longer in use and can have the data associated
> > with it dropped. To do this I have implemented functionality that allows
> > for what I am referring to as "waste page treatment".
> > 
> > I have based many of the terms and functionality off of waste water
> > treatment, the idea for the similarity occured to me after I had reached
> > the point of referring to the hints as "bubbles", as the hints used the
> > same approach as the balloon functionality but would disappear if they
> > were touched, as a result I started to think of the virtio device as an
> > aerator. The general idea with all of this is that the guest should be
> > treating the unused pages so that when they end up heading "downstream"
> > to either another guest, or back at the host they will not need to be
> > written to swap.
> > 
> > So for a bit of background for the treatment process, it is based on a
> > sequencing batch reactor (SBR)[1]. The treatment process itself has five
> > stages. The first stage is the fill, with this we take the raw pages and
> > add them to the reactor. The second stage is react, in this stage we hand
> > the pages off to the Virtio Balloon driver to have hints attached to them
> > and for those hints to be sent to the hypervisor. The third stage is
> > settle, in this stage we are waiting for the hypervisor to process the
> > pages, and we should receive an interrupt when it is completed. The fourth
> > stage is to decant, or drain the reactor of pages. Finally we have the
> > idle stage which we will go into if the reference count for the reactor
> > gets down to 0 after a drain, or if a fill operation fails to obtain any
> > pages and the reference count has hit 0. Otherwise we return to the first
> > state and start the cycle over again.
> 
> While I like this analogy, I don't like the terminology mixed into
> linux-mm core.
> 
> mm/aeration.c? Bubble? Treated? whut?
> 
> Can you come up with a terminology once can understand without a PHD in
> biology? (if that is even the right field of study, I have no idea)

I had started with the bubble, as I had mentioned before. From there I got
to aerator because of the fact that we were filling the memory with holes.
I figure the first two work pretty well, but I am not really attached to
any of the other terms. As far as the rest of the terminology most of it
is actually chemistry if I am not mistaken. I could probably just swap out
"Treated" with "Aerated" and it would work just as well. I would also need
to get away from the more complex terms such as "decant", but for the most
part that is just a matter of finding the synonyms such as "drain".

> ALSO: isn't the analogy partially wrong? Nobody would be using "waste
> water" just because they are low on "clean water". At least not in my
> city (I hope so ;) ). But maybe I am not getting the whole concept
> because we are dealing with pages we want to hint to the hypervisor and
> not with actual "waste".

Actually the analogy isn't for a low condition. The analogy would be for a
condition where we have an excess of waste water and don't want to contain
it. As such we want to treat it and return it to the water cycle.

As far as the "waste" in the analogy I was thinking more of the page data.
When a page has been used we normally mark it as "Dirty", so I thought it
would be an apt analogy since those dirty pages would have to be written
to long term storage if we didn't do something to invalidate the page
data.

> > This patch set is still far more intrusive then I would really like for
> > what it has to do. Currently I am splitting the nr_free_pages into two
> > values and having to add a pointer and an index to track where we area in
> > the treatment process for a given free_area. I'm also not sure I have
> > covered all possible corner cases where pages can get into the free_area
> > or move from one migratetype to another.
> 
> Yes, it is quite intrusive. Maybe we can minimize the impact/error
> proneness.

My hope by submitting this as an RFC was to get input on what directions I
might need to head in before I went to far down this current path.

> > Also I am still leaving a number of things hard-coded such as limiting the
> > lowest order processed to PAGEBLOCK_ORDER, and have left it up to the
> > guest to determine what size of reactor it wants to allocate to process
> > the hints.
> > 
> > Another consideration I am still debating is if I really want to process
> > the aerator_cycle() function in interrupt context or if I should have it
> > running in a thread somewhere else.
> 
> Did you get to test/benchmark the difference?

I haven't yet.


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-06-03 15:34 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-30 21:53 [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment Alexander Duyck
2019-05-30 21:53 ` [RFC PATCH 01/11] mm: Move MAX_ORDER definition closer to pageblock_order Alexander Duyck
2019-05-30 21:53 ` [RFC PATCH 02/11] mm: Adjust shuffle code to allow for future coalescing Alexander Duyck
2019-05-30 21:53 ` [RFC PATCH 03/11] mm: Add support for Treated Buddy pages Alexander Duyck
2019-05-30 21:54 ` [RFC PATCH 04/11] mm: Split nr_free into nr_free_raw and nr_free_treated Alexander Duyck
2019-05-30 21:54 ` [RFC PATCH 05/11] mm: Propogate Treated bit when splitting Alexander Duyck
2019-05-30 21:54 ` [RFC PATCH 06/11] mm: Add membrane to free area to use as divider between treated and raw pages Alexander Duyck
2019-05-30 21:54 ` [RFC PATCH 07/11] mm: Add support for acquiring first free "raw" or "untreated" page in zone Alexander Duyck
2019-05-30 21:54 ` [RFC PATCH 08/11] mm: Add support for creating memory aeration Alexander Duyck
2019-05-30 21:54 ` [RFC PATCH 09/11] mm: Count isolated pages as "treated" Alexander Duyck
2019-05-30 21:54 ` [RFC PATCH 10/11] virtio-balloon: Add support for aerating memory via bubble hinting Alexander Duyck
2019-05-30 21:54 ` [RFC PATCH 11/11] mm: Add free page notification hook Alexander Duyck
2019-05-30 21:57 ` [RFC QEMU PATCH] QEMU: Provide a interface for hinting based off of the balloon infrastructure Alexander Duyck
2019-05-30 22:52 ` [RFC PATCH 00/11] mm / virtio: Provide support for paravirtual waste page treatment Michael S. Tsirkin
2019-05-31 11:16 ` Nitesh Narayan Lal
2019-05-31 15:51   ` Alexander Duyck
2019-06-03  9:31 ` David Hildenbrand
2019-06-03 15:33   ` Alexander Duyck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).