linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 0/5] make balloon pages movable by compaction
@ 2012-08-21 12:47 Rafael Aquini
  2012-08-21 12:47 ` [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility Rafael Aquini
                   ` (4 more replies)
  0 siblings, 5 replies; 52+ messages in thread
From: Rafael Aquini @ 2012-08-21 12:47 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, virtualization, Rusty Russell, Michael S. Tsirkin,
	Rik van Riel, Mel Gorman, Andi Kleen, Andrew Morton,
	Konrad Rzeszutek Wilk, Minchan Kim, Rafael Aquini

Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a guest,
thus imposing performance penalties associated with the reduced number of
transparent huge pages that could be used by the guest workload.

This patch-set follows the main idea discussed at 2012 LSFMMS session:
"Ballooning for transparent huge pages" -- http://lwn.net/Articles/490114/
to introduce the required changes to the virtio_balloon driver, as well as
the changes to the core compaction & migration bits, in order to make those
subsystems aware of ballooned pages and allow memory balloon pages become
movable within a guest, thus avoiding the aforementioned fragmentation issue

Rafael Aquini (5):
  mm: introduce a common interface for balloon pages mobility
  mm: introduce compaction and migration for ballooned pages
  virtio_balloon: introduce migration primitives to balloon pages
  mm: introduce putback_movable_pages()
  mm: add vm event counters for balloon pages compaction

 drivers/virtio/virtio_balloon.c    | 212 +++++++++++++++++++++++++++++++++++--
 include/linux/balloon_compaction.h | 113 ++++++++++++++++++++
 include/linux/migrate.h            |   2 +
 include/linux/pagemap.h            |  18 ++++
 include/linux/vm_event_item.h      |   8 +-
 mm/Kconfig                         |  15 +++
 mm/Makefile                        |   2 +-
 mm/balloon_compaction.c            | 147 +++++++++++++++++++++++++
 mm/compaction.c                    |  51 +++++----
 mm/migrate.c                       |  52 ++++++++-
 mm/page_alloc.c                    |   2 +-
 mm/vmstat.c                        |  10 +-
 12 files changed, 595 insertions(+), 37 deletions(-)
 create mode 100644 include/linux/balloon_compaction.h
 create mode 100644 mm/balloon_compaction.c


Change log:
v8:
 * introduce a common MM interface for balloon driver page compaction (Michael);
 * remove the global state preventing multiple balloon device support (Michael);
 * introduce RCU protection/syncrhonization to balloon page->mapping (Michael);
v7:
 * fix a potential page leak case at 'putback_balloon_page' (Mel);
 * adjust vm-events-counter patch and remove its drop-on-merge message (Rik);
 * add 'putback_movable_pages' to avoid hacks on 'putback_lru_pages' (Minchan);
v6:
 * rename 'is_balloon_page()' to 'movable_balloon_page()' (Rik);
v5:
 * address Andrew Morton's review comments on the patch series;
 * address a couple extra nitpick suggestions on PATCH 01 (Minchan);
v4: 
 * address Rusty Russel's review comments on PATCH 02;
 * re-base virtio_balloon patch on 9c378abc5c0c6fc8e3acf5968924d274503819b3;
V3: 
 * address reviewers nitpick suggestions on PATCH 01 (Mel, Minchan);
V2: 
 * address Mel Gorman's review comments on PATCH 01;


Preliminary test results:
(2 VCPU 2048mB RAM KVM guest running 3.6.0_rc2+ -- after a reboot)

* 64mB balloon:
[root@localhost ~]# awk '/compact/ {print}' /proc/vmstat
compact_blocks_moved 0
compact_pages_moved 0
compact_pagemigrate_failed 0
compact_stall 0
compact_fail 0
compact_success 0
compact_balloon_isolated 0
compact_balloon_migrated 0
compact_balloon_returned 0
compact_balloon_released 0
[root@localhost ~]# 
[root@localhost ~]# for i in $(seq 1 6); do echo 1 > /proc/sys/vm/compact_memory & done &>/dev/null 
[1]   Done                    echo 1 > /proc/sys/vm/compact_memory
[2]   Done                    echo 1 > /proc/sys/vm/compact_memory
[3]   Done                    echo 1 > /proc/sys/vm/compact_memory
[4]   Done                    echo 1 > /proc/sys/vm/compact_memory
[5]-  Done                    echo 1 > /proc/sys/vm/compact_memory
[6]+  Done                    echo 1 > /proc/sys/vm/compact_memory
[root@localhost ~]# 
[root@localhost ~]# awk '/compact/ {print}' /proc/vmstat
compact_blocks_moved 3108
compact_pages_moved 43169
compact_pagemigrate_failed 95
compact_stall 0
compact_fail 0
compact_success 0
compact_balloon_isolated 16384
compact_balloon_migrated 16384
compact_balloon_returned 0
compact_balloon_released 16384


* 128 mB balloon:
[root@localhost ~]# awk '/compact/ {print}' /proc/vmstat
compact_blocks_moved 0
compact_pages_moved 0
compact_pagemigrate_failed 0
compact_stall 0
compact_fail 0
compact_success 0
compact_balloon_isolated 0
compact_balloon_migrated 0
compact_balloon_returned 0
compact_balloon_released 0
[root@localhost ~]# 
[root@localhost ~]# for i in $(seq 1 6); do echo 1 > /proc/sys/vm/compact_memory & done &>/dev/null  
[1]   Done                    echo 1 > /proc/sys/vm/compact_memory
[2]   Done                    echo 1 > /proc/sys/vm/compact_memory
[3]   Done                    echo 1 > /proc/sys/vm/compact_memory
[4]   Done                    echo 1 > /proc/sys/vm/compact_memory
[5]-  Done                    echo 1 > /proc/sys/vm/compact_memory
[6]+  Done                    echo 1 > /proc/sys/vm/compact_memory
[root@localhost ~]# 
[root@localhost ~]# awk '/compact/ {print}' /proc/vmstat
compact_blocks_moved 3062
compact_pages_moved 49774
compact_pagemigrate_failed 129
compact_stall 0
compact_fail 0
compact_success 0
compact_balloon_isolated 26076
compact_balloon_migrated 25957
compact_balloon_returned 119
compact_balloon_released 25957

-- 
1.7.11.4

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-21 12:47 [PATCH v8 0/5] make balloon pages movable by compaction Rafael Aquini
@ 2012-08-21 12:47 ` Rafael Aquini
  2012-08-21 13:52   ` Michael S. Tsirkin
  2012-08-21 15:20   ` Peter Zijlstra
  2012-08-21 12:47 ` [PATCH v8 2/5] mm: introduce compaction and migration for ballooned pages Rafael Aquini
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 52+ messages in thread
From: Rafael Aquini @ 2012-08-21 12:47 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, virtualization, Rusty Russell, Michael S. Tsirkin,
	Rik van Riel, Mel Gorman, Andi Kleen, Andrew Morton,
	Konrad Rzeszutek Wilk, Minchan Kim, Rafael Aquini

Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a guest,
thus imposing performance penalties associated with the reduced number of
transparent huge pages that could be used by the guest workload.

This patch introduces a common interface to help a balloon driver on
making its page set movable to compaction, and thus allowing the system
to better leverage the compation efforts on memory defragmentation.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
---
 include/linux/balloon_compaction.h | 113 +++++++++++++++++++++++++++++
 include/linux/pagemap.h            |  18 +++++
 mm/Kconfig                         |  15 ++++
 mm/Makefile                        |   2 +-
 mm/balloon_compaction.c            | 145 +++++++++++++++++++++++++++++++++++++
 5 files changed, 292 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/balloon_compaction.h
 create mode 100644 mm/balloon_compaction.c

diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h
new file mode 100644
index 0000000..5fbf036
--- /dev/null
+++ b/include/linux/balloon_compaction.h
@@ -0,0 +1,113 @@
+/*
+ * include/linux/balloon_compaction.h
+ *
+ * Common interface definitions for making balloon pages movable to compaction.
+ *
+ * Copyright (C) 2012, Red Hat, Inc.  Rafael Aquini <aquini@redhat.com>
+ */
+#ifndef _LINUX_BALLOON_COMPACTION_H
+#define _LINUX_BALLOON_COMPACTION_H
+#ifdef __KERNEL__
+
+#include <linux/rcupdate.h>
+#include <linux/pagemap.h>
+#include <linux/gfp.h>
+
+#if defined(CONFIG_BALLOON_COMPACTION)
+extern bool isolate_balloon_page(struct page *);
+extern void putback_balloon_page(struct page *);
+
+static inline bool movable_balloon_page(struct page *page)
+{
+	struct address_space *mapping = NULL;
+	bool ret = false;
+	/*
+	 * Before dereferencing and testing mapping->flags, lets make sure
+	 * this is not a page that uses ->mapping in a different way
+	 */
+	if (!PageSlab(page) && !PageSwapCache(page) &&
+	    !PageAnon(page) && !page_mapped(page)) {
+		rcu_read_lock();
+		mapping = rcu_dereference(page->mapping);
+		if (mapping_balloon(mapping))
+			ret = true;
+		rcu_read_unlock();
+	}
+	return ret;
+}
+
+static inline gfp_t balloon_mapping_gfp_mask(void)
+{
+	return GFP_HIGHUSER_MOVABLE;
+}
+
+/*
+ * __page_balloon_device - return the balloon device owing the page.
+ *
+ * This shall only be used at driver callbacks under proper page_lock,
+ * to get access to the balloon device structure that owns @page.
+ */
+static inline void *__page_balloon_device(struct page *page)
+{
+	struct address_space *mapping;
+	mapping = rcu_access_pointer(page->mapping);
+	if (mapping)
+		mapping = mapping->assoc_mapping;
+	return (void *)mapping;
+}
+
+#define count_balloon_event(e) count_vm_event(e)
+/*
+ * DEFINE_BALLOON_MAPPING_AOPS - declare and instantiate a callback descriptor
+ *				 to be used as balloon page->mapping->a_ops.
+ *
+ * @label     : declaration identifier (var name)
+ * @migratepg : callback symbol name for performing the page migration step
+ * @isolatepg : callback symbol name for performing the page isolation step
+ * @putbackpg : callback symbol name for performing the page putback step
+ *
+ * address_space_operations utilized methods for ballooned pages:
+ *   .migratepage    - used to perform balloon's page migration (as is)
+ *   .launder_page   - used to isolate a page from balloon's page list
+ *   .freepage       - used to reinsert an isolated page to balloon's page list
+ */
+#define DEFINE_BALLOON_MAPPING_AOPS(label, migratepg, isolatepg, putbackpg) \
+	const struct address_space_operations (label) = {		    \
+		.migratepage  = (migratepg),				    \
+		.launder_page = (isolatepg),				    \
+		.freepage     = (putbackpg),				    \
+	}
+
+#else
+static inline bool isolate_balloon_page(struct page *page) { return false; }
+static inline bool movable_balloon_page(struct page *page) { return false; }
+static inline void putback_balloon_page(struct page *page) { return; }
+
+static inline gfp_t balloon_mapping_gfp_mask(void)
+{
+	return GFP_HIGHUSER;
+}
+
+#define count_balloon_event(e) {}
+#define DEFINE_BALLOON_MAPPING_AOPS(label, migratepg, isolatepg, putbackpg) \
+	const struct address_space_operations *(label) = NULL
+
+#endif /* CONFIG_BALLOON_COMPACTION */
+
+extern struct address_space *alloc_balloon_mapping(void *balloon_device,
+				const struct address_space_operations *a_ops);
+extern void *page_balloon_device(struct page *page);
+
+static inline void assign_balloon_mapping(struct page *page,
+					  struct address_space *mapping)
+{
+	rcu_assign_pointer(page->mapping, mapping);
+}
+
+static inline void clear_balloon_mapping(struct page *page)
+{
+	rcu_assign_pointer(page->mapping, NULL);
+}
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_BALLOON_COMPACTION_H */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e42c762..6df0664 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -24,6 +24,7 @@ enum mapping_flags {
 	AS_ENOSPC	= __GFP_BITS_SHIFT + 1,	/* ENOSPC on async write */
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
+	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
@@ -53,6 +54,23 @@ static inline int mapping_unevictable(struct address_space *mapping)
 	return !!mapping;
 }
 
+static inline void mapping_set_balloon(struct address_space *mapping)
+{
+	set_bit(AS_BALLOON_MAP, &mapping->flags);
+}
+
+static inline void mapping_clear_balloon(struct address_space *mapping)
+{
+	clear_bit(AS_BALLOON_MAP, &mapping->flags);
+}
+
+static inline int mapping_balloon(struct address_space *mapping)
+{
+	if (mapping)
+		return test_bit(AS_BALLOON_MAP, &mapping->flags);
+	return !!mapping;
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
diff --git a/mm/Kconfig b/mm/Kconfig
index d5c8019..4857899 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -188,6 +188,21 @@ config SPLIT_PTLOCK_CPUS
 	default "4"
 
 #
+# support for memory balloon compaction
+config BALLOON_COMPACTION
+	bool "Allow for balloon memory compaction/migration"
+	select COMPACTION
+	depends on VIRTIO_BALLOON
+	help
+	  Memory fragmentation introduced by ballooning might reduce
+	  significantly the number of 2MB contiguous memory blocks that can be
+	  used within a guest, thus imposing performance penalties associated
+	  with the reduced number of transparent huge pages that could be used
+	  by the guest workload. Allowing the compaction & migration for memory
+	  pages enlisted as being part of memory balloon devices avoids the
+	  scenario aforementioned and helps improving memory defragmentation.
+
+#
 # support for memory compaction
 config COMPACTION
 	bool "Allow for memory compaction"
diff --git a/mm/Makefile b/mm/Makefile
index 92753e2..78d8caa 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -16,7 +16,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
-			   compaction.o $(mmu-y)
+			   compaction.o balloon_compaction.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
new file mode 100644
index 0000000..d79f13d
--- /dev/null
+++ b/mm/balloon_compaction.c
@@ -0,0 +1,145 @@
+/*
+ * mm/balloon_compaction.c
+ *
+ * Common interface for making balloon pages movable to compaction.
+ *
+ * Copyright (C) 2012, Red Hat, Inc.  Rafael Aquini <aquini@redhat.com>
+ */
+#include <linux/slab.h>
+#include <linux/export.h>
+#include <linux/balloon_compaction.h>
+
+/*
+ * alloc_balloon_mapping - allocates a special ->mapping for ballooned pages.
+ * @balloon_device: pointer address that references the balloon device which
+ *                 owns pages bearing this ->mapping.
+ * @a_ops: balloon_mapping address_space_operations descriptor.
+ *
+ * Users must call it to properly allocate and initialize an instance of
+ * struct address_space which will be used as the special page->mapping for
+ * balloon devices enlisted page instances.
+ */
+struct address_space *alloc_balloon_mapping(void *balloon_device,
+				const struct address_space_operations *a_ops)
+{
+	struct address_space *mapping;
+
+	mapping = kmalloc(sizeof(*mapping), GFP_KERNEL);
+	if (mapping) {
+		/*
+		 * Give a clean 'zeroed' status to all elements of this special
+		 * balloon page->mapping struct address_space instance.
+		 */
+		address_space_init_once(mapping);
+
+		/*
+		 * Set mapping->flags appropriately, to allow balloon ->mapping
+		 * identification, as well as give a proper hint to the balloon
+		 * driver on what GFP allocation mask shall be used.
+		 */
+		mapping_set_balloon(mapping);
+		mapping_set_gfp_mask(mapping, balloon_mapping_gfp_mask());
+
+		/* balloon's page->mapping->a_ops callback descriptor */
+		mapping->a_ops = a_ops;
+
+		/*
+		 * balloon special page->mapping overloads ->assoc_mapping
+		 * to held a reference back to the balloon device wich 'owns'
+		 * a given page. This is the way we can cope with multiple
+		 * balloon devices without losing reference of several
+		 * ballooned pagesets.
+		 */
+		mapping->assoc_mapping = (void *)balloon_device;
+
+		return mapping;
+	}
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(alloc_balloon_mapping);
+
+#if defined(CONFIG_BALLOON_COMPACTION)
+static inline int __isolate_balloon_page(struct page *page)
+{
+	struct address_space *mapping;
+	int ret = 0;
+
+	rcu_read_lock();
+	mapping = rcu_dereference(page->mapping);
+	if (mapping)
+		ret = mapping->a_ops->launder_page(page);
+
+	rcu_read_unlock();
+	return ret;
+}
+
+static inline void __putback_balloon_page(struct page *page)
+{
+	struct address_space *mapping;
+
+	rcu_read_lock();
+	mapping = rcu_dereference(page->mapping);
+	/*
+	 * If we stumble across a page->mapping NULL here, so something has
+	 * messed with the private isolated pageset we are iterating over.
+	 */
+	BUG_ON(!mapping);
+
+	mapping->a_ops->freepage(page);
+	/* isolation bumps up the page refcount, time to decrement it */
+	put_page(page);
+	rcu_read_unlock();
+}
+
+/* __isolate_lru_page() counterpart for a ballooned page */
+bool isolate_balloon_page(struct page *page)
+{
+	if (likely(get_page_unless_zero(page))) {
+		/*
+		 * As balloon pages are not isolated from LRU lists, concurrent
+		 * compaction threads can race against page migration functions
+		 * move_to_new_page() & __unmap_and_move().
+		 * In order to avoid having an already isolated balloon page
+		 * being (wrongly) re-isolated while it is under migration,
+		 * lets be sure we have the page lock before proceeding with
+		 * the balloon page isolation steps.
+		 */
+		if (likely(trylock_page(page))) {
+			/*
+			 * A ballooned page, by default, has just one refcount.
+			 * Prevent concurrent compaction threads from isolating
+			 * an already isolated balloon page by refcount check.
+			 */
+			if (movable_balloon_page(page) &&
+			    (page_count(page) == 2)) {
+				if (__isolate_balloon_page(page)) {
+					unlock_page(page);
+					return true;
+				}
+			}
+			unlock_page(page);
+		}
+		/*
+		 * This page is either under migration, it is isolated already,
+		 * or its isolation step has not happened at this round.
+		 * Drop the refcount taken for it.
+		 */
+		put_page(page);
+	}
+	return false;
+}
+
+/* putback_lru_page() counterpart for a ballooned page */
+void putback_balloon_page(struct page *page)
+{
+	/*
+	 * 'lock_page()' stabilizes the page and prevents races against
+	 * concurrent isolation threads attempting to re-isolate it.
+	 */
+	lock_page(page);
+	if (movable_balloon_page(page))
+		__putback_balloon_page(page);
+
+	unlock_page(page);
+}
+#endif /* CONFIG_BALLOON_COMPACTION */
-- 
1.7.11.4


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v8 2/5] mm: introduce compaction and migration for ballooned pages
  2012-08-21 12:47 [PATCH v8 0/5] make balloon pages movable by compaction Rafael Aquini
  2012-08-21 12:47 ` [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility Rafael Aquini
@ 2012-08-21 12:47 ` Rafael Aquini
  2012-08-21 12:47 ` [PATCH v8 3/5] virtio_balloon: introduce migration primitives to balloon pages Rafael Aquini
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 52+ messages in thread
From: Rafael Aquini @ 2012-08-21 12:47 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, virtualization, Rusty Russell, Michael S. Tsirkin,
	Rik van Riel, Mel Gorman, Andi Kleen, Andrew Morton,
	Konrad Rzeszutek Wilk, Minchan Kim, Rafael Aquini

Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a guest,
thus imposing performance penalties associated with the reduced number of
transparent huge pages that could be used by the guest workload.

This patch introduces the helper functions as well as the necessary changes
to teach compaction and migration bits how to cope with pages which are
part of a guest memory balloon, in order to make them movable by memory
compaction procedures.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
---
 mm/compaction.c | 47 ++++++++++++++++++++++++++++-------------------
 mm/migrate.c    | 31 ++++++++++++++++++++++++++++++-
 2 files changed, 58 insertions(+), 20 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index e78cb96..ce43dc2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -14,6 +14,7 @@
 #include <linux/backing-dev.h>
 #include <linux/sysctl.h>
 #include <linux/sysfs.h>
+#include <linux/balloon_compaction.h>
 #include "internal.h"
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
@@ -312,32 +313,40 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 			continue;
 		}
 
-		if (!PageLRU(page))
-			continue;
-
 		/*
-		 * PageLRU is set, and lru_lock excludes isolation,
-		 * splitting and collapsing (collapsing has already
-		 * happened if PageLRU is set).
+		 * It is possible to migrate LRU pages and balloon pages.
+		 * Skip any other type of page.
 		 */
-		if (PageTransHuge(page)) {
-			low_pfn += (1 << compound_order(page)) - 1;
-			continue;
-		}
+		if (PageLRU(page)) {
+			/*
+			 * PageLRU is set, and lru_lock excludes isolation,
+			 * splitting and collapsing (collapsing has already
+			 * happened if PageLRU is set).
+			 */
+			if (PageTransHuge(page)) {
+				low_pfn += (1 << compound_order(page)) - 1;
+				continue;
+			}
 
-		if (!cc->sync)
-			mode |= ISOLATE_ASYNC_MIGRATE;
+			if (!cc->sync)
+				mode |= ISOLATE_ASYNC_MIGRATE;
 
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+			lruvec = mem_cgroup_page_lruvec(page, zone);
 
-		/* Try isolate the page */
-		if (__isolate_lru_page(page, mode) != 0)
-			continue;
+			/* Try isolate the page */
+			if (__isolate_lru_page(page, mode) != 0)
+				continue;
 
-		VM_BUG_ON(PageTransCompound(page));
+			VM_BUG_ON(PageTransCompound(page));
+
+			/* Successfully isolated */
+			del_page_from_lru_list(page, lruvec, page_lru(page));
+		} else if (unlikely(movable_balloon_page(page))) {
+			if (!isolate_balloon_page(page))
+				continue;
+		} else
+			continue;
 
-		/* Successfully isolated */
-		del_page_from_lru_list(page, lruvec, page_lru(page));
 		list_add(&page->lru, migratelist);
 		cc->nr_migratepages++;
 		nr_isolated++;
diff --git a/mm/migrate.c b/mm/migrate.c
index 77ed2d7..6392da258 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -35,6 +35,7 @@
 #include <linux/hugetlb.h>
 #include <linux/hugetlb_cgroup.h>
 #include <linux/gfp.h>
+#include <linux/balloon_compaction.h>
 
 #include <asm/tlbflush.h>
 
@@ -79,7 +80,10 @@ void putback_lru_pages(struct list_head *l)
 		list_del(&page->lru);
 		dec_zone_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
-		putback_lru_page(page);
+		if (unlikely(movable_balloon_page(page)))
+			putback_balloon_page(page);
+		else
+			putback_lru_page(page);
 	}
 }
 
@@ -778,6 +782,17 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		}
 	}
 
+	if (unlikely(movable_balloon_page(page))) {
+		/*
+		 * A ballooned page does not need any special attention from
+		 * physical to virtual reverse mapping procedures.
+		 * Skip any attempt to unmap PTEs or to remap swap cache,
+		 * in order to avoid burning cycles at rmap level.
+		 */
+		remap_swapcache = 0;
+		goto skip_unmap;
+	}
+
 	/*
 	 * Corner case handling:
 	 * 1. When a new swap-cache page is read into, it is added to the LRU
@@ -846,6 +861,20 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 			goto out;
 
 	rc = __unmap_and_move(page, newpage, force, offlining, mode);
+
+	if (unlikely(movable_balloon_page(newpage))) {
+		/*
+		 * A ballooned page has been migrated already. Now, it is the
+		 * time to wrap-up counters, handle the old page back to Buddy
+		 * and return.
+		 */
+		list_del(&page->lru);
+		dec_zone_page_state(page, NR_ISOLATED_ANON +
+				    page_is_file_cache(page));
+		put_page(page);
+		__free_page(page);
+		return rc;
+	}
 out:
 	if (rc != -EAGAIN) {
 		/*
-- 
1.7.11.4


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v8 3/5] virtio_balloon: introduce migration primitives to balloon pages
  2012-08-21 12:47 [PATCH v8 0/5] make balloon pages movable by compaction Rafael Aquini
  2012-08-21 12:47 ` [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility Rafael Aquini
  2012-08-21 12:47 ` [PATCH v8 2/5] mm: introduce compaction and migration for ballooned pages Rafael Aquini
@ 2012-08-21 12:47 ` Rafael Aquini
  2012-08-21 14:40   ` Michael S. Tsirkin
  2012-08-21 14:57   ` Michael S. Tsirkin
  2012-08-21 12:47 ` [PATCH v8 4/5] mm: introduce putback_movable_pages() Rafael Aquini
  2012-08-21 12:47 ` [PATCH v8 5/5] mm: add vm event counters for balloon pages compaction Rafael Aquini
  4 siblings, 2 replies; 52+ messages in thread
From: Rafael Aquini @ 2012-08-21 12:47 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, virtualization, Rusty Russell, Michael S. Tsirkin,
	Rik van Riel, Mel Gorman, Andi Kleen, Andrew Morton,
	Konrad Rzeszutek Wilk, Minchan Kim, Rafael Aquini

Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a guest,
thus imposing performance penalties associated with the reduced number of
transparent huge pages that could be used by the guest workload.

Besides making balloon pages movable at allocation time and introducing
the necessary primitives to perform balloon page migration/compaction,
this patch also introduces the following locking scheme to provide the
proper synchronization and protection for struct virtio_balloon elements
against concurrent accesses due to parallel operations introduced by
memory compaction / page migration.
 - balloon_lock (mutex) : synchronizes the access demand to elements of
			  struct virtio_balloon and its queue operations;
 - pages_lock (spinlock): special protection to balloon pages list against
			  concurrent list handling operations;
 - virtio_baloon->pages list handling sync by RCU operations;

Signed-off-by: Rafael Aquini <aquini@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 210 +++++++++++++++++++++++++++++++++++++---
 1 file changed, 199 insertions(+), 11 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 0908e60..bda7bb0 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -27,6 +27,7 @@
 #include <linux/delay.h>
 #include <linux/slab.h>
 #include <linux/module.h>
+#include <linux/balloon_compaction.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -35,6 +36,12 @@
  */
 #define VIRTIO_BALLOON_PAGES_PER_PAGE (PAGE_SIZE >> VIRTIO_BALLOON_PFN_SHIFT)
 
+/* flags used to hint compaction procedures about the balloon device status */
+enum balloon_status_flags {
+	BALLOON_REMOVAL = 0,	/* balloon device is under removal steps */
+	BALLOON_OK,		/* balloon device is up and running */
+};
+
 struct virtio_balloon
 {
 	struct virtio_device *vdev;
@@ -46,11 +53,24 @@ struct virtio_balloon
 	/* The thread servicing the balloon. */
 	struct task_struct *thread;
 
+	/* balloon special page->mapping */
+	struct address_space *mapping;
+
+	/* Synchronize access/update to this struct virtio_balloon elements */
+	struct mutex balloon_lock;
+
 	/* Waiting for host to ack the pages we released. */
 	wait_queue_head_t acked;
 
 	/* Number of balloon pages we've told the Host we're not using. */
 	unsigned int num_pages;
+
+	/* balloon device status flag */
+	unsigned short balloon_status;
+
+	/* Protect 'pages' list against concurrent handling */
+	spinlock_t pages_lock;
+
 	/*
 	 * The pages we've told the Host we're not using.
 	 * Each page on this list adds VIRTIO_BALLOON_PAGES_PER_PAGE
@@ -122,13 +142,17 @@ static void set_page_pfns(u32 pfns[], struct page *page)
 
 static void fill_balloon(struct virtio_balloon *vb, size_t num)
 {
+	/* Get the proper GFP alloc mask from vb->mapping flags */
+	gfp_t vb_gfp_mask = mapping_gfp_mask(vb->mapping);
+
 	/* We can only do one array worth at a time. */
 	num = min(num, ARRAY_SIZE(vb->pfns));
 
+	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
 	     vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-		struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY |
-					__GFP_NOMEMALLOC | __GFP_NOWARN);
+		struct page *page = alloc_page(vb_gfp_mask | __GFP_NORETRY |
+					       __GFP_NOWARN | __GFP_NOMEMALLOC);
 		if (!page) {
 			if (printk_ratelimit())
 				dev_printk(KERN_INFO, &vb->vdev->dev,
@@ -141,7 +165,10 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num)
 		set_page_pfns(vb->pfns + vb->num_pfns, page);
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		totalram_pages--;
-		list_add(&page->lru, &vb->pages);
+		spin_lock(&vb->pages_lock);
+		list_add_rcu(&page->lru, &vb->pages);
+		assign_balloon_mapping(page, vb->mapping);
+		spin_unlock(&vb->pages_lock);
 	}
 
 	/* Didn't get any?  Oh well. */
@@ -149,6 +176,7 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num)
 		return;
 
 	tell_host(vb, vb->inflate_vq);
+	mutex_unlock(&vb->balloon_lock);
 }
 
 static void release_pages_by_pfn(const u32 pfns[], unsigned int num)
@@ -169,21 +197,48 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num)
 	/* We can only do one array worth at a time. */
 	num = min(num, ARRAY_SIZE(vb->pfns));
 
+	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
 	     vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-		page = list_first_entry(&vb->pages, struct page, lru);
-		list_del(&page->lru);
+		/*
+		 * We can race against 'virtballoon_isolatepage()' and end up
+		 * stumbling across a _temporarily_ empty 'pages' list.
+		 */
+		spin_lock(&vb->pages_lock);
+		page = list_first_or_null_rcu(&vb->pages, struct page, lru);
+		if (!page) {
+			spin_unlock(&vb->pages_lock);
+			break;
+		}
+		/*
+		 * It is safe now to drop page->mapping and delete this page
+		 * from balloon page list, since we are grabbing 'pages_lock'
+		 * which prevents 'virtballoon_isolatepage()' from acting.
+		 */
+		clear_balloon_mapping(page);
+		list_del_rcu(&page->lru);
+		spin_unlock(&vb->pages_lock);
 		set_page_pfns(vb->pfns + vb->num_pfns, page);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
+	/*
+	 * Syncrhonize RCU grace period and wait for all RCU read critical side
+	 * sections to finish before proceeding with page release steps.
+	 * This avoids compaction/migration callback races against balloon
+	 * device removal steps.
+	 */
+	synchronize_rcu();
 
 	/*
 	 * Note that if
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	tell_host(vb, vb->deflate_vq);
-	release_pages_by_pfn(vb->pfns, vb->num_pfns);
+	if (vb->num_pfns > 0) {
+		tell_host(vb, vb->deflate_vq);
+		release_pages_by_pfn(vb->pfns, vb->num_pfns);
+	}
+	mutex_unlock(&vb->balloon_lock);
 }
 
 static inline void update_stat(struct virtio_balloon *vb, int idx,
@@ -239,6 +294,7 @@ static void stats_handle_request(struct virtio_balloon *vb)
 	struct scatterlist sg;
 	unsigned int len;
 
+	mutex_lock(&vb->balloon_lock);
 	vb->need_stats_update = 0;
 	update_balloon_stats(vb);
 
@@ -249,6 +305,7 @@ static void stats_handle_request(struct virtio_balloon *vb)
 	if (virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL) < 0)
 		BUG();
 	virtqueue_kick(vq);
+	mutex_unlock(&vb->balloon_lock);
 }
 
 static void virtballoon_changed(struct virtio_device *vdev)
@@ -261,22 +318,27 @@ static void virtballoon_changed(struct virtio_device *vdev)
 static inline s64 towards_target(struct virtio_balloon *vb)
 {
 	__le32 v;
-	s64 target;
+	s64 target, actual;
 
+	mutex_lock(&vb->balloon_lock);
+	actual = vb->num_pages;
 	vb->vdev->config->get(vb->vdev,
 			      offsetof(struct virtio_balloon_config, num_pages),
 			      &v, sizeof(v));
 	target = le32_to_cpu(v);
-	return target - vb->num_pages;
+	mutex_unlock(&vb->balloon_lock);
+	return target - actual;
 }
 
 static void update_balloon_size(struct virtio_balloon *vb)
 {
-	__le32 actual = cpu_to_le32(vb->num_pages);
-
+	__le32 actual;
+	mutex_lock(&vb->balloon_lock);
+	actual = cpu_to_le32(vb->num_pages);
 	vb->vdev->config->set(vb->vdev,
 			      offsetof(struct virtio_balloon_config, actual),
 			      &actual, sizeof(actual));
+	mutex_unlock(&vb->balloon_lock);
 }
 
 static int balloon(void *_vballoon)
@@ -339,9 +401,117 @@ static int init_vqs(struct virtio_balloon *vb)
 	return 0;
 }
 
+#ifdef CONFIG_BALLOON_COMPACTION
+/*
+ * Populate balloon_mapping->a_ops callback method to perform the balloon
+ * page migration task.
+ *
+ * After a ballooned page gets isolated by compaction procedures, this is the
+ * function that performs the page migration on behalf of move_to_new_page(),
+ * when the last calls (page)->mapping->a_ops->migratepage.
+ *
+ * Page migration for virtio balloon is done in a simple swap fashion which
+ * follows these two steps:
+ *  1) insert newpage into vb->pages list and update the host about it;
+ *  2) update the host about the removed old page from vb->pages list;
+ */
+int virtballoon_migratepage(struct address_space *mapping,
+		struct page *newpage, struct page *page, enum migrate_mode mode)
+{
+	struct virtio_balloon *vb = (void *)__page_balloon_device(page);
+
+	/* at this point, besides very unlikely, a NULL *vb is a serious bug */
+	BUG_ON(!vb);
+
+	/*
+	 * Skip page migration step if the memory balloon device is under its
+	 * removal procedure, to avoid racing against module unload.
+	 *
+	 * If there still are isolated (balloon) pages under migration lists,
+	 * 'virtballoon_putbackpage()' will take care of them properly, before
+	 * the module unload finishes.
+	 */
+	if (vb->balloon_status == BALLOON_REMOVAL)
+		return -EAGAIN;
+
+	mutex_lock(&vb->balloon_lock);
+
+	/* balloon's page migration 1st step */
+	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+	spin_lock(&vb->pages_lock);
+	list_add_rcu(&newpage->lru, &vb->pages);
+	spin_unlock(&vb->pages_lock);
+	set_page_pfns(vb->pfns, newpage);
+	tell_host(vb, vb->inflate_vq);
+
+	/* balloon's page migration 2nd step */
+	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+	set_page_pfns(vb->pfns, page);
+	tell_host(vb, vb->deflate_vq);
+
+	mutex_unlock(&vb->balloon_lock);
+	return 0;
+}
+
+/*
+ * Populate balloon_mapping->a_ops callback method to help compaction on
+ * isolating a page from the balloon page list for posterior migration.
+ */
+int virtballoon_isolatepage(struct page *page)
+{
+	struct virtio_balloon *vb = (void *)__page_balloon_device(page);
+	int ret = 0;
+	/*
+	 * If we stumble across a NULL *vb here, it means this page has been
+	 * already released by 'leak_balloon()'.
+	 *
+	 * We also skip the page isolation step if the memory balloon device is
+	 * under its removal procedure, to avoid racing against module unload.
+	 */
+	if (vb && (vb->balloon_status != BALLOON_REMOVAL)) {
+		spin_lock(&vb->pages_lock);
+		/*
+		 * virtballoon_isolatepage() can race against leak_balloon(),
+		 * and (wrongly) isolate a page that is about to be freed.
+		 * Test page->mapping under pages_lock to close that window.
+		 */
+		if (rcu_access_pointer(page->mapping) == vb->mapping) {
+			/* It is safe to isolate this page, now */
+			list_del_rcu(&page->lru);
+			ret = 1;
+		}
+		spin_unlock(&vb->pages_lock);
+	}
+	return ret;
+}
+
+/*
+ * Populate balloon_mapping->a_ops callback method to help compaction on
+ * re-inserting a not-migrated isolated page into the balloon page list.
+ */
+void virtballoon_putbackpage(struct page *page)
+{
+	struct virtio_balloon *vb = (void *)__page_balloon_device(page);
+
+	/* at this point, besides very unlikely, a NULL *vb is a serious bug */
+	BUG_ON(!vb);
+
+	spin_lock(&vb->pages_lock);
+	list_add_rcu(&page->lru, &vb->pages);
+	spin_unlock(&vb->pages_lock);
+}
+#endif /* CONFIG_BALLOON_COMPACTION */
+
+/* define the balloon_mapping->a_ops callbacks to allow compaction/migration */
+static DEFINE_BALLOON_MAPPING_AOPS(virtio_balloon_aops,
+				   virtballoon_migratepage,
+				   virtballoon_isolatepage,
+				   virtballoon_putbackpage);
+
 static int virtballoon_probe(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb;
+	struct address_space *vb_mapping;
 	int err;
 
 	vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
@@ -351,12 +521,24 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	}
 
 	INIT_LIST_HEAD(&vb->pages);
+	spin_lock_init(&vb->pages_lock);
+	mutex_init(&vb->balloon_lock);
+
 	vb->num_pages = 0;
 	init_waitqueue_head(&vb->config_change);
 	init_waitqueue_head(&vb->acked);
 	vb->vdev = vdev;
 	vb->need_stats_update = 0;
 
+	/* Allocate a special page->mapping for this balloon device */
+	vb_mapping = alloc_balloon_mapping((void *)vb, &virtio_balloon_aops);
+	if (!vb_mapping) {
+		err = -ENOMEM;
+		goto out_free_vb;
+	}
+	/* Store the page->mapping reference for this balloon device */
+	vb->mapping = vb_mapping;
+
 	err = init_vqs(vb);
 	if (err)
 		goto out_free_vb;
@@ -367,12 +549,14 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		goto out_del_vqs;
 	}
 
+	vb->balloon_status = BALLOON_OK;
 	return 0;
 
 out_del_vqs:
 	vdev->config->del_vqs(vdev);
 out_free_vb:
 	kfree(vb);
+	kfree(vb_mapping);
 out:
 	return err;
 }
@@ -394,8 +578,10 @@ static void __devexit virtballoon_remove(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb = vdev->priv;
 
+	vb->balloon_status = BALLOON_REMOVAL;
 	kthread_stop(vb->thread);
 	remove_common(vb);
+	kfree(vb->mapping);
 	kfree(vb);
 }
 
@@ -409,6 +595,7 @@ static int virtballoon_freeze(struct virtio_device *vdev)
 	 * function is called.
 	 */
 
+	vb->balloon_status = BALLOON_REMOVAL;
 	remove_common(vb);
 	return 0;
 }
@@ -424,6 +611,7 @@ static int virtballoon_restore(struct virtio_device *vdev)
 
 	fill_balloon(vb, towards_target(vb));
 	update_balloon_size(vb);
+	vb->balloon_status = BALLOON_OK;
 	return 0;
 }
 #endif
-- 
1.7.11.4


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v8 4/5] mm: introduce putback_movable_pages()
  2012-08-21 12:47 [PATCH v8 0/5] make balloon pages movable by compaction Rafael Aquini
                   ` (2 preceding siblings ...)
  2012-08-21 12:47 ` [PATCH v8 3/5] virtio_balloon: introduce migration primitives to balloon pages Rafael Aquini
@ 2012-08-21 12:47 ` Rafael Aquini
  2012-08-21 14:42   ` Michael S. Tsirkin
  2012-08-21 12:47 ` [PATCH v8 5/5] mm: add vm event counters for balloon pages compaction Rafael Aquini
  4 siblings, 1 reply; 52+ messages in thread
From: Rafael Aquini @ 2012-08-21 12:47 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, virtualization, Rusty Russell, Michael S. Tsirkin,
	Rik van Riel, Mel Gorman, Andi Kleen, Andrew Morton,
	Konrad Rzeszutek Wilk, Minchan Kim, Rafael Aquini

The PATCH "mm: introduce compaction and migration for virtio ballooned pages"
hacks around putback_lru_pages() in order to allow ballooned pages to be
re-inserted on balloon page list as if a ballooned page was like a LRU page.

As ballooned pages are not legitimate LRU pages, this patch introduces
putback_movable_pages() to properly cope with cases where the isolated
pageset contains ballooned pages and LRU pages, thus fixing the mentioned
inelegant hack around putback_lru_pages().

Signed-off-by: Rafael Aquini <aquini@redhat.com>
---
 include/linux/migrate.h |  2 ++
 mm/compaction.c         |  4 ++--
 mm/migrate.c            | 20 ++++++++++++++++++++
 mm/page_alloc.c         |  2 +-
 4 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index ce7e667..ff103a1 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -10,6 +10,7 @@ typedef struct page *new_page_t(struct page *, unsigned long private, int **);
 #ifdef CONFIG_MIGRATION
 
 extern void putback_lru_pages(struct list_head *l);
+extern void putback_movable_pages(struct list_head *l);
 extern int migrate_page(struct address_space *,
 			struct page *, struct page *, enum migrate_mode);
 extern int migrate_pages(struct list_head *l, new_page_t x,
@@ -33,6 +34,7 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 #else
 
 static inline void putback_lru_pages(struct list_head *l) {}
+static inline void putback_movable_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t x,
 		unsigned long private, bool offlining,
 		enum migrate_mode mode) { return -ENOSYS; }
diff --git a/mm/compaction.c b/mm/compaction.c
index ce43dc2..782ed32 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -759,9 +759,9 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		trace_mm_compaction_migratepages(nr_migrate - nr_remaining,
 						nr_remaining);
 
-		/* Release LRU pages not migrated */
+		/* Release isolated pages not migrated */
 		if (err) {
-			putback_lru_pages(&cc->migratepages);
+			putback_movable_pages(&cc->migratepages);
 			cc->nr_migratepages = 0;
 			if (err == -ENOMEM) {
 				ret = COMPACT_PARTIAL;
diff --git a/mm/migrate.c b/mm/migrate.c
index 6392da258..0bf2caf 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -80,6 +80,26 @@ void putback_lru_pages(struct list_head *l)
 		list_del(&page->lru);
 		dec_zone_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
+			putback_lru_page(page);
+	}
+}
+
+/*
+ * Put previously isolated pages back onto the appropriated lists
+ * from where they were once taken off for compaction/migration.
+ *
+ * This function shall be used instead of putback_lru_pages(),
+ * whenever the isolated pageset has been built by isolate_migratepages_range()
+ */
+void putback_movable_pages(struct list_head *l)
+{
+	struct page *page;
+	struct page *page2;
+
+	list_for_each_entry_safe(page, page2, l, lru) {
+		list_del(&page->lru);
+		dec_zone_page_state(page, NR_ISOLATED_ANON +
+				page_is_file_cache(page));
 		if (unlikely(movable_balloon_page(page)))
 			putback_balloon_page(page);
 		else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 009ac28..78b7663 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5669,7 +5669,7 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
 				    0, false, MIGRATE_SYNC);
 	}
 
-	putback_lru_pages(&cc.migratepages);
+	putback_movable_pages(&cc.migratepages);
 	return ret > 0 ? 0 : ret;
 }
 
-- 
1.7.11.4


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v8 5/5] mm: add vm event counters for balloon pages compaction
  2012-08-21 12:47 [PATCH v8 0/5] make balloon pages movable by compaction Rafael Aquini
                   ` (3 preceding siblings ...)
  2012-08-21 12:47 ` [PATCH v8 4/5] mm: introduce putback_movable_pages() Rafael Aquini
@ 2012-08-21 12:47 ` Rafael Aquini
  4 siblings, 0 replies; 52+ messages in thread
From: Rafael Aquini @ 2012-08-21 12:47 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, virtualization, Rusty Russell, Michael S. Tsirkin,
	Rik van Riel, Mel Gorman, Andi Kleen, Andrew Morton,
	Konrad Rzeszutek Wilk, Minchan Kim, Rafael Aquini

This patch introduces a new set of vm event counters to keep track of
ballooned pages compaction activity.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
---
 drivers/virtio/virtio_balloon.c |  2 ++
 include/linux/vm_event_item.h   |  8 +++++++-
 mm/balloon_compaction.c         |  6 ++++--
 mm/migrate.c                    |  1 +
 mm/vmstat.c                     | 10 +++++++++-
 5 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index bda7bb0..c358ed3 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -449,6 +449,8 @@ int virtballoon_migratepage(struct address_space *mapping,
 	set_page_pfns(vb->pfns, page);
 	tell_host(vb, vb->deflate_vq);
 
+	/* perform vm accountability on this successful page migration */
+	count_balloon_event(COMPACTBALLOONMIGRATED);
 	mutex_unlock(&vb->balloon_lock);
 	return 0;
 }
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 57f7b10..6868aba 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -41,7 +41,13 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_COMPACTION
 		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
 		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
-#endif
+#ifdef CONFIG_BALLOON_COMPACTION
+		COMPACTBALLOONISOLATED, /* isolated from balloon pagelist */
+		COMPACTBALLOONMIGRATED, /* balloon page sucessfully migrated */
+		COMPACTBALLOONRETURNED, /* putback to pagelist, not-migrated */
+		COMPACTBALLOONRELEASED, /* old-page released after migration */
+#endif /* CONFIG_BALLOON_COMPACTION */
+#endif /* CONFIG_COMPACTION */
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
 #endif
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index d79f13d..9186000 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -114,6 +114,7 @@ bool isolate_balloon_page(struct page *page)
 			    (page_count(page) == 2)) {
 				if (__isolate_balloon_page(page)) {
 					unlock_page(page);
+					count_vm_event(COMPACTBALLOONISOLATED);
 					return true;
 				}
 			}
@@ -137,9 +138,10 @@ void putback_balloon_page(struct page *page)
 	 * concurrent isolation threads attempting to re-isolate it.
 	 */
 	lock_page(page);
-	if (movable_balloon_page(page))
+	if (movable_balloon_page(page)) {
 		__putback_balloon_page(page);
-
+		count_vm_event(COMPACTBALLOONRETURNED);
+	}
 	unlock_page(page);
 }
 #endif /* CONFIG_BALLOON_COMPACTION */
diff --git a/mm/migrate.c b/mm/migrate.c
index 0bf2caf..052e59a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -893,6 +893,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 				    page_is_file_cache(page));
 		put_page(page);
 		__free_page(page);
+		count_balloon_event(COMPACTBALLOONRELEASED);
 		return rc;
 	}
 out:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index df7a674..c7919c4 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -768,7 +768,15 @@ const char * const vmstat_text[] = {
 	"compact_stall",
 	"compact_fail",
 	"compact_success",
-#endif
+
+#ifdef CONFIG_BALLOON_COMPACTION
+	"compact_balloon_isolated",
+	"compact_balloon_migrated",
+	"compact_balloon_returned",
+	"compact_balloon_released",
+#endif /* CONFIG_BALLOON_COMPACTION */
+
+#endif /* CONFIG_COMPACTION */
 
 #ifdef CONFIG_HUGETLB_PAGE
 	"htlb_buddy_alloc_success",
-- 
1.7.11.4


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-21 12:47 ` [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility Rafael Aquini
@ 2012-08-21 13:52   ` Michael S. Tsirkin
  2012-08-21 14:25     ` Michael S. Tsirkin
                       ` (2 more replies)
  2012-08-21 15:20   ` Peter Zijlstra
  1 sibling, 3 replies; 52+ messages in thread
From: Michael S. Tsirkin @ 2012-08-21 13:52 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: linux-mm, linux-kernel, virtualization, Rusty Russell,
	Rik van Riel, Mel Gorman, Andi Kleen, Andrew Morton,
	Konrad Rzeszutek Wilk, Minchan Kim

On Tue, Aug 21, 2012 at 09:47:44AM -0300, Rafael Aquini wrote:
> Memory fragmentation introduced by ballooning might reduce significantly
> the number of 2MB contiguous memory blocks that can be used within a guest,
> thus imposing performance penalties associated with the reduced number of
> transparent huge pages that could be used by the guest workload.
> 
> This patch introduces a common interface to help a balloon driver on
> making its page set movable to compaction, and thus allowing the system
> to better leverage the compation efforts on memory defragmentation.
> 
> Signed-off-by: Rafael Aquini <aquini@redhat.com>
> ---
>  include/linux/balloon_compaction.h | 113 +++++++++++++++++++++++++++++
>  include/linux/pagemap.h            |  18 +++++
>  mm/Kconfig                         |  15 ++++
>  mm/Makefile                        |   2 +-
>  mm/balloon_compaction.c            | 145 +++++++++++++++++++++++++++++++++++++
>  5 files changed, 292 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/balloon_compaction.h
>  create mode 100644 mm/balloon_compaction.c
> 
> diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h
> new file mode 100644
> index 0000000..5fbf036
> --- /dev/null
> +++ b/include/linux/balloon_compaction.h
> @@ -0,0 +1,113 @@
> +/*
> + * include/linux/balloon_compaction.h
> + *
> + * Common interface definitions for making balloon pages movable to compaction.
> + *
> + * Copyright (C) 2012, Red Hat, Inc.  Rafael Aquini <aquini@redhat.com>
> + */
> +#ifndef _LINUX_BALLOON_COMPACTION_H
> +#define _LINUX_BALLOON_COMPACTION_H
> +#ifdef __KERNEL__
> +
> +#include <linux/rcupdate.h>
> +#include <linux/pagemap.h>
> +#include <linux/gfp.h>
> +
> +#if defined(CONFIG_BALLOON_COMPACTION)
> +extern bool isolate_balloon_page(struct page *);
> +extern void putback_balloon_page(struct page *);
> +
> +static inline bool movable_balloon_page(struct page *page)
> +{
> +	struct address_space *mapping = NULL;
> +	bool ret = false;
> +	/*
> +	 * Before dereferencing and testing mapping->flags, lets make sure
> +	 * this is not a page that uses ->mapping in a different way
> +	 */
> +	if (!PageSlab(page) && !PageSwapCache(page) &&
> +	    !PageAnon(page) && !page_mapped(page)) {
> +		rcu_read_lock();
> +		mapping = rcu_dereference(page->mapping);
> +		if (mapping_balloon(mapping))
> +			ret = true;
> +		rcu_read_unlock();

This looks suspicious: you drop rcu_read_unlock
so can't page switch from balloon to non balloon?

Even if correct, it's probably cleaner to just make caller
invoke this within an rcu critical section.
You will notice that all callers immediately re-enter
rcu critical section anyway.

Alternatively, I noted that accesses to page->mapping
seem protected by page lock bit.
If true we can rely on that instead of RCU, just need
assign_balloon_mapping to lock_page/unlock_page.

> +	}
> +	return ret;
> +}
> +
> +static inline gfp_t balloon_mapping_gfp_mask(void)
> +{
> +	return GFP_HIGHUSER_MOVABLE;
> +}
> +
> +/*
> + * __page_balloon_device - return the balloon device owing the page.
> + *
> + * This shall only be used at driver callbacks under proper page_lock,
> + * to get access to the balloon device structure that owns @page.
> + */
> +static inline void *__page_balloon_device(struct page *page)
> +{
> +	struct address_space *mapping;
> +	mapping = rcu_access_pointer(page->mapping);
> +	if (mapping)
> +		mapping = mapping->assoc_mapping;
> +	return (void *)mapping;
> +}
> +
> +#define count_balloon_event(e) count_vm_event(e)
> +/*
> + * DEFINE_BALLOON_MAPPING_AOPS - declare and instantiate a callback descriptor
> + *				 to be used as balloon page->mapping->a_ops.
> + *
> + * @label     : declaration identifier (var name)
> + * @migratepg : callback symbol name for performing the page migration step
> + * @isolatepg : callback symbol name for performing the page isolation step
> + * @putbackpg : callback symbol name for performing the page putback step
> + *
> + * address_space_operations utilized methods for ballooned pages:
> + *   .migratepage    - used to perform balloon's page migration (as is)
> + *   .launder_page   - used to isolate a page from balloon's page list
> + *   .freepage       - used to reinsert an isolated page to balloon's page list
> + */

It would be a good idea to document the assumptions here.
Looks like .launder_page and .freepage are called in rcu critical
section.
But migratepage isn't - why is that safe?

> +#define DEFINE_BALLOON_MAPPING_AOPS(label, migratepg, isolatepg, putbackpg) \
> +	const struct address_space_operations (label) = {		    \
> +		.migratepage  = (migratepg),				    \
> +		.launder_page = (isolatepg),				    \
> +		.freepage     = (putbackpg),				    \
> +	}
> +
> +#else
> +static inline bool isolate_balloon_page(struct page *page) { return false; }
> +static inline bool movable_balloon_page(struct page *page) { return false; }
> +static inline void putback_balloon_page(struct page *page) { return; }
> +
> +static inline gfp_t balloon_mapping_gfp_mask(void)
> +{
> +	return GFP_HIGHUSER;
> +}
> +
> +#define count_balloon_event(e) {}
> +#define DEFINE_BALLOON_MAPPING_AOPS(label, migratepg, isolatepg, putbackpg) \
> +	const struct address_space_operations *(label) = NULL
> +
> +#endif /* CONFIG_BALLOON_COMPACTION */
> +
> +extern struct address_space *alloc_balloon_mapping(void *balloon_device,
> +				const struct address_space_operations *a_ops);
> +extern void *page_balloon_device(struct page *page);
> +
> +static inline void assign_balloon_mapping(struct page *page,
> +					  struct address_space *mapping)
> +{
> +	rcu_assign_pointer(page->mapping, mapping);
> +}
> +
> +static inline void clear_balloon_mapping(struct page *page)
> +{
> +	rcu_assign_pointer(page->mapping, NULL);
> +}
> +
> +#endif /* __KERNEL__ */
> +#endif /* _LINUX_BALLOON_COMPACTION_H */
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index e42c762..6df0664 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -24,6 +24,7 @@ enum mapping_flags {
>  	AS_ENOSPC	= __GFP_BITS_SHIFT + 1,	/* ENOSPC on async write */
>  	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
>  	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
> +	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
>  };
>  
>  static inline void mapping_set_error(struct address_space *mapping, int error)
> @@ -53,6 +54,23 @@ static inline int mapping_unevictable(struct address_space *mapping)
>  	return !!mapping;
>  }
>  
> +static inline void mapping_set_balloon(struct address_space *mapping)
> +{
> +	set_bit(AS_BALLOON_MAP, &mapping->flags);
> +}
> +
> +static inline void mapping_clear_balloon(struct address_space *mapping)
> +{
> +	clear_bit(AS_BALLOON_MAP, &mapping->flags);
> +}
> +
> +static inline int mapping_balloon(struct address_space *mapping)
> +{
> +	if (mapping)
> +		return test_bit(AS_BALLOON_MAP, &mapping->flags);
> +	return !!mapping;
> +}
> +
>  static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
>  {
>  	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
> diff --git a/mm/Kconfig b/mm/Kconfig
> index d5c8019..4857899 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -188,6 +188,21 @@ config SPLIT_PTLOCK_CPUS
>  	default "4"
>  
>  #
> +# support for memory balloon compaction
> +config BALLOON_COMPACTION
> +	bool "Allow for balloon memory compaction/migration"
> +	select COMPACTION
> +	depends on VIRTIO_BALLOON
> +	help
> +	  Memory fragmentation introduced by ballooning might reduce
> +	  significantly the number of 2MB contiguous memory blocks that can be
> +	  used within a guest, thus imposing performance penalties associated
> +	  with the reduced number of transparent huge pages that could be used
> +	  by the guest workload. Allowing the compaction & migration for memory
> +	  pages enlisted as being part of memory balloon devices avoids the
> +	  scenario aforementioned and helps improving memory defragmentation.
> +
> +#
>  # support for memory compaction
>  config COMPACTION
>  	bool "Allow for memory compaction"
> diff --git a/mm/Makefile b/mm/Makefile
> index 92753e2..78d8caa 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -16,7 +16,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
>  			   readahead.o swap.o truncate.o vmscan.o shmem.o \
>  			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
>  			   mm_init.o mmu_context.o percpu.o slab_common.o \
> -			   compaction.o $(mmu-y)
> +			   compaction.o balloon_compaction.o $(mmu-y)
>  
>  obj-y += init-mm.o
>  
> diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
> new file mode 100644
> index 0000000..d79f13d
> --- /dev/null
> +++ b/mm/balloon_compaction.c
> @@ -0,0 +1,145 @@
> +/*
> + * mm/balloon_compaction.c
> + *
> + * Common interface for making balloon pages movable to compaction.
> + *
> + * Copyright (C) 2012, Red Hat, Inc.  Rafael Aquini <aquini@redhat.com>
> + */
> +#include <linux/slab.h>
> +#include <linux/export.h>
> +#include <linux/balloon_compaction.h>
> +
> +/*
> + * alloc_balloon_mapping - allocates a special ->mapping for ballooned pages.
> + * @balloon_device: pointer address that references the balloon device which
> + *                 owns pages bearing this ->mapping.
> + * @a_ops: balloon_mapping address_space_operations descriptor.
> + *
> + * Users must call it to properly allocate and initialize an instance of
> + * struct address_space which will be used as the special page->mapping for
> + * balloon devices enlisted page instances.
> + */
> +struct address_space *alloc_balloon_mapping(void *balloon_device,
> +				const struct address_space_operations *a_ops)
> +{
> +	struct address_space *mapping;
> +
> +	mapping = kmalloc(sizeof(*mapping), GFP_KERNEL);
> +	if (mapping) {

If !mapping return NULL will make code shorter.

> +		/*
> +		 * Give a clean 'zeroed' status to all elements of this special
> +		 * balloon page->mapping struct address_space instance.
> +		 */
> +		address_space_init_once(mapping);
> +
> +		/*
> +		 * Set mapping->flags appropriately, to allow balloon ->mapping
> +		 * identification, as well as give a proper hint to the balloon
> +		 * driver on what GFP allocation mask shall be used.
> +		 */
> +		mapping_set_balloon(mapping);
> +		mapping_set_gfp_mask(mapping, balloon_mapping_gfp_mask());
> +
> +		/* balloon's page->mapping->a_ops callback descriptor */
> +		mapping->a_ops = a_ops;
> +
> +		/*
> +		 * balloon special page->mapping overloads ->assoc_mapping
> +		 * to held a reference back to the balloon device wich 'owns'
> +		 * a given page. This is the way we can cope with multiple
> +		 * balloon devices without losing reference of several
> +		 * ballooned pagesets.
> +		 */
> +		mapping->assoc_mapping = (void *)balloon_device;

(void *) is only a good idea if you are casting a long value.

> +
> +		return mapping;
> +	}
> +	return NULL;
> +}
> +EXPORT_SYMBOL_GPL(alloc_balloon_mapping);
> +
> +#if defined(CONFIG_BALLOON_COMPACTION)
> +static inline int __isolate_balloon_page(struct page *page)
> +{
> +	struct address_space *mapping;
> +	int ret = 0;
> +
> +	rcu_read_lock();
> +	mapping = rcu_dereference(page->mapping);
> +	if (mapping)
> +		ret = mapping->a_ops->launder_page(page);
> +
> +	rcu_read_unlock();
> +	return ret;
> +}
> +
> +static inline void __putback_balloon_page(struct page *page)
> +{
> +	struct address_space *mapping;
> +
> +	rcu_read_lock();
> +	mapping = rcu_dereference(page->mapping);
> +	/*
> +	 * If we stumble across a page->mapping NULL here, so something has
> +	 * messed with the private isolated pageset we are iterating over.
> +	 */
> +	BUG_ON(!mapping);
> +
> +	mapping->a_ops->freepage(page);
> +	/* isolation bumps up the page refcount, time to decrement it */
> +	put_page(page);
> +	rcu_read_unlock();
> +}
> +
> +/* __isolate_lru_page() counterpart for a ballooned page */
> +bool isolate_balloon_page(struct page *page)
> +{
> +	if (likely(get_page_unless_zero(page))) {
> +		/*
> +		 * As balloon pages are not isolated from LRU lists, concurrent
> +		 * compaction threads can race against page migration functions
> +		 * move_to_new_page() & __unmap_and_move().
> +		 * In order to avoid having an already isolated balloon page
> +		 * being (wrongly) re-isolated while it is under migration,
> +		 * lets be sure we have the page lock before proceeding with
> +		 * the balloon page isolation steps.
> +		 */
> +		if (likely(trylock_page(page))) {
> +			/*
> +			 * A ballooned page, by default, has just one refcount.
> +			 * Prevent concurrent compaction threads from isolating
> +			 * an already isolated balloon page by refcount check.
> +			 */
> +			if (movable_balloon_page(page) &&
> +			    (page_count(page) == 2)) {
> +				if (__isolate_balloon_page(page)) {
> +					unlock_page(page);
> +					return true;
> +				}
> +			}
> +			unlock_page(page);
> +		}
> +		/*
> +		 * This page is either under migration, it is isolated already,
> +		 * or its isolation step has not happened at this round.
> +		 * Drop the refcount taken for it.
> +		 */
> +		put_page(page);
> +	}
> +	return false;
> +}
> +
> +/* putback_lru_page() counterpart for a ballooned page */
> +void putback_balloon_page(struct page *page)
> +{
> +	/*
> +	 * 'lock_page()' stabilizes the page and prevents races against
> +	 * concurrent isolation threads attempting to re-isolate it.
> +	 */
> +	lock_page(page);
> +	if (movable_balloon_page(page))
> +		__putback_balloon_page(page);
> +
> +	unlock_page(page);
> +}
> +#endif /* CONFIG_BALLOON_COMPACTION */
> -- 
> 1.7.11.4

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-21 13:52   ` Michael S. Tsirkin
@ 2012-08-21 14:25     ` Michael S. Tsirkin
  2012-08-21 15:16     ` Peter Zijlstra
  2012-08-21 17:55     ` Rafael Aquini
  2 siblings, 0 replies; 52+ messages in thread
From: Michael S. Tsirkin @ 2012-08-21 14:25 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: linux-mm, linux-kernel, virtualization, Rusty Russell,
	Rik van Riel, Mel Gorman, Andi Kleen, Andrew Morton,
	Konrad Rzeszutek Wilk, Minchan Kim

On Tue, Aug 21, 2012 at 04:52:23PM +0300, Michael S. Tsirkin wrote:
> On Tue, Aug 21, 2012 at 09:47:44AM -0300, Rafael Aquini wrote:
> > Memory fragmentation introduced by ballooning might reduce significantly
> > the number of 2MB contiguous memory blocks that can be used within a guest,
> > thus imposing performance penalties associated with the reduced number of
> > transparent huge pages that could be used by the guest workload.
> > 
> > This patch introduces a common interface to help a balloon driver on
> > making its page set movable to compaction, and thus allowing the system
> > to better leverage the compation efforts on memory defragmentation.
> > 
> > Signed-off-by: Rafael Aquini <aquini@redhat.com>
> > ---
> >  include/linux/balloon_compaction.h | 113 +++++++++++++++++++++++++++++
> >  include/linux/pagemap.h            |  18 +++++
> >  mm/Kconfig                         |  15 ++++
> >  mm/Makefile                        |   2 +-
> >  mm/balloon_compaction.c            | 145 +++++++++++++++++++++++++++++++++++++
> >  5 files changed, 292 insertions(+), 1 deletion(-)
> >  create mode 100644 include/linux/balloon_compaction.h
> >  create mode 100644 mm/balloon_compaction.c
> > 
> > diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h
> > new file mode 100644
> > index 0000000..5fbf036
> > --- /dev/null
> > +++ b/include/linux/balloon_compaction.h
> > @@ -0,0 +1,113 @@
> > +/*
> > + * include/linux/balloon_compaction.h
> > + *
> > + * Common interface definitions for making balloon pages movable to compaction.
> > + *
> > + * Copyright (C) 2012, Red Hat, Inc.  Rafael Aquini <aquini@redhat.com>
> > + */
> > +#ifndef _LINUX_BALLOON_COMPACTION_H
> > +#define _LINUX_BALLOON_COMPACTION_H
> > +#ifdef __KERNEL__
> > +
> > +#include <linux/rcupdate.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/gfp.h>
> > +
> > +#if defined(CONFIG_BALLOON_COMPACTION)
> > +extern bool isolate_balloon_page(struct page *);
> > +extern void putback_balloon_page(struct page *);
> > +
> > +static inline bool movable_balloon_page(struct page *page)
> > +{
> > +	struct address_space *mapping = NULL;
> > +	bool ret = false;
> > +	/*
> > +	 * Before dereferencing and testing mapping->flags, lets make sure
> > +	 * this is not a page that uses ->mapping in a different way
> > +	 */
> > +	if (!PageSlab(page) && !PageSwapCache(page) &&
> > +	    !PageAnon(page) && !page_mapped(page)) {
> > +		rcu_read_lock();
> > +		mapping = rcu_dereference(page->mapping);
> > +		if (mapping_balloon(mapping))
> > +			ret = true;
> > +		rcu_read_unlock();
> 
> This looks suspicious: you drop rcu_read_unlock
> so can't page switch from balloon to non balloon?
> 
> Even if correct, it's probably cleaner to just make caller
> invoke this within an rcu critical section.
> You will notice that all callers immediately re-enter
> rcu critical section anyway.
> 
> Alternatively, I noted that accesses to page->mapping
> seem protected by page lock bit.
> If true we can rely on that instead of RCU, just need
> assign_balloon_mapping to lock_page/unlock_page.

Actually not true in all cases: not in case
of putback.  Not sure whether code can be changed
to make it true.


> > +	}
> > +	return ret;
> > +}
> > +
> > +static inline gfp_t balloon_mapping_gfp_mask(void)
> > +{
> > +	return GFP_HIGHUSER_MOVABLE;
> > +}
> > +
> > +/*
> > + * __page_balloon_device - return the balloon device owing the page.
> > + *
> > + * This shall only be used at driver callbacks under proper page_lock,
> > + * to get access to the balloon device structure that owns @page.
> > + */
> > +static inline void *__page_balloon_device(struct page *page)
> > +{
> > +	struct address_space *mapping;
> > +	mapping = rcu_access_pointer(page->mapping);
> > +	if (mapping)
> > +		mapping = mapping->assoc_mapping;
> > +	return (void *)mapping;
> > +}
> > +
> > +#define count_balloon_event(e) count_vm_event(e)
> > +/*
> > + * DEFINE_BALLOON_MAPPING_AOPS - declare and instantiate a callback descriptor
> > + *				 to be used as balloon page->mapping->a_ops.
> > + *
> > + * @label     : declaration identifier (var name)
> > + * @migratepg : callback symbol name for performing the page migration step
> > + * @isolatepg : callback symbol name for performing the page isolation step
> > + * @putbackpg : callback symbol name for performing the page putback step
> > + *
> > + * address_space_operations utilized methods for ballooned pages:
> > + *   .migratepage    - used to perform balloon's page migration (as is)
> > + *   .launder_page   - used to isolate a page from balloon's page list
> > + *   .freepage       - used to reinsert an isolated page to balloon's page list
> > + */
> 
> It would be a good idea to document the assumptions here.
> Looks like .launder_page and .freepage are called in rcu critical
> section.
> But migratepage isn't - why is that safe?
> 
> > +#define DEFINE_BALLOON_MAPPING_AOPS(label, migratepg, isolatepg, putbackpg) \
> > +	const struct address_space_operations (label) = {		    \
> > +		.migratepage  = (migratepg),				    \
> > +		.launder_page = (isolatepg),				    \
> > +		.freepage     = (putbackpg),				    \
> > +	}
> > +
> > +#else
> > +static inline bool isolate_balloon_page(struct page *page) { return false; }
> > +static inline bool movable_balloon_page(struct page *page) { return false; }
> > +static inline void putback_balloon_page(struct page *page) { return; }
> > +
> > +static inline gfp_t balloon_mapping_gfp_mask(void)
> > +{
> > +	return GFP_HIGHUSER;
> > +}
> > +
> > +#define count_balloon_event(e) {}
> > +#define DEFINE_BALLOON_MAPPING_AOPS(label, migratepg, isolatepg, putbackpg) \
> > +	const struct address_space_operations *(label) = NULL
> > +
> > +#endif /* CONFIG_BALLOON_COMPACTION */
> > +
> > +extern struct address_space *alloc_balloon_mapping(void *balloon_device,
> > +				const struct address_space_operations *a_ops);
> > +extern void *page_balloon_device(struct page *page);
> > +
> > +static inline void assign_balloon_mapping(struct page *page,
> > +					  struct address_space *mapping)
> > +{
> > +	rcu_assign_pointer(page->mapping, mapping);
> > +}
> > +
> > +static inline void clear_balloon_mapping(struct page *page)
> > +{
> > +	rcu_assign_pointer(page->mapping, NULL);
> > +}
> > +
> > +#endif /* __KERNEL__ */
> > +#endif /* _LINUX_BALLOON_COMPACTION_H */
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index e42c762..6df0664 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -24,6 +24,7 @@ enum mapping_flags {
> >  	AS_ENOSPC	= __GFP_BITS_SHIFT + 1,	/* ENOSPC on async write */
> >  	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
> >  	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
> > +	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
> >  };
> >  
> >  static inline void mapping_set_error(struct address_space *mapping, int error)
> > @@ -53,6 +54,23 @@ static inline int mapping_unevictable(struct address_space *mapping)
> >  	return !!mapping;
> >  }
> >  
> > +static inline void mapping_set_balloon(struct address_space *mapping)
> > +{
> > +	set_bit(AS_BALLOON_MAP, &mapping->flags);
> > +}
> > +
> > +static inline void mapping_clear_balloon(struct address_space *mapping)
> > +{
> > +	clear_bit(AS_BALLOON_MAP, &mapping->flags);
> > +}
> > +
> > +static inline int mapping_balloon(struct address_space *mapping)
> > +{
> > +	if (mapping)
> > +		return test_bit(AS_BALLOON_MAP, &mapping->flags);
> > +	return !!mapping;
> > +}
> > +
> >  static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
> >  {
> >  	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index d5c8019..4857899 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -188,6 +188,21 @@ config SPLIT_PTLOCK_CPUS
> >  	default "4"
> >  
> >  #
> > +# support for memory balloon compaction
> > +config BALLOON_COMPACTION
> > +	bool "Allow for balloon memory compaction/migration"
> > +	select COMPACTION
> > +	depends on VIRTIO_BALLOON
> > +	help
> > +	  Memory fragmentation introduced by ballooning might reduce
> > +	  significantly the number of 2MB contiguous memory blocks that can be
> > +	  used within a guest, thus imposing performance penalties associated
> > +	  with the reduced number of transparent huge pages that could be used
> > +	  by the guest workload. Allowing the compaction & migration for memory
> > +	  pages enlisted as being part of memory balloon devices avoids the
> > +	  scenario aforementioned and helps improving memory defragmentation.
> > +
> > +#
> >  # support for memory compaction
> >  config COMPACTION
> >  	bool "Allow for memory compaction"
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 92753e2..78d8caa 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -16,7 +16,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
> >  			   readahead.o swap.o truncate.o vmscan.o shmem.o \
> >  			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
> >  			   mm_init.o mmu_context.o percpu.o slab_common.o \
> > -			   compaction.o $(mmu-y)
> > +			   compaction.o balloon_compaction.o $(mmu-y)
> >  
> >  obj-y += init-mm.o
> >  
> > diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
> > new file mode 100644
> > index 0000000..d79f13d
> > --- /dev/null
> > +++ b/mm/balloon_compaction.c
> > @@ -0,0 +1,145 @@
> > +/*
> > + * mm/balloon_compaction.c
> > + *
> > + * Common interface for making balloon pages movable to compaction.
> > + *
> > + * Copyright (C) 2012, Red Hat, Inc.  Rafael Aquini <aquini@redhat.com>
> > + */
> > +#include <linux/slab.h>
> > +#include <linux/export.h>
> > +#include <linux/balloon_compaction.h>
> > +
> > +/*
> > + * alloc_balloon_mapping - allocates a special ->mapping for ballooned pages.
> > + * @balloon_device: pointer address that references the balloon device which
> > + *                 owns pages bearing this ->mapping.
> > + * @a_ops: balloon_mapping address_space_operations descriptor.
> > + *
> > + * Users must call it to properly allocate and initialize an instance of
> > + * struct address_space which will be used as the special page->mapping for
> > + * balloon devices enlisted page instances.
> > + */
> > +struct address_space *alloc_balloon_mapping(void *balloon_device,
> > +				const struct address_space_operations *a_ops)
> > +{
> > +	struct address_space *mapping;
> > +
> > +	mapping = kmalloc(sizeof(*mapping), GFP_KERNEL);
> > +	if (mapping) {
> 
> If !mapping return NULL will make code shorter.
> 
> > +		/*
> > +		 * Give a clean 'zeroed' status to all elements of this special
> > +		 * balloon page->mapping struct address_space instance.
> > +		 */
> > +		address_space_init_once(mapping);
> > +
> > +		/*
> > +		 * Set mapping->flags appropriately, to allow balloon ->mapping
> > +		 * identification, as well as give a proper hint to the balloon
> > +		 * driver on what GFP allocation mask shall be used.
> > +		 */
> > +		mapping_set_balloon(mapping);
> > +		mapping_set_gfp_mask(mapping, balloon_mapping_gfp_mask());
> > +
> > +		/* balloon's page->mapping->a_ops callback descriptor */
> > +		mapping->a_ops = a_ops;
> > +
> > +		/*
> > +		 * balloon special page->mapping overloads ->assoc_mapping
> > +		 * to held a reference back to the balloon device wich 'owns'
> > +		 * a given page. This is the way we can cope with multiple
> > +		 * balloon devices without losing reference of several
> > +		 * ballooned pagesets.
> > +		 */
> > +		mapping->assoc_mapping = (void *)balloon_device;
> 
> (void *) is only a good idea if you are casting a long value.
> 
> > +
> > +		return mapping;
> > +	}
> > +	return NULL;
> > +}
> > +EXPORT_SYMBOL_GPL(alloc_balloon_mapping);
> > +
> > +#if defined(CONFIG_BALLOON_COMPACTION)
> > +static inline int __isolate_balloon_page(struct page *page)
> > +{
> > +	struct address_space *mapping;
> > +	int ret = 0;
> > +
> > +	rcu_read_lock();
> > +	mapping = rcu_dereference(page->mapping);
> > +	if (mapping)
> > +		ret = mapping->a_ops->launder_page(page);
> > +
> > +	rcu_read_unlock();
> > +	return ret;
> > +}
> > +
> > +static inline void __putback_balloon_page(struct page *page)
> > +{
> > +	struct address_space *mapping;
> > +
> > +	rcu_read_lock();
> > +	mapping = rcu_dereference(page->mapping);
> > +	/*
> > +	 * If we stumble across a page->mapping NULL here, so something has
> > +	 * messed with the private isolated pageset we are iterating over.
> > +	 */
> > +	BUG_ON(!mapping);
> > +
> > +	mapping->a_ops->freepage(page);
> > +	/* isolation bumps up the page refcount, time to decrement it */
> > +	put_page(page);
> > +	rcu_read_unlock();
> > +}
> > +
> > +/* __isolate_lru_page() counterpart for a ballooned page */
> > +bool isolate_balloon_page(struct page *page)
> > +{
> > +	if (likely(get_page_unless_zero(page))) {
> > +		/*
> > +		 * As balloon pages are not isolated from LRU lists, concurrent
> > +		 * compaction threads can race against page migration functions
> > +		 * move_to_new_page() & __unmap_and_move().
> > +		 * In order to avoid having an already isolated balloon page
> > +		 * being (wrongly) re-isolated while it is under migration,
> > +		 * lets be sure we have the page lock before proceeding with
> > +		 * the balloon page isolation steps.
> > +		 */
> > +		if (likely(trylock_page(page))) {
> > +			/*
> > +			 * A ballooned page, by default, has just one refcount.
> > +			 * Prevent concurrent compaction threads from isolating
> > +			 * an already isolated balloon page by refcount check.
> > +			 */
> > +			if (movable_balloon_page(page) &&
> > +			    (page_count(page) == 2)) {
> > +				if (__isolate_balloon_page(page)) {
> > +					unlock_page(page);
> > +					return true;
> > +				}
> > +			}
> > +			unlock_page(page);
> > +		}
> > +		/*
> > +		 * This page is either under migration, it is isolated already,
> > +		 * or its isolation step has not happened at this round.
> > +		 * Drop the refcount taken for it.
> > +		 */
> > +		put_page(page);
> > +	}
> > +	return false;
> > +}
> > +
> > +/* putback_lru_page() counterpart for a ballooned page */
> > +void putback_balloon_page(struct page *page)
> > +{
> > +	/*
> > +	 * 'lock_page()' stabilizes the page and prevents races against
> > +	 * concurrent isolation threads attempting to re-isolate it.
> > +	 */
> > +	lock_page(page);
> > +	if (movable_balloon_page(page))
> > +		__putback_balloon_page(page);
> > +
> > +	unlock_page(page);
> > +}
> > +#endif /* CONFIG_BALLOON_COMPACTION */
> > -- 
> > 1.7.11.4

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 3/5] virtio_balloon: introduce migration primitives to balloon pages
  2012-08-21 12:47 ` [PATCH v8 3/5] virtio_balloon: introduce migration primitives to balloon pages Rafael Aquini
@ 2012-08-21 14:40   ` Michael S. Tsirkin
  2012-08-21 15:34     ` Peter Zijlstra
  2012-08-21 15:37     ` Peter Zijlstra
  2012-08-21 14:57   ` Michael S. Tsirkin
  1 sibling, 2 replies; 52+ messages in thread
From: Michael S. Tsirkin @ 2012-08-21 14:40 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: linux-mm, linux-kernel, virtualization, Rusty Russell,
	Rik van Riel, Mel Gorman, Andi Kleen, Andrew Morton,
	Konrad Rzeszutek Wilk, Minchan Kim

On Tue, Aug 21, 2012 at 09:47:46AM -0300, Rafael Aquini wrote:
> Memory fragmentation introduced by ballooning might reduce significantly
> the number of 2MB contiguous memory blocks that can be used within a guest,
> thus imposing performance penalties associated with the reduced number of
> transparent huge pages that could be used by the guest workload.
> 
> Besides making balloon pages movable at allocation time and introducing
> the necessary primitives to perform balloon page migration/compaction,
> this patch also introduces the following locking scheme to provide the
> proper synchronization and protection for struct virtio_balloon elements
> against concurrent accesses due to parallel operations introduced by
> memory compaction / page migration.
>  - balloon_lock (mutex) : synchronizes the access demand to elements of
> 			  struct virtio_balloon and its queue operations;
>  - pages_lock (spinlock): special protection to balloon pages list against
> 			  concurrent list handling operations;
>  - virtio_baloon->pages list handling sync by RCU operations;
> 
> Signed-off-by: Rafael Aquini <aquini@redhat.com>
> ---
>  drivers/virtio/virtio_balloon.c | 210 +++++++++++++++++++++++++++++++++++++---
>  1 file changed, 199 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 0908e60..bda7bb0 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -27,6 +27,7 @@
>  #include <linux/delay.h>
>  #include <linux/slab.h>
>  #include <linux/module.h>
> +#include <linux/balloon_compaction.h>
>  
>  /*
>   * Balloon device works in 4K page units.  So each page is pointed to by
> @@ -35,6 +36,12 @@
>   */
>  #define VIRTIO_BALLOON_PAGES_PER_PAGE (PAGE_SIZE >> VIRTIO_BALLOON_PFN_SHIFT)
>  
> +/* flags used to hint compaction procedures about the balloon device status */
> +enum balloon_status_flags {
> +	BALLOON_REMOVAL = 0,	/* balloon device is under removal steps */
> +	BALLOON_OK,		/* balloon device is up and running */
> +};
> +
>  struct virtio_balloon
>  {
>  	struct virtio_device *vdev;
> @@ -46,11 +53,24 @@ struct virtio_balloon
>  	/* The thread servicing the balloon. */
>  	struct task_struct *thread;
>  
> +	/* balloon special page->mapping */
> +	struct address_space *mapping;
> +
> +	/* Synchronize access/update to this struct virtio_balloon elements */
> +	struct mutex balloon_lock;
> +
>  	/* Waiting for host to ack the pages we released. */
>  	wait_queue_head_t acked;
>  
>  	/* Number of balloon pages we've told the Host we're not using. */
>  	unsigned int num_pages;
> +
> +	/* balloon device status flag */
> +	unsigned short balloon_status;
> +
> +	/* Protect 'pages' list against concurrent handling */
> +	spinlock_t pages_lock;
> +
>  	/*
>  	 * The pages we've told the Host we're not using.
>  	 * Each page on this list adds VIRTIO_BALLOON_PAGES_PER_PAGE
> @@ -122,13 +142,17 @@ static void set_page_pfns(u32 pfns[], struct page *page)
>  
>  static void fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
> +	/* Get the proper GFP alloc mask from vb->mapping flags */
> +	gfp_t vb_gfp_mask = mapping_gfp_mask(vb->mapping);
> +
>  	/* We can only do one array worth at a time. */
>  	num = min(num, ARRAY_SIZE(vb->pfns));
>  
> +	mutex_lock(&vb->balloon_lock);
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
>  	     vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> -		struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY |
> -					__GFP_NOMEMALLOC | __GFP_NOWARN);
> +		struct page *page = alloc_page(vb_gfp_mask | __GFP_NORETRY |
> +					       __GFP_NOWARN | __GFP_NOMEMALLOC);
>  		if (!page) {
>  			if (printk_ratelimit())
>  				dev_printk(KERN_INFO, &vb->vdev->dev,
> @@ -141,7 +165,10 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num)
>  		set_page_pfns(vb->pfns + vb->num_pfns, page);
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		totalram_pages--;
> -		list_add(&page->lru, &vb->pages);
> +		spin_lock(&vb->pages_lock);
> +		list_add_rcu(&page->lru, &vb->pages);
> +		assign_balloon_mapping(page, vb->mapping);
> +		spin_unlock(&vb->pages_lock);
>  	}
>  
>  	/* Didn't get any?  Oh well. */
> @@ -149,6 +176,7 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num)
>  		return;
>  
>  	tell_host(vb, vb->inflate_vq);
> +	mutex_unlock(&vb->balloon_lock);
>  }
>  
>  static void release_pages_by_pfn(const u32 pfns[], unsigned int num)
> @@ -169,21 +197,48 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num)
>  	/* We can only do one array worth at a time. */
>  	num = min(num, ARRAY_SIZE(vb->pfns));
>  
> +	mutex_lock(&vb->balloon_lock);
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
>  	     vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> -		page = list_first_entry(&vb->pages, struct page, lru);
> -		list_del(&page->lru);
> +		/*
> +		 * We can race against 'virtballoon_isolatepage()' and end up
> +		 * stumbling across a _temporarily_ empty 'pages' list.
> +		 */
> +		spin_lock(&vb->pages_lock);
> +		page = list_first_or_null_rcu(&vb->pages, struct page, lru);

Why is list_first_or_null_rcu called outside
RCU critical section here?

> +		if (!page) {
> +			spin_unlock(&vb->pages_lock);
> +			break;
> +		}
> +		/*
> +		 * It is safe now to drop page->mapping and delete this page
> +		 * from balloon page list, since we are grabbing 'pages_lock'
> +		 * which prevents 'virtballoon_isolatepage()' from acting.
> +		 */
> +		clear_balloon_mapping(page);
> +		list_del_rcu(&page->lru);
> +		spin_unlock(&vb->pages_lock);
>  		set_page_pfns(vb->pfns + vb->num_pfns, page);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}
> +	/*
> +	 * Syncrhonize RCU grace period and wait for all RCU read critical side
> +	 * sections to finish before proceeding with page release steps.
> +	 * This avoids compaction/migration callback races against balloon
> +	 * device removal steps.
> +	 */
> +	synchronize_rcu();
>  
>  	/*
>  	 * Note that if
>  	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>  	 * is true, we *have* to do it in this order
>  	 */
> -	tell_host(vb, vb->deflate_vq);
> -	release_pages_by_pfn(vb->pfns, vb->num_pfns);
> +	if (vb->num_pfns > 0) {
> +		tell_host(vb, vb->deflate_vq);
> +		release_pages_by_pfn(vb->pfns, vb->num_pfns);
> +	}
> +	mutex_unlock(&vb->balloon_lock);
>  }
>  
>  static inline void update_stat(struct virtio_balloon *vb, int idx,
> @@ -239,6 +294,7 @@ static void stats_handle_request(struct virtio_balloon *vb)
>  	struct scatterlist sg;
>  	unsigned int len;
>  
> +	mutex_lock(&vb->balloon_lock);
>  	vb->need_stats_update = 0;
>  	update_balloon_stats(vb);
>  
> @@ -249,6 +305,7 @@ static void stats_handle_request(struct virtio_balloon *vb)
>  	if (virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL) < 0)
>  		BUG();
>  	virtqueue_kick(vq);
> +	mutex_unlock(&vb->balloon_lock);
>  }
>  
>  static void virtballoon_changed(struct virtio_device *vdev)
> @@ -261,22 +318,27 @@ static void virtballoon_changed(struct virtio_device *vdev)
>  static inline s64 towards_target(struct virtio_balloon *vb)
>  {
>  	__le32 v;
> -	s64 target;
> +	s64 target, actual;
>  
> +	mutex_lock(&vb->balloon_lock);
> +	actual = vb->num_pages;
>  	vb->vdev->config->get(vb->vdev,
>  			      offsetof(struct virtio_balloon_config, num_pages),
>  			      &v, sizeof(v));
>  	target = le32_to_cpu(v);
> -	return target - vb->num_pages;
> +	mutex_unlock(&vb->balloon_lock);
> +	return target - actual;
>  }
>  
>  static void update_balloon_size(struct virtio_balloon *vb)
>  {
> -	__le32 actual = cpu_to_le32(vb->num_pages);
> -
> +	__le32 actual;
> +	mutex_lock(&vb->balloon_lock);
> +	actual = cpu_to_le32(vb->num_pages);
>  	vb->vdev->config->set(vb->vdev,
>  			      offsetof(struct virtio_balloon_config, actual),
>  			      &actual, sizeof(actual));
> +	mutex_unlock(&vb->balloon_lock);
>  }
>  
>  static int balloon(void *_vballoon)
> @@ -339,9 +401,117 @@ static int init_vqs(struct virtio_balloon *vb)
>  	return 0;
>  }
>  
> +#ifdef CONFIG_BALLOON_COMPACTION
> +/*
> + * Populate balloon_mapping->a_ops callback method to perform the balloon
> + * page migration task.
> + *
> + * After a ballooned page gets isolated by compaction procedures, this is the
> + * function that performs the page migration on behalf of move_to_new_page(),
> + * when the last calls (page)->mapping->a_ops->migratepage.
> + *
> + * Page migration for virtio balloon is done in a simple swap fashion which
> + * follows these two steps:
> + *  1) insert newpage into vb->pages list and update the host about it;
> + *  2) update the host about the removed old page from vb->pages list;
> + */
> +int virtballoon_migratepage(struct address_space *mapping,
> +		struct page *newpage, struct page *page, enum migrate_mode mode)
> +{
> +	struct virtio_balloon *vb = (void *)__page_balloon_device(page);

Don't cast to void *.

> +
> +	/* at this point, besides very unlikely, a NULL *vb is a serious bug */

A bit of a strange comment. Either it's unlikely or a bug :)
Besides, vb is checked for NULL value, not *vb.

> +	BUG_ON(!vb);
> +
> +	/*
> +	 * Skip page migration step if the memory balloon device is under its
> +	 * removal procedure, to avoid racing against module unload.
> +	 *

What kind of race does this fix? Why does not
leaking all pages fix it?

> +	 * If there still are isolated (balloon) pages under migration lists,
> +	 * 'virtballoon_putbackpage()' will take care of them properly, before
> +	 * the module unload finishes.
> +	 */
> +	if (vb->balloon_status == BALLOON_REMOVAL)
> +		return -EAGAIN;
> +
> +	mutex_lock(&vb->balloon_lock);
> +
> +	/* balloon's page migration 1st step */
> +	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +	spin_lock(&vb->pages_lock);
> +	list_add_rcu(&newpage->lru, &vb->pages);
> +	spin_unlock(&vb->pages_lock);
> +	set_page_pfns(vb->pfns, newpage);
> +	tell_host(vb, vb->inflate_vq);
> +
> +	/* balloon's page migration 2nd step */
> +	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +	set_page_pfns(vb->pfns, page);
> +	tell_host(vb, vb->deflate_vq);
> +
> +	mutex_unlock(&vb->balloon_lock);
> +	return 0;
> +}
> +
> +/*
> + * Populate balloon_mapping->a_ops callback method to help compaction on
> + * isolating a page from the balloon page list for posterior migration.
> + */
> +int virtballoon_isolatepage(struct page *page)
> +{
> +	struct virtio_balloon *vb = (void *)__page_balloon_device(page);

Cast here looks wrong.

> +	int ret = 0;
> +	/*
> +	 * If we stumble across a NULL *vb here, it means this page has been
> +	 * already released by 'leak_balloon()'.
> +	 *
> +	 * We also skip the page isolation step if the memory balloon device is
> +	 * under its removal procedure, to avoid racing against module unload.
> +	 */

What kind of race do you have in mind here?
Doesn't leaking all pages in module removal address it?

> +	if (vb


How can vb be NULL? Pls document.

> && (vb->balloon_status != BALLOON_REMOVAL)) {

Read of balloon_status needs some kind of memory barrier I think.

> +		spin_lock(&vb->pages_lock);
> +		/*
> +		 * virtballoon_isolatepage() can race against leak_balloon(),
> +		 * and (wrongly) isolate a page that is about to be freed.
> +		 * Test page->mapping under pages_lock to close that window.
> +		 */
> +		if (rcu_access_pointer(page->mapping) == vb->mapping) {
> +			/* It is safe to isolate this page, now */
> +			list_del_rcu(&page->lru);
> +			ret = 1;
> +		}
> +		spin_unlock(&vb->pages_lock);
> +	}
> +	return ret;
> +}
> +
> +/*
> + * Populate balloon_mapping->a_ops callback method to help compaction on
> + * re-inserting a not-migrated isolated page into the balloon page list.
> + */
> +void virtballoon_putbackpage(struct page *page)
> +{
> +	struct virtio_balloon *vb = (void *)__page_balloon_device(page);
> +
> +	/* at this point, besides very unlikely, a NULL *vb is a serious bug */

A bit of a strange comment. Either it's unlikely or a bug :)
Besides, vb is checked for NULL value, not *vb.

> +	BUG_ON(!vb);
> +
> +	spin_lock(&vb->pages_lock);
> +	list_add_rcu(&page->lru, &vb->pages);
> +	spin_unlock(&vb->pages_lock);
> +}
> +#endif /* CONFIG_BALLOON_COMPACTION */
> +
> +/* define the balloon_mapping->a_ops callbacks to allow compaction/migration */
> +static DEFINE_BALLOON_MAPPING_AOPS(virtio_balloon_aops,
> +				   virtballoon_migratepage,
> +				   virtballoon_isolatepage,
> +				   virtballoon_putbackpage);
> +
>  static int virtballoon_probe(struct virtio_device *vdev)
>  {
>  	struct virtio_balloon *vb;
> +	struct address_space *vb_mapping;
>  	int err;
>  
>  	vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
> @@ -351,12 +521,24 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	}
>  
>  	INIT_LIST_HEAD(&vb->pages);
> +	spin_lock_init(&vb->pages_lock);
> +	mutex_init(&vb->balloon_lock);
> +
>  	vb->num_pages = 0;
>  	init_waitqueue_head(&vb->config_change);
>  	init_waitqueue_head(&vb->acked);
>  	vb->vdev = vdev;
>  	vb->need_stats_update = 0;
>  
> +	/* Allocate a special page->mapping for this balloon device */

This comment is not helpful, pls remove.

> +	vb_mapping = alloc_balloon_mapping((void *)vb, &virtio_balloon_aops);

Cast to void * is not needed.

> +	if (!vb_mapping) {
> +		err = -ENOMEM;
> +		goto out_free_vb;
> +	}
> +	/* Store the page->mapping reference for this balloon device */
> +	vb->mapping = vb_mapping;

This comment is not helpful, pls remove.

> +
>  	err = init_vqs(vb);
>  	if (err)
>  		goto out_free_vb;
> @@ -367,12 +549,14 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  		goto out_del_vqs;
>  	}
>  
> +	vb->balloon_status = BALLOON_OK;
>  	return 0;
>  
>  out_del_vqs:
>  	vdev->config->del_vqs(vdev);
>  out_free_vb:
>  	kfree(vb);
> +	kfree(vb_mapping);

I think it's better to wrap free for vb mapping too.

>  out:
>  	return err;
>  }
> @@ -394,8 +578,10 @@ static void __devexit virtballoon_remove(struct virtio_device *vdev)
>  {
>  	struct virtio_balloon *vb = vdev->priv;
>  
> +	vb->balloon_status = BALLOON_REMOVAL;

This needs some kind of barrier.

>  	kthread_stop(vb->thread);
>  	remove_common(vb);
> +	kfree(vb->mapping);
>  	kfree(vb);
>  }
>  
> @@ -409,6 +595,7 @@ static int virtballoon_freeze(struct virtio_device *vdev)
>  	 * function is called.
>  	 */
>  
> +	vb->balloon_status = BALLOON_REMOVAL;

Here module is not going away so why change status?

>  	remove_common(vb);
>  	return 0;
>  }
> @@ -424,6 +611,7 @@ static int virtballoon_restore(struct virtio_device *vdev)
>  
>  	fill_balloon(vb, towards_target(vb));
>  	update_balloon_size(vb);
> +	vb->balloon_status = BALLOON_OK;

Isn't this too late to set status here? We filled the balloon ...


>  	return 0;
>  }
>  #endif
> -- 
> 1.7.11.4

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 4/5] mm: introduce putback_movable_pages()
  2012-08-21 12:47 ` [PATCH v8 4/5] mm: introduce putback_movable_pages() Rafael Aquini
@ 2012-08-21 14:42   ` Michael S. Tsirkin
  0 siblings, 0 replies; 52+ messages in thread
From: Michael S. Tsirkin @ 2012-08-21 14:42 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: linux-mm, linux-kernel, virtualization, Rusty Russell,
	Rik van Riel, Mel Gorman, Andi Kleen, Andrew Morton,
	Konrad Rzeszutek Wilk, Minchan Kim

On Tue, Aug 21, 2012 at 09:47:47AM -0300, Rafael Aquini wrote:
> The PATCH "mm: introduce compaction and migration for virtio ballooned pages"
> hacks around putback_lru_pages() in order to allow ballooned pages to be
> re-inserted on balloon page list as if a ballooned page was like a LRU page.
> 
> As ballooned pages are not legitimate LRU pages, this patch introduces
> putback_movable_pages() to properly cope with cases where the isolated
> pageset contains ballooned pages and LRU pages, thus fixing the mentioned
> inelegant hack around putback_lru_pages().
> 
> Signed-off-by: Rafael Aquini <aquini@redhat.com>
> ---
>  include/linux/migrate.h |  2 ++
>  mm/compaction.c         |  4 ++--
>  mm/migrate.c            | 20 ++++++++++++++++++++
>  mm/page_alloc.c         |  2 +-
>  4 files changed, 25 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index ce7e667..ff103a1 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -10,6 +10,7 @@ typedef struct page *new_page_t(struct page *, unsigned long private, int **);
>  #ifdef CONFIG_MIGRATION
>  
>  extern void putback_lru_pages(struct list_head *l);
> +extern void putback_movable_pages(struct list_head *l);
>  extern int migrate_page(struct address_space *,
>  			struct page *, struct page *, enum migrate_mode);
>  extern int migrate_pages(struct list_head *l, new_page_t x,
> @@ -33,6 +34,7 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping,
>  #else
>  
>  static inline void putback_lru_pages(struct list_head *l) {}
> +static inline void putback_movable_pages(struct list_head *l) {}
>  static inline int migrate_pages(struct list_head *l, new_page_t x,
>  		unsigned long private, bool offlining,
>  		enum migrate_mode mode) { return -ENOSYS; }
> diff --git a/mm/compaction.c b/mm/compaction.c
> index ce43dc2..782ed32 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -759,9 +759,9 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>  		trace_mm_compaction_migratepages(nr_migrate - nr_remaining,
>  						nr_remaining);
>  
> -		/* Release LRU pages not migrated */
> +		/* Release isolated pages not migrated */
>  		if (err) {
> -			putback_lru_pages(&cc->migratepages);
> +			putback_movable_pages(&cc->migratepages);
>  			cc->nr_migratepages = 0;
>  			if (err == -ENOMEM) {
>  				ret = COMPACT_PARTIAL;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 6392da258..0bf2caf 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -80,6 +80,26 @@ void putback_lru_pages(struct list_head *l)
>  		list_del(&page->lru);
>  		dec_zone_page_state(page, NR_ISOLATED_ANON +
>  				page_is_file_cache(page));
> +			putback_lru_page(page);
> +	}
> +}
> +
> +/*
> + * Put previously isolated pages back onto the appropriated lists

Do you mean appropriate lists?

> + * from where they were once taken off for compaction/migration.
> + *
> + * This function shall be used instead of putback_lru_pages(),
> + * whenever the isolated pageset has been built by isolate_migratepages_range()
> + */
> +void putback_movable_pages(struct list_head *l)
> +{
> +	struct page *page;
> +	struct page *page2;
> +
> +	list_for_each_entry_safe(page, page2, l, lru) {
> +		list_del(&page->lru);
> +		dec_zone_page_state(page, NR_ISOLATED_ANON +
> +				page_is_file_cache(page));
>  		if (unlikely(movable_balloon_page(page)))
>  			putback_balloon_page(page);
>  		else
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 009ac28..78b7663 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5669,7 +5669,7 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
>  				    0, false, MIGRATE_SYNC);
>  	}
>  
> -	putback_lru_pages(&cc.migratepages);
> +	putback_movable_pages(&cc.migratepages);
>  	return ret > 0 ? 0 : ret;
>  }
>  
> -- 
> 1.7.11.4

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 3/5] virtio_balloon: introduce migration primitives to balloon pages
  2012-08-21 12:47 ` [PATCH v8 3/5] virtio_balloon: introduce migration primitives to balloon pages Rafael Aquini
  2012-08-21 14:40   ` Michael S. Tsirkin
@ 2012-08-21 14:57   ` Michael S. Tsirkin
  1 sibling, 0 replies; 52+ messages in thread
From: Michael S. Tsirkin @ 2012-08-21 14:57 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: linux-mm, linux-kernel, virtualization, Rusty Russell,
	Rik van Riel, Mel Gorman, Andi Kleen, Andrew Morton,
	Konrad Rzeszutek Wilk, Minchan Kim

On Tue, Aug 21, 2012 at 09:47:46AM -0300, Rafael Aquini wrote:
> Memory fragmentation introduced by ballooning might reduce significantly
> the number of 2MB contiguous memory blocks that can be used within a guest,
> thus imposing performance penalties associated with the reduced number of
> transparent huge pages that could be used by the guest workload.
> 
> Besides making balloon pages movable at allocation time and introducing
> the necessary primitives to perform balloon page migration/compaction,
> this patch also introduces the following locking scheme to provide the
> proper synchronization and protection for struct virtio_balloon elements
> against concurrent accesses due to parallel operations introduced by
> memory compaction / page migration.
>  - balloon_lock (mutex) : synchronizes the access demand to elements of
> 			  struct virtio_balloon and its queue operations;
>  - pages_lock (spinlock): special protection to balloon pages list against
> 			  concurrent list handling operations;
>  - virtio_baloon->pages list handling sync by RCU operations;
> 
> Signed-off-by: Rafael Aquini <aquini@redhat.com>

Sorry about sending multiple comments to same patch.
I just thought of some more questions.

> ---
>  drivers/virtio/virtio_balloon.c | 210 +++++++++++++++++++++++++++++++++++++---
>  1 file changed, 199 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 0908e60..bda7bb0 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -27,6 +27,7 @@
>  #include <linux/delay.h>
>  #include <linux/slab.h>
>  #include <linux/module.h>
> +#include <linux/balloon_compaction.h>
>  
>  /*
>   * Balloon device works in 4K page units.  So each page is pointed to by
> @@ -35,6 +36,12 @@
>   */
>  #define VIRTIO_BALLOON_PAGES_PER_PAGE (PAGE_SIZE >> VIRTIO_BALLOON_PFN_SHIFT)
>  
> +/* flags used to hint compaction procedures about the balloon device status */
> +enum balloon_status_flags {
> +	BALLOON_REMOVAL = 0,	/* balloon device is under removal steps */
> +	BALLOON_OK,		/* balloon device is up and running */
> +};
> +
>  struct virtio_balloon
>  {
>  	struct virtio_device *vdev;
> @@ -46,11 +53,24 @@ struct virtio_balloon
>  	/* The thread servicing the balloon. */
>  	struct task_struct *thread;
>  
> +	/* balloon special page->mapping */
> +	struct address_space *mapping;
> +
> +	/* Synchronize access/update to this struct virtio_balloon elements */
> +	struct mutex balloon_lock;
> +
>  	/* Waiting for host to ack the pages we released. */
>  	wait_queue_head_t acked;
>  
>  	/* Number of balloon pages we've told the Host we're not using. */
>  	unsigned int num_pages;
> +
> +	/* balloon device status flag */
> +	unsigned short balloon_status;
> +
> +	/* Protect 'pages' list against concurrent handling */
> +	spinlock_t pages_lock;
> +
>  	/*
>  	 * The pages we've told the Host we're not using.
>  	 * Each page on this list adds VIRTIO_BALLOON_PAGES_PER_PAGE
> @@ -122,13 +142,17 @@ static void set_page_pfns(u32 pfns[], struct page *page)
>  
>  static void fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
> +	/* Get the proper GFP alloc mask from vb->mapping flags */
> +	gfp_t vb_gfp_mask = mapping_gfp_mask(vb->mapping);
> +
>  	/* We can only do one array worth at a time. */
>  	num = min(num, ARRAY_SIZE(vb->pfns));
>  
> +	mutex_lock(&vb->balloon_lock);
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
>  	     vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> -		struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY |
> -					__GFP_NOMEMALLOC | __GFP_NOWARN);
> +		struct page *page = alloc_page(vb_gfp_mask | __GFP_NORETRY |
> +					       __GFP_NOWARN | __GFP_NOMEMALLOC);
>  		if (!page) {
>  			if (printk_ratelimit())
>  				dev_printk(KERN_INFO, &vb->vdev->dev,
> @@ -141,7 +165,10 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num)
>  		set_page_pfns(vb->pfns + vb->num_pfns, page);
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		totalram_pages--;
> -		list_add(&page->lru, &vb->pages);
> +		spin_lock(&vb->pages_lock);
> +		list_add_rcu(&page->lru, &vb->pages);
> +		assign_balloon_mapping(page, vb->mapping);
> +		spin_unlock(&vb->pages_lock);
>  	}
>  
>  	/* Didn't get any?  Oh well. */
> @@ -149,6 +176,7 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num)
>  		return;
>  
>  	tell_host(vb, vb->inflate_vq);
> +	mutex_unlock(&vb->balloon_lock);
>  }
>  
>  static void release_pages_by_pfn(const u32 pfns[], unsigned int num)
> @@ -169,21 +197,48 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num)
>  	/* We can only do one array worth at a time. */
>  	num = min(num, ARRAY_SIZE(vb->pfns));
>  
> +	mutex_lock(&vb->balloon_lock);
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
>  	     vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> -		page = list_first_entry(&vb->pages, struct page, lru);
> -		list_del(&page->lru);
> +		/*
> +		 * We can race against 'virtballoon_isolatepage()' and end up
> +		 * stumbling across a _temporarily_ empty 'pages' list.
> +		 */
> +		spin_lock(&vb->pages_lock);
> +		page = list_first_or_null_rcu(&vb->pages, struct page, lru);
> +		if (!page) {
> +			spin_unlock(&vb->pages_lock);
> +			break;

Now what happens? putback will return the pages here but you have exited
the loop.

> +		}
> +		/*
> +		 * It is safe now to drop page->mapping and delete this page
> +		 * from balloon page list, since we are grabbing 'pages_lock'
> +		 * which prevents 'virtballoon_isolatepage()' from acting.
> +		 */
> +		clear_balloon_mapping(page);
> +		list_del_rcu(&page->lru);
> +		spin_unlock(&vb->pages_lock);
>  		set_page_pfns(vb->pfns + vb->num_pfns, page);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}
> +	/*
> +	 * Syncrhonize

Typo.

> RCU grace period and wait for all RCU read critical side
> +	 * sections to finish before proceeding with page release steps.
> +	 * This avoids compaction/migration callback races against balloon
> +	 * device removal steps.
> +	 */
> +	synchronize_rcu();
>  
>  	/*
>  	 * Note that if
>  	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>  	 * is true, we *have* to do it in this order
>  	 */
> -	tell_host(vb, vb->deflate_vq);
> -	release_pages_by_pfn(vb->pfns, vb->num_pfns);
> +	if (vb->num_pfns > 0) {
> +		tell_host(vb, vb->deflate_vq);
> +		release_pages_by_pfn(vb->pfns, vb->num_pfns);
> +	}
> +	mutex_unlock(&vb->balloon_lock);
>  }
>  
>  static inline void update_stat(struct virtio_balloon *vb, int idx,
> @@ -239,6 +294,7 @@ static void stats_handle_request(struct virtio_balloon *vb)
>  	struct scatterlist sg;
>  	unsigned int len;
>  
> +	mutex_lock(&vb->balloon_lock);
>  	vb->need_stats_update = 0;
>  	update_balloon_stats(vb);
>  
> @@ -249,6 +305,7 @@ static void stats_handle_request(struct virtio_balloon *vb)
>  	if (virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL) < 0)
>  		BUG();
>  	virtqueue_kick(vq);
> +	mutex_unlock(&vb->balloon_lock);
>  }
>  
>  static void virtballoon_changed(struct virtio_device *vdev)
> @@ -261,22 +318,27 @@ static void virtballoon_changed(struct virtio_device *vdev)
>  static inline s64 towards_target(struct virtio_balloon *vb)
>  {
>  	__le32 v;
> -	s64 target;
> +	s64 target, actual;
>  
> +	mutex_lock(&vb->balloon_lock);
> +	actual = vb->num_pages;
>  	vb->vdev->config->get(vb->vdev,
>  			      offsetof(struct virtio_balloon_config, num_pages),
>  			      &v, sizeof(v));
>  	target = le32_to_cpu(v);
> -	return target - vb->num_pages;
> +	mutex_unlock(&vb->balloon_lock);
> +	return target - actual;
>  }
>  
>  static void update_balloon_size(struct virtio_balloon *vb)
>  {
> -	__le32 actual = cpu_to_le32(vb->num_pages);
> -
> +	__le32 actual;
> +	mutex_lock(&vb->balloon_lock);
> +	actual = cpu_to_le32(vb->num_pages);
>  	vb->vdev->config->set(vb->vdev,
>  			      offsetof(struct virtio_balloon_config, actual),
>  			      &actual, sizeof(actual));
> +	mutex_unlock(&vb->balloon_lock);
>  }
>  
>  static int balloon(void *_vballoon)
> @@ -339,9 +401,117 @@ static int init_vqs(struct virtio_balloon *vb)
>  	return 0;
>  }
>  
> +#ifdef CONFIG_BALLOON_COMPACTION
> +/*
> + * Populate balloon_mapping->a_ops callback method to perform the balloon
> + * page migration task.
> + *
> + * After a ballooned page gets isolated by compaction procedures, this is the
> + * function that performs the page migration on behalf of move_to_new_page(),
> + * when the last calls (page)->mapping->a_ops->migratepage.
> + *
> + * Page migration for virtio balloon is done in a simple swap fashion which
> + * follows these two steps:
> + *  1) insert newpage into vb->pages list and update the host about it;
> + *  2) update the host about the removed old page from vb->pages list;
> + */
> +int virtballoon_migratepage(struct address_space *mapping,
> +		struct page *newpage, struct page *page, enum migrate_mode mode)
> +{
> +	struct virtio_balloon *vb = (void *)__page_balloon_device(page);
> +
> +	/* at this point, besides very unlikely, a NULL *vb is a serious bug */
> +	BUG_ON(!vb);
> +
> +	/*
> +	 * Skip page migration step if the memory balloon device is under its
> +	 * removal procedure, to avoid racing against module unload.
> +	 *
> +	 * If there still are isolated (balloon) pages under migration lists,
> +	 * 'virtballoon_putbackpage()' will take care of them properly, before
> +	 * the module unload finishes.
> +	 */
> +	if (vb->balloon_status == BALLOON_REMOVAL)
> +		return -EAGAIN;
> +
> +	mutex_lock(&vb->balloon_lock);
> +
> +	/* balloon's page migration 1st step */
> +	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +	spin_lock(&vb->pages_lock);
> +	list_add_rcu(&newpage->lru, &vb->pages);
> +	spin_unlock(&vb->pages_lock);
> +	set_page_pfns(vb->pfns, newpage);
> +	tell_host(vb, vb->inflate_vq);
> +
> +	/* balloon's page migration 2nd step */
> +	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +	set_page_pfns(vb->pfns, page);
> +	tell_host(vb, vb->deflate_vq);
> +
> +	mutex_unlock(&vb->balloon_lock);
> +	return 0;
> +}
> +
> +/*
> + * Populate balloon_mapping->a_ops callback method to help compaction on
> + * isolating a page from the balloon page list for posterior migration.

What does posterior migration mean?

> + */
> +int virtballoon_isolatepage(struct page *page)
> +{

Nothing at all in this function seems to be virtio balloon
specific. Can it move to mm core?

> +	struct virtio_balloon *vb = (void *)__page_balloon_device(page);
> +	int ret = 0;
> +	/*
> +	 * If we stumble across a NULL *vb here, it means this page has been
> +	 * already released by 'leak_balloon()'.
> +	 *
> +	 * We also skip the page isolation step if the memory balloon device is
> +	 * under its removal procedure, to avoid racing against module unload.
> +	 */
> +	if (vb && (vb->balloon_status != BALLOON_REMOVAL)) {
> +		spin_lock(&vb->pages_lock);
> +		/*
> +		 * virtballoon_isolatepage() can race against leak_balloon(),
> +		 * and (wrongly) isolate a page that is about to be freed.
> +		 * Test page->mapping under pages_lock to close that window.
> +		 */

Looks strange. If so could vb have gone away completely?
And if so would not this trigger use after free?
Can we take some lock when leaking page to prevent this, instead?
page_lock comes to mind ...

> +		if (rcu_access_pointer(page->mapping) == vb->mapping) {
> +			/* It is safe to isolate this page, now */
> +			list_del_rcu(&page->lru);
> +			ret = 1;
> +		}
> +		spin_unlock(&vb->pages_lock);
> +	}
> +	return ret;
> +}
> +
> +/*
> + * Populate balloon_mapping->a_ops callback method to help compaction on
> + * re-inserting a not-migrated isolated page into the balloon page list.
> + */
> +void virtballoon_putbackpage(struct page *page)
> +{
> +	struct virtio_balloon *vb = (void *)__page_balloon_device(page);
> +
> +	/* at this point, besides very unlikely, a NULL *vb is a serious bug */
> +	BUG_ON(!vb);
> +
> +	spin_lock(&vb->pages_lock);
> +	list_add_rcu(&page->lru, &vb->pages);
> +	spin_unlock(&vb->pages_lock);
> +}
> +#endif /* CONFIG_BALLOON_COMPACTION */
> +
> +/* define the balloon_mapping->a_ops callbacks to allow compaction/migration */
> +static DEFINE_BALLOON_MAPPING_AOPS(virtio_balloon_aops,
> +				   virtballoon_migratepage,
> +				   virtballoon_isolatepage,
> +				   virtballoon_putbackpage);
> +
>  static int virtballoon_probe(struct virtio_device *vdev)
>  {
>  	struct virtio_balloon *vb;
> +	struct address_space *vb_mapping;
>  	int err;
>  
>  	vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
> @@ -351,12 +521,24 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	}
>  
>  	INIT_LIST_HEAD(&vb->pages);
> +	spin_lock_init(&vb->pages_lock);
> +	mutex_init(&vb->balloon_lock);
> +
>  	vb->num_pages = 0;
>  	init_waitqueue_head(&vb->config_change);
>  	init_waitqueue_head(&vb->acked);
>  	vb->vdev = vdev;
>  	vb->need_stats_update = 0;
>  
> +	/* Allocate a special page->mapping for this balloon device */
> +	vb_mapping = alloc_balloon_mapping((void *)vb, &virtio_balloon_aops);
> +	if (!vb_mapping) {
> +		err = -ENOMEM;
> +		goto out_free_vb;
> +	}
> +	/* Store the page->mapping reference for this balloon device */
> +	vb->mapping = vb_mapping;
> +
>  	err = init_vqs(vb);
>  	if (err)
>  		goto out_free_vb;
> @@ -367,12 +549,14 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  		goto out_del_vqs;
>  	}
>  
> +	vb->balloon_status = BALLOON_OK;
>  	return 0;
>  
>  out_del_vqs:
>  	vdev->config->del_vqs(vdev);
>  out_free_vb:
>  	kfree(vb);
> +	kfree(vb_mapping);
>  out:
>  	return err;
>  }
> @@ -394,8 +578,10 @@ static void __devexit virtballoon_remove(struct virtio_device *vdev)
>  {
>  	struct virtio_balloon *vb = vdev->priv;
>  
> +	vb->balloon_status = BALLOON_REMOVAL;
>  	kthread_stop(vb->thread);
>  	remove_common(vb);
> +	kfree(vb->mapping);
>  	kfree(vb);
>  }
>  
> @@ -409,6 +595,7 @@ static int virtballoon_freeze(struct virtio_device *vdev)
>  	 * function is called.
>  	 */
>  
> +	vb->balloon_status = BALLOON_REMOVAL;
>  	remove_common(vb);
>  	return 0;
>  }
> @@ -424,6 +611,7 @@ static int virtballoon_restore(struct virtio_device *vdev)
>  
>  	fill_balloon(vb, towards_target(vb));
>  	update_balloon_size(vb);
> +	vb->balloon_status = BALLOON_OK;
>  	return 0;
>  }
>  #endif
> -- 
> 1.7.11.4

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-21 13:52   ` Michael S. Tsirkin
  2012-08-21 14:25     ` Michael S. Tsirkin
@ 2012-08-21 15:16     ` Peter Zijlstra
  2012-08-21 15:41       ` Michael S. Tsirkin
  2012-08-21 17:55     ` Rafael Aquini
  2 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2012-08-21 15:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rafael Aquini, linux-mm, linux-kernel, virtualization,
	Rusty Russell, Rik van Riel, Mel Gorman, Andi Kleen,
	Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Tue, 2012-08-21 at 16:52 +0300, Michael S. Tsirkin wrote:
> > +             rcu_read_lock();
> > +             mapping = rcu_dereference(page->mapping);
> > +             if (mapping_balloon(mapping))
> > +                     ret = true;
> > +             rcu_read_unlock();
> 
> This looks suspicious: you drop rcu_read_unlock
> so can't page switch from balloon to non balloon? 

RCU read lock is a non-exclusive lock, it cannot avoid anything like
that.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-21 12:47 ` [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility Rafael Aquini
  2012-08-21 13:52   ` Michael S. Tsirkin
@ 2012-08-21 15:20   ` Peter Zijlstra
  2012-08-21 16:24     ` Paul E. McKenney
  1 sibling, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2012-08-21 15:20 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: linux-mm, linux-kernel, virtualization, Rusty Russell,
	Michael S. Tsirkin, Rik van Riel, Mel Gorman, Andi Kleen,
	Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim, paulmck

On Tue, 2012-08-21 at 09:47 -0300, Rafael Aquini wrote:
> +       mapping = rcu_access_pointer(page->mapping);
> +       if (mapping)
> +               mapping = mapping->assoc_mapping; 

The comment near rcu_access_pointer() explicitly says:

 * Return the value of the specified RCU-protected pointer, but omit the
 * smp_read_barrier_depends() and keep the ACCESS_ONCE().  This is useful
 * when the value of this pointer is accessed, but the pointer is not
 * dereferenced,

Yet you dereference the pointer... smells like fail to me.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 3/5] virtio_balloon: introduce migration primitives to balloon pages
  2012-08-21 14:40   ` Michael S. Tsirkin
@ 2012-08-21 15:34     ` Peter Zijlstra
  2012-08-21 15:37     ` Peter Zijlstra
  1 sibling, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2012-08-21 15:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rafael Aquini, linux-mm, linux-kernel, virtualization,
	Rusty Russell, Rik van Riel, Mel Gorman, Andi Kleen,
	Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Tue, 2012-08-21 at 17:40 +0300, Michael S. Tsirkin wrote:
> > +     vb->balloon_status = BALLOON_REMOVAL;
> 
> This needs some kind of barrier.
> 
> >       kthread_stop(vb->thread); 

kthread_stop() implies an smp_wmb() [ because it needs to do a wakeup ].

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 3/5] virtio_balloon: introduce migration primitives to balloon pages
  2012-08-21 14:40   ` Michael S. Tsirkin
  2012-08-21 15:34     ` Peter Zijlstra
@ 2012-08-21 15:37     ` Peter Zijlstra
  1 sibling, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2012-08-21 15:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rafael Aquini, linux-mm, linux-kernel, virtualization,
	Rusty Russell, Rik van Riel, Mel Gorman, Andi Kleen,
	Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim, paulmck

On Tue, 2012-08-21 at 17:40 +0300, Michael S. Tsirkin wrote:
> > +             spin_lock(&vb->pages_lock);
> > +             page = list_first_or_null_rcu(&vb->pages, struct page, lru);
> 
> Why is list_first_or_null_rcu called outside
> RCU critical section here?

It looks like vb->pages_lock is the exclusive (or modification)
counterpart to the rcu-read-lock in this particular case, so it should
be fine.

Although for that same reason, it seems superfluous to use the RCU list
method since we're exclusive with list manipulations anyway.

> > +             if (!page) {
> > +                     spin_unlock(&vb->pages_lock);
> > +                     break;
> > +             }
> > +             /*
> > +              * It is safe now to drop page->mapping and delete this page
> > +              * from balloon page list, since we are grabbing 'pages_lock'
> > +              * which prevents 'virtballoon_isolatepage()' from acting.
> > +              */
> > +             clear_balloon_mapping(page);
> > +             list_del_rcu(&page->lru);
> > +             spin_unlock(&vb->pages_lock); 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-21 15:16     ` Peter Zijlstra
@ 2012-08-21 15:41       ` Michael S. Tsirkin
  2012-08-21 17:42         ` Rafael Aquini
  0 siblings, 1 reply; 52+ messages in thread
From: Michael S. Tsirkin @ 2012-08-21 15:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael Aquini, linux-mm, linux-kernel, virtualization,
	Rusty Russell, Rik van Riel, Mel Gorman, Andi Kleen,
	Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Tue, Aug 21, 2012 at 05:16:06PM +0200, Peter Zijlstra wrote:
> On Tue, 2012-08-21 at 16:52 +0300, Michael S. Tsirkin wrote:
> > > +             rcu_read_lock();
> > > +             mapping = rcu_dereference(page->mapping);
> > > +             if (mapping_balloon(mapping))
> > > +                     ret = true;
> > > +             rcu_read_unlock();
> > 
> > This looks suspicious: you drop rcu_read_unlock
> > so can't page switch from balloon to non balloon? 
> 
> RCU read lock is a non-exclusive lock, it cannot avoid anything like
> that.

You are right, of course. So even keeping rcu_read_lock across both test
and operation won't be enough - you need to make this function return
the mapping and pass it to isolate_page/putback_page so that it is only
dereferenced once.

-- 
MST

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-21 15:20   ` Peter Zijlstra
@ 2012-08-21 16:24     ` Paul E. McKenney
  2012-08-21 17:28       ` Rafael Aquini
  0 siblings, 1 reply; 52+ messages in thread
From: Paul E. McKenney @ 2012-08-21 16:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael Aquini, linux-mm, linux-kernel, virtualization,
	Rusty Russell, Michael S. Tsirkin, Rik van Riel, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Tue, Aug 21, 2012 at 05:20:11PM +0200, Peter Zijlstra wrote:
> On Tue, 2012-08-21 at 09:47 -0300, Rafael Aquini wrote:
> > +       mapping = rcu_access_pointer(page->mapping);
> > +       if (mapping)
> > +               mapping = mapping->assoc_mapping; 
> 
> The comment near rcu_access_pointer() explicitly says:
> 
>  * Return the value of the specified RCU-protected pointer, but omit the
>  * smp_read_barrier_depends() and keep the ACCESS_ONCE().  This is useful
>  * when the value of this pointer is accessed, but the pointer is not
>  * dereferenced,
> 
> Yet you dereference the pointer... smells like fail to me.

Indeed!

This will break DEC Alpha.  In addition, if ->mapping can transition
from non-NULL to NULL, and if you used rcu_access_pointer() rather
than rcu_dereference() to avoid lockdep-RCU from yelling at you about
not either being in an RCU read-side critical section or holding an
update-side lock, you can see failures as follows:

1.	CPU 0 runs the above code, picks up mapping, and finds it non-NULL.

2.	CPU 0 is preempted or otherwise delayed.  (Keep in mind that
	even disabling interrupts in a guest OS does not prevent the
	host hypervisor from preempting!)

3.	Some other CPU NULLs page->mapping.  Because CPU 0 isn't doing
	anything to prevent it, this other CPU frees the memory.

4.	CPU 0 resumes, and then accesses what is now the freelist.
	Arbitrarily bad things start happening.

If you are in a read-side critical section, use rcu_dereference() instead
of rcu_access_pointer().  If you are holding an update-side lock, use
rcu_dereference_protected() and say what lock you are holding.  If you
are doing something else, please say what it is.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-21 16:24     ` Paul E. McKenney
@ 2012-08-21 17:28       ` Rafael Aquini
  2012-08-21 19:13         ` Michael S. Tsirkin
  0 siblings, 1 reply; 52+ messages in thread
From: Rafael Aquini @ 2012-08-21 17:28 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, linux-mm, linux-kernel, virtualization,
	Rusty Russell, Michael S. Tsirkin, Rik van Riel, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Tue, Aug 21, 2012 at 09:24:32AM -0700, Paul E. McKenney wrote:
> On Tue, Aug 21, 2012 at 05:20:11PM +0200, Peter Zijlstra wrote:
> > On Tue, 2012-08-21 at 09:47 -0300, Rafael Aquini wrote:
> > > +       mapping = rcu_access_pointer(page->mapping);
> > > +       if (mapping)
> > > +               mapping = mapping->assoc_mapping; 
> > 
> > The comment near rcu_access_pointer() explicitly says:
> > 
> >  * Return the value of the specified RCU-protected pointer, but omit the
> >  * smp_read_barrier_depends() and keep the ACCESS_ONCE().  This is useful
> >  * when the value of this pointer is accessed, but the pointer is not
> >  * dereferenced,
> > 
> > Yet you dereference the pointer... smells like fail to me.
> 
> Indeed!
> 
> This will break DEC Alpha.  In addition, if ->mapping can transition
> from non-NULL to NULL, and if you used rcu_access_pointer() rather
> than rcu_dereference() to avoid lockdep-RCU from yelling at you about
> not either being in an RCU read-side critical section or holding an
> update-side lock, you can see failures as follows:
> 
> 1.	CPU 0 runs the above code, picks up mapping, and finds it non-NULL.
> 
> 2.	CPU 0 is preempted or otherwise delayed.  (Keep in mind that
> 	even disabling interrupts in a guest OS does not prevent the
> 	host hypervisor from preempting!)
> 
> 3.	Some other CPU NULLs page->mapping.  Because CPU 0 isn't doing
> 	anything to prevent it, this other CPU frees the memory.
> 
> 4.	CPU 0 resumes, and then accesses what is now the freelist.
> 	Arbitrarily bad things start happening.
> 
> If you are in a read-side critical section, use rcu_dereference() instead
> of rcu_access_pointer().  If you are holding an update-side lock, use
> rcu_dereference_protected() and say what lock you are holding.  If you
> are doing something else, please say what it is.
> 
> 							Thanx, Paul
>
Paul & Peter,

Thanks for looking into this stuff and providing me such valuable feedback, and
RCU usage crashcourse.

I believe rcu_dereference_protected() is what I want/need here, since this code
is always called for pages which we hold locked (PG_locked bit). So, it brings me
to ask you if the following usage looks sane enough to fix the well pointed issue,
or if it's another misuse of RCU API:

+       mapping = rcu_dereference_protecetd(page->mapping, PageLocked(page));
+       if (mapping)
+               mapping = mapping->assoc_mapping; 



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-21 15:41       ` Michael S. Tsirkin
@ 2012-08-21 17:42         ` Rafael Aquini
  2012-08-21 19:28           ` Michael S. Tsirkin
  0 siblings, 1 reply; 52+ messages in thread
From: Rafael Aquini @ 2012-08-21 17:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Zijlstra, linux-mm, linux-kernel, virtualization,
	Rusty Russell, Rik van Riel, Mel Gorman, Andi Kleen,
	Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Tue, Aug 21, 2012 at 06:41:42PM +0300, Michael S. Tsirkin wrote:
> On Tue, Aug 21, 2012 at 05:16:06PM +0200, Peter Zijlstra wrote:
> > On Tue, 2012-08-21 at 16:52 +0300, Michael S. Tsirkin wrote:
> > > > +             rcu_read_lock();
> > > > +             mapping = rcu_dereference(page->mapping);
> > > > +             if (mapping_balloon(mapping))
> > > > +                     ret = true;
> > > > +             rcu_read_unlock();
> > > 
> > > This looks suspicious: you drop rcu_read_unlock
> > > so can't page switch from balloon to non balloon? 
> > 
> > RCU read lock is a non-exclusive lock, it cannot avoid anything like
> > that.
> 
> You are right, of course. So even keeping rcu_read_lock across both test
> and operation won't be enough - you need to make this function return
> the mapping and pass it to isolate_page/putback_page so that it is only
> dereferenced once.
>
No, I need to dereference page->mapping to check ->mapping flags here, before
returning. Remember this function is used at MM's compaction/migration inner
circles to identify ballooned pages and decide what's the next step. This
function is doing the right thing, IMHO.

Also, looking at how compaction/migration work, we verify the only critical path
for this function is the page isolation step. The other steps (migration and
putback) perform their work on private lists previouly isolated from a given
source.

So, we just need to make sure that the isolation part does not screw things up
by isolating pages that balloon driver is about to release. That's why there are
so many checkpoints down the page isolation path assuring we really are
isolating a balloon page. 



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-21 13:52   ` Michael S. Tsirkin
  2012-08-21 14:25     ` Michael S. Tsirkin
  2012-08-21 15:16     ` Peter Zijlstra
@ 2012-08-21 17:55     ` Rafael Aquini
  2012-08-21 19:16       ` Michael S. Tsirkin
  2 siblings, 1 reply; 52+ messages in thread
From: Rafael Aquini @ 2012-08-21 17:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-mm, linux-kernel, virtualization, Rusty Russell,
	Rik van Riel, Mel Gorman, Andi Kleen, Andrew Morton,
	Konrad Rzeszutek Wilk, Minchan Kim

On Tue, Aug 21, 2012 at 04:52:23PM +0300, Michael S. Tsirkin wrote:
> > + * address_space_operations utilized methods for ballooned pages:
> > + *   .migratepage    - used to perform balloon's page migration (as is)
> > + *   .launder_page   - used to isolate a page from balloon's page list
> > + *   .freepage       - used to reinsert an isolated page to balloon's page list
> > + */
> 
> It would be a good idea to document the assumptions here.
> Looks like .launder_page and .freepage are called in rcu critical
> section.
> But migratepage isn't - why is that safe?
> 

The migratepage callback for virtio_balloon can sleep, and IIUC we cannot sleep
within a RCU critical section. 

Also, The migratepage callback is called at inner migration's circle function
move_to_new_page(), and I don't think embedding it in a RCU critical section
would be a good idea, for the same understanding aforementioned.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-21 17:28       ` Rafael Aquini
@ 2012-08-21 19:13         ` Michael S. Tsirkin
  2012-08-21 19:23           ` Rafael Aquini
  0 siblings, 1 reply; 52+ messages in thread
From: Michael S. Tsirkin @ 2012-08-21 19:13 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Paul E. McKenney, Peter Zijlstra, linux-mm, linux-kernel,
	virtualization, Rusty Russell, Rik van Riel, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Tue, Aug 21, 2012 at 02:28:20PM -0300, Rafael Aquini wrote:
> On Tue, Aug 21, 2012 at 09:24:32AM -0700, Paul E. McKenney wrote:
> > On Tue, Aug 21, 2012 at 05:20:11PM +0200, Peter Zijlstra wrote:
> > > On Tue, 2012-08-21 at 09:47 -0300, Rafael Aquini wrote:
> > > > +       mapping = rcu_access_pointer(page->mapping);
> > > > +       if (mapping)
> > > > +               mapping = mapping->assoc_mapping; 
> > > 
> > > The comment near rcu_access_pointer() explicitly says:
> > > 
> > >  * Return the value of the specified RCU-protected pointer, but omit the
> > >  * smp_read_barrier_depends() and keep the ACCESS_ONCE().  This is useful
> > >  * when the value of this pointer is accessed, but the pointer is not
> > >  * dereferenced,
> > > 
> > > Yet you dereference the pointer... smells like fail to me.
> > 
> > Indeed!
> > 
> > This will break DEC Alpha.  In addition, if ->mapping can transition
> > from non-NULL to NULL, and if you used rcu_access_pointer() rather
> > than rcu_dereference() to avoid lockdep-RCU from yelling at you about
> > not either being in an RCU read-side critical section or holding an
> > update-side lock, you can see failures as follows:
> > 
> > 1.	CPU 0 runs the above code, picks up mapping, and finds it non-NULL.
> > 
> > 2.	CPU 0 is preempted or otherwise delayed.  (Keep in mind that
> > 	even disabling interrupts in a guest OS does not prevent the
> > 	host hypervisor from preempting!)
> > 
> > 3.	Some other CPU NULLs page->mapping.  Because CPU 0 isn't doing
> > 	anything to prevent it, this other CPU frees the memory.
> > 
> > 4.	CPU 0 resumes, and then accesses what is now the freelist.
> > 	Arbitrarily bad things start happening.
> > 
> > If you are in a read-side critical section, use rcu_dereference() instead
> > of rcu_access_pointer().  If you are holding an update-side lock, use
> > rcu_dereference_protected() and say what lock you are holding.  If you
> > are doing something else, please say what it is.
> > 
> > 							Thanx, Paul
> >
> Paul & Peter,
> 
> Thanks for looking into this stuff and providing me such valuable feedback, and
> RCU usage crashcourse.
> 
> I believe rcu_dereference_protected() is what I want/need here, since this code
> is always called for pages which we hold locked (PG_locked bit).

It would only help if we locked the page while updating the mapping,
as far as I can see we don't.

> So, it brings me
> to ask you if the following usage looks sane enough to fix the well pointed issue,
> or if it's another misuse of RCU API:
> 
> +       mapping = rcu_dereference_protecetd(page->mapping, PageLocked(page));
> +       if (mapping)
> +               mapping = mapping->assoc_mapping; 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-21 17:55     ` Rafael Aquini
@ 2012-08-21 19:16       ` Michael S. Tsirkin
  2012-08-21 19:34         ` Rafael Aquini
  0 siblings, 1 reply; 52+ messages in thread
From: Michael S. Tsirkin @ 2012-08-21 19:16 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: linux-mm, linux-kernel, virtualization, Rusty Russell,
	Rik van Riel, Mel Gorman, Andi Kleen, Andrew Morton,
	Konrad Rzeszutek Wilk, Minchan Kim

On Tue, Aug 21, 2012 at 02:55:03PM -0300, Rafael Aquini wrote:
> On Tue, Aug 21, 2012 at 04:52:23PM +0300, Michael S. Tsirkin wrote:
> > > + * address_space_operations utilized methods for ballooned pages:
> > > + *   .migratepage    - used to perform balloon's page migration (as is)
> > > + *   .launder_page   - used to isolate a page from balloon's page list
> > > + *   .freepage       - used to reinsert an isolated page to balloon's page list
> > > + */
> > 
> > It would be a good idea to document the assumptions here.
> > Looks like .launder_page and .freepage are called in rcu critical
> > section.
> > But migratepage isn't - why is that safe?
> > 
> 
> The migratepage callback for virtio_balloon can sleep, and IIUC we cannot sleep
> within a RCU critical section. 
> 
> Also, The migratepage callback is called at inner migration's circle function
> move_to_new_page(), and I don't think embedding it in a RCU critical section
> would be a good idea, for the same understanding aforementioned.

Yes but this means it is still exposed to the module unloading
races that RCU was supposed to fix.
So need to either rework that code so it won't sleep
or switch to some other synchronization.

-- 
MST

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-21 19:13         ` Michael S. Tsirkin
@ 2012-08-21 19:23           ` Rafael Aquini
  2012-08-21 19:30             ` Michael S. Tsirkin
  0 siblings, 1 reply; 52+ messages in thread
From: Rafael Aquini @ 2012-08-21 19:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paul E. McKenney, Peter Zijlstra, linux-mm, linux-kernel,
	virtualization, Rusty Russell, Rik van Riel, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Tue, Aug 21, 2012 at 10:13:30PM +0300, Michael S. Tsirkin wrote:
> > 
> > I believe rcu_dereference_protected() is what I want/need here, since this code
> > is always called for pages which we hold locked (PG_locked bit).
> 
> It would only help if we locked the page while updating the mapping,
> as far as I can see we don't.
>

But we can do it. In fact, by doing it (locking the page) we can easily avoid
the nasty race balloon_isolate_page / leak_balloon, in a much simpler way, IMHO.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-21 17:42         ` Rafael Aquini
@ 2012-08-21 19:28           ` Michael S. Tsirkin
  0 siblings, 0 replies; 52+ messages in thread
From: Michael S. Tsirkin @ 2012-08-21 19:28 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Peter Zijlstra, linux-mm, linux-kernel, virtualization,
	Rusty Russell, Rik van Riel, Mel Gorman, Andi Kleen,
	Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Tue, Aug 21, 2012 at 02:42:52PM -0300, Rafael Aquini wrote:
> On Tue, Aug 21, 2012 at 06:41:42PM +0300, Michael S. Tsirkin wrote:
> > On Tue, Aug 21, 2012 at 05:16:06PM +0200, Peter Zijlstra wrote:
> > > On Tue, 2012-08-21 at 16:52 +0300, Michael S. Tsirkin wrote:
> > > > > +             rcu_read_lock();
> > > > > +             mapping = rcu_dereference(page->mapping);
> > > > > +             if (mapping_balloon(mapping))
> > > > > +                     ret = true;
> > > > > +             rcu_read_unlock();
> > > > 
> > > > This looks suspicious: you drop rcu_read_unlock
> > > > so can't page switch from balloon to non balloon? 
> > > 
> > > RCU read lock is a non-exclusive lock, it cannot avoid anything like
> > > that.
> > 
> > You are right, of course. So even keeping rcu_read_lock across both test
> > and operation won't be enough - you need to make this function return
> > the mapping and pass it to isolate_page/putback_page so that it is only
> > dereferenced once.
> >
> No, I need to dereference page->mapping to check ->mapping flags here, before
> returning. Remember this function is used at MM's compaction/migration inner
> circles to identify ballooned pages and decide what's the next step. This
> function is doing the right thing, IMHO.

Yes but the calling code is not doing the right thing.

What Peter pointed out here is that two calls to rcu dereference pointer
can return different values: rcu critical section is not a lock.
So the test for balloon page is not effective: it can change
after the fact.

To fix, get the pointer once and then pass the mapping
around.


> Also, looking at how compaction/migration work, we verify the only critical path
> for this function is the page isolation step. The other steps (migration and
> putback) perform their work on private lists previouly isolated from a given
> source.

I vaguely understand but it would be nice to document this properly.
The interaction between page->lru handling in balloon and in mm
is especially confusing.

> So, we just need to make sure that the isolation part does not screw things up
> by isolating pages that balloon driver is about to release. That's why there are
> so many checkpoints down the page isolation path assuring we really are
> isolating a balloon page. 

Well, testing same thing multiple times is just confusing.  It is very
hard to make sure there are no races with so much complexity,
and the requirements from the balloon driver are unclear to me -
it very much looks like it is poking in mm internals.

-- 
MST

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-21 19:23           ` Rafael Aquini
@ 2012-08-21 19:30             ` Michael S. Tsirkin
  2012-08-21 20:45               ` Rafael Aquini
  0 siblings, 1 reply; 52+ messages in thread
From: Michael S. Tsirkin @ 2012-08-21 19:30 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Paul E. McKenney, Peter Zijlstra, linux-mm, linux-kernel,
	virtualization, Rusty Russell, Rik van Riel, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Tue, Aug 21, 2012 at 04:23:58PM -0300, Rafael Aquini wrote:
> On Tue, Aug 21, 2012 at 10:13:30PM +0300, Michael S. Tsirkin wrote:
> > > 
> > > I believe rcu_dereference_protected() is what I want/need here, since this code
> > > is always called for pages which we hold locked (PG_locked bit).
> > 
> > It would only help if we locked the page while updating the mapping,
> > as far as I can see we don't.
> >
> 
> But we can do it. In fact, by doing it (locking the page) we can easily avoid
> the nasty race balloon_isolate_page / leak_balloon, in a much simpler way, IMHO.

Absolutely. Further, we should look hard at whether most RCU uses
in this patchset can be replaced with page lock.

-- 
MST

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-21 19:16       ` Michael S. Tsirkin
@ 2012-08-21 19:34         ` Rafael Aquini
  2012-08-22  0:06           ` Michael S. Tsirkin
  0 siblings, 1 reply; 52+ messages in thread
From: Rafael Aquini @ 2012-08-21 19:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-mm, linux-kernel, virtualization, Rusty Russell,
	Rik van Riel, Mel Gorman, Andi Kleen, Andrew Morton,
	Konrad Rzeszutek Wilk, Minchan Kim

On Tue, Aug 21, 2012 at 10:16:12PM +0300, Michael S. Tsirkin wrote:
> On Tue, Aug 21, 2012 at 02:55:03PM -0300, Rafael Aquini wrote:
> > On Tue, Aug 21, 2012 at 04:52:23PM +0300, Michael S. Tsirkin wrote:
> > > > + * address_space_operations utilized methods for ballooned pages:
> > > > + *   .migratepage    - used to perform balloon's page migration (as is)
> > > > + *   .launder_page   - used to isolate a page from balloon's page list
> > > > + *   .freepage       - used to reinsert an isolated page to balloon's page list
> > > > + */
> > > 
> > > It would be a good idea to document the assumptions here.
> > > Looks like .launder_page and .freepage are called in rcu critical
> > > section.
> > > But migratepage isn't - why is that safe?
> > > 
> > 
> > The migratepage callback for virtio_balloon can sleep, and IIUC we cannot sleep
> > within a RCU critical section. 
> > 
> > Also, The migratepage callback is called at inner migration's circle function
> > move_to_new_page(), and I don't think embedding it in a RCU critical section
> > would be a good idea, for the same understanding aforementioned.
> 
> Yes but this means it is still exposed to the module unloading
> races that RCU was supposed to fix.
> So need to either rework that code so it won't sleep
> or switch to some other synchronization.
>
Can you refactor tell_host() to not sleep? Or, can I get rid of calling it at
virtballoon_migratepage()? If 'no' is the answer for both questions, that's the
way that code has to remain, even if we find a way around to hack the
migratepage callback and have it embedded into a RCU crit section.

That's why I believe once the balloon driver is commanded to unload, we must
flag virtballoon_migratepage to skip it's work. By doing this, the thread
performing memory compaction will have to recur to the 'putback' path which is
RCU protected. (IMHO).

As the module will not uload utill it leaks all pages on its list, that unload
race you pointed before will be covered.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-21 19:30             ` Michael S. Tsirkin
@ 2012-08-21 20:45               ` Rafael Aquini
  2012-08-22  0:07                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 52+ messages in thread
From: Rafael Aquini @ 2012-08-21 20:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paul E. McKenney, Peter Zijlstra, linux-mm, linux-kernel,
	virtualization, Rusty Russell, Rik van Riel, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Tue, Aug 21, 2012 at 10:30:31PM +0300, Michael S. Tsirkin wrote:
> On Tue, Aug 21, 2012 at 04:23:58PM -0300, Rafael Aquini wrote:
> > On Tue, Aug 21, 2012 at 10:13:30PM +0300, Michael S. Tsirkin wrote:
> > > > 
> > > > I believe rcu_dereference_protected() is what I want/need here, since this code
> > > > is always called for pages which we hold locked (PG_locked bit).
> > > 
> > > It would only help if we locked the page while updating the mapping,
> > > as far as I can see we don't.
> > >
> > 
> > But we can do it. In fact, by doing it (locking the page) we can easily avoid
> > the nasty race balloon_isolate_page / leak_balloon, in a much simpler way, IMHO.
> 
> Absolutely. Further, we should look hard at whether most RCU uses
> in this patchset can be replaced with page lock.
>

Yeah, In fact, by testing/grabbing the page lock at leak_balloon() even the
module unload X migration / putback race seems to fade away, since migration
code holds the page locked all the way.

And that seems a quite easy task to be accomplished:

....
@@ -169,21 +197,61 @@ static void leak_balloon(struct virtio_balloon *vb, size_t
num)
        /* We can only do one array worth at a time. */
        num = min(num, ARRAY_SIZE(vb->pfns));

+       mutex_lock(&vb->balloon_lock);
        for (vb->num_pfns = 0; vb->num_pfns < num;
             vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
+               spin_lock(&vb->pages_lock);
+               /*
+                * 'virtballoon_isolatepage()' can drain vb->pages list
+                * making us to stumble across a _temporarily_ empty list.
+                *
+                * Release the spinlock and resume from here in order to
+                * give page migration a shot to refill vb->pages list.
+                */
+               if (unlikely(list_empty(&vb->pages))) {
+                       spin_unlock(&vb->pages_lock);
+                       break;
+               }
+
                page = list_first_entry(&vb->pages, struct page, lru);
+
+               /*
+                * Grab the page lock to avoid racing against threads isolating
+                * pages from vb->pages list (it's done under page lock).
+                *
+                * Failing to grab the page lock here means this page has been
+                * selected for isolation already.
+                */
+               if (!trylock_page(page)) {
+                       spin_unlock(&vb->pages_lock);
+                       break;
+               }
+
+               clear_balloon_mapping(page);
                list_del(&page->lru);
                set_page_pfns(vb->pfns + vb->num_pfns, page);
                vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
+               unlock_page(page);
+               spin_unlock(&vb->pages_lock);
        }

.....

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-21 19:34         ` Rafael Aquini
@ 2012-08-22  0:06           ` Michael S. Tsirkin
  0 siblings, 0 replies; 52+ messages in thread
From: Michael S. Tsirkin @ 2012-08-22  0:06 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: linux-mm, linux-kernel, virtualization, Rusty Russell,
	Rik van Riel, Mel Gorman, Andi Kleen, Andrew Morton,
	Konrad Rzeszutek Wilk, Minchan Kim

On Tue, Aug 21, 2012 at 04:34:39PM -0300, Rafael Aquini wrote:
> On Tue, Aug 21, 2012 at 10:16:12PM +0300, Michael S. Tsirkin wrote:
> > On Tue, Aug 21, 2012 at 02:55:03PM -0300, Rafael Aquini wrote:
> > > On Tue, Aug 21, 2012 at 04:52:23PM +0300, Michael S. Tsirkin wrote:
> > > > > + * address_space_operations utilized methods for ballooned pages:
> > > > > + *   .migratepage    - used to perform balloon's page migration (as is)
> > > > > + *   .launder_page   - used to isolate a page from balloon's page list
> > > > > + *   .freepage       - used to reinsert an isolated page to balloon's page list
> > > > > + */
> > > > 
> > > > It would be a good idea to document the assumptions here.
> > > > Looks like .launder_page and .freepage are called in rcu critical
> > > > section.
> > > > But migratepage isn't - why is that safe?
> > > > 
> > > 
> > > The migratepage callback for virtio_balloon can sleep, and IIUC we cannot sleep
> > > within a RCU critical section. 
> > > 
> > > Also, The migratepage callback is called at inner migration's circle function
> > > move_to_new_page(), and I don't think embedding it in a RCU critical section
> > > would be a good idea, for the same understanding aforementioned.
> > 
> > Yes but this means it is still exposed to the module unloading
> > races that RCU was supposed to fix.
> > So need to either rework that code so it won't sleep
> > or switch to some other synchronization.
> >
> Can you refactor tell_host() to not sleep? Or, can I get rid of calling it at
> virtballoon_migratepage()? If 'no' is the answer for both questions, that's the
> way that code has to remain, even if we find a way around to hack the
> migratepage callback and have it embedded into a RCU crit section.
> 
> That's why I believe once the balloon driver is commanded to unload, we must
> flag virtballoon_migratepage to skip it's work. By doing this, the thread
> performing memory compaction will have to recur to the 'putback' path which is
> RCU protected. (IMHO).
> 
> As the module will not uload utill it leaks all pages on its list, that unload
> race you pointed before will be covered.


It can not be: nothing callback does can prevent it from
running after module unload: you must have some synchronization
in the calling code.

-- 
MST

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-21 20:45               ` Rafael Aquini
@ 2012-08-22  0:07                 ` Michael S. Tsirkin
  2012-08-22  1:19                   ` Rafael Aquini
  0 siblings, 1 reply; 52+ messages in thread
From: Michael S. Tsirkin @ 2012-08-22  0:07 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Paul E. McKenney, Peter Zijlstra, linux-mm, linux-kernel,
	virtualization, Rusty Russell, Rik van Riel, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Tue, Aug 21, 2012 at 05:45:56PM -0300, Rafael Aquini wrote:
> On Tue, Aug 21, 2012 at 10:30:31PM +0300, Michael S. Tsirkin wrote:
> > On Tue, Aug 21, 2012 at 04:23:58PM -0300, Rafael Aquini wrote:
> > > On Tue, Aug 21, 2012 at 10:13:30PM +0300, Michael S. Tsirkin wrote:
> > > > > 
> > > > > I believe rcu_dereference_protected() is what I want/need here, since this code
> > > > > is always called for pages which we hold locked (PG_locked bit).
> > > > 
> > > > It would only help if we locked the page while updating the mapping,
> > > > as far as I can see we don't.
> > > >
> > > 
> > > But we can do it. In fact, by doing it (locking the page) we can easily avoid
> > > the nasty race balloon_isolate_page / leak_balloon, in a much simpler way, IMHO.
> > 
> > Absolutely. Further, we should look hard at whether most RCU uses
> > in this patchset can be replaced with page lock.
> >
> 
> Yeah, In fact, by testing/grabbing the page lock at leak_balloon() even the
> module unload X migration / putback race seems to fade away, since migration
> code holds the page locked all the way.
> And that seems a quite easy task to be accomplished:
> 
> ....
> @@ -169,21 +197,61 @@ static void leak_balloon(struct virtio_balloon *vb, size_t
> num)
>         /* We can only do one array worth at a time. */
>         num = min(num, ARRAY_SIZE(vb->pfns));
> 
> +       mutex_lock(&vb->balloon_lock);
>         for (vb->num_pfns = 0; vb->num_pfns < num;
>              vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> +               spin_lock(&vb->pages_lock);
> +               /*
> +                * 'virtballoon_isolatepage()' can drain vb->pages list
> +                * making us to stumble across a _temporarily_ empty list.

This still worries me. If this happens we do not
lock the page so module can go away?
if not need to document why.

> +                *
> +                * Release the spinlock and resume from here in order to
> +                * give page migration a shot to refill vb->pages list.
> +                */
> +               if (unlikely(list_empty(&vb->pages))) {
> +                       spin_unlock(&vb->pages_lock);
> +                       break;
> +               }
> +
>                 page = list_first_entry(&vb->pages, struct page, lru);
> +
> +               /*
> +                * Grab the page lock to avoid racing against threads isolating
> +                * pages from vb->pages list (it's done under page lock).
> +                *
> +                * Failing to grab the page lock here means this page has been
> +                * selected for isolation already.
> +                */
> +               if (!trylock_page(page)) {
> +                       spin_unlock(&vb->pages_lock);
> +                       break;
> +               }
> +
> +               clear_balloon_mapping(page);
>                 list_del(&page->lru);
>                 set_page_pfns(vb->pfns + vb->num_pfns, page);
>                 vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
> +               unlock_page(page);
> +               spin_unlock(&vb->pages_lock);
>         }
> 
> .....

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-22  0:07                 ` Michael S. Tsirkin
@ 2012-08-22  1:19                   ` Rafael Aquini
  2012-08-22  9:33                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 52+ messages in thread
From: Rafael Aquini @ 2012-08-22  1:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paul E. McKenney, Peter Zijlstra, linux-mm, linux-kernel,
	virtualization, Rusty Russell, Rik van Riel, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Wed, Aug 22, 2012 at 03:07:41AM +0300, Michael S. Tsirkin wrote:
> On Tue, Aug 21, 2012 at 05:45:56PM -0300, Rafael Aquini wrote:
> > On Tue, Aug 21, 2012 at 10:30:31PM +0300, Michael S. Tsirkin wrote:
> > > On Tue, Aug 21, 2012 at 04:23:58PM -0300, Rafael Aquini wrote:
> > > > On Tue, Aug 21, 2012 at 10:13:30PM +0300, Michael S. Tsirkin wrote:
> > > > > > 
> > > > > > I believe rcu_dereference_protected() is what I want/need here, since this code
> > > > > > is always called for pages which we hold locked (PG_locked bit).
> > > > > 
> > > > > It would only help if we locked the page while updating the mapping,
> > > > > as far as I can see we don't.
> > > > >
> > > > 
> > > > But we can do it. In fact, by doing it (locking the page) we can easily avoid
> > > > the nasty race balloon_isolate_page / leak_balloon, in a much simpler way, IMHO.
> > > 
> > > Absolutely. Further, we should look hard at whether most RCU uses
> > > in this patchset can be replaced with page lock.
> > >
> > 
> > Yeah, In fact, by testing/grabbing the page lock at leak_balloon() even the
> > module unload X migration / putback race seems to fade away, since migration
> > code holds the page locked all the way.
> > And that seems a quite easy task to be accomplished:
> > 
> > ....
> > @@ -169,21 +197,61 @@ static void leak_balloon(struct virtio_balloon *vb, size_t
> > num)
> >         /* We can only do one array worth at a time. */
> >         num = min(num, ARRAY_SIZE(vb->pfns));
> > 
> > +       mutex_lock(&vb->balloon_lock);
> >         for (vb->num_pfns = 0; vb->num_pfns < num;
> >              vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> > +               spin_lock(&vb->pages_lock);
> > +               /*
> > +                * 'virtballoon_isolatepage()' can drain vb->pages list
> > +                * making us to stumble across a _temporarily_ empty list.
> 
> This still worries me. If this happens we do not
> lock the page so module can go away?
> if not need to document why.
>
The module won't unload unless it leaks all its pages. If we hit that test that
worries you, leak_balloon() will get back to its caller -- remove_common(), and
it will kept looping at:

        /* There might be pages left in the balloon: free them. */
        while (vb->num_pages)
                leak_balloon(vb, vb->num_pages);

This is true because we do not mess with vb->num_pages while isolating/migrating
balloon pages, so the module will only unload when all isolated pages get back
to vb->pages_list and leak_balloon() reap them appropriatelly. As we will be
doing isolation/migration/putback steps under 'page lock' that race is gone.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-22  1:19                   ` Rafael Aquini
@ 2012-08-22  9:33                     ` Michael S. Tsirkin
  2012-08-23  2:19                       ` Rafael Aquini
  0 siblings, 1 reply; 52+ messages in thread
From: Michael S. Tsirkin @ 2012-08-22  9:33 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Paul E. McKenney, Peter Zijlstra, linux-mm, linux-kernel,
	virtualization, Rusty Russell, Rik van Riel, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Tue, Aug 21, 2012 at 10:19:31PM -0300, Rafael Aquini wrote:
> On Wed, Aug 22, 2012 at 03:07:41AM +0300, Michael S. Tsirkin wrote:
> > On Tue, Aug 21, 2012 at 05:45:56PM -0300, Rafael Aquini wrote:
> > > On Tue, Aug 21, 2012 at 10:30:31PM +0300, Michael S. Tsirkin wrote:
> > > > On Tue, Aug 21, 2012 at 04:23:58PM -0300, Rafael Aquini wrote:
> > > > > On Tue, Aug 21, 2012 at 10:13:30PM +0300, Michael S. Tsirkin wrote:
> > > > > > > 
> > > > > > > I believe rcu_dereference_protected() is what I want/need here, since this code
> > > > > > > is always called for pages which we hold locked (PG_locked bit).
> > > > > > 
> > > > > > It would only help if we locked the page while updating the mapping,
> > > > > > as far as I can see we don't.
> > > > > >
> > > > > 
> > > > > But we can do it. In fact, by doing it (locking the page) we can easily avoid
> > > > > the nasty race balloon_isolate_page / leak_balloon, in a much simpler way, IMHO.
> > > > 
> > > > Absolutely. Further, we should look hard at whether most RCU uses
> > > > in this patchset can be replaced with page lock.
> > > >
> > > 
> > > Yeah, In fact, by testing/grabbing the page lock at leak_balloon() even the
> > > module unload X migration / putback race seems to fade away, since migration
> > > code holds the page locked all the way.
> > > And that seems a quite easy task to be accomplished:
> > > 
> > > ....
> > > @@ -169,21 +197,61 @@ static void leak_balloon(struct virtio_balloon *vb, size_t
> > > num)
> > >         /* We can only do one array worth at a time. */
> > >         num = min(num, ARRAY_SIZE(vb->pfns));
> > > 
> > > +       mutex_lock(&vb->balloon_lock);
> > >         for (vb->num_pfns = 0; vb->num_pfns < num;
> > >              vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> > > +               spin_lock(&vb->pages_lock);
> > > +               /*
> > > +                * 'virtballoon_isolatepage()' can drain vb->pages list
> > > +                * making us to stumble across a _temporarily_ empty list.
> > 
> > This still worries me. If this happens we do not
> > lock the page so module can go away?
> > if not need to document why.
> >
> The module won't unload unless it leaks all its pages. If we hit that test that
> worries you, leak_balloon() will get back to its caller -- remove_common(), and
> it will kept looping at:
> 
>         /* There might be pages left in the balloon: free them. */
>         while (vb->num_pages)
>                 leak_balloon(vb, vb->num_pages);
> 
> This is true because we do not mess with vb->num_pages while isolating/migrating
> balloon pages, so the module will only unload when all isolated pages get back
> to vb->pages_list and leak_balloon() reap them appropriatelly. As we will be
> doing isolation/migration/putback steps under 'page lock' that race is gone.


Hmm, so this will busy wait which is unelegant.
We need some event IMO.
Also, reading num_pages without a lock here
which seems wrong.

A similar concern applies to normal leaking
of the balloon: here we might leak less than
required, then wait for the next config change
event.

How about we signal config_change
event when pages are back to pages_list?

-- 
MST

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-22  9:33                     ` Michael S. Tsirkin
@ 2012-08-23  2:19                       ` Rafael Aquini
  2012-08-23 10:01                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 52+ messages in thread
From: Rafael Aquini @ 2012-08-23  2:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paul E. McKenney, Peter Zijlstra, linux-mm, linux-kernel,
	virtualization, Rusty Russell, Rik van Riel, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Wed, Aug 22, 2012 at 12:33:17PM +0300, Michael S. Tsirkin wrote:
> Hmm, so this will busy wait which is unelegant.
> We need some event IMO.

No, it does not busy wait. leak_balloon() is mutual exclusive with migration
steps, so for the case we have one racing against the other, we really want
leak_balloon() dropping the mutex temporarily to allow migration complete its
work of refilling vb->pages list. Also, leak_balloon() calls tell_host(), which
will potentially make it to schedule for each round of vb->pfns leak_balloon()
will release. So, when remove_common() calls leak_balloon() looping on
vb->num_pages, that won't become a tight loop. 
The scheme was apparently working before this series, and it will remain working
after it.


> Also, reading num_pages without a lock here
> which seems wrong.

I'll protect it with vb->balloon_lock mutex. That will be consistent with the
lock protection scheme this patch is introducing for struct virtio_balloon
elements.


> A similar concern applies to normal leaking
> of the balloon: here we might leak less than
> required, then wait for the next config change
> event.

Just as before, same thing here. If you leaked less than required, balloon()
will keep calling leak_balloon() until the balloon target is reached. This
scheme was working before, and it will keep working after this patch.


> How about we signal config_change
> event when pages are back to pages_list?

I really don't know what to tell you here, but, to me, it seems like an
overcomplication that isn't directly entangled with this patch purposes.
Besides, you cannot expect compation / migration happening and racing against
leak_balloon() all the time to make them signal events to the later, so we might
just be creating a wait-forever condition for leak_balloon(), IMHO.

Cheers!


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-23  2:19                       ` Rafael Aquini
@ 2012-08-23 10:01                         ` Michael S. Tsirkin
  2012-08-23 12:13                           ` Rafael Aquini
  0 siblings, 1 reply; 52+ messages in thread
From: Michael S. Tsirkin @ 2012-08-23 10:01 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Paul E. McKenney, Peter Zijlstra, linux-mm, linux-kernel,
	virtualization, Rusty Russell, Rik van Riel, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Wed, Aug 22, 2012 at 11:19:04PM -0300, Rafael Aquini wrote:
> On Wed, Aug 22, 2012 at 12:33:17PM +0300, Michael S. Tsirkin wrote:
> > Hmm, so this will busy wait which is unelegant.
> > We need some event IMO.
> 
> No, it does not busy wait. leak_balloon() is mutual exclusive with migration
> steps, so for the case we have one racing against the other, we really want
> leak_balloon() dropping the mutex temporarily to allow migration complete its
> work of refilling vb->pages list. Also, leak_balloon() calls tell_host(), which
> will potentially make it to schedule for each round of vb->pfns leak_balloon()
> will release.

tell_host might not even cause an exit to host. Even if it does
it does not involve guest scheduler.

> So, when remove_common() calls leak_balloon() looping on
> vb->num_pages, that won't become a tight loop. 
> The scheme was apparently working before this series, and it will remain working
> after it.

It seems that before we would always leak all requested memory
in one go. I can't tell why we have a while loop there at all.
Rusty, could you clarify please?

> 
> > Also, reading num_pages without a lock here
> > which seems wrong.
> 
> I'll protect it with vb->balloon_lock mutex. That will be consistent with the
> lock protection scheme this patch is introducing for struct virtio_balloon
> elements.
> 
> 
> > A similar concern applies to normal leaking
> > of the balloon: here we might leak less than
> > required, then wait for the next config change
> > event.
> 
> Just as before, same thing here. If you leaked less than required, balloon()
> will keep calling leak_balloon() until the balloon target is reached. This
> scheme was working before, and it will keep working after this patch.
>

IIUC we never hit this path before.
 
> > How about we signal config_change
> > event when pages are back to pages_list?
> 
> I really don't know what to tell you here, but, to me, it seems like an
> overcomplication that isn't directly entangled with this patch purposes.
> Besides, you cannot expect compation / migration happening and racing against
> leak_balloon() all the time to make them signal events to the later, so we might
> just be creating a wait-forever condition for leak_balloon(), IMHO.

So use wait_event or similar, check for existance of isolated pages.

> Cheers!

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-23 10:01                         ` Michael S. Tsirkin
@ 2012-08-23 12:13                           ` Rafael Aquini
  2012-08-23 12:34                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 52+ messages in thread
From: Rafael Aquini @ 2012-08-23 12:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paul E. McKenney, Peter Zijlstra, linux-mm, linux-kernel,
	virtualization, Rusty Russell, Rik van Riel, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Thu, Aug 23, 2012 at 01:01:07PM +0300, Michael S. Tsirkin wrote:
> > So, when remove_common() calls leak_balloon() looping on
> > vb->num_pages, that won't become a tight loop. 
> > The scheme was apparently working before this series, and it will remain working
> > after it.
> 
> It seems that before we would always leak all requested memory
> in one go. I can't tell why we have a while loop there at all.
> Rusty, could you clarify please?
>

It seems that your claim isn't right. leak_balloon() cannot do it all at once,
as for each round it only releases 256 pages, at most; and the 'one go' would
require a couple of loop rounds at remove_common().
So, nothing has changed here.

 
> > Just as before, same thing here. If you leaked less than required, balloon()
> > will keep calling leak_balloon() until the balloon target is reached. This
> > scheme was working before, and it will keep working after this patch.
> >
> 
> IIUC we never hit this path before.
>  
So, how does balloon() works then?


> > > How about we signal config_change
> > > event when pages are back to pages_list?
> > 
> > I really don't know what to tell you here, but, to me, it seems like an
> > overcomplication that isn't directly entangled with this patch purposes.
> > Besides, you cannot expect compation / migration happening and racing against
> > leak_balloon() all the time to make them signal events to the later, so we might
> > just be creating a wait-forever condition for leak_balloon(), IMHO.
> 
> So use wait_event or similar, check for existance of isolated pages.
> 

The thing here is expecting compaction as being an external event to signal
actions to the balloon driver won't work as you desire. Also, as far as the
balloon driver is concerned, it's only a matter of time to accomplish a total,
or partial, balloon leak, even when we have some pages isolated from balloon's
page list.

IMHO, you're attempting to complicate a simple thing that is already working
well. As said before, there are no guarantees you'll have isolated pages 
by the time you're leaking the balloon, so you might leave it waiting forever
on something that will not happen. And if there are isolated pages while balloon
is leaking, they'll have their chance to get back to the list before the device
finishes its leaking job.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-23 12:13                           ` Rafael Aquini
@ 2012-08-23 12:34                             ` Michael S. Tsirkin
  2012-08-23 13:06                               ` Rafael Aquini
  0 siblings, 1 reply; 52+ messages in thread
From: Michael S. Tsirkin @ 2012-08-23 12:34 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Paul E. McKenney, Peter Zijlstra, linux-mm, linux-kernel,
	virtualization, Rusty Russell, Rik van Riel, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Thu, Aug 23, 2012 at 09:13:39AM -0300, Rafael Aquini wrote:
> On Thu, Aug 23, 2012 at 01:01:07PM +0300, Michael S. Tsirkin wrote:
> > > So, when remove_common() calls leak_balloon() looping on
> > > vb->num_pages, that won't become a tight loop. 
> > > The scheme was apparently working before this series, and it will remain working
> > > after it.
> > 
> > It seems that before we would always leak all requested memory
> > in one go. I can't tell why we have a while loop there at all.
> > Rusty, could you clarify please?
> >
> 
> It seems that your claim isn't right. leak_balloon() cannot do it all at once,
> as for each round it only releases 256 pages, at most; and the 'one go' would
> require a couple of loop rounds at remove_common().

You are right in this respect.

> So, nothing has changed here.

Yes, your patch does change things:
leak_balloon now might return without freeing any pages.
In that case we will not be making any progress, and just
spin, pinning CPU.

>  
> > > Just as before, same thing here. If you leaked less than required, balloon()
> > > will keep calling leak_balloon() until the balloon target is reached. This
> > > scheme was working before, and it will keep working after this patch.
> > >
> > 
> > IIUC we never hit this path before.
> >  
> So, how does balloon() works then?
> 

It gets a request to leak a given number of pages
and executes it, then tells host that it is done.
It never needs to spin busy-waiting on a CPU for this.

> > > > How about we signal config_change
> > > > event when pages are back to pages_list?
> > > 
> > > I really don't know what to tell you here, but, to me, it seems like an
> > > overcomplication that isn't directly entangled with this patch purposes.
> > > Besides, you cannot expect compation / migration happening and racing against
> > > leak_balloon() all the time to make them signal events to the later, so we might
> > > just be creating a wait-forever condition for leak_balloon(), IMHO.
> > 
> > So use wait_event or similar, check for existance of isolated pages.
> > 
> 
> The thing here is expecting compaction as being an external event to signal
> actions to the balloon driver won't work as you desire. Also, as far as the
> balloon driver is concerned, it's only a matter of time to accomplish a total,
> or partial, balloon leak, even when we have some pages isolated from balloon's
> page list.
> 
> IMHO, you're attempting to complicate a simple thing that is already working
> well. As said before, there are no guarantees you'll have isolated pages 
> by the time you're leaking the balloon, so you might leave it waiting forever
> on something that will not happen. And if there are isolated pages while balloon
> is leaking, they'll have their chance to get back to the list before the device
> finishes its leaking job.

Well busy wait pinning CPU is ugly.  Instead we should block thread and
wake it up when done.  I don't mind how we fix it specifically.

-- 
MST

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-23 12:34                             ` Michael S. Tsirkin
@ 2012-08-23 13:06                               ` Rafael Aquini
  2012-08-23 13:53                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 52+ messages in thread
From: Rafael Aquini @ 2012-08-23 13:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paul E. McKenney, Peter Zijlstra, linux-mm, linux-kernel,
	virtualization, Rusty Russell, Rik van Riel, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Thu, Aug 23, 2012 at 03:34:32PM +0300, Michael S. Tsirkin wrote:
> > So, nothing has changed here.
> 
> Yes, your patch does change things:
> leak_balloon now might return without freeing any pages.
> In that case we will not be making any progress, and just
> spin, pinning CPU.

That's a transitory condition, that migh happen if leak_balloon() takes place
when compaction, or migration are under their way and it might only affects the
module unload case. Also it won't pin CPU because it keeps releasing the locks
it grabs, as it loops. So, we are locubrating about rarities, IMHO. 

> 
> >  
> > > > Just as before, same thing here. If you leaked less than required, balloon()
> > > > will keep calling leak_balloon() until the balloon target is reached. This
> > > > scheme was working before, and it will keep working after this patch.
> > > >
> > > 
> > > IIUC we never hit this path before.
> > >  
> > So, how does balloon() works then?
> > 
> 
> It gets a request to leak a given number of pages
> and executes it, then tells host that it is done.
> It never needs to spin busy-waiting on a CPU for this.
>

So, what this patch changes for the ordinary leak_balloon() case?

 
> Well busy wait pinning CPU is ugly.  Instead we should block thread and
> wake it up when done.  I don't mind how we fix it specifically.
>

I already told you that we do not do that by any mean introduced by this patch.
You're just being stubborn here. If those bits are broken, they were already
broken before I did come up with this proposal.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-23 13:06                               ` Rafael Aquini
@ 2012-08-23 13:53                                 ` Michael S. Tsirkin
  2012-08-23 15:21                                   ` Rafael Aquini
  2012-08-23 16:25                                   ` Michael S. Tsirkin
  0 siblings, 2 replies; 52+ messages in thread
From: Michael S. Tsirkin @ 2012-08-23 13:53 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Paul E. McKenney, Peter Zijlstra, linux-mm, linux-kernel,
	virtualization, Rusty Russell, Rik van Riel, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Thu, Aug 23, 2012 at 10:06:07AM -0300, Rafael Aquini wrote:
> On Thu, Aug 23, 2012 at 03:34:32PM +0300, Michael S. Tsirkin wrote:
> > > So, nothing has changed here.
> > 
> > Yes, your patch does change things:
> > leak_balloon now might return without freeing any pages.
> > In that case we will not be making any progress, and just
> > spin, pinning CPU.
> 
> That's a transitory condition, that migh happen if leak_balloon() takes place
> when compaction, or migration are under their way and it might only affects the
> module unload case.

Regular operation seems even more broken: host might ask
you to leak memory but because it is under compaction
you might leak nothing. No?

> Also it won't pin CPU because it keeps releasing the locks
> it grabs, as it loops.

What has releazing locks have to do with it?

> So, we are locubrating about rarities, IMHO. 
> > 
> > >  
> > > > > Just as before, same thing here. If you leaked less than required, balloon()
> > > > > will keep calling leak_balloon() until the balloon target is reached. This
> > > > > scheme was working before, and it will keep working after this patch.
> > > > >
> > > > 
> > > > IIUC we never hit this path before.
> > > >  
> > > So, how does balloon() works then?
> > > 
> > 
> > It gets a request to leak a given number of pages
> > and executes it, then tells host that it is done.
> > It never needs to spin busy-waiting on a CPU for this.
> >
> 
> So, what this patch changes for the ordinary leak_balloon() case?
> 

That not all pages used by balloon are on &vb->pages list.

> > Well busy wait pinning CPU is ugly.  Instead we should block thread and
> > wake it up when done.  I don't mind how we fix it specifically.
> >
> 
> I already told you that we do not do that by any mean introduced by this patch.
> You're just being stubborn here. If those bits are broken, they were already
> broken before I did come up with this proposal.

Sorry you don't address the points I am making.  Maybe there are no
bugs. But it looks like there are.  And assuming I am just seeing things
this just means patch needs more comments, in commit log and in
code to explain the design so that it stops looking like that.

Basically it was very simple: we assumed page->lru was never
touched for an allocated page, so it's safe to use it for
internal book-keeping by the driver.

Now, this is not the case anymore, you add some logic in mm/ that might
or might not touch page->lru depending on things like reference count.
And you are asking why things break even though you change very little
in balloon itself?  Because the interface between balloon and mm is now
big, fragile and largely undocumented.

Another strangeness I just noticed: if we ever do an extra get_page in
balloon, compaction logic in mm will break, yes?  But one expects to be
able to do get_page after alloc_page without ill effects
as long as one does put_page before free.

Just a thought: maybe it is cleaner to move all balloon page tracking
into mm/?  Implement alloc_balloon/free_balloon with methods to fill and
leak pages, and callbacks to invoke when done.  This should be good for
other hypervisors too. If you like this idea, I can even try to help out
by refactoring current code in this way, so that you can build on it.
But this is just a thought, not a must.

-- 
MST

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-23 13:53                                 ` Michael S. Tsirkin
@ 2012-08-23 15:21                                   ` Rafael Aquini
  2012-08-23 15:54                                     ` Michael S. Tsirkin
  2012-08-23 16:25                                   ` Michael S. Tsirkin
  1 sibling, 1 reply; 52+ messages in thread
From: Rafael Aquini @ 2012-08-23 15:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paul E. McKenney, Peter Zijlstra, linux-mm, linux-kernel,
	virtualization, Rusty Russell, Rik van Riel, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Thu, Aug 23, 2012 at 04:53:29PM +0300, Michael S. Tsirkin wrote:
> On Thu, Aug 23, 2012 at 10:06:07AM -0300, Rafael Aquini wrote:
> > On Thu, Aug 23, 2012 at 03:34:32PM +0300, Michael S. Tsirkin wrote:
> > > > So, nothing has changed here.
> > > 
> > > Yes, your patch does change things:
> > > leak_balloon now might return without freeing any pages.
> > > In that case we will not be making any progress, and just
> > > spin, pinning CPU.
> > 
> > That's a transitory condition, that migh happen if leak_balloon() takes place
> > when compaction, or migration are under their way and it might only affects the
> > module unload case.
> 
> Regular operation seems even more broken: host might ask
> you to leak memory but because it is under compaction
> you might leak nothing. No?
>

And that is exactely what it wants to do. If there is (temporarily) nothing to leak,
then not leaking is the only sane thing to do. Having balloon pages being migrated
does not break the leak at all, despite it can last a little longer.

 
 
> > 
> > I already told you that we do not do that by any mean introduced by this patch.
> > You're just being stubborn here. If those bits are broken, they were already
> > broken before I did come up with this proposal.
> 
> Sorry you don't address the points I am making.  Maybe there are no
> bugs. But it looks like there are.  And assuming I am just seeing things
> this just means patch needs more comments, in commit log and in
> code to explain the design so that it stops looking like that.
>

Yes, I belive you're biased here.


> Basically it was very simple: we assumed page->lru was never
> touched for an allocated page, so it's safe to use it for
> internal book-keeping by the driver.
>
> Now, this is not the case anymore, you add some logic in mm/ that might
> or might not touch page->lru depending on things like reference count.
> And you are asking why things break even though you change very little
> in balloon itself?  Because the interface between balloon and mm is now
> big, fragile and largely undocumented.
> 

The driver don't use page->lru as its bookeeping at all, it uses
vb->num_pages instead. 


> Another strangeness I just noticed: if we ever do an extra get_page in
> balloon, compaction logic in mm will break, yes?  But one expects to be
> able to do get_page after alloc_page without ill effects
> as long as one does put_page before free.
>

You can do it (bump up the balloon page refcount), and it will only prevent
balloon pages from being isolated and migrated, thus reducing the effectiveness of
defragmenting memory when balloon pages are present, just like it happens today.

It really doesn't seems the case of virtio_balloon driver, or any other driver,
which allocates pages directly from buddy to keep raising the page refcount,
though.
 

> Just a thought: maybe it is cleaner to move all balloon page tracking
> into mm/?  Implement alloc_balloon/free_balloon with methods to fill and
> leak pages, and callbacks to invoke when done.  This should be good for
> other hypervisors too. If you like this idea, I can even try to help out
> by refactoring current code in this way, so that you can build on it.
> But this is just a thought, not a must.
>

That seems to be a good thought to be on a future enhancements wish-list, for sure.
We can start thinking of it, and I surely would be more than happy on be doing
it along with you. But I don't think not having it right away is a dealbreaker
for this proposal, as is.

I'm not against your thoughts, and I'm really glad that you're providing such
good dicussion over this subject, but, now I'll wait for Rusty thoughts on 
this one question.

Cheers!

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-23 15:21                                   ` Rafael Aquini
@ 2012-08-23 15:54                                     ` Michael S. Tsirkin
  2012-08-23 16:03                                       ` Rik van Riel
  0 siblings, 1 reply; 52+ messages in thread
From: Michael S. Tsirkin @ 2012-08-23 15:54 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Paul E. McKenney, Peter Zijlstra, linux-mm, linux-kernel,
	virtualization, Rusty Russell, Rik van Riel, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Thu, Aug 23, 2012 at 12:21:29PM -0300, Rafael Aquini wrote:
> On Thu, Aug 23, 2012 at 04:53:29PM +0300, Michael S. Tsirkin wrote:
> > On Thu, Aug 23, 2012 at 10:06:07AM -0300, Rafael Aquini wrote:
> > > On Thu, Aug 23, 2012 at 03:34:32PM +0300, Michael S. Tsirkin wrote:
> > > > > So, nothing has changed here.
> > > > 
> > > > Yes, your patch does change things:
> > > > leak_balloon now might return without freeing any pages.
> > > > In that case we will not be making any progress, and just
> > > > spin, pinning CPU.
> > > 
> > > That's a transitory condition, that migh happen if leak_balloon() takes place
> > > when compaction, or migration are under their way and it might only affects the
> > > module unload case.
> > 
> > Regular operation seems even more broken: host might ask
> > you to leak memory but because it is under compaction
> > you might leak nothing. No?
> >
> 
> And that is exactely what it wants to do. If there is (temporarily) nothing to leak,
> then not leaking is the only sane thing to do.

It's an internal issue between balloon and mm. User does not care.

> Having balloon pages being migrated
> does not break the leak at all, despite it can last a little longer.
> 

Not "longer" - apparently forever unless user resend the leak command.
It's wrong - it should
1. not tell host if nothing was done
2. after migration finished leak and tell host

> > > 
> > > I already told you that we do not do that by any mean introduced by this patch.
> > > You're just being stubborn here. If those bits are broken, they were already
> > > broken before I did come up with this proposal.
> > 
> > Sorry you don't address the points I am making.  Maybe there are no
> > bugs. But it looks like there are.  And assuming I am just seeing things
> > this just means patch needs more comments, in commit log and in
> > code to explain the design so that it stops looking like that.
> >
> 
> Yes, I belive you're biased here.
> 
> 
> > Basically it was very simple: we assumed page->lru was never
> > touched for an allocated page, so it's safe to use it for
> > internal book-keeping by the driver.
> >
> > Now, this is not the case anymore, you add some logic in mm/ that might
> > or might not touch page->lru depending on things like reference count.
> > And you are asking why things break even though you change very little
> > in balloon itself?  Because the interface between balloon and mm is now
> > big, fragile and largely undocumented.
> > 
> 
> The driver don't use page->lru as its bookeeping at all, it uses
> vb->num_pages instead. 

$ grep lru drivers/virtio/virtio_balloon.c
                list_add(&page->lru, &vb->pages);
                page = list_first_entry(&vb->pages, struct page, lru);
                list_del(&page->lru);


> 
> > Another strangeness I just noticed: if we ever do an extra get_page in
> > balloon, compaction logic in mm will break, yes?  But one expects to be
> > able to do get_page after alloc_page without ill effects
> > as long as one does put_page before free.
> >
> 
> You can do it (bump up the balloon page refcount), and it will only prevent
> balloon pages from being isolated and migrated, thus reducing the effectiveness of
> defragmenting memory when balloon pages are present, just like it happens today.
> 
> It really doesn't seems the case of virtio_balloon driver, or any other driver,
> which allocates pages directly from buddy to keep raising the page refcount,
> though.
>  

E.g. network devices routinely play with pages they get from buddy,
this is used for sharing memory between skbs.

> > Just a thought: maybe it is cleaner to move all balloon page tracking
> > into mm/?  Implement alloc_balloon/free_balloon with methods to fill and
> > leak pages, and callbacks to invoke when done.  This should be good for
> > other hypervisors too. If you like this idea, I can even try to help out
> > by refactoring current code in this way, so that you can build on it.
> > But this is just a thought, not a must.
> >
> 
> That seems to be a good thought to be on a future enhancements wish-list, for sure.
> We can start thinking of it, and I surely would be more than happy on be doing
> it along with you. But I don't think not having it right away is a dealbreaker
> for this proposal, as is.

I grant busywait on module unloading isn't a huge deal breaker.

Poking in mm internals is not a dealbreaker?
Not leaking as much as
you are asked to isn't?

> I'm not against your thoughts, and I'm really glad that you're providing such
> good dicussion over this subject, but, now I'll wait for Rusty thoughts on 
> this one question.
> 
> Cheers!

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-23 15:54                                     ` Michael S. Tsirkin
@ 2012-08-23 16:03                                       ` Rik van Riel
  2012-08-23 16:06                                         ` Rafael Aquini
  0 siblings, 1 reply; 52+ messages in thread
From: Rik van Riel @ 2012-08-23 16:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rafael Aquini, Paul E. McKenney, Peter Zijlstra, linux-mm,
	linux-kernel, virtualization, Rusty Russell, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On 08/23/2012 11:54 AM, Michael S. Tsirkin wrote:
> On Thu, Aug 23, 2012 at 12:21:29PM -0300, Rafael Aquini wrote:
>> On Thu, Aug 23, 2012 at 04:53:29PM +0300, Michael S. Tsirkin wrote:
>>> On Thu, Aug 23, 2012 at 10:06:07AM -0300, Rafael Aquini wrote:
>>>> On Thu, Aug 23, 2012 at 03:34:32PM +0300, Michael S. Tsirkin wrote:
>>>>>> So, nothing has changed here.
>>>>>
>>>>> Yes, your patch does change things:
>>>>> leak_balloon now might return without freeing any pages.
>>>>> In that case we will not be making any progress, and just
>>>>> spin, pinning CPU.
>>>>
>>>> That's a transitory condition, that migh happen if leak_balloon() takes place
>>>> when compaction, or migration are under their way and it might only affects the
>>>> module unload case.
>>>
>>> Regular operation seems even more broken: host might ask
>>> you to leak memory but because it is under compaction
>>> you might leak nothing. No?
>>>
>>
>> And that is exactely what it wants to do. If there is (temporarily) nothing to leak,
>> then not leaking is the only sane thing to do.
>
> It's an internal issue between balloon and mm. User does not care.
>
>> Having balloon pages being migrated
>> does not break the leak at all, despite it can last a little longer.
>>
>
> Not "longer" - apparently forever unless user resend the leak command.
> It's wrong - it should
> 1. not tell host if nothing was done
> 2. after migration finished leak and tell host

Agreed.  If the balloon is told to leak N pages, and could
not do so because those pages were locked, the balloon driver
needs to retry (maybe waiting on a page lock?) and not signal
completion until after the job has been completed.

Having the balloon driver wait on the page lock should be
fine, because compaction does not hold the page lock for
long.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-23 16:03                                       ` Rik van Riel
@ 2012-08-23 16:06                                         ` Rafael Aquini
  2012-08-23 16:10                                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 52+ messages in thread
From: Rafael Aquini @ 2012-08-23 16:06 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michael S. Tsirkin, Paul E. McKenney, Peter Zijlstra, linux-mm,
	linux-kernel, virtualization, Rusty Russell, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Thu, Aug 23, 2012 at 12:03:15PM -0400, Rik van Riel wrote:
> >
> >Not "longer" - apparently forever unless user resend the leak command.
> >It's wrong - it should
> >1. not tell host if nothing was done
> >2. after migration finished leak and tell host
> 
> Agreed.  If the balloon is told to leak N pages, and could
> not do so because those pages were locked, the balloon driver
> needs to retry (maybe waiting on a page lock?) and not signal
> completion until after the job has been completed.
> 
> Having the balloon driver wait on the page lock should be
> fine, because compaction does not hold the page lock for
> long.

And that is precisely what leak_balloon is doing. When it stumbles across a
locked page it gets rid of that leak round to give a shot for compaction to
finish its task.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-23 16:06                                         ` Rafael Aquini
@ 2012-08-23 16:10                                           ` Michael S. Tsirkin
  0 siblings, 0 replies; 52+ messages in thread
From: Michael S. Tsirkin @ 2012-08-23 16:10 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Rik van Riel, Paul E. McKenney, Peter Zijlstra, linux-mm,
	linux-kernel, virtualization, Rusty Russell, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Thu, Aug 23, 2012 at 01:06:48PM -0300, Rafael Aquini wrote:
> On Thu, Aug 23, 2012 at 12:03:15PM -0400, Rik van Riel wrote:
> > >
> > >Not "longer" - apparently forever unless user resend the leak command.
> > >It's wrong - it should
> > >1. not tell host if nothing was done
> > >2. after migration finished leak and tell host
> > 
> > Agreed.  If the balloon is told to leak N pages, and could
> > not do so because those pages were locked, the balloon driver
> > needs to retry (maybe waiting on a page lock?) and not signal
> > completion until after the job has been completed.
> > 
> > Having the balloon driver wait on the page lock should be
> > fine, because compaction does not hold the page lock for
> > long.
> 
> And that is precisely what leak_balloon is doing. When it stumbles across a
> locked page it gets rid of that leak round to give a shot for compaction to
> finish its task.
> 

Yes but I do not see where it will retry.
If it does, please add a comment pointing out where it happens.

-- 
MST

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-23 13:53                                 ` Michael S. Tsirkin
  2012-08-23 15:21                                   ` Rafael Aquini
@ 2012-08-23 16:25                                   ` Michael S. Tsirkin
  2012-08-23 17:28                                     ` Rafael Aquini
  1 sibling, 1 reply; 52+ messages in thread
From: Michael S. Tsirkin @ 2012-08-23 16:25 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Paul E. McKenney, Peter Zijlstra, linux-mm, linux-kernel,
	virtualization, Rusty Russell, Rik van Riel, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Thu, Aug 23, 2012 at 04:53:28PM +0300, Michael S. Tsirkin wrote:
> Basically it was very simple: we assumed page->lru was never
> touched for an allocated page, so it's safe to use it for
> internal book-keeping by the driver.
> 
> Now, this is not the case anymore, you add some logic in mm/ that might
> or might not touch page->lru depending on things like reference count.

Another thought: would the issue go away if balloon used
page->private to link pages instead of LRU?
mm core could keep a reference on page to avoid it
being used while mm handles it (maybe it does already?).

If we do this, will not the only change to balloon be to tell mm that it
can use compaction for these pages when it allocates the page: using
some GPF flag or a new API?

-- 
MST

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-23 16:25                                   ` Michael S. Tsirkin
@ 2012-08-23 17:28                                     ` Rafael Aquini
  2012-08-23 17:59                                       ` Rik van Riel
  2012-08-23 23:36                                       ` Michael S. Tsirkin
  0 siblings, 2 replies; 52+ messages in thread
From: Rafael Aquini @ 2012-08-23 17:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paul E. McKenney, Peter Zijlstra, linux-mm, linux-kernel,
	virtualization, Rusty Russell, Rik van Riel, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Thu, Aug 23, 2012 at 07:25:05PM +0300, Michael S. Tsirkin wrote:
> On Thu, Aug 23, 2012 at 04:53:28PM +0300, Michael S. Tsirkin wrote:
> > Basically it was very simple: we assumed page->lru was never
> > touched for an allocated page, so it's safe to use it for
> > internal book-keeping by the driver.
> > 
> > Now, this is not the case anymore, you add some logic in mm/ that might
> > or might not touch page->lru depending on things like reference count.
> 
> Another thought: would the issue go away if balloon used
> page->private to link pages instead of LRU?
> mm core could keep a reference on page to avoid it
> being used while mm handles it (maybe it does already?).
>
I don't think so. That would be a lot more trikier and complex, IMHO.
 
> If we do this, will not the only change to balloon be to tell mm that it
> can use compaction for these pages when it allocates the page: using
> some GPF flag or a new API?
> 

What about keep a conter at virtio_balloon structure on how much pages are
isolated from balloon's list and check it at leak time?
if the counter gets > 0 than we can safely put leak_balloon() to wait until
balloon page list gets completely refilled. I guess that is simple to get
accomplished and potentially addresses all your concerns on this issue.

Cheers!


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-23 17:28                                     ` Rafael Aquini
@ 2012-08-23 17:59                                       ` Rik van Riel
  2012-08-23 23:36                                       ` Michael S. Tsirkin
  1 sibling, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2012-08-23 17:59 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Michael S. Tsirkin, Paul E. McKenney, Peter Zijlstra, linux-mm,
	linux-kernel, virtualization, Rusty Russell, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On 08/23/2012 01:28 PM, Rafael Aquini wrote:

> What about keep a conter at virtio_balloon structure on how much pages are
> isolated from balloon's list and check it at leak time?
> if the counter gets > 0 than we can safely put leak_balloon() to wait until
> balloon page list gets completely refilled.

We only have to wait if we failed to leak enough
pages, and then only for as many additional pages
as we require.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-23 17:28                                     ` Rafael Aquini
  2012-08-23 17:59                                       ` Rik van Riel
@ 2012-08-23 23:36                                       ` Michael S. Tsirkin
  2012-08-24  0:26                                         ` Rafael Aquini
                                                           ` (2 more replies)
  1 sibling, 3 replies; 52+ messages in thread
From: Michael S. Tsirkin @ 2012-08-23 23:36 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Paul E. McKenney, Peter Zijlstra, linux-mm, linux-kernel,
	virtualization, Rusty Russell, Rik van Riel, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Thu, Aug 23, 2012 at 02:28:45PM -0300, Rafael Aquini wrote:
> On Thu, Aug 23, 2012 at 07:25:05PM +0300, Michael S. Tsirkin wrote:
> > On Thu, Aug 23, 2012 at 04:53:28PM +0300, Michael S. Tsirkin wrote:
> > > Basically it was very simple: we assumed page->lru was never
> > > touched for an allocated page, so it's safe to use it for
> > > internal book-keeping by the driver.
> > > 
> > > Now, this is not the case anymore, you add some logic in mm/ that might
> > > or might not touch page->lru depending on things like reference count.
> > 
> > Another thought: would the issue go away if balloon used
> > page->private to link pages instead of LRU?
> > mm core could keep a reference on page to avoid it
> > being used while mm handles it (maybe it does already?).
> >
> I don't think so. That would be a lot more trikier and complex, IMHO.

What's tricky? Linking pages through a void * orivate pointer?
I can code it up in a couple of minutes.
It's middle of the night so too tired to test but still:

> > If we do this, will not the only change to balloon be to tell mm that it
> > can use compaction for these pages when it allocates the page: using
> > some GPF flag or a new API?
> > 
> 
> What about keep a conter at virtio_balloon structure on how much pages are
> isolated from balloon's list and check it at leak time?
> if the counter gets > 0 than we can safely put leak_balloon() to wait until
> balloon page list gets completely refilled. I guess that is simple to get
> accomplished and potentially addresses all your concerns on this issue.
> 
> Cheers!

I would wake it each time after adding a page, then it
can stop waiting when it leaks enough.
But again, it's cleaner to just keep tracking all
pages, let mm hang on to them by keeping a reference.

--->

virtio-balloon: replace page->lru list with page->private.

The point is to free up page->lru for use by compaction.
Warning: completely untested, will provide tested version
if we agree on this direction.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

---

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 0908e60..b38f57ce 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -56,7 +56,7 @@ struct virtio_balloon
 	 * Each page on this list adds VIRTIO_BALLOON_PAGES_PER_PAGE
 	 * to num_pages above.
 	 */
-	struct list_head pages;
+	void *pages;
 
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
@@ -141,7 +141,9 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num)
 		set_page_pfns(vb->pfns + vb->num_pfns, page);
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		totalram_pages--;
-		list_add(&page->lru, &vb->pages);
+		/* Add to list of pages */
+		page->private = vb->pages;
+		vb->pages = page->private;
 	}
 
 	/* Didn't get any?  Oh well. */
@@ -171,8 +173,9 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num)
 
 	for (vb->num_pfns = 0; vb->num_pfns < num;
 	     vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-		page = list_first_entry(&vb->pages, struct page, lru);
-		list_del(&page->lru);
+		/* Delete from list of pages */
+		page = vb->pages;
+		vb->pages = page->private;
 		set_page_pfns(vb->pfns + vb->num_pfns, page);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -350,7 +353,7 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		goto out;
 	}
 
-	INIT_LIST_HEAD(&vb->pages);
+	vb->pages = NULL;
 	vb->num_pages = 0;
 	init_waitqueue_head(&vb->config_change);
 	init_waitqueue_head(&vb->acked);
-- 
MST

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-23 23:36                                       ` Michael S. Tsirkin
@ 2012-08-24  0:26                                         ` Rafael Aquini
  2012-08-24  0:33                                         ` Rafael Aquini
  2012-08-24  3:12                                         ` Rik van Riel
  2 siblings, 0 replies; 52+ messages in thread
From: Rafael Aquini @ 2012-08-24  0:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paul E. McKenney, Peter Zijlstra, linux-mm, linux-kernel,
	virtualization, Rusty Russell, Rik van Riel, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Fri, Aug 24, 2012 at 02:36:16AM +0300, Michael S. Tsirkin wrote:
> On Thu, Aug 23, 2012 at 02:28:45PM -0300, Rafael Aquini wrote:
> > On Thu, Aug 23, 2012 at 07:25:05PM +0300, Michael S. Tsirkin wrote:
> > > On Thu, Aug 23, 2012 at 04:53:28PM +0300, Michael S. Tsirkin wrote:
> > > > Basically it was very simple: we assumed page->lru was never
> > > > touched for an allocated page, so it's safe to use it for
> > > > internal book-keeping by the driver.
> > > > 
> > > > Now, this is not the case anymore, you add some logic in mm/ that might
> > > > or might not touch page->lru depending on things like reference count.
> > > 
> > > Another thought: would the issue go away if balloon used
> > > page->private to link pages instead of LRU?
> > > mm core could keep a reference on page to avoid it
> > > being used while mm handles it (maybe it does already?).
> > >
> > I don't think so. That would be a lot more trikier and complex, IMHO.
> 
> What's tricky? Linking pages through a void * orivate pointer?
> I can code it up in a couple of minutes.
> It's middle of the night so too tired to test but still:
> 
> > > If we do this, will not the only change to balloon be to tell mm that it
> > > can use compaction for these pages when it allocates the page: using
> > > some GPF flag or a new API?
> > > 
> > 
> > What about keep a conter at virtio_balloon structure on how much pages are
> > isolated from balloon's list and check it at leak time?
> > if the counter gets > 0 than we can safely put leak_balloon() to wait until
> > balloon page list gets completely refilled. I guess that is simple to get
> > accomplished and potentially addresses all your concerns on this issue.
> > 
> > Cheers!
> 
> I would wake it each time after adding a page, then it
> can stop waiting when it leaks enough.
> But again, it's cleaner to just keep tracking all
> pages, let mm hang on to them by keeping a reference.
> 
> --->
> 
> virtio-balloon: replace page->lru list with page->private.
> 
> The point is to free up page->lru for use by compaction.
> Warning: completely untested, will provide tested version
> if we agree on this direction.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>

This way balloon driver will potentially release pages that were already
migrated and doesn't belong to it anymore, since the page under migration never
gets isolated from balloon's page list. It's a lot more dangerous than it was
before. 

I'm working on having leak_balloon on the right way, as you correctly has
pointed. I was blind and biased. So, thank you for pointing me the way.


> ---
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 0908e60..b38f57ce 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -56,7 +56,7 @@ struct virtio_balloon
>  	 * Each page on this list adds VIRTIO_BALLOON_PAGES_PER_PAGE
>  	 * to num_pages above.
>  	 */
> -	struct list_head pages;
> +	void *pages;
>  
>  	/* The array of pfns we tell the Host about. */
>  	unsigned int num_pfns;
> @@ -141,7 +141,9 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num)
>  		set_page_pfns(vb->pfns + vb->num_pfns, page);
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		totalram_pages--;
> -		list_add(&page->lru, &vb->pages);
> +		/* Add to list of pages */
> +		page->private = vb->pages;
> +		vb->pages = page->private;
>  	}
>  
>  	/* Didn't get any?  Oh well. */
> @@ -171,8 +173,9 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num)
>  
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
>  	     vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> -		page = list_first_entry(&vb->pages, struct page, lru);
> -		list_del(&page->lru);
> +		/* Delete from list of pages */
> +		page = vb->pages;
> +		vb->pages = page->private;
>  		set_page_pfns(vb->pfns + vb->num_pfns, page);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}
> @@ -350,7 +353,7 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  		goto out;
>  	}
>  
> -	INIT_LIST_HEAD(&vb->pages);
> +	vb->pages = NULL;
>  	vb->num_pages = 0;
>  	init_waitqueue_head(&vb->config_change);
>  	init_waitqueue_head(&vb->acked);
> -- 
> MST

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-23 23:36                                       ` Michael S. Tsirkin
  2012-08-24  0:26                                         ` Rafael Aquini
@ 2012-08-24  0:33                                         ` Rafael Aquini
  2012-08-24  0:38                                           ` Rafael Aquini
  2012-08-24  3:12                                         ` Rik van Riel
  2 siblings, 1 reply; 52+ messages in thread
From: Rafael Aquini @ 2012-08-24  0:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paul E. McKenney, Peter Zijlstra, linux-mm, linux-kernel,
	virtualization, Rusty Russell, Rik van Riel, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Fri, Aug 24, 2012 at 02:36:16AM +0300, Michael S. Tsirkin wrote:
> I would wake it each time after adding a page, then it
> can stop waiting when it leaks enough.
> But again, it's cleaner to just keep tracking all
> pages, let mm hang on to them by keeping a reference.
> 
Here is a rough idea on how it's getting:

Basically, I'm have introducing an atomic counter to track isolated pages, I
also have changed vb->num_pages into an atomic conter. All inc/dec operations
take place under pages_lock spinlock, and we only perform work under page lock.

It's still missing the wait-part (I'll write it during the weekend) and your
concerns (and mine) will be addressed, IMHO.

---8<---
+/*
+ *
+ */
+static inline void __wait_on_isolated_pages(struct virtio_balloon *vb,
+                                           size_t num)
+{
+       /* There are no isolated pages for this balloon device */
+       if (!atomic_read(&vb->num_isolated_pages))
+               return;
+
+       /* the leak target is smaller than # of pages on vb->pages list */
+       if (num < (atomic_read(&vb->num_pages) -
+           atomic_read(&vb->num_isolated_pages)))
+               return;
+       else {
+               spin_unlock(&vb->pages_lock);
+               /* wait stuff goes here */
+               spin_lock(&vb->pages_lock);
+       }
+}
+
 static void leak_balloon(struct virtio_balloon *vb, size_t num)
 {
-       struct page *page;
+       /* The array of pfns we tell the Host about. */
+       unsigned int num_pfns;
+       u32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];

        /* We can only do one array worth at a time. */
-       num = min(num, ARRAY_SIZE(vb->pfns));
+       num = min(num, ARRAY_SIZE(pfns));

-       for (vb->num_pfns = 0; vb->num_pfns < num;
-            vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-               page = list_first_entry(&vb->pages, struct page, lru);
-               list_del(&page->lru);
-               set_page_pfns(vb->pfns + vb->num_pfns, page);
-               vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
+       for (num_pfns = 0; num_pfns < num;
+            num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
+               struct page *page = NULL;
+               spin_lock(&vb->pages_lock);
+               __wait_on_isolated_pages(vb, num);
+
+               if (!list_empty(&vb->pages))
+                       page = list_first_entry(&vb->pages, struct page, lru);
+               /*
+                * Grab the page lock to avoid racing against threads isolating
+                * pages from, or migrating pages back to vb->pages list.
+                * (both tasks are done under page lock protection)
+                *
+                * Failing to grab the page lock here means this page is being
+                * isolated already, or its migration has not finished yet.
+                */
+               if (page && trylock_page(page)) {
+                       clear_balloon_mapping(page);
+                       list_del(&page->lru);
+                       set_page_pfns(pfns + num_pfns, page);
+                       atomic_sub(VIRTIO_BALLOON_PAGES_PER_PAGE,
+                                  &vb->num_pages);
+                       unlock_page(page);
+               }
+               spin_unlock(&vb->pages_lock);
        }

        /*
@@ -182,8 +251,10 @@ static void leak_balloon(struct virtio_balloon *vb, size_t
num)
         * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
         * is true, we *have* to do it in this order
         */
+       mutex_lock(&vb->balloon_lock);
        tell_host(vb, vb->deflate_vq);
-       release_pages_by_pfn(vb->pfns, vb->num_pfns);
+       mutex_unlock(&vb->balloon_lock);
+       release_pages_by_pfn(pfns, num_pfns);
 }
---8<---

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-24  0:33                                         ` Rafael Aquini
@ 2012-08-24  0:38                                           ` Rafael Aquini
  2012-08-24  0:49                                             ` Rafael Aquini
  0 siblings, 1 reply; 52+ messages in thread
From: Rafael Aquini @ 2012-08-24  0:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paul E. McKenney, Peter Zijlstra, linux-mm, linux-kernel,
	virtualization, Rusty Russell, Rik van Riel, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Thu, Aug 23, 2012 at 09:33:53PM -0300, Rafael Aquini wrote:
> On Fri, Aug 24, 2012 at 02:36:16AM +0300, Michael S. Tsirkin wrote:
> > I would wake it each time after adding a page, then it
> > can stop waiting when it leaks enough.
> > But again, it's cleaner to just keep tracking all
> > pages, let mm hang on to them by keeping a reference.
> > 

Btw, it's also late here, and there still some work to be done around those
bits, but I guess that has potential to get this issue nailed.

> Here is a rough idea on how it's getting:
> 
> Basically, I'm have introducing an atomic counter to track isolated pages, I
> also have changed vb->num_pages into an atomic conter. All inc/dec operations
> take place under pages_lock spinlock, and we only perform work under page lock.
> 
> It's still missing the wait-part (I'll write it during the weekend) and your
> concerns (and mine) will be addressed, IMHO.
> 
> ---8<---
> +/*
> + *
> + */
> +static inline void __wait_on_isolated_pages(struct virtio_balloon *vb,
> +                                           size_t num)
> +{
> +       /* There are no isolated pages for this balloon device */
> +       if (!atomic_read(&vb->num_isolated_pages))
> +               return;
> +
> +       /* the leak target is smaller than # of pages on vb->pages list */
> +       if (num < (atomic_read(&vb->num_pages) -
> +           atomic_read(&vb->num_isolated_pages)))
> +               return;
> +       else {
> +               spin_unlock(&vb->pages_lock);
> +               /* wait stuff goes here */
> +               spin_lock(&vb->pages_lock);
> +       }
> +}
> +
>  static void leak_balloon(struct virtio_balloon *vb, size_t num)
>  {
> -       struct page *page;
> +       /* The array of pfns we tell the Host about. */
> +       unsigned int num_pfns;
> +       u32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> 
>         /* We can only do one array worth at a time. */
> -       num = min(num, ARRAY_SIZE(vb->pfns));
> +       num = min(num, ARRAY_SIZE(pfns));
> 
> -       for (vb->num_pfns = 0; vb->num_pfns < num;
> -            vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> -               page = list_first_entry(&vb->pages, struct page, lru);
> -               list_del(&page->lru);
> -               set_page_pfns(vb->pfns + vb->num_pfns, page);
> -               vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
> +       for (num_pfns = 0; num_pfns < num;
> +            num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> +               struct page *page = NULL;
> +               spin_lock(&vb->pages_lock);
> +               __wait_on_isolated_pages(vb, num);
> +
> +               if (!list_empty(&vb->pages))
> +                       page = list_first_entry(&vb->pages, struct page, lru);
> +               /*
> +                * Grab the page lock to avoid racing against threads isolating
> +                * pages from, or migrating pages back to vb->pages list.
> +                * (both tasks are done under page lock protection)
> +                *
> +                * Failing to grab the page lock here means this page is being
> +                * isolated already, or its migration has not finished yet.
> +                */
> +               if (page && trylock_page(page)) {
> +                       clear_balloon_mapping(page);
> +                       list_del(&page->lru);
> +                       set_page_pfns(pfns + num_pfns, page);
> +                       atomic_sub(VIRTIO_BALLOON_PAGES_PER_PAGE,
> +                                  &vb->num_pages);
> +                       unlock_page(page);
> +               }
> +               spin_unlock(&vb->pages_lock);
>         }
> 
>         /*
> @@ -182,8 +251,10 @@ static void leak_balloon(struct virtio_balloon *vb, size_t
> num)
>          * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>          * is true, we *have* to do it in this order
>          */
> +       mutex_lock(&vb->balloon_lock);
>         tell_host(vb, vb->deflate_vq);
> -       release_pages_by_pfn(vb->pfns, vb->num_pfns);
> +       mutex_unlock(&vb->balloon_lock);
> +       release_pages_by_pfn(pfns, num_pfns);
>  }
> ---8<---

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-24  0:38                                           ` Rafael Aquini
@ 2012-08-24  0:49                                             ` Rafael Aquini
  0 siblings, 0 replies; 52+ messages in thread
From: Rafael Aquini @ 2012-08-24  0:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paul E. McKenney, Peter Zijlstra, linux-mm, linux-kernel,
	virtualization, Rusty Russell, Rik van Riel, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Thu, Aug 23, 2012 at 09:38:48PM -0300, Rafael Aquini wrote:
> On Thu, Aug 23, 2012 at 09:33:53PM -0300, Rafael Aquini wrote:
> > On Fri, Aug 24, 2012 at 02:36:16AM +0300, Michael S. Tsirkin wrote:
> > > I would wake it each time after adding a page, then it
> > > can stop waiting when it leaks enough.
> > > But again, it's cleaner to just keep tracking all
> > > pages, let mm hang on to them by keeping a reference.
> > > 
> 
> Btw, it's also late here, and there still some work to be done around those
> bits, but I guess that has potential to get this issue nailed.
> 
> > Here is a rough idea on how it's getting:
> > 
> > Basically, I'm have introducing an atomic counter to track isolated pages, I
> > also have changed vb->num_pages into an atomic conter. All inc/dec operations
> > take place under pages_lock spinlock, and we only perform work under page lock.
> > 
> > It's still missing the wait-part (I'll write it during the weekend) and your
> > concerns (and mine) will be addressed, IMHO.
> > 
> > ---8<---
> > +/*
> > + *
> > + */
> > +static inline void __wait_on_isolated_pages(struct virtio_balloon *vb,
> > +                                           size_t num)
> > +{
> > +       /* There are no isolated pages for this balloon device */
> > +       if (!atomic_read(&vb->num_isolated_pages))
> > +               return;
> > +
> > +       /* the leak target is smaller than # of pages on vb->pages list */
> > +       if (num < (atomic_read(&vb->num_pages) -
> > +           atomic_read(&vb->num_isolated_pages)))
> > +               return;
> > +       else {
> > +               spin_unlock(&vb->pages_lock);
> > +               /* wait stuff goes here */
> > +               spin_lock(&vb->pages_lock);
> > +       }
> > +}
> > +
> >  static void leak_balloon(struct virtio_balloon *vb, size_t num)
> >  {
> > -       struct page *page;
> > +       /* The array of pfns we tell the Host about. */
> > +       unsigned int num_pfns;
> > +       u32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> > 
> >         /* We can only do one array worth at a time. */
> > -       num = min(num, ARRAY_SIZE(vb->pfns));
> > +       num = min(num, ARRAY_SIZE(pfns));
> > 
> > -       for (vb->num_pfns = 0; vb->num_pfns < num;
> > -            vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> > -               page = list_first_entry(&vb->pages, struct page, lru);
> > -               list_del(&page->lru);
> > -               set_page_pfns(vb->pfns + vb->num_pfns, page);
> > -               vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
> > +       for (num_pfns = 0; num_pfns < num;
> > +            num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {

Doh, here I need a while(num_pfns < num;)


> > +               struct page *page = NULL;
> > +               spin_lock(&vb->pages_lock);
> > +               __wait_on_isolated_pages(vb, num);
> > +
> > +               if (!list_empty(&vb->pages))
> > +                       page = list_first_entry(&vb->pages, struct page, lru);
> > +               /*
> > +                * Grab the page lock to avoid racing against threads isolating
> > +                * pages from, or migrating pages back to vb->pages list.
> > +                * (both tasks are done under page lock protection)
> > +                *
> > +                * Failing to grab the page lock here means this page is being
> > +                * isolated already, or its migration has not finished yet.
> > +                */
> > +               if (page && trylock_page(page)) {
> > +                       clear_balloon_mapping(page);
> > +                       list_del(&page->lru);
> > +                       set_page_pfns(pfns + num_pfns, page);
> > +                       atomic_sub(VIRTIO_BALLOON_PAGES_PER_PAGE,
> > +                                  &vb->num_pages);

and here: num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE;		

> > +                       unlock_page(page);
> > +               }
> > +               spin_unlock(&vb->pages_lock);
> >         }
> > 
> >         /*
> > @@ -182,8 +251,10 @@ static void leak_balloon(struct virtio_balloon *vb, size_t
> > num)
> >          * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
> >          * is true, we *have* to do it in this order
> >          */
> > +       mutex_lock(&vb->balloon_lock);
> >         tell_host(vb, vb->deflate_vq);
> > -       release_pages_by_pfn(vb->pfns, vb->num_pfns);
> > +       mutex_unlock(&vb->balloon_lock);
> > +       release_pages_by_pfn(pfns, num_pfns);
> >  }
> > ---8<---

Well, I'll assume that's because I'm working for 14h already and running out of
caffeine.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-23 23:36                                       ` Michael S. Tsirkin
  2012-08-24  0:26                                         ` Rafael Aquini
  2012-08-24  0:33                                         ` Rafael Aquini
@ 2012-08-24  3:12                                         ` Rik van Riel
  2012-08-24  8:03                                           ` Michael S. Tsirkin
  2 siblings, 1 reply; 52+ messages in thread
From: Rik van Riel @ 2012-08-24  3:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rafael Aquini, Paul E. McKenney, Peter Zijlstra, linux-mm,
	linux-kernel, virtualization, Rusty Russell, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On 08/23/2012 07:36 PM, Michael S. Tsirkin wrote:

> --->
>
> virtio-balloon: replace page->lru list with page->private.
>
> The point is to free up page->lru for use by compaction.
> Warning: completely untested, will provide tested version
> if we agree on this direction.

A singly linked list is not going to work for page migration,
which needs to get pages that might be in the middle of the
balloon list.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility
  2012-08-24  3:12                                         ` Rik van Riel
@ 2012-08-24  8:03                                           ` Michael S. Tsirkin
  0 siblings, 0 replies; 52+ messages in thread
From: Michael S. Tsirkin @ 2012-08-24  8:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Rafael Aquini, Paul E. McKenney, Peter Zijlstra, linux-mm,
	linux-kernel, virtualization, Rusty Russell, Mel Gorman,
	Andi Kleen, Andrew Morton, Konrad Rzeszutek Wilk, Minchan Kim

On Thu, Aug 23, 2012 at 11:12:17PM -0400, Rik van Riel wrote:
> On 08/23/2012 07:36 PM, Michael S. Tsirkin wrote:
> 
> >--->
> >
> >virtio-balloon: replace page->lru list with page->private.
> >
> >The point is to free up page->lru for use by compaction.
> >Warning: completely untested, will provide tested version
> >if we agree on this direction.
> 
> A singly linked list is not going to work for page migration,
> which needs to get pages that might be in the middle of the
> balloon list.

For virtballoon_migratepage? Hmm I think you are right. I'll
need to think it over but if we can think of no other way
to avoid ther need to handle isolation in virtio,
we'll just have to use the original plan and add
balloon core to mm.

> -- 
> All rights reversed

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2012-08-24  8:03 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-21 12:47 [PATCH v8 0/5] make balloon pages movable by compaction Rafael Aquini
2012-08-21 12:47 ` [PATCH v8 1/5] mm: introduce a common interface for balloon pages mobility Rafael Aquini
2012-08-21 13:52   ` Michael S. Tsirkin
2012-08-21 14:25     ` Michael S. Tsirkin
2012-08-21 15:16     ` Peter Zijlstra
2012-08-21 15:41       ` Michael S. Tsirkin
2012-08-21 17:42         ` Rafael Aquini
2012-08-21 19:28           ` Michael S. Tsirkin
2012-08-21 17:55     ` Rafael Aquini
2012-08-21 19:16       ` Michael S. Tsirkin
2012-08-21 19:34         ` Rafael Aquini
2012-08-22  0:06           ` Michael S. Tsirkin
2012-08-21 15:20   ` Peter Zijlstra
2012-08-21 16:24     ` Paul E. McKenney
2012-08-21 17:28       ` Rafael Aquini
2012-08-21 19:13         ` Michael S. Tsirkin
2012-08-21 19:23           ` Rafael Aquini
2012-08-21 19:30             ` Michael S. Tsirkin
2012-08-21 20:45               ` Rafael Aquini
2012-08-22  0:07                 ` Michael S. Tsirkin
2012-08-22  1:19                   ` Rafael Aquini
2012-08-22  9:33                     ` Michael S. Tsirkin
2012-08-23  2:19                       ` Rafael Aquini
2012-08-23 10:01                         ` Michael S. Tsirkin
2012-08-23 12:13                           ` Rafael Aquini
2012-08-23 12:34                             ` Michael S. Tsirkin
2012-08-23 13:06                               ` Rafael Aquini
2012-08-23 13:53                                 ` Michael S. Tsirkin
2012-08-23 15:21                                   ` Rafael Aquini
2012-08-23 15:54                                     ` Michael S. Tsirkin
2012-08-23 16:03                                       ` Rik van Riel
2012-08-23 16:06                                         ` Rafael Aquini
2012-08-23 16:10                                           ` Michael S. Tsirkin
2012-08-23 16:25                                   ` Michael S. Tsirkin
2012-08-23 17:28                                     ` Rafael Aquini
2012-08-23 17:59                                       ` Rik van Riel
2012-08-23 23:36                                       ` Michael S. Tsirkin
2012-08-24  0:26                                         ` Rafael Aquini
2012-08-24  0:33                                         ` Rafael Aquini
2012-08-24  0:38                                           ` Rafael Aquini
2012-08-24  0:49                                             ` Rafael Aquini
2012-08-24  3:12                                         ` Rik van Riel
2012-08-24  8:03                                           ` Michael S. Tsirkin
2012-08-21 12:47 ` [PATCH v8 2/5] mm: introduce compaction and migration for ballooned pages Rafael Aquini
2012-08-21 12:47 ` [PATCH v8 3/5] virtio_balloon: introduce migration primitives to balloon pages Rafael Aquini
2012-08-21 14:40   ` Michael S. Tsirkin
2012-08-21 15:34     ` Peter Zijlstra
2012-08-21 15:37     ` Peter Zijlstra
2012-08-21 14:57   ` Michael S. Tsirkin
2012-08-21 12:47 ` [PATCH v8 4/5] mm: introduce putback_movable_pages() Rafael Aquini
2012-08-21 14:42   ` Michael S. Tsirkin
2012-08-21 12:47 ` [PATCH v8 5/5] mm: add vm event counters for balloon pages compaction Rafael Aquini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).