linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v12 0/8] Virtio-balloon Enhancement
@ 2017-07-12 12:40 Wei Wang
  2017-07-12 12:40 ` [PATCH v12 1/8] virtio-balloon: deflate via a page list Wei Wang
                   ` (8 more replies)
  0 siblings, 9 replies; 60+ messages in thread
From: Wei Wang @ 2017-07-12 12:40 UTC (permalink / raw)
  To: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, mst,
	david, cornelia.huck, akpm, mgorman, aarcange, amit.shah,
	pbonzini, wei.w.wang, liliang.opensource
  Cc: virtio-dev, yang.zhang.wz, quan.xu

This patch series enhances the existing virtio-balloon with the following new
features:
1) fast ballooning: transfer ballooned pages between the guest and host in
chunks using sgs, instead of one by one; and
2) cmdq: a new virtqueue to send commands between the device and driver.
Currently, it supports commands to report memory stats (replace the old statq
mechanism) and report guest unused pages.

Change Log:

v11->v12:
1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned pages.
2) virtio-ring: enable the driver to build up a desc chain using vring desc.
3) virtio-ring: Add locking to the existing START_USE() and END_USE() macro
to lock/unlock the vq when a vq operation starts/ends.
4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
5) virtio-balloon: describe chunks of ballooned pages and free pages blocks
directly using one or more chains of desc from the vq.

v10->v11:
1) virtio_balloon: use vring_desc to describe a chunk;
2) virtio_ring: support to add an indirect desc table to virtqueue;
3)  virtio_balloon: use cmdq to report guest memory statistics.

v9->v10:
1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
2) virtio-balloon: add virtballoon_validate();
3) virtio-balloon: msg format change;
4) virtio-balloon: move miscq handling to a task on system_freezable_wq;
5) virtio-balloon: code cleanup.

v8->v9:
1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
implementation;
2) Simpler function to get the free page block.

v7->v8:
1) Use only one chunk format, instead of two.
2) re-write the virtio-balloon implementation patch.
3) commit changes
4) patch re-org

Liang Li (1):
  virtio-balloon: deflate via a page list

Matthew Wilcox (1):
  Introduce xbitmap

Wei Wang (6):
  virtio-balloon: coding format cleanup
  xbitmap: add xb_find_next_bit() and xb_zero()
  virtio-balloon: VIRTIO_BALLOON_F_SG
  mm: support reporting free page blocks
  mm: export symbol of next_zone and first_online_pgdat
  virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ

 drivers/virtio/virtio_balloon.c     | 414 ++++++++++++++++++++++++++++++++----
 drivers/virtio/virtio_ring.c        | 224 +++++++++++++++++--
 include/linux/mm.h                  |   5 +
 include/linux/radix-tree.h          |   2 +
 include/linux/virtio.h              |  22 ++
 include/linux/xbitmap.h             |  53 +++++
 include/uapi/linux/virtio_balloon.h |  11 +
 lib/radix-tree.c                    | 164 +++++++++++++-
 mm/mmzone.c                         |   2 +
 mm/page_alloc.c                     |  96 +++++++++
 10 files changed, 926 insertions(+), 67 deletions(-)
 create mode 100644 include/linux/xbitmap.h

-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v12 1/8] virtio-balloon: deflate via a page list
  2017-07-12 12:40 [PATCH v12 0/8] Virtio-balloon Enhancement Wei Wang
@ 2017-07-12 12:40 ` Wei Wang
  2017-07-12 12:40 ` [PATCH v12 2/8] virtio-balloon: coding format cleanup Wei Wang
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 60+ messages in thread
From: Wei Wang @ 2017-07-12 12:40 UTC (permalink / raw)
  To: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, mst,
	david, cornelia.huck, akpm, mgorman, aarcange, amit.shah,
	pbonzini, wei.w.wang, liliang.opensource
  Cc: virtio-dev, yang.zhang.wz, quan.xu

From: Liang Li <liang.z.li@intel.com>

This patch saves the deflated pages to a list, instead of the PFN array.
Accordingly, the balloon_pfn_to_page() function is removed.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 drivers/virtio/virtio_balloon.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 22caf80..7f38ae6 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -104,12 +104,6 @@ static u32 page_to_balloon_pfn(struct page *page)
 	return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-	BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-	return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
 	struct virtio_balloon *vb = vq->vdev->priv;
@@ -182,18 +176,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 	return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+				 struct list_head *pages)
 {
-	unsigned int i;
-	struct page *page;
+	struct page *page, *next;
 
-	/* Find pfns pointing at start of each page, get pages and free them. */
-	for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-		page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-							   vb->pfns[i]));
+	list_for_each_entry_safe(page, next, pages, lru) {
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
 			adjust_managed_page_count(page, 1);
+		list_del(&page->lru);
 		put_page(page); /* balloon reference */
 	}
 }
@@ -203,6 +195,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	unsigned num_freed_pages;
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
+	LIST_HEAD(pages);
 
 	/* We can only do one array worth at a time. */
 	num = min(num, ARRAY_SIZE(vb->pfns));
@@ -216,6 +209,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		if (!page)
 			break;
 		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
 
@@ -227,7 +221,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 */
 	if (vb->num_pfns != 0)
 		tell_host(vb, vb->deflate_vq);
-	release_pages_balloon(vb);
+	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
 }
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 2/8] virtio-balloon: coding format cleanup
  2017-07-12 12:40 [PATCH v12 0/8] Virtio-balloon Enhancement Wei Wang
  2017-07-12 12:40 ` [PATCH v12 1/8] virtio-balloon: deflate via a page list Wei Wang
@ 2017-07-12 12:40 ` Wei Wang
  2017-07-12 12:40 ` [PATCH v12 3/8] Introduce xbitmap Wei Wang
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 60+ messages in thread
From: Wei Wang @ 2017-07-12 12:40 UTC (permalink / raw)
  To: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, mst,
	david, cornelia.huck, akpm, mgorman, aarcange, amit.shah,
	pbonzini, wei.w.wang, liliang.opensource
  Cc: virtio-dev, yang.zhang.wz, quan.xu

Clean up the comment format.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 drivers/virtio/virtio_balloon.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 7f38ae6..f0b3a0b 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -132,8 +132,10 @@ static void set_page_pfns(struct virtio_balloon *vb,
 {
 	unsigned int i;
 
-	/* Set balloon pfns pointing at this page.
-	 * Note that the first pfn points at start of the page. */
+	/*
+	 * Set balloon pfns pointing at this page.
+	 * Note that the first pfn points at start of the page.
+	 */
 	for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
 		pfns[i] = cpu_to_virtio32(vb->vdev,
 					  page_to_balloon_pfn(page) + i);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 3/8] Introduce xbitmap
  2017-07-12 12:40 [PATCH v12 0/8] Virtio-balloon Enhancement Wei Wang
  2017-07-12 12:40 ` [PATCH v12 1/8] virtio-balloon: deflate via a page list Wei Wang
  2017-07-12 12:40 ` [PATCH v12 2/8] virtio-balloon: coding format cleanup Wei Wang
@ 2017-07-12 12:40 ` Wei Wang
  2017-07-12 12:40 ` [PATCH v12 4/8] xbitmap: add xb_find_next_bit() and xb_zero() Wei Wang
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 60+ messages in thread
From: Wei Wang @ 2017-07-12 12:40 UTC (permalink / raw)
  To: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, mst,
	david, cornelia.huck, akpm, mgorman, aarcange, amit.shah,
	pbonzini, wei.w.wang, liliang.opensource
  Cc: virtio-dev, yang.zhang.wz, quan.xu

From: Matthew Wilcox <mawilcox@microsoft.com>

The eXtensible Bitmap is a sparse bitmap representation which is
efficient for set bits which tend to cluster.  It supports up to
'unsigned long' worth of bits, and this commit adds the bare bones --
xb_set_bit(), xb_clear_bit() and xb_test_bit().

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 include/linux/radix-tree.h |   2 +
 include/linux/xbitmap.h    |  49 ++++++++++++++++
 lib/radix-tree.c           | 138 ++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 187 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/xbitmap.h

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 3e57350..428ccc9 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -317,6 +317,8 @@ void radix_tree_iter_delete(struct radix_tree_root *,
 			struct radix_tree_iter *iter, void __rcu **slot);
 void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot);
 void radix_tree_clear_tags(struct radix_tree_root *, struct radix_tree_node *,
 			   void __rcu **slot);
 unsigned int radix_tree_gang_lookup(const struct radix_tree_root *,
diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
new file mode 100644
index 0000000..0b93a46
--- /dev/null
+++ b/include/linux/xbitmap.h
@@ -0,0 +1,49 @@
+/*
+ * eXtensible Bitmaps
+ * Copyright (c) 2017 Microsoft Corporation <mawilcox@microsoft.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * eXtensible Bitmaps provide an unlimited-size sparse bitmap facility.
+ * All bits are initially zero.
+ */
+
+#include <linux/idr.h>
+
+struct xb {
+	struct radix_tree_root xbrt;
+};
+
+#define XB_INIT {							\
+	.xbrt = RADIX_TREE_INIT(IDR_RT_MARKER | GFP_NOWAIT),		\
+}
+#define DEFINE_XB(name)		struct xb name = XB_INIT
+
+static inline void xb_init(struct xb *xb)
+{
+	INIT_RADIX_TREE(&xb->xbrt, IDR_RT_MARKER | GFP_NOWAIT);
+}
+
+int xb_set_bit(struct xb *xb, unsigned long bit);
+bool xb_test_bit(const struct xb *xb, unsigned long bit);
+int xb_clear_bit(struct xb *xb, unsigned long bit);
+
+static inline bool xb_empty(const struct xb *xb)
+{
+	return radix_tree_empty(&xb->xbrt);
+}
+
+void xb_preload(gfp_t gfp);
+
+static inline void xb_preload_end(void)
+{
+	preempt_enable();
+}
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 898e879..d624914 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -37,6 +37,7 @@
 #include <linux/rcupdate.h>
 #include <linux/slab.h>
 #include <linux/string.h>
+#include <linux/xbitmap.h>
 
 
 /* Number of nodes in fully populated tree of given height */
@@ -78,6 +79,14 @@ static struct kmem_cache *radix_tree_node_cachep;
 #define IDA_PRELOAD_SIZE	(IDA_MAX_PATH * 2 - 1)
 
 /*
+ * The XB can go up to unsigned long, but also uses a bitmap.
+ */
+#define XB_INDEX_BITS		(BITS_PER_LONG - ilog2(IDA_BITMAP_BITS))
+#define XB_MAX_PATH		(DIV_ROUND_UP(XB_INDEX_BITS, \
+					      RADIX_TREE_MAP_SHIFT))
+#define XB_PRELOAD_SIZE		(XB_MAX_PATH * 2 - 1)
+
+/*
  * Per-cpu pool of preloaded nodes
  */
 struct radix_tree_preload {
@@ -840,6 +849,8 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
 							offset, 0, 0);
 			if (!child)
 				return -ENOMEM;
+			if (is_idr(root))
+				all_tag_set(child, IDR_FREE);
 			rcu_assign_pointer(*slot, node_to_entry(child));
 			if (node)
 				node->count++;
@@ -1986,8 +1997,8 @@ void __radix_tree_delete_node(struct radix_tree_root *root,
 	delete_node(root, node, update_node, private);
 }
 
-static bool __radix_tree_delete(struct radix_tree_root *root,
-				struct radix_tree_node *node, void __rcu **slot)
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot)
 {
 	void *old = rcu_dereference_raw(*slot);
 	int exceptional = radix_tree_exceptional_entry(old) ? -1 : 0;
@@ -2137,6 +2148,129 @@ int ida_pre_get(struct ida *ida, gfp_t gfp)
 }
 EXPORT_SYMBOL(ida_pre_get);
 
+void xb_preload(gfp_t gfp)
+{
+	__radix_tree_preload(gfp, XB_PRELOAD_SIZE);
+	if (!this_cpu_read(ida_bitmap)) {
+		struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
+
+		if (!bitmap)
+			return;
+		bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
+		kfree(bitmap);
+	}
+}
+EXPORT_SYMBOL(xb_preload);
+
+int xb_set_bit(struct xb *xb, unsigned long bit)
+{
+	int err;
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	err = __radix_tree_create(root, index, 0, &node, &slot);
+	if (err)
+		return err;
+	bitmap = rcu_dereference_raw(*slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit < BITS_PER_LONG) {
+			tmp |= 1UL << ebit;
+			rcu_assign_pointer(*slot, (void *)tmp);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+		rcu_assign_pointer(*slot, bitmap);
+	}
+
+	if (!bitmap) {
+		if (ebit < BITS_PER_LONG) {
+			bitmap = (void *)((1UL << ebit) |
+					RADIX_TREE_EXCEPTIONAL_ENTRY);
+			__radix_tree_replace(root, node, slot, bitmap, NULL,
+						NULL);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		__radix_tree_replace(root, node, slot, bitmap, NULL, NULL);
+	}
+
+	__set_bit(bit, bitmap->bitmap);
+	return 0;
+}
+
+int xb_clear_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	bitmap = __radix_tree_lookup(root, index, &node, &slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit >= BITS_PER_LONG)
+			return 0;
+		tmp &= ~(1UL << ebit);
+		if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+			__radix_tree_delete(root, node, slot);
+		else
+			rcu_assign_pointer(*slot, (void *)tmp);
+		return 0;
+	}
+
+	if (!bitmap)
+		return 0;
+
+	__clear_bit(bit, bitmap->bitmap);
+	if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+		kfree(bitmap);
+		__radix_tree_delete(root, node, slot);
+	}
+
+	return 0;
+}
+
+bool xb_test_bit(const struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	const struct radix_tree_root *root = &xb->xbrt;
+	struct ida_bitmap *bitmap = radix_tree_lookup(root, index);
+
+	bit %= IDA_BITMAP_BITS;
+
+	if (!bitmap)
+		return false;
+	if (radix_tree_exception(bitmap)) {
+		bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
+		if (bit > BITS_PER_LONG)
+			return false;
+		return (unsigned long)bitmap & (1UL << bit);
+	}
+	return test_bit(bit, bitmap->bitmap);
+}
+
 void __rcu **idr_get_free(struct radix_tree_root *root,
 			struct radix_tree_iter *iter, gfp_t gfp, int end)
 {
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 4/8] xbitmap: add xb_find_next_bit() and xb_zero()
  2017-07-12 12:40 [PATCH v12 0/8] Virtio-balloon Enhancement Wei Wang
                   ` (2 preceding siblings ...)
  2017-07-12 12:40 ` [PATCH v12 3/8] Introduce xbitmap Wei Wang
@ 2017-07-12 12:40 ` Wei Wang
  2017-07-12 12:40 ` [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 60+ messages in thread
From: Wei Wang @ 2017-07-12 12:40 UTC (permalink / raw)
  To: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, mst,
	david, cornelia.huck, akpm, mgorman, aarcange, amit.shah,
	pbonzini, wei.w.wang, liliang.opensource
  Cc: virtio-dev, yang.zhang.wz, quan.xu

xb_find_next_bit() is added to support find the next "1" or "0" bit
in the given range. xb_zero() is added to support zero the given range
of bits.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 include/linux/xbitmap.h |  4 ++++
 lib/radix-tree.c        | 26 ++++++++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
index 0b93a46..88c2045 100644
--- a/include/linux/xbitmap.h
+++ b/include/linux/xbitmap.h
@@ -36,6 +36,10 @@ int xb_set_bit(struct xb *xb, unsigned long bit);
 bool xb_test_bit(const struct xb *xb, unsigned long bit);
 int xb_clear_bit(struct xb *xb, unsigned long bit);
 
+void xb_zero(struct xb *xb, unsigned long start, unsigned long end);
+unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+			       unsigned long end, bool set);
+
 static inline bool xb_empty(const struct xb *xb)
 {
 	return radix_tree_empty(&xb->xbrt);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index d624914..c45b910 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -2271,6 +2271,32 @@ bool xb_test_bit(const struct xb *xb, unsigned long bit)
 	return test_bit(bit, bitmap->bitmap);
 }
 
+void xb_zero(struct xb *xb, unsigned long start, unsigned long end)
+{
+	unsigned long i;
+
+	for (i = start; i <= end; i++)
+		xb_clear_bit(xb, i);
+}
+
+/*
+ * Find the next one (@set = 1) or zero (@set = 0) bit within the bit range
+ * from @start to @end in @xb. If no such bit is found in the given range,
+ * bit end + 1 will be returned.
+ */
+unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+			       unsigned long end, bool set)
+{
+	unsigned long i;
+
+	for (i = start; i <= end; i++) {
+		if (xb_test_bit(xb, i) == set)
+			break;
+	}
+
+	return i;
+}
+
 void __rcu **idr_get_free(struct radix_tree_root *root,
 			struct radix_tree_iter *iter, gfp_t gfp, int end)
 {
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-07-12 12:40 [PATCH v12 0/8] Virtio-balloon Enhancement Wei Wang
                   ` (3 preceding siblings ...)
  2017-07-12 12:40 ` [PATCH v12 4/8] xbitmap: add xb_find_next_bit() and xb_zero() Wei Wang
@ 2017-07-12 12:40 ` Wei Wang
  2017-07-12 13:06   ` Michael S. Tsirkin
                     ` (4 more replies)
  2017-07-12 12:40 ` [PATCH v12 6/8] mm: support reporting free page blocks Wei Wang
                   ` (3 subsequent siblings)
  8 siblings, 5 replies; 60+ messages in thread
From: Wei Wang @ 2017-07-12 12:40 UTC (permalink / raw)
  To: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, mst,
	david, cornelia.huck, akpm, mgorman, aarcange, amit.shah,
	pbonzini, wei.w.wang, liliang.opensource
  Cc: virtio-dev, yang.zhang.wz, quan.xu

Add a new feature, VIRTIO_BALLOON_F_SG, which enables to
transfer a chunk of ballooned (i.e. inflated/deflated) pages using
scatter-gather lists to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the balloon pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
sgs. An sg describes a chunk of guest physically continuous pages.
With this mechanism, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~491ms
resulting in an improvement of ~88%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_balloon.c     | 141 ++++++++++++++++++++++---
 drivers/virtio/virtio_ring.c        | 199 +++++++++++++++++++++++++++++++++---
 include/linux/virtio.h              |  20 ++++
 include/uapi/linux/virtio_balloon.h |   1 +
 4 files changed, 329 insertions(+), 32 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f0b3a0b..aa4e7ec 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -32,6 +32,7 @@
 #include <linux/mm.h>
 #include <linux/mount.h>
 #include <linux/magic.h>
+#include <linux/xbitmap.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -79,6 +80,9 @@ struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	/* The xbitmap used to record ballooned pages */
+	struct xb page_xb;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -141,13 +145,71 @@ static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
+/*
+ * Send balloon pages in sgs to host.
+ * The balloon pages are recorded in the page xbitmap. Each bit in the bitmap
+ * corresponds to a page of PAGE_SIZE. The page xbitmap is searched for
+ * continuous "1" bits, which correspond to continuous pages, to chunk into
+ * sgs.
+ *
+ * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
+ * need to be serached.
+ */
+static void tell_host_sgs(struct virtio_balloon *vb,
+			  struct virtqueue *vq,
+			  unsigned long page_xb_start,
+			  unsigned long page_xb_end)
+{
+	unsigned int head_id = VIRTQUEUE_DESC_ID_INIT,
+		     prev_id = VIRTQUEUE_DESC_ID_INIT;
+	unsigned long sg_pfn_start, sg_pfn_end;
+	uint64_t sg_addr;
+	uint32_t sg_size;
+
+	sg_pfn_start = page_xb_start;
+	while (sg_pfn_start < page_xb_end) {
+		sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start,
+						page_xb_end, 1);
+		if (sg_pfn_start == page_xb_end + 1)
+			break;
+		sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1,
+					      page_xb_end, 0);
+		sg_addr = sg_pfn_start << PAGE_SHIFT;
+		sg_size = (sg_pfn_end - sg_pfn_start) * PAGE_SIZE;
+		virtqueue_add_chain_desc(vq, sg_addr, sg_size, &head_id,
+					 &prev_id, 0);
+		xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end);
+		sg_pfn_start = sg_pfn_end + 1;
+	}
+
+	if (head_id != VIRTQUEUE_DESC_ID_INIT) {
+		virtqueue_add_chain(vq, head_id, 0, NULL, vb, NULL);
+		virtqueue_kick_async(vq, vb->acked);
+	}
+}
+
+/* Update pfn_max and pfn_min according to the pfn of @page */
+static inline void update_pfn_range(struct virtio_balloon *vb,
+				    struct page *page,
+				    unsigned long *pfn_min,
+				    unsigned long *pfn_max)
+{
+	unsigned long pfn = page_to_pfn(page);
+
+	*pfn_min = min(pfn, *pfn_min);
+	*pfn_max = max(pfn, *pfn_max);
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	unsigned num_allocated_pages;
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -162,7 +224,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (use_sg) {
+			update_pfn_range(vb, page, &pfn_min, &pfn_max);
+			xb_set_bit(&vb->page_xb, page_to_pfn(page));
+		} else {
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		}
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -171,8 +238,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns != 0) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -198,9 +269,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	/* Traditionally, we can only do one array worth at a time. */
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	/* We can't release more pages than taken */
@@ -210,7 +284,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (use_sg) {
+			update_pfn_range(vb, page, &pfn_min, &pfn_max);
+			xb_set_bit(&vb->page_xb, page_to_pfn(page));
+		} else {
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		}
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -221,8 +300,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns != 0) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
@@ -441,6 +524,18 @@ static int init_vqs(struct virtio_balloon *vb)
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
+
+static void tell_host_one_page(struct virtio_balloon *vb, struct virtqueue *vq,
+			       struct page *page)
+{
+	unsigned int id = VIRTQUEUE_DESC_ID_INIT;
+	u64 addr = page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT;
+
+	virtqueue_add_chain_desc(vq, addr, PAGE_SIZE, &id, &id, 0);
+	virtqueue_add_chain(vq, id, 0, NULL, (void *)addr, NULL);
+	virtqueue_kick_async(vq, vb->acked);
+}
+
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
  *			     a compation thread.     (called under page lock)
@@ -464,6 +559,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
 	unsigned long flags;
 
 	/*
@@ -485,16 +581,22 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (use_sg) {
+		tell_host_one_page(vb, vb->inflate_vq, newpage);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (use_sg) {
+		tell_host_one_page(vb, vb->deflate_vq, page);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -553,6 +655,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (err)
 		goto out_free_vb;
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
+		xb_init(&vb->page_xb);
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -618,6 +723,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
 
+	xb_empty(&vb->page_xb);
 	remove_common(vb);
 #ifdef CONFIG_BALLOON_COMPACTION
 	if (vb->vb_dev_info.inode)
@@ -669,6 +775,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_SG,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5e1b548..b9d7e10 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -269,7 +269,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	struct vring_virtqueue *vq = to_vvq(_vq);
 	struct scatterlist *sg;
 	struct vring_desc *desc;
-	unsigned int i, n, avail, descs_used, uninitialized_var(prev), err_idx;
+	unsigned int i, n, descs_used, uninitialized_var(prev), err_id;
 	int head;
 	bool indirect;
 
@@ -387,10 +387,68 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	else
 		vq->free_head = i;
 
-	/* Store token and indirect buffer state. */
+	END_USE(vq);
+
+	return virtqueue_add_chain(_vq, head, indirect, desc, data, ctx);
+
+unmap_release:
+	err_id = i;
+	i = head;
+
+	for (n = 0; n < total_sg; n++) {
+		if (i == err_id)
+			break;
+		vring_unmap_one(vq, &desc[i]);
+		i = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next);
+	}
+
+	vq->vq.num_free += total_sg;
+
+	if (indirect)
+		kfree(desc);
+
+	END_USE(vq);
+	return -EIO;
+}
+
+/**
+ * virtqueue_add_chain - expose a chain of buffers to the other end
+ * @_vq: the struct virtqueue we're talking about.
+ * @head: desc id of the chain head.
+ * @indirect: set if the chain of descs are indrect descs.
+ * @indir_desc: the first indirect desc.
+ * @data: the token identifying the chain.
+ * @ctx: extra context for the token.
+ *
+ * Caller must ensure we don't call this with other virtqueue operations
+ * at the same time (except where noted).
+ *
+ * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
+ */
+int virtqueue_add_chain(struct virtqueue *_vq,
+			unsigned int head,
+			bool indirect,
+			struct vring_desc *indir_desc,
+			void *data,
+			void *ctx)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	unsigned int avail;
+
+	/* The desc chain is empty. */
+	if (head == VIRTQUEUE_DESC_ID_INIT)
+		return 0;
+
+	START_USE(vq);
+
+	if (unlikely(vq->broken)) {
+		END_USE(vq);
+		return -EIO;
+	}
+
 	vq->desc_state[head].data = data;
 	if (indirect)
-		vq->desc_state[head].indir_desc = desc;
+		vq->desc_state[head].indir_desc = indir_desc;
 	if (ctx)
 		vq->desc_state[head].indir_desc = ctx;
 
@@ -415,26 +473,87 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 		virtqueue_kick(_vq);
 
 	return 0;
+}
+EXPORT_SYMBOL_GPL(virtqueue_add_chain);
 
-unmap_release:
-	err_idx = i;
-	i = head;
+/**
+ * virtqueue_add_chain_desc - add a buffer to a chain using a vring desc
+ * @vq: the struct virtqueue we're talking about.
+ * @addr: address of the buffer to add.
+ * @len: length of the buffer.
+ * @head_id: desc id of the chain head.
+ * @prev_id: desc id of the previous buffer.
+ * @in: set if the buffer is for the device to write.
+ *
+ * Caller must ensure we don't call this with other virtqueue operations
+ * at the same time (except where noted).
+ *
+ * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
+ */
+int virtqueue_add_chain_desc(struct virtqueue *_vq,
+			     uint64_t addr,
+			     uint32_t len,
+			     unsigned int *head_id,
+			     unsigned int *prev_id,
+			     bool in)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	struct vring_desc *desc = vq->vring.desc;
+	uint16_t flags = in ? VRING_DESC_F_WRITE : 0;
+	unsigned int i;
 
-	for (n = 0; n < total_sg; n++) {
-		if (i == err_idx)
-			break;
-		vring_unmap_one(vq, &desc[i]);
-		i = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next);
+	/* Sanity check */
+	if (!_vq || !head_id || !prev_id)
+		return -EINVAL;
+retry:
+	START_USE(vq);
+	if (unlikely(vq->broken)) {
+		END_USE(vq);
+		return -EIO;
 	}
 
-	vq->vq.num_free += total_sg;
+	if (vq->vq.num_free < 1) {
+		/*
+		 * If there is no desc avail in the vq, so kick what is
+		 * already added, and re-start to build a new chain for
+		 * the passed sg.
+		 */
+		if (likely(*head_id != VIRTQUEUE_DESC_ID_INIT)) {
+			END_USE(vq);
+			virtqueue_add_chain(_vq, *head_id, 0, NULL, vq, NULL);
+			virtqueue_kick_sync(_vq);
+			*head_id = VIRTQUEUE_DESC_ID_INIT;
+			*prev_id = VIRTQUEUE_DESC_ID_INIT;
+			goto retry;
+		} else {
+			END_USE(vq);
+			return -ENOSPC;
+		}
+	}
 
-	if (indirect)
-		kfree(desc);
+	i = vq->free_head;
+	flags &= ~VRING_DESC_F_NEXT;
+	desc[i].flags = cpu_to_virtio16(_vq->vdev, flags);
+	desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
+	desc[i].len = cpu_to_virtio32(_vq->vdev, len);
+
+	/* Add the desc to the end of the chain */
+	if (*prev_id != VIRTQUEUE_DESC_ID_INIT) {
+		desc[*prev_id].next = cpu_to_virtio16(_vq->vdev, i);
+		desc[*prev_id].flags |= cpu_to_virtio16(_vq->vdev,
+							 VRING_DESC_F_NEXT);
+	}
+	*prev_id = i;
+	if (*head_id == VIRTQUEUE_DESC_ID_INIT)
+		*head_id = *prev_id;
 
+	vq->vq.num_free--;
+	vq->free_head = virtio16_to_cpu(_vq->vdev, desc[i].next);
 	END_USE(vq);
-	return -EIO;
+
+	return 0;
 }
+EXPORT_SYMBOL_GPL(virtqueue_add_chain_desc);
 
 /**
  * virtqueue_add_sgs - expose buffers to other end
@@ -627,6 +746,56 @@ bool virtqueue_kick(struct virtqueue *vq)
 }
 EXPORT_SYMBOL_GPL(virtqueue_kick);
 
+/**
+ * virtqueue_kick_sync - update after add_buf and busy wait till update is done
+ * @vq: the struct virtqueue
+ *
+ * After one or more virtqueue_add_* calls, invoke this to kick
+ * the other side. Busy wait till the other side is done with the update.
+ *
+ * Caller must ensure we don't call this with other virtqueue
+ * operations at the same time (except where noted).
+ *
+ * Returns false if kick failed, otherwise true.
+ */
+bool virtqueue_kick_sync(struct virtqueue *vq)
+{
+	u32 len;
+
+	if (likely(virtqueue_kick(vq))) {
+		while (!virtqueue_get_buf(vq, &len) &&
+		       !virtqueue_is_broken(vq))
+			cpu_relax();
+		return true;
+	}
+	return false;
+}
+EXPORT_SYMBOL_GPL(virtqueue_kick_sync);
+
+/**
+ * virtqueue_kick_async - update after add_buf and blocking till update is done
+ * @vq: the struct virtqueue
+ *
+ * After one or more virtqueue_add_* calls, invoke this to kick
+ * the other side. Blocking till the other side is done with the update.
+ *
+ * Caller must ensure we don't call this with other virtqueue
+ * operations at the same time (except where noted).
+ *
+ * Returns false if kick failed, otherwise true.
+ */
+bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq)
+{
+	u32 len;
+
+	if (likely(virtqueue_kick(vq))) {
+		wait_event(wq, virtqueue_get_buf(vq, &len));
+		return true;
+	}
+	return false;
+}
+EXPORT_SYMBOL_GPL(virtqueue_kick_async);
+
 static void detach_buf(struct vring_virtqueue *vq, unsigned int head,
 		       void **ctx)
 {
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 28b0e96..9f27101 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -57,8 +57,28 @@ int virtqueue_add_sgs(struct virtqueue *vq,
 		      void *data,
 		      gfp_t gfp);
 
+/* A desc with this init id is treated as an invalid desc */
+#define VIRTQUEUE_DESC_ID_INIT UINT_MAX
+int virtqueue_add_chain_desc(struct virtqueue *_vq,
+			     uint64_t addr,
+			     uint32_t len,
+			     unsigned int *head_id,
+			     unsigned int *prev_id,
+			     bool in);
+
+int virtqueue_add_chain(struct virtqueue *_vq,
+			unsigned int head,
+			bool indirect,
+			struct vring_desc *indirect_desc,
+			void *data,
+			void *ctx);
+
 bool virtqueue_kick(struct virtqueue *vq);
 
+bool virtqueue_kick_sync(struct virtqueue *vq);
+
+bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq);
+
 bool virtqueue_kick_prepare(struct virtqueue *vq);
 
 bool virtqueue_notify(struct virtqueue *vq);
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..37780a7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 6/8] mm: support reporting free page blocks
  2017-07-12 12:40 [PATCH v12 0/8] Virtio-balloon Enhancement Wei Wang
                   ` (4 preceding siblings ...)
  2017-07-12 12:40 ` [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
@ 2017-07-12 12:40 ` Wei Wang
  2017-07-13  0:33   ` Michael S. Tsirkin
  2017-07-14 12:30   ` Michal Hocko
  2017-07-12 12:40 ` [PATCH v12 7/8] mm: export symbol of next_zone and first_online_pgdat Wei Wang
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 60+ messages in thread
From: Wei Wang @ 2017-07-12 12:40 UTC (permalink / raw)
  To: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, mst,
	david, cornelia.huck, akpm, mgorman, aarcange, amit.shah,
	pbonzini, wei.w.wang, liliang.opensource
  Cc: virtio-dev, yang.zhang.wz, quan.xu

This patch adds support for reporting blocks of pages on the free list
specified by the caller.

As pages can leave the free list during this call or immediately
afterwards, they are not guaranteed to be free after the function
returns. The only guarantee this makes is that the page was on the free
list at some point in time after the function has been invoked.

Therefore, it is not safe for caller to use any pages on the returned
block or to discard data that is put there after the function returns.
However, it is safe for caller to discard data that was in one of these
pages before the function was invoked.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 include/linux/mm.h |  5 +++
 mm/page_alloc.c    | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 101 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5..76cb433 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1835,6 +1835,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 
+#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
+extern int report_unused_page_block(struct zone *zone, unsigned int order,
+				    unsigned int migratetype,
+				    struct page **page);
+#endif
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
  * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 64b7d82..8b3c9dd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4753,6 +4753,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 	show_swap_cache_info();
 }
 
+#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
+
+/*
+ * Heuristically get a page block in the system that is unused.
+ * It is possible that pages from the page block are used immediately after
+ * report_unused_page_block() returns. It is the caller's responsibility
+ * to either detect or prevent the use of such pages.
+ *
+ * The free list to check: zone->free_area[order].free_list[migratetype].
+ *
+ * If the caller supplied page block (i.e. **page) is on the free list, offer
+ * the next page block on the list to the caller. Otherwise, offer the first
+ * page block on the list.
+ *
+ * Note: it is not safe for caller to use any pages on the returned
+ * block or to discard data that is put there after the function returns.
+ * However, it is safe for caller to discard data that was in one of these
+ * pages before the function was invoked.
+ *
+ * Return 0 when a page block is found on the caller specified free list.
+ */
+int report_unused_page_block(struct zone *zone, unsigned int order,
+			     unsigned int migratetype, struct page **page)
+{
+	struct zone *this_zone;
+	struct list_head *this_list;
+	int ret = 0;
+	unsigned long flags;
+
+	/* Sanity check */
+	if (zone == NULL || page == NULL || order >= MAX_ORDER ||
+	    migratetype >= MIGRATE_TYPES)
+		return -EINVAL;
+
+	/* Zone validity check */
+	for_each_populated_zone(this_zone) {
+		if (zone == this_zone)
+			break;
+	}
+
+	/* Got a non-existent zone from the caller? */
+	if (zone != this_zone)
+		return -EINVAL;
+
+	spin_lock_irqsave(&this_zone->lock, flags);
+
+	this_list = &zone->free_area[order].free_list[migratetype];
+	if (list_empty(this_list)) {
+		*page = NULL;
+		ret = 1;
+		goto out;
+	}
+
+	/* The caller is asking for the first free page block on the list */
+	if ((*page) == NULL) {
+		*page = list_first_entry(this_list, struct page, lru);
+		ret = 0;
+		goto out;
+	}
+
+	/*
+	 * The page block passed from the caller is not on this free list
+	 * anymore (e.g. a 1MB free page block has been split). In this case,
+	 * offer the first page block on the free list that the caller is
+	 * asking for.
+	 */
+	if (PageBuddy(*page) && order != page_order(*page)) {
+		*page = list_first_entry(this_list, struct page, lru);
+		ret = 0;
+		goto out;
+	}
+
+	/*
+	 * The page block passed from the caller has been the last page block
+	 * on the list.
+	 */
+	if ((*page)->lru.next == this_list) {
+		*page = NULL;
+		ret = 1;
+		goto out;
+	}
+
+	/*
+	 * Finally, fall into the regular case: the page block passed from the
+	 * caller is still on the free list. Offer the next one.
+	 */
+	*page = list_next_entry((*page), lru);
+	ret = 0;
+out:
+	spin_unlock_irqrestore(&this_zone->lock, flags);
+	return ret;
+}
+EXPORT_SYMBOL(report_unused_page_block);
+
+#endif
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 7/8] mm: export symbol of next_zone and first_online_pgdat
  2017-07-12 12:40 [PATCH v12 0/8] Virtio-balloon Enhancement Wei Wang
                   ` (5 preceding siblings ...)
  2017-07-12 12:40 ` [PATCH v12 6/8] mm: support reporting free page blocks Wei Wang
@ 2017-07-12 12:40 ` Wei Wang
  2017-07-13  0:16   ` Michael S. Tsirkin
  2017-07-14 12:31   ` Michal Hocko
  2017-07-12 12:40 ` [PATCH v12 8/8] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ Wei Wang
  2017-07-13  0:14 ` [PATCH v12 0/8] Virtio-balloon Enhancement Michael S. Tsirkin
  8 siblings, 2 replies; 60+ messages in thread
From: Wei Wang @ 2017-07-12 12:40 UTC (permalink / raw)
  To: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, mst,
	david, cornelia.huck, akpm, mgorman, aarcange, amit.shah,
	pbonzini, wei.w.wang, liliang.opensource
  Cc: virtio-dev, yang.zhang.wz, quan.xu

This patch enables for_each_zone()/for_each_populated_zone() to be
invoked by a kernel module.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 mm/mmzone.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/mmzone.c b/mm/mmzone.c
index a51c0a6..08a2a3a 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -13,6 +13,7 @@ struct pglist_data *first_online_pgdat(void)
 {
 	return NODE_DATA(first_online_node);
 }
+EXPORT_SYMBOL_GPL(first_online_pgdat);
 
 struct pglist_data *next_online_pgdat(struct pglist_data *pgdat)
 {
@@ -41,6 +42,7 @@ struct zone *next_zone(struct zone *zone)
 	}
 	return zone;
 }
+EXPORT_SYMBOL_GPL(next_zone);
 
 static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
 {
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v12 8/8] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
  2017-07-12 12:40 [PATCH v12 0/8] Virtio-balloon Enhancement Wei Wang
                   ` (6 preceding siblings ...)
  2017-07-12 12:40 ` [PATCH v12 7/8] mm: export symbol of next_zone and first_online_pgdat Wei Wang
@ 2017-07-12 12:40 ` Wei Wang
  2017-07-13  0:22   ` Michael S. Tsirkin
  2017-07-13  0:14 ` [PATCH v12 0/8] Virtio-balloon Enhancement Michael S. Tsirkin
  8 siblings, 1 reply; 60+ messages in thread
From: Wei Wang @ 2017-07-12 12:40 UTC (permalink / raw)
  To: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, mst,
	david, cornelia.huck, akpm, mgorman, aarcange, amit.shah,
	pbonzini, wei.w.wang, liliang.opensource
  Cc: virtio-dev, yang.zhang.wz, quan.xu

Add a new vq, cmdq, to handle requests between the device and driver.

This patch implements two commands sent from the device and handled in
the driver.
1) VIRTIO_BALLOON_CMDQ_REPORT_STATS: this command is used to report
the guest memory statistics to the host. The stats_vq mechanism is not
used when the cmdq mechanism is enabled.
2) VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES: this command is used to
report the guest unused pages to the host.

Since now we have a vq to handle multiple commands, we need to keep only
one vq operation at a time. Here, we change the existing START_USE()
and END_USE() to lock on each vq operation.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 drivers/virtio/virtio_balloon.c     | 245 ++++++++++++++++++++++++++++++++++--
 drivers/virtio/virtio_ring.c        |  25 +++-
 include/linux/virtio.h              |   2 +
 include/uapi/linux/virtio_balloon.h |  10 ++
 4 files changed, 265 insertions(+), 17 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index aa4e7ec..ae91fbf 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
 
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *cmd_vq;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
 	struct work_struct update_balloon_size_work;
+	struct work_struct cmdq_handle_work;
 
 	/* Prevent updating balloon when it is being canceled. */
 	spinlock_t stop_update_lock;
@@ -90,6 +91,12 @@ struct virtio_balloon {
 	/* Memory statistics */
 	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
 
+	/* Cmdq msg buffer for memory statistics */
+	struct virtio_balloon_cmdq_hdr cmdq_stats_hdr;
+
+	/* Cmdq msg buffer for reporting ununsed pages */
+	struct virtio_balloon_cmdq_hdr cmdq_unused_page_hdr;
+
 	/* To register callback in oom notifier call chain */
 	struct notifier_block nb;
 };
@@ -485,25 +492,214 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
+static unsigned int cmdq_hdr_add(struct virtqueue *vq,
+				 struct virtio_balloon_cmdq_hdr *hdr,
+				 bool in)
+{
+	unsigned int id = VIRTQUEUE_DESC_ID_INIT;
+	uint64_t hdr_pa = (uint64_t)virt_to_phys((void *)hdr);
+
+	virtqueue_add_chain_desc(vq, hdr_pa, sizeof(*hdr), &id, &id, in);
+
+	/* Deliver the hdr for the host to send commands. */
+	if (in) {
+		hdr->flags = 0;
+		virtqueue_add_chain(vq, id, 0, NULL, hdr, NULL);
+		virtqueue_kick(vq);
+	}
+
+	return id;
+}
+
+static void cmdq_add_chain_desc(struct virtio_balloon *vb,
+				struct virtio_balloon_cmdq_hdr *hdr,
+				uint64_t addr,
+				uint32_t len,
+				unsigned int *head_id,
+				unsigned int *prev_id)
+{
+retry:
+	if (*head_id == VIRTQUEUE_DESC_ID_INIT) {
+		*head_id = cmdq_hdr_add(vb->cmd_vq, hdr, 0);
+		*prev_id = *head_id;
+	}
+
+	virtqueue_add_chain_desc(vb->cmd_vq, addr, len, head_id, prev_id, 0);
+	if (*head_id == *prev_id) {
+		/*
+		 * The VQ was full and kicked to release some descs. Now we
+		 * will re-start to build the chain by using the hdr as the
+		 * first desc, so we need to detach the desc that was just
+		 * added, and re-start to add the hdr.
+		 */
+		virtqueue_detach_buf(vb->cmd_vq, *head_id, NULL);
+		*head_id = VIRTQUEUE_DESC_ID_INIT;
+		*prev_id = VIRTQUEUE_DESC_ID_INIT;
+		goto retry;
+	}
+}
+
+static void cmdq_handle_stats(struct virtio_balloon *vb)
+{
+	unsigned int num_stats,
+		     head_id = VIRTQUEUE_DESC_ID_INIT,
+		     prev_id = VIRTQUEUE_DESC_ID_INIT;
+	uint64_t addr = (uint64_t)virt_to_phys((void *)vb->stats);
+	uint32_t len;
+
+	spin_lock(&vb->stop_update_lock);
+	if (!vb->stop_update) {
+		num_stats = update_balloon_stats(vb);
+		len = sizeof(struct virtio_balloon_stat) * num_stats;
+		cmdq_add_chain_desc(vb, &vb->cmdq_stats_hdr, addr, len,
+				    &head_id, &prev_id);
+		virtqueue_add_chain(vb->cmd_vq, head_id, 0, NULL, vb, NULL);
+		virtqueue_kick_sync(vb->cmd_vq);
+	}
+	spin_unlock(&vb->stop_update_lock);
+}
+
+static void cmdq_add_unused_page(struct virtio_balloon *vb,
+				 struct zone *zone,
+				 unsigned int order,
+				 unsigned int type,
+				 struct page *page,
+				 unsigned int *head_id,
+				 unsigned int *prev_id)
+{
+	uint64_t addr;
+	uint32_t len;
+
+	while (!report_unused_page_block(zone, order, type, &page)) {
+		addr = (u64)page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT;
+		len = (u64)(1 << order) << VIRTIO_BALLOON_PFN_SHIFT;
+		cmdq_add_chain_desc(vb, &vb->cmdq_unused_page_hdr, addr, len,
+				    head_id, prev_id);
+	}
+}
+
+static void cmdq_handle_unused_pages(struct virtio_balloon *vb)
+{
+	struct virtqueue *vq = vb->cmd_vq;
+	unsigned int order = 0, type = 0,
+		     head_id = VIRTQUEUE_DESC_ID_INIT,
+		     prev_id = VIRTQUEUE_DESC_ID_INIT;
+	struct zone *zone = NULL;
+	struct page *page = NULL;
+
+	for_each_populated_zone(zone)
+		for_each_migratetype_order(order, type)
+			cmdq_add_unused_page(vb, zone, order, type, page,
+					     &head_id, &prev_id);
+
+	/* Set the cmd completion flag. */
+	vb->cmdq_unused_page_hdr.flags |=
+				cpu_to_le32(VIRTIO_BALLOON_CMDQ_F_COMPLETION);
+	virtqueue_add_chain(vq, head_id, 0, NULL, vb, NULL);
+	virtqueue_kick_sync(vb->cmd_vq);
+}
+
+static void cmdq_handle(struct virtio_balloon *vb)
+{
+	struct virtio_balloon_cmdq_hdr *hdr;
+	unsigned int len;
+
+	while ((hdr = (struct virtio_balloon_cmdq_hdr *)
+			virtqueue_get_buf(vb->cmd_vq, &len)) != NULL) {
+		switch (__le32_to_cpu(hdr->cmd)) {
+		case VIRTIO_BALLOON_CMDQ_REPORT_STATS:
+			cmdq_handle_stats(vb);
+			break;
+		case VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES:
+			cmdq_handle_unused_pages(vb);
+			break;
+		default:
+			dev_warn(&vb->vdev->dev, "%s: wrong cmd\n", __func__);
+			return;
+		}
+		/*
+		 * Replenish all the command buffer to the device after a
+		 * command is handled. This is for the convenience of the
+		 * device to rewind the cmdq to get back all the command
+		 * buffer after live migration.
+		 */
+		cmdq_hdr_add(vb->cmd_vq, &vb->cmdq_stats_hdr, 1);
+		cmdq_hdr_add(vb->cmd_vq, &vb->cmdq_unused_page_hdr, 1);
+	}
+}
+
+static void cmdq_handle_work_func(struct work_struct *work)
+{
+	struct virtio_balloon *vb;
+
+	vb = container_of(work, struct virtio_balloon,
+			  cmdq_handle_work);
+	cmdq_handle(vb);
+}
+
+static void cmdq_callback(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+
+	queue_work(system_freezable_wq, &vb->cmdq_handle_work);
+}
+
 static int init_vqs(struct virtio_balloon *vb)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
-	int err, nvqs;
+	struct virtqueue **vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	int err = -ENOMEM;
+	int nvqs;
+
+	/* Inflateq and deflateq are used unconditionally */
+	nvqs = 2;
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CMD_VQ) ||
+	    virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs++;
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vq;
+	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
+	if (!callbacks)
+		goto err_callback;
+	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
+	if (!names)
+		goto err_names;
+
+	callbacks[0] = balloon_ack;
+	names[0] = "inflate";
+	callbacks[1] = balloon_ack;
+	names[1] = "deflate";
 
 	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
+	 * The stats_vq is used only when cmdq is not supported (or disabled)
+	 * by the device.
 	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
-	if (err)
-		return err;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CMD_VQ)) {
+		callbacks[2] = cmdq_callback;
+		names[2] = "cmdq";
+	} else if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		callbacks[2] = stats_request;
+		names[2] = "stats";
+	}
 
+	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks,
+					 names, NULL, NULL);
+	if (err)
+		goto err_find;
 	vb->inflate_vq = vqs[0];
 	vb->deflate_vq = vqs[1];
-	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CMD_VQ)) {
+		vb->cmd_vq = vqs[2];
+		/* Prime the cmdq with the header buffer. */
+		cmdq_hdr_add(vb->cmd_vq, &vb->cmdq_stats_hdr, 1);
+		cmdq_hdr_add(vb->cmd_vq, &vb->cmdq_unused_page_hdr, 1);
+	} else if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
 		struct scatterlist sg;
 		unsigned int num_stats;
 		vb->stats_vq = vqs[2];
@@ -520,6 +716,16 @@ static int init_vqs(struct virtio_balloon *vb)
 			BUG();
 		virtqueue_kick(vb->stats_vq);
 	}
+
+err_find:
+	kfree(names);
+err_names:
+	kfree(callbacks);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return err;
+
 	return 0;
 }
 
@@ -640,7 +846,18 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		goto out;
 	}
 
-	INIT_WORK(&vb->update_balloon_stats_work, update_balloon_stats_func);
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_CMD_VQ)) {
+		vb->cmdq_stats_hdr.cmd =
+				cpu_to_le32(VIRTIO_BALLOON_CMDQ_REPORT_STATS);
+		vb->cmdq_stats_hdr.flags = 0;
+		vb->cmdq_unused_page_hdr.cmd =
+			cpu_to_le32(VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES);
+		vb->cmdq_unused_page_hdr.flags = 0;
+		INIT_WORK(&vb->cmdq_handle_work, cmdq_handle_work_func);
+	} else if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		INIT_WORK(&vb->update_balloon_stats_work,
+			  update_balloon_stats_func);
+	}
 	INIT_WORK(&vb->update_balloon_size_work, update_balloon_size_func);
 	spin_lock_init(&vb->stop_update_lock);
 	vb->stop_update = false;
@@ -722,6 +939,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	spin_unlock_irq(&vb->stop_update_lock);
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
+	cancel_work_sync(&vb->cmdq_handle_work);
 
 	xb_empty(&vb->page_xb);
 	remove_common(vb);
@@ -776,6 +994,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_SG,
+	VIRTIO_BALLOON_F_CMD_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index b9d7e10..793de12 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -52,8 +52,13 @@
 			"%s:"fmt, (_vq)->vq.name, ##args);	\
 		(_vq)->broken = true;				\
 	} while (0)
-#define START_USE(vq)
-#define END_USE(vq)
+#define START_USE(_vq)						\
+	do {							\
+		while ((_vq)->in_use)				\
+			cpu_relax();				\
+		(_vq)->in_use = __LINE__;			\
+	} while (0)
+#define END_USE(_vq)	((_vq)->in_use = 0)
 #endif
 
 struct vring_desc_state {
@@ -101,9 +106,9 @@ struct vring_virtqueue {
 	size_t queue_size_in_bytes;
 	dma_addr_t queue_dma_addr;
 
-#ifdef DEBUG
 	/* They're supposed to lock for us. */
 	unsigned int in_use;
+#ifdef DEBUG
 
 	/* Figure out if their kicks are too delayed. */
 	bool last_add_time_valid;
@@ -845,6 +850,18 @@ static void detach_buf(struct vring_virtqueue *vq, unsigned int head,
 	}
 }
 
+void virtqueue_detach_buf(struct virtqueue *_vq, unsigned int head, void **ctx)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	START_USE(vq);
+
+	detach_buf(vq, head, ctx);
+
+	END_USE(vq);
+}
+EXPORT_SYMBOL_GPL(virtqueue_detach_buf);
+
 static inline bool more_used(const struct vring_virtqueue *vq)
 {
 	return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx);
@@ -1158,8 +1175,8 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
 	vq->avail_idx_shadow = 0;
 	vq->num_added = 0;
 	list_add_tail(&vq->vq.list, &vdev->vqs);
+	vq->in_use = 0;
 #ifdef DEBUG
-	vq->in_use = false;
 	vq->last_add_time_valid = false;
 #endif
 
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 9f27101..9df480b 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -88,6 +88,8 @@ void *virtqueue_get_buf(struct virtqueue *vq, unsigned int *len);
 void *virtqueue_get_buf_ctx(struct virtqueue *vq, unsigned int *len,
 			    void **ctx);
 
+void virtqueue_detach_buf(struct virtqueue *_vq, unsigned int head, void **ctx);
+
 void virtqueue_disable_cb(struct virtqueue *vq);
 
 bool virtqueue_enable_cb(struct virtqueue *vq);
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 37780a7..b38c370 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
+#define VIRTIO_BALLOON_F_CMD_VQ		4 /* Command virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -83,4 +84,13 @@ struct virtio_balloon_stat {
 	__virtio64 val;
 } __attribute__((packed));
 
+struct virtio_balloon_cmdq_hdr {
+#define VIRTIO_BALLOON_CMDQ_REPORT_STATS	0
+#define VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES	1
+	__le32 cmd;
+/* Flag to indicate the completion of handling a command */
+#define VIRTIO_BALLOON_CMDQ_F_COMPLETION	1
+	__le32 flags;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-07-12 12:40 ` [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
@ 2017-07-12 13:06   ` Michael S. Tsirkin
  2017-07-12 13:29     ` Wei Wang
  2017-07-13  0:44   ` Michael S. Tsirkin
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 60+ messages in thread
From: Michael S. Tsirkin @ 2017-07-12 13:06 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On Wed, Jul 12, 2017 at 08:40:18PM +0800, Wei Wang wrote:
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index 28b0e96..9f27101 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -57,8 +57,28 @@ int virtqueue_add_sgs(struct virtqueue *vq,
>  		      void *data,
>  		      gfp_t gfp);
>  
> +/* A desc with this init id is treated as an invalid desc */
> +#define VIRTQUEUE_DESC_ID_INIT UINT_MAX
> +int virtqueue_add_chain_desc(struct virtqueue *_vq,
> +			     uint64_t addr,
> +			     uint32_t len,
> +			     unsigned int *head_id,
> +			     unsigned int *prev_id,
> +			     bool in);
> +
> +int virtqueue_add_chain(struct virtqueue *_vq,
> +			unsigned int head,
> +			bool indirect,
> +			struct vring_desc *indirect_desc,
> +			void *data,
> +			void *ctx);
> +
>  bool virtqueue_kick(struct virtqueue *vq);
>  
> +bool virtqueue_kick_sync(struct virtqueue *vq);
> +
> +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq);
> +
>  bool virtqueue_kick_prepare(struct virtqueue *vq);
>  
>  bool virtqueue_notify(struct virtqueue *vq);

I don't much care for this API. It does exactly what balloon needs,
but at cost of e.g. transparently busy-waiting. Unlikely to be
a good fit for anything else.

If you don't like my original _first/_next/_last, you will
need to come up with something else.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-07-12 13:06   ` Michael S. Tsirkin
@ 2017-07-12 13:29     ` Wei Wang
  2017-07-12 13:56       ` Michael S. Tsirkin
  0 siblings, 1 reply; 60+ messages in thread
From: Wei Wang @ 2017-07-12 13:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On 07/12/2017 09:06 PM, Michael S. Tsirkin wrote:
> On Wed, Jul 12, 2017 at 08:40:18PM +0800, Wei Wang wrote:
>> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
>> index 28b0e96..9f27101 100644
>> --- a/include/linux/virtio.h
>> +++ b/include/linux/virtio.h
>> @@ -57,8 +57,28 @@ int virtqueue_add_sgs(struct virtqueue *vq,
>>   		      void *data,
>>   		      gfp_t gfp);
>>   
>> +/* A desc with this init id is treated as an invalid desc */
>> +#define VIRTQUEUE_DESC_ID_INIT UINT_MAX
>> +int virtqueue_add_chain_desc(struct virtqueue *_vq,
>> +			     uint64_t addr,
>> +			     uint32_t len,
>> +			     unsigned int *head_id,
>> +			     unsigned int *prev_id,
>> +			     bool in);
>> +
>> +int virtqueue_add_chain(struct virtqueue *_vq,
>> +			unsigned int head,
>> +			bool indirect,
>> +			struct vring_desc *indirect_desc,
>> +			void *data,
>> +			void *ctx);
>> +
>>   bool virtqueue_kick(struct virtqueue *vq);
>>   
>> +bool virtqueue_kick_sync(struct virtqueue *vq);
>> +
>> +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq);
>> +
>>   bool virtqueue_kick_prepare(struct virtqueue *vq);
>>   
>>   bool virtqueue_notify(struct virtqueue *vq);
> I don't much care for this API. It does exactly what balloon needs,
> but at cost of e.g. transparently busy-waiting. Unlikely to be
> a good fit for anything else.

If you were referring to this API - virtqueue_add_chain_desc():

Busy waiting only happens when the vq is full (i.e. no desc left). If
necessary, I think we can add an input parameter like
"bool busywaiting", then the caller can decide to simply get a -ENOSPC
or busy wait to add when no desc is available.

>
> If you don't like my original _first/_next/_last, you will
> need to come up with something else.

I thought the above virtqueue_add_chain_des() performs the same
functionality as _first/next/last, which are used to grab descs from the
vq and chain them together. If not, could you please elaborate the
usage of the original proposal?

Best,
Wei



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-07-12 13:29     ` Wei Wang
@ 2017-07-12 13:56       ` Michael S. Tsirkin
  2017-07-13  7:42         ` Wei Wang
  0 siblings, 1 reply; 60+ messages in thread
From: Michael S. Tsirkin @ 2017-07-12 13:56 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On Wed, Jul 12, 2017 at 09:29:00PM +0800, Wei Wang wrote:
> On 07/12/2017 09:06 PM, Michael S. Tsirkin wrote:
> > On Wed, Jul 12, 2017 at 08:40:18PM +0800, Wei Wang wrote:
> > > diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> > > index 28b0e96..9f27101 100644
> > > --- a/include/linux/virtio.h
> > > +++ b/include/linux/virtio.h
> > > @@ -57,8 +57,28 @@ int virtqueue_add_sgs(struct virtqueue *vq,
> > >   		      void *data,
> > >   		      gfp_t gfp);
> > > +/* A desc with this init id is treated as an invalid desc */
> > > +#define VIRTQUEUE_DESC_ID_INIT UINT_MAX
> > > +int virtqueue_add_chain_desc(struct virtqueue *_vq,
> > > +			     uint64_t addr,
> > > +			     uint32_t len,
> > > +			     unsigned int *head_id,
> > > +			     unsigned int *prev_id,
> > > +			     bool in);
> > > +
> > > +int virtqueue_add_chain(struct virtqueue *_vq,
> > > +			unsigned int head,
> > > +			bool indirect,
> > > +			struct vring_desc *indirect_desc,
> > > +			void *data,
> > > +			void *ctx);
> > > +
> > >   bool virtqueue_kick(struct virtqueue *vq);
> > > +bool virtqueue_kick_sync(struct virtqueue *vq);
> > > +
> > > +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq);
> > > +
> > >   bool virtqueue_kick_prepare(struct virtqueue *vq);
> > >   bool virtqueue_notify(struct virtqueue *vq);
> > I don't much care for this API. It does exactly what balloon needs,
> > but at cost of e.g. transparently busy-waiting. Unlikely to be
> > a good fit for anything else.
> 
> If you were referring to this API - virtqueue_add_chain_desc():
> 
> Busy waiting only happens when the vq is full (i.e. no desc left). If
> necessary, I think we can add an input parameter like
> "bool busywaiting", then the caller can decide to simply get a -ENOSPC
> or busy wait to add when no desc is available.

I think this just shows this API is too high level.
This policy should live in drivers.

> > 
> > If you don't like my original _first/_next/_last, you will
> > need to come up with something else.
> 
> I thought the above virtqueue_add_chain_des() performs the same
> functionality as _first/next/last, which are used to grab descs from the
> vq and chain them together. If not, could you please elaborate the
> usage of the original proposal?
> 
> Best,
> Wei
> 

So the way I see it, there are several issues:

- internal wait - forces multiple APIs like kick/kick_sync
  note how kick_sync can fail but your code never checks return code
- need to re-write the last descriptor - might not work
  for alternative layouts which always expose descriptors
  immediately
- some kind of iterator type would be nicer instead of
  maintaining head/prev explicitly


As for the use, it would be better to do

if (!add_next(vq, ...)) {
	add_last(vq, ...)
	kick
	wait
}

Using VIRTQUEUE_DESC_ID_INIT seems to avoid a branch in the driver, but
in fact it merely puts the branch in the virtio code.



-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 0/8] Virtio-balloon Enhancement
  2017-07-12 12:40 [PATCH v12 0/8] Virtio-balloon Enhancement Wei Wang
                   ` (7 preceding siblings ...)
  2017-07-12 12:40 ` [PATCH v12 8/8] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ Wei Wang
@ 2017-07-13  0:14 ` Michael S. Tsirkin
  8 siblings, 0 replies; 60+ messages in thread
From: Michael S. Tsirkin @ 2017-07-13  0:14 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On Wed, Jul 12, 2017 at 08:40:13PM +0800, Wei Wang wrote:
> This patch series enhances the existing virtio-balloon with the following new
> features:
> 1) fast ballooning: transfer ballooned pages between the guest and host in
> chunks using sgs, instead of one by one; and
> 2) cmdq: a new virtqueue to send commands between the device and driver.
> Currently, it supports commands to report memory stats (replace the old statq
> mechanism) and report guest unused pages.

Could we get some feedback from mm crowd on patches 6 and 7?

> Change Log:
> 
> v11->v12:
> 1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned pages.
> 2) virtio-ring: enable the driver to build up a desc chain using vring desc.
> 3) virtio-ring: Add locking to the existing START_USE() and END_USE() macro
> to lock/unlock the vq when a vq operation starts/ends.
> 4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
> 5) virtio-balloon: describe chunks of ballooned pages and free pages blocks
> directly using one or more chains of desc from the vq.
> 
> v10->v11:
> 1) virtio_balloon: use vring_desc to describe a chunk;
> 2) virtio_ring: support to add an indirect desc table to virtqueue;
> 3)  virtio_balloon: use cmdq to report guest memory statistics.
> 
> v9->v10:
> 1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
> 2) virtio-balloon: add virtballoon_validate();
> 3) virtio-balloon: msg format change;
> 4) virtio-balloon: move miscq handling to a task on system_freezable_wq;
> 5) virtio-balloon: code cleanup.
> 
> v8->v9:
> 1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
> VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
> implementation;
> 2) Simpler function to get the free page block.
> 
> v7->v8:
> 1) Use only one chunk format, instead of two.
> 2) re-write the virtio-balloon implementation patch.
> 3) commit changes
> 4) patch re-org
> 
> Liang Li (1):
>   virtio-balloon: deflate via a page list
> 
> Matthew Wilcox (1):
>   Introduce xbitmap
> 
> Wei Wang (6):
>   virtio-balloon: coding format cleanup
>   xbitmap: add xb_find_next_bit() and xb_zero()
>   virtio-balloon: VIRTIO_BALLOON_F_SG
>   mm: support reporting free page blocks
>   mm: export symbol of next_zone and first_online_pgdat
>   virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
> 
>  drivers/virtio/virtio_balloon.c     | 414 ++++++++++++++++++++++++++++++++----
>  drivers/virtio/virtio_ring.c        | 224 +++++++++++++++++--
>  include/linux/mm.h                  |   5 +
>  include/linux/radix-tree.h          |   2 +
>  include/linux/virtio.h              |  22 ++
>  include/linux/xbitmap.h             |  53 +++++
>  include/uapi/linux/virtio_balloon.h |  11 +
>  lib/radix-tree.c                    | 164 +++++++++++++-
>  mm/mmzone.c                         |   2 +
>  mm/page_alloc.c                     |  96 +++++++++
>  10 files changed, 926 insertions(+), 67 deletions(-)
>  create mode 100644 include/linux/xbitmap.h
> 
> -- 
> 2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 7/8] mm: export symbol of next_zone and first_online_pgdat
  2017-07-12 12:40 ` [PATCH v12 7/8] mm: export symbol of next_zone and first_online_pgdat Wei Wang
@ 2017-07-13  0:16   ` Michael S. Tsirkin
  2017-07-13  8:41     ` [virtio-dev] " Wei Wang
  2017-07-14 12:31   ` Michal Hocko
  1 sibling, 1 reply; 60+ messages in thread
From: Michael S. Tsirkin @ 2017-07-13  0:16 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On Wed, Jul 12, 2017 at 08:40:20PM +0800, Wei Wang wrote:
> This patch enables for_each_zone()/for_each_populated_zone() to be
> invoked by a kernel module.

... for use by virtio balloon.

> Signed-off-by: Wei Wang <wei.w.wang@intel.com>

balloon seems to only use
+       for_each_populated_zone(zone)
+               for_each_migratetype_order(order, type)


> ---
>  mm/mmzone.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/mmzone.c b/mm/mmzone.c
> index a51c0a6..08a2a3a 100644
> --- a/mm/mmzone.c
> +++ b/mm/mmzone.c
> @@ -13,6 +13,7 @@ struct pglist_data *first_online_pgdat(void)
>  {
>  	return NODE_DATA(first_online_node);
>  }
> +EXPORT_SYMBOL_GPL(first_online_pgdat);
>  
>  struct pglist_data *next_online_pgdat(struct pglist_data *pgdat)
>  {
> @@ -41,6 +42,7 @@ struct zone *next_zone(struct zone *zone)
>  	}
>  	return zone;
>  }
> +EXPORT_SYMBOL_GPL(next_zone);
>  
>  static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
>  {
> -- 
> 2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 8/8] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
  2017-07-12 12:40 ` [PATCH v12 8/8] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ Wei Wang
@ 2017-07-13  0:22   ` Michael S. Tsirkin
  2017-07-13  8:46     ` Wei Wang
  0 siblings, 1 reply; 60+ messages in thread
From: Michael S. Tsirkin @ 2017-07-13  0:22 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On Wed, Jul 12, 2017 at 08:40:21PM +0800, Wei Wang wrote:
> Add a new vq, cmdq, to handle requests between the device and driver.
> 
> This patch implements two commands sent from the device and handled in
> the driver.
> 1) VIRTIO_BALLOON_CMDQ_REPORT_STATS: this command is used to report
> the guest memory statistics to the host. The stats_vq mechanism is not
> used when the cmdq mechanism is enabled.
> 2) VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES: this command is used to
> report the guest unused pages to the host.
> 
> Since now we have a vq to handle multiple commands, we need to keep only
> one vq operation at a time. Here, we change the existing START_USE()
> and END_USE() to lock on each vq operation.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 245 ++++++++++++++++++++++++++++++++++--
>  drivers/virtio/virtio_ring.c        |  25 +++-
>  include/linux/virtio.h              |   2 +
>  include/uapi/linux/virtio_balloon.h |  10 ++
>  4 files changed, 265 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index aa4e7ec..ae91fbf 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
>  
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *cmd_vq;
>  
>  	/* The balloon servicing is delegated to a freezable workqueue. */
>  	struct work_struct update_balloon_stats_work;
>  	struct work_struct update_balloon_size_work;
> +	struct work_struct cmdq_handle_work;
>  
>  	/* Prevent updating balloon when it is being canceled. */
>  	spinlock_t stop_update_lock;
> @@ -90,6 +91,12 @@ struct virtio_balloon {
>  	/* Memory statistics */
>  	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>  
> +	/* Cmdq msg buffer for memory statistics */
> +	struct virtio_balloon_cmdq_hdr cmdq_stats_hdr;
> +
> +	/* Cmdq msg buffer for reporting ununsed pages */
> +	struct virtio_balloon_cmdq_hdr cmdq_unused_page_hdr;
> +
>  	/* To register callback in oom notifier call chain */
>  	struct notifier_block nb;
>  };
> @@ -485,25 +492,214 @@ static void update_balloon_size_func(struct work_struct *work)
>  		queue_work(system_freezable_wq, work);
>  }
>  
> +static unsigned int cmdq_hdr_add(struct virtqueue *vq,
> +				 struct virtio_balloon_cmdq_hdr *hdr,
> +				 bool in)
> +{
> +	unsigned int id = VIRTQUEUE_DESC_ID_INIT;
> +	uint64_t hdr_pa = (uint64_t)virt_to_phys((void *)hdr);
> +
> +	virtqueue_add_chain_desc(vq, hdr_pa, sizeof(*hdr), &id, &id, in);
> +
> +	/* Deliver the hdr for the host to send commands. */
> +	if (in) {
> +		hdr->flags = 0;
> +		virtqueue_add_chain(vq, id, 0, NULL, hdr, NULL);
> +		virtqueue_kick(vq);
> +	}
> +
> +	return id;
> +}
> +
> +static void cmdq_add_chain_desc(struct virtio_balloon *vb,
> +				struct virtio_balloon_cmdq_hdr *hdr,
> +				uint64_t addr,
> +				uint32_t len,
> +				unsigned int *head_id,
> +				unsigned int *prev_id)
> +{
> +retry:
> +	if (*head_id == VIRTQUEUE_DESC_ID_INIT) {
> +		*head_id = cmdq_hdr_add(vb->cmd_vq, hdr, 0);
> +		*prev_id = *head_id;
> +	}
> +
> +	virtqueue_add_chain_desc(vb->cmd_vq, addr, len, head_id, prev_id, 0);
> +	if (*head_id == *prev_id) {

That's an ugly way to detect ring full.

> +		/*
> +		 * The VQ was full and kicked to release some descs. Now we
> +		 * will re-start to build the chain by using the hdr as the
> +		 * first desc, so we need to detach the desc that was just
> +		 * added, and re-start to add the hdr.
> +		 */
> +		virtqueue_detach_buf(vb->cmd_vq, *head_id, NULL);
> +		*head_id = VIRTQUEUE_DESC_ID_INIT;
> +		*prev_id = VIRTQUEUE_DESC_ID_INIT;
> +		goto retry;
> +	}
> +}
> +
> +static void cmdq_handle_stats(struct virtio_balloon *vb)
> +{
> +	unsigned int num_stats,
> +		     head_id = VIRTQUEUE_DESC_ID_INIT,
> +		     prev_id = VIRTQUEUE_DESC_ID_INIT;
> +	uint64_t addr = (uint64_t)virt_to_phys((void *)vb->stats);
> +	uint32_t len;
> +
> +	spin_lock(&vb->stop_update_lock);
> +	if (!vb->stop_update) {
> +		num_stats = update_balloon_stats(vb);
> +		len = sizeof(struct virtio_balloon_stat) * num_stats;
> +		cmdq_add_chain_desc(vb, &vb->cmdq_stats_hdr, addr, len,
> +				    &head_id, &prev_id);
> +		virtqueue_add_chain(vb->cmd_vq, head_id, 0, NULL, vb, NULL);
> +		virtqueue_kick_sync(vb->cmd_vq);
> +	}
> +	spin_unlock(&vb->stop_update_lock);
> +}
> +
> +static void cmdq_add_unused_page(struct virtio_balloon *vb,
> +				 struct zone *zone,
> +				 unsigned int order,
> +				 unsigned int type,
> +				 struct page *page,
> +				 unsigned int *head_id,
> +				 unsigned int *prev_id)
> +{
> +	uint64_t addr;
> +	uint32_t len;
> +
> +	while (!report_unused_page_block(zone, order, type, &page)) {
> +		addr = (u64)page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT;
> +		len = (u64)(1 << order) << VIRTIO_BALLOON_PFN_SHIFT;
> +		cmdq_add_chain_desc(vb, &vb->cmdq_unused_page_hdr, addr, len,
> +				    head_id, prev_id);
> +	}
> +}
> +
> +static void cmdq_handle_unused_pages(struct virtio_balloon *vb)
> +{
> +	struct virtqueue *vq = vb->cmd_vq;
> +	unsigned int order = 0, type = 0,
> +		     head_id = VIRTQUEUE_DESC_ID_INIT,
> +		     prev_id = VIRTQUEUE_DESC_ID_INIT;
> +	struct zone *zone = NULL;
> +	struct page *page = NULL;
> +
> +	for_each_populated_zone(zone)
> +		for_each_migratetype_order(order, type)
> +			cmdq_add_unused_page(vb, zone, order, type, page,
> +					     &head_id, &prev_id);
> +
> +	/* Set the cmd completion flag. */
> +	vb->cmdq_unused_page_hdr.flags |=
> +				cpu_to_le32(VIRTIO_BALLOON_CMDQ_F_COMPLETION);
> +	virtqueue_add_chain(vq, head_id, 0, NULL, vb, NULL);
> +	virtqueue_kick_sync(vb->cmd_vq);
> +}
> +
> +static void cmdq_handle(struct virtio_balloon *vb)
> +{
> +	struct virtio_balloon_cmdq_hdr *hdr;
> +	unsigned int len;
> +
> +	while ((hdr = (struct virtio_balloon_cmdq_hdr *)
> +			virtqueue_get_buf(vb->cmd_vq, &len)) != NULL) {
> +		switch (__le32_to_cpu(hdr->cmd)) {
> +		case VIRTIO_BALLOON_CMDQ_REPORT_STATS:
> +			cmdq_handle_stats(vb);
> +			break;
> +		case VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES:
> +			cmdq_handle_unused_pages(vb);
> +			break;
> +		default:
> +			dev_warn(&vb->vdev->dev, "%s: wrong cmd\n", __func__);
> +			return;
> +		}
> +		/*
> +		 * Replenish all the command buffer to the device after a
> +		 * command is handled. This is for the convenience of the
> +		 * device to rewind the cmdq to get back all the command
> +		 * buffer after live migration.
> +		 */
> +		cmdq_hdr_add(vb->cmd_vq, &vb->cmdq_stats_hdr, 1);
> +		cmdq_hdr_add(vb->cmd_vq, &vb->cmdq_unused_page_hdr, 1);
> +	}
> +}
> +
> +static void cmdq_handle_work_func(struct work_struct *work)
> +{
> +	struct virtio_balloon *vb;
> +
> +	vb = container_of(work, struct virtio_balloon,
> +			  cmdq_handle_work);
> +	cmdq_handle(vb);
> +}
> +
> +static void cmdq_callback(struct virtqueue *vq)
> +{
> +	struct virtio_balloon *vb = vq->vdev->priv;
> +
> +	queue_work(system_freezable_wq, &vb->cmdq_handle_work);
> +}
> +
>  static int init_vqs(struct virtio_balloon *vb)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct virtqueue **vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	int err = -ENOMEM;
> +	int nvqs;
> +
> +	/* Inflateq and deflateq are used unconditionally */
> +	nvqs = 2;
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CMD_VQ) ||
> +	    virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> +		nvqs++;
> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs)
> +		goto err_vq;
> +	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
> +	if (!callbacks)
> +		goto err_callback;
> +	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
> +	if (!names)
> +		goto err_names;
> +
> +	callbacks[0] = balloon_ack;
> +	names[0] = "inflate";
> +	callbacks[1] = balloon_ack;
> +	names[1] = "deflate";
>  
>  	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> +	 * The stats_vq is used only when cmdq is not supported (or disabled)
> +	 * by the device.
>  	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> -	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> -	if (err)
> -		return err;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CMD_VQ)) {
> +		callbacks[2] = cmdq_callback;
> +		names[2] = "cmdq";
> +	} else if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		callbacks[2] = stats_request;
> +		names[2] = "stats";
> +	}
>  
> +	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks,
> +					 names, NULL, NULL);
> +	if (err)
> +		goto err_find;
>  	vb->inflate_vq = vqs[0];
>  	vb->deflate_vq = vqs[1];
> -	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CMD_VQ)) {
> +		vb->cmd_vq = vqs[2];
> +		/* Prime the cmdq with the header buffer. */
> +		cmdq_hdr_add(vb->cmd_vq, &vb->cmdq_stats_hdr, 1);
> +		cmdq_hdr_add(vb->cmd_vq, &vb->cmdq_unused_page_hdr, 1);
> +	} else if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>  		struct scatterlist sg;
>  		unsigned int num_stats;
>  		vb->stats_vq = vqs[2];
> @@ -520,6 +716,16 @@ static int init_vqs(struct virtio_balloon *vb)
>  			BUG();
>  		virtqueue_kick(vb->stats_vq);
>  	}
> +
> +err_find:
> +	kfree(names);
> +err_names:
> +	kfree(callbacks);
> +err_callback:
> +	kfree(vqs);
> +err_vq:
> +	return err;
> +
>  	return 0;
>  }
>  
> @@ -640,7 +846,18 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  		goto out;
>  	}
>  
> -	INIT_WORK(&vb->update_balloon_stats_work, update_balloon_stats_func);
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_CMD_VQ)) {
> +		vb->cmdq_stats_hdr.cmd =
> +				cpu_to_le32(VIRTIO_BALLOON_CMDQ_REPORT_STATS);
> +		vb->cmdq_stats_hdr.flags = 0;
> +		vb->cmdq_unused_page_hdr.cmd =
> +			cpu_to_le32(VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES);
> +		vb->cmdq_unused_page_hdr.flags = 0;
> +		INIT_WORK(&vb->cmdq_handle_work, cmdq_handle_work_func);
> +	} else if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		INIT_WORK(&vb->update_balloon_stats_work,
> +			  update_balloon_stats_func);
> +	}
>  	INIT_WORK(&vb->update_balloon_size_work, update_balloon_size_func);
>  	spin_lock_init(&vb->stop_update_lock);
>  	vb->stop_update = false;
> @@ -722,6 +939,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	spin_unlock_irq(&vb->stop_update_lock);
>  	cancel_work_sync(&vb->update_balloon_size_work);
>  	cancel_work_sync(&vb->update_balloon_stats_work);
> +	cancel_work_sync(&vb->cmdq_handle_work);
>  
>  	xb_empty(&vb->page_xb);
>  	remove_common(vb);
> @@ -776,6 +994,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>  	VIRTIO_BALLOON_F_SG,
> +	VIRTIO_BALLOON_F_CMD_VQ,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index b9d7e10..793de12 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -52,8 +52,13 @@
>  			"%s:"fmt, (_vq)->vq.name, ##args);	\
>  		(_vq)->broken = true;				\
>  	} while (0)
> -#define START_USE(vq)
> -#define END_USE(vq)
> +#define START_USE(_vq)						\
> +	do {							\
> +		while ((_vq)->in_use)				\
> +			cpu_relax();				\
> +		(_vq)->in_use = __LINE__;			\
> +	} while (0)
> +#define END_USE(_vq)	((_vq)->in_use = 0)
>  #endif
>  
>  struct vring_desc_state {
> @@ -101,9 +106,9 @@ struct vring_virtqueue {
>  	size_t queue_size_in_bytes;
>  	dma_addr_t queue_dma_addr;
>  
> -#ifdef DEBUG
>  	/* They're supposed to lock for us. */
>  	unsigned int in_use;
> +#ifdef DEBUG
>  
>  	/* Figure out if their kicks are too delayed. */
>  	bool last_add_time_valid;
> @@ -845,6 +850,18 @@ static void detach_buf(struct vring_virtqueue *vq, unsigned int head,
>  	}
>  }
>  
> +void virtqueue_detach_buf(struct virtqueue *_vq, unsigned int head, void **ctx)
> +{
> +	struct vring_virtqueue *vq = to_vvq(_vq);
> +
> +	START_USE(vq);
> +
> +	detach_buf(vq, head, ctx);
> +
> +	END_USE(vq);
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_detach_buf);
> +
>  static inline bool more_used(const struct vring_virtqueue *vq)
>  {
>  	return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx);
> @@ -1158,8 +1175,8 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
>  	vq->avail_idx_shadow = 0;
>  	vq->num_added = 0;
>  	list_add_tail(&vq->vq.list, &vdev->vqs);
> +	vq->in_use = 0;
>  #ifdef DEBUG
> -	vq->in_use = false;
>  	vq->last_add_time_valid = false;
>  #endif
>  
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index 9f27101..9df480b 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -88,6 +88,8 @@ void *virtqueue_get_buf(struct virtqueue *vq, unsigned int *len);
>  void *virtqueue_get_buf_ctx(struct virtqueue *vq, unsigned int *len,
>  			    void **ctx);
>  
> +void virtqueue_detach_buf(struct virtqueue *_vq, unsigned int head, void **ctx);
> +
>  void virtqueue_disable_cb(struct virtqueue *vq);
>  
>  bool virtqueue_enable_cb(struct virtqueue *vq);
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 37780a7..b38c370 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -35,6 +35,7 @@
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
> +#define VIRTIO_BALLOON_F_CMD_VQ		4 /* Command virtqueue */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> @@ -83,4 +84,13 @@ struct virtio_balloon_stat {
>  	__virtio64 val;
>  } __attribute__((packed));
>  
> +struct virtio_balloon_cmdq_hdr {
> +#define VIRTIO_BALLOON_CMDQ_REPORT_STATS	0
> +#define VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES	1
> +	__le32 cmd;
> +/* Flag to indicate the completion of handling a command */
> +#define VIRTIO_BALLOON_CMDQ_F_COMPLETION	1
> +	__le32 flags;
> +};
> +
>  #endif /* _LINUX_VIRTIO_BALLOON_H */
> -- 
> 2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 6/8] mm: support reporting free page blocks
  2017-07-12 12:40 ` [PATCH v12 6/8] mm: support reporting free page blocks Wei Wang
@ 2017-07-13  0:33   ` Michael S. Tsirkin
  2017-07-13  8:25     ` Wei Wang
  2017-07-14 12:30   ` Michal Hocko
  1 sibling, 1 reply; 60+ messages in thread
From: Michael S. Tsirkin @ 2017-07-13  0:33 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On Wed, Jul 12, 2017 at 08:40:19PM +0800, Wei Wang wrote:
> This patch adds support for reporting blocks of pages on the free list
> specified by the caller.
> 
> As pages can leave the free list during this call or immediately
> afterwards, they are not guaranteed to be free after the function
> returns. The only guarantee this makes is that the page was on the free
> list at some point in time after the function has been invoked.
> 
> Therefore, it is not safe for caller to use any pages on the returned
> block or to discard data that is put there after the function returns.
> However, it is safe for caller to discard data that was in one of these
> pages before the function was invoked.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
>  include/linux/mm.h |  5 +++
>  mm/page_alloc.c    | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 101 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 46b9ac5..76cb433 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1835,6 +1835,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>  		unsigned long zone_start_pfn, unsigned long *zholes_size);
>  extern void free_initmem(void);
>  
> +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
> +extern int report_unused_page_block(struct zone *zone, unsigned int order,
> +				    unsigned int migratetype,
> +				    struct page **page);
> +#endif
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>   * into the buddy system. The freed pages will be poisoned with pattern
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 64b7d82..8b3c9dd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4753,6 +4753,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  	show_swap_cache_info();
>  }
>  
> +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
> +
> +/*
> + * Heuristically get a page block in the system that is unused.
> + * It is possible that pages from the page block are used immediately after
> + * report_unused_page_block() returns. It is the caller's responsibility
> + * to either detect or prevent the use of such pages.
> + *
> + * The free list to check: zone->free_area[order].free_list[migratetype].
> + *
> + * If the caller supplied page block (i.e. **page) is on the free list, offer
> + * the next page block on the list to the caller. Otherwise, offer the first
> + * page block on the list.
> + *
> + * Note: it is not safe for caller to use any pages on the returned
> + * block or to discard data that is put there after the function returns.
> + * However, it is safe for caller to discard data that was in one of these
> + * pages before the function was invoked.
> + *
> + * Return 0 when a page block is found on the caller specified free list.

Otherwise?

> + */

As an alternative, we could have an API that scans free pages
and invokes a callback under a lock. Granted, this might end up
staying a lot of time under a lock. Is this a big issue?
Some benchmarking will tell.

It would then be up to the hypervisor to decide whether it wants to play
tricks with the dirty bit or just wants to drop pages while VCPU is
stopped.


> +int report_unused_page_block(struct zone *zone, unsigned int order,
> +			     unsigned int migratetype, struct page **page)
> +{
> +	struct zone *this_zone;
> +	struct list_head *this_list;
> +	int ret = 0;
> +	unsigned long flags;
> +
> +	/* Sanity check */
> +	if (zone == NULL || page == NULL || order >= MAX_ORDER ||
> +	    migratetype >= MIGRATE_TYPES)
> +		return -EINVAL;

Why do callers this?

> +
> +	/* Zone validity check */
> +	for_each_populated_zone(this_zone) {
> +		if (zone == this_zone)
> +			break;
> +	}

Why?  Will take a long time if there are lots of zones.

> +
> +	/* Got a non-existent zone from the caller? */
> +	if (zone != this_zone)
> +		return -EINVAL;

When does this happen?

> +
> +	spin_lock_irqsave(&this_zone->lock, flags);
> +
> +	this_list = &zone->free_area[order].free_list[migratetype];
> +	if (list_empty(this_list)) {
> +		*page = NULL;
> +		ret = 1;


What does this mean?

> +		goto out;
> +	}
> +
> +	/* The caller is asking for the first free page block on the list */
> +	if ((*page) == NULL) {

if (!*page) is shorter and prettier.

> +		*page = list_first_entry(this_list, struct page, lru);
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	/*
> +	 * The page block passed from the caller is not on this free list
> +	 * anymore (e.g. a 1MB free page block has been split). In this case,
> +	 * offer the first page block on the free list that the caller is
> +	 * asking for.

This just might keep giving you same block over and over again.
E.g.
	- get 1st block
	- get 2nd block
	- 2nd gets broken up
	- get 1st block again

this way we might never make progress beyond the 1st 2 blocks


> +	 */
> +	if (PageBuddy(*page) && order != page_order(*page)) {
> +		*page = list_first_entry(this_list, struct page, lru);
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	/*
> +	 * The page block passed from the caller has been the last page block
> +	 * on the list.
> +	 */
> +	if ((*page)->lru.next == this_list) {
> +		*page = NULL;
> +		ret = 1;
> +		goto out;
> +	}
> +
> +	/*
> +	 * Finally, fall into the regular case: the page block passed from the
> +	 * caller is still on the free list. Offer the next one.
> +	 */
> +	*page = list_next_entry((*page), lru);
> +	ret = 0;
> +out:
> +	spin_unlock_irqrestore(&this_zone->lock, flags);
> +	return ret;
> +}
> +EXPORT_SYMBOL(report_unused_page_block);
> +
> +#endif
> +
>  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
>  {
>  	zoneref->zone = zone;
> -- 
> 2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-07-12 12:40 ` [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
  2017-07-12 13:06   ` Michael S. Tsirkin
@ 2017-07-13  0:44   ` Michael S. Tsirkin
  2017-07-13  1:16   ` kbuild test robot
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 60+ messages in thread
From: Michael S. Tsirkin @ 2017-07-13  0:44 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On Wed, Jul 12, 2017 at 08:40:18PM +0800, Wei Wang wrote:
> Add a new feature, VIRTIO_BALLOON_F_SG, which enables to
> transfer a chunk of ballooned (i.e. inflated/deflated) pages using
> scatter-gather lists to the host.
> 
> The implementation of the previous virtio-balloon is not very
> efficient, because the balloon pages are transferred to the
> host one by one. Here is the breakdown of the time in percentage
> spent on each step of the balloon inflating process (inflating
> 7GB of an 8GB idle guest).
> 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete.
> The above profiling shows that the bottlenecks are stage 2)
> and stage 4).
> 
> This patch optimizes step 2) by transferring pages to the host in
> sgs. An sg describes a chunk of guest physically continuous pages.
> With this mechanism, step 4) can also be optimized by doing address
> translation and madvise() in chunks rather than page by page.
> 
> With this new feature, the above ballooning process takes ~491ms
> resulting in an improvement of ~88%.
> 
> TODO: optimize stage 1) by allocating/freeing a chunk of pages
> instead of a single page each time.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 141 ++++++++++++++++++++++---
>  drivers/virtio/virtio_ring.c        | 199 +++++++++++++++++++++++++++++++++---
>  include/linux/virtio.h              |  20 ++++
>  include/uapi/linux/virtio_balloon.h |   1 +
>  4 files changed, 329 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index f0b3a0b..aa4e7ec 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -32,6 +32,7 @@
>  #include <linux/mm.h>
>  #include <linux/mount.h>
>  #include <linux/magic.h>
> +#include <linux/xbitmap.h>
>  
>  /*
>   * Balloon device works in 4K page units.  So each page is pointed to by
> @@ -79,6 +80,9 @@ struct virtio_balloon {
>  	/* Synchronize access/update to this struct virtio_balloon elements */
>  	struct mutex balloon_lock;
>  
> +	/* The xbitmap used to record ballooned pages */
> +	struct xb page_xb;
> +
>  	/* The array of pfns we tell the Host about. */
>  	unsigned int num_pfns;
>  	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> @@ -141,13 +145,71 @@ static void set_page_pfns(struct virtio_balloon *vb,
>  					  page_to_balloon_pfn(page) + i);
>  }
>  
> +/*
> + * Send balloon pages in sgs to host.
> + * The balloon pages are recorded in the page xbitmap. Each bit in the bitmap
> + * corresponds to a page of PAGE_SIZE. The page xbitmap is searched for
> + * continuous "1" bits, which correspond to continuous pages, to chunk into
> + * sgs.
> + *
> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
> + * need to be serached.

searched

> + */
> +static void tell_host_sgs(struct virtio_balloon *vb,
> +			  struct virtqueue *vq,
> +			  unsigned long page_xb_start,
> +			  unsigned long page_xb_end)
> +{
> +	unsigned int head_id = VIRTQUEUE_DESC_ID_INIT,
> +		     prev_id = VIRTQUEUE_DESC_ID_INIT;
> +	unsigned long sg_pfn_start, sg_pfn_end;
> +	uint64_t sg_addr;
> +	uint32_t sg_size;
> +
> +	sg_pfn_start = page_xb_start;
> +	while (sg_pfn_start < page_xb_end) {
> +		sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start,
> +						page_xb_end, 1);
> +		if (sg_pfn_start == page_xb_end + 1)
> +			break;
> +		sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1,
> +					      page_xb_end, 0);
> +		sg_addr = sg_pfn_start << PAGE_SHIFT;
> +		sg_size = (sg_pfn_end - sg_pfn_start) * PAGE_SIZE;

There's an issue here - this might not fit in uint32_t.
You need to limit sg_pfn_end - something like:

	/* make sure sg_size below fits in a 32 bit integer */
	sg_pfn_end = min(sg_pfn_end, sg_pfn_start + UINT_MAX >> PAGE_SIZE);

> +		virtqueue_add_chain_desc(vq, sg_addr, sg_size, &head_id,
> +					 &prev_id, 0);
> +		xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end);
> +		sg_pfn_start = sg_pfn_end + 1;
> +	}
> +
> +	if (head_id != VIRTQUEUE_DESC_ID_INIT) {
> +		virtqueue_add_chain(vq, head_id, 0, NULL, vb, NULL);
> +		virtqueue_kick_async(vq, vb->acked);
> +	}
> +}
> +
> +/* Update pfn_max and pfn_min according to the pfn of @page */
> +static inline void update_pfn_range(struct virtio_balloon *vb,
> +				    struct page *page,
> +				    unsigned long *pfn_min,
> +				    unsigned long *pfn_max)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +}
> +
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	unsigned num_allocated_pages;
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
>  	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
> @@ -162,7 +224,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  			msleep(200);
>  			break;
>  		}
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (use_sg) {
> +			update_pfn_range(vb, page, &pfn_min, &pfn_max);
> +			xb_set_bit(&vb->page_xb, page_to_pfn(page));
> +		} else {
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		}
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		if (!virtio_has_feature(vb->vdev,
>  					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> @@ -171,8 +238,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  
>  	num_allocated_pages = vb->num_pfns;
>  	/* Did we get any? */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->inflate_vq);
> +	if (vb->num_pfns != 0) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->inflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	return num_allocated_pages;
> @@ -198,9 +269,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> -	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	/* Traditionally, we can only do one array worth at a time. */
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	/* We can't release more pages than taken */
> @@ -210,7 +284,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  		page = balloon_page_dequeue(vb_dev_info);
>  		if (!page)
>  			break;
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (use_sg) {
> +			update_pfn_range(vb, page, &pfn_min, &pfn_max);
> +			xb_set_bit(&vb->page_xb, page_to_pfn(page));
> +		} else {
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		}
>  		list_add(&page->lru, &pages);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}
> @@ -221,8 +300,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>  	 * is true, we *have* to do it in this order
>  	 */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->deflate_vq);
> +	if (vb->num_pfns != 0) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->deflate_vq);
> +	}
>  	release_pages_balloon(vb, &pages);
>  	mutex_unlock(&vb->balloon_lock);
>  	return num_freed_pages;
> @@ -441,6 +524,18 @@ static int init_vqs(struct virtio_balloon *vb)
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> +
> +static void tell_host_one_page(struct virtio_balloon *vb, struct virtqueue *vq,
> +			       struct page *page)
> +{
> +	unsigned int id = VIRTQUEUE_DESC_ID_INIT;
> +	u64 addr = page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT;
> +
> +	virtqueue_add_chain_desc(vq, addr, PAGE_SIZE, &id, &id, 0);
> +	virtqueue_add_chain(vq, id, 0, NULL, (void *)addr, NULL);
> +	virtqueue_kick_async(vq, vb->acked);
> +}
> +
>  /*
>   * virtballoon_migratepage - perform the balloon page migration on behalf of
>   *			     a compation thread.     (called under page lock)
> @@ -464,6 +559,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  {
>  	struct virtio_balloon *vb = container_of(vb_dev_info,
>  			struct virtio_balloon, vb_dev_info);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
>  	unsigned long flags;
>  
>  	/*
> @@ -485,16 +581,22 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  	vb_dev_info->isolated_pages--;
>  	__count_vm_event(BALLOON_MIGRATE);
>  	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, newpage);
> -	tell_host(vb, vb->inflate_vq);
> -
> +	if (use_sg) {
> +		tell_host_one_page(vb, vb->inflate_vq, newpage);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, newpage);
> +		tell_host(vb, vb->inflate_vq);
> +	}
>  	/* balloon's page migration 2nd step -- deflate "page" */
>  	balloon_page_delete(page);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, page);
> -	tell_host(vb, vb->deflate_vq);
> -
> +	if (use_sg) {
> +		tell_host_one_page(vb, vb->deflate_vq, page);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, page);
> +		tell_host(vb, vb->deflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	put_page(page); /* balloon reference */
> @@ -553,6 +655,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (err)
>  		goto out_free_vb;
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
> +		xb_init(&vb->page_xb);
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -618,6 +723,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	cancel_work_sync(&vb->update_balloon_size_work);
>  	cancel_work_sync(&vb->update_balloon_stats_work);
>  
> +	xb_empty(&vb->page_xb);
>  	remove_common(vb);
>  #ifdef CONFIG_BALLOON_COMPACTION
>  	if (vb->vb_dev_info.inode)
> @@ -669,6 +775,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_SG,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 5e1b548..b9d7e10 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -269,7 +269,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  	struct vring_virtqueue *vq = to_vvq(_vq);
>  	struct scatterlist *sg;
>  	struct vring_desc *desc;
> -	unsigned int i, n, avail, descs_used, uninitialized_var(prev), err_idx;
> +	unsigned int i, n, descs_used, uninitialized_var(prev), err_id;
>  	int head;
>  	bool indirect;
>  
> @@ -387,10 +387,68 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  	else
>  		vq->free_head = i;
>  
> -	/* Store token and indirect buffer state. */
> +	END_USE(vq);
> +
> +	return virtqueue_add_chain(_vq, head, indirect, desc, data, ctx);
> +
> +unmap_release:
> +	err_id = i;
> +	i = head;
> +
> +	for (n = 0; n < total_sg; n++) {
> +		if (i == err_id)
> +			break;
> +		vring_unmap_one(vq, &desc[i]);
> +		i = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next);
> +	}
> +
> +	vq->vq.num_free += total_sg;
> +
> +	if (indirect)
> +		kfree(desc);
> +
> +	END_USE(vq);
> +	return -EIO;
> +}
> +
> +/**
> + * virtqueue_add_chain - expose a chain of buffers to the other end
> + * @_vq: the struct virtqueue we're talking about.
> + * @head: desc id of the chain head.
> + * @indirect: set if the chain of descs are indrect descs.
> + * @indir_desc: the first indirect desc.
> + * @data: the token identifying the chain.
> + * @ctx: extra context for the token.
> + *
> + * Caller must ensure we don't call this with other virtqueue operations
> + * at the same time (except where noted).
> + *
> + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
> + */
> +int virtqueue_add_chain(struct virtqueue *_vq,
> +			unsigned int head,
> +			bool indirect,
> +			struct vring_desc *indir_desc,
> +			void *data,
> +			void *ctx)
> +{
> +	struct vring_virtqueue *vq = to_vvq(_vq);
> +	unsigned int avail;
> +
> +	/* The desc chain is empty. */
> +	if (head == VIRTQUEUE_DESC_ID_INIT)
> +		return 0;
> +
> +	START_USE(vq);
> +
> +	if (unlikely(vq->broken)) {
> +		END_USE(vq);
> +		return -EIO;
> +	}
> +
>  	vq->desc_state[head].data = data;
>  	if (indirect)
> -		vq->desc_state[head].indir_desc = desc;
> +		vq->desc_state[head].indir_desc = indir_desc;
>  	if (ctx)
>  		vq->desc_state[head].indir_desc = ctx;
>  
> @@ -415,26 +473,87 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  		virtqueue_kick(_vq);
>  
>  	return 0;
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_add_chain);
>  
> -unmap_release:
> -	err_idx = i;
> -	i = head;
> +/**
> + * virtqueue_add_chain_desc - add a buffer to a chain using a vring desc
> + * @vq: the struct virtqueue we're talking about.
> + * @addr: address of the buffer to add.
> + * @len: length of the buffer.
> + * @head_id: desc id of the chain head.
> + * @prev_id: desc id of the previous buffer.
> + * @in: set if the buffer is for the device to write.
> + *
> + * Caller must ensure we don't call this with other virtqueue operations
> + * at the same time (except where noted).
> + *
> + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
> + */
> +int virtqueue_add_chain_desc(struct virtqueue *_vq,
> +			     uint64_t addr,
> +			     uint32_t len,
> +			     unsigned int *head_id,
> +			     unsigned int *prev_id,
> +			     bool in)
> +{
> +	struct vring_virtqueue *vq = to_vvq(_vq);
> +	struct vring_desc *desc = vq->vring.desc;
> +	uint16_t flags = in ? VRING_DESC_F_WRITE : 0;
> +	unsigned int i;
>  
> -	for (n = 0; n < total_sg; n++) {
> -		if (i == err_idx)
> -			break;
> -		vring_unmap_one(vq, &desc[i]);
> -		i = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next);
> +	/* Sanity check */
> +	if (!_vq || !head_id || !prev_id)
> +		return -EINVAL;
> +retry:
> +	START_USE(vq);
> +	if (unlikely(vq->broken)) {
> +		END_USE(vq);
> +		return -EIO;
>  	}
>  
> -	vq->vq.num_free += total_sg;
> +	if (vq->vq.num_free < 1) {
> +		/*
> +		 * If there is no desc avail in the vq, so kick what is
> +		 * already added, and re-start to build a new chain for
> +		 * the passed sg.
> +		 */
> +		if (likely(*head_id != VIRTQUEUE_DESC_ID_INIT)) {
> +			END_USE(vq);
> +			virtqueue_add_chain(_vq, *head_id, 0, NULL, vq, NULL);
> +			virtqueue_kick_sync(_vq);
> +			*head_id = VIRTQUEUE_DESC_ID_INIT;
> +			*prev_id = VIRTQUEUE_DESC_ID_INIT;
> +			goto retry;
> +		} else {
> +			END_USE(vq);
> +			return -ENOSPC;
> +		}
> +	}
>  
> -	if (indirect)
> -		kfree(desc);
> +	i = vq->free_head;
> +	flags &= ~VRING_DESC_F_NEXT;
> +	desc[i].flags = cpu_to_virtio16(_vq->vdev, flags);
> +	desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
> +	desc[i].len = cpu_to_virtio32(_vq->vdev, len);
> +
> +	/* Add the desc to the end of the chain */
> +	if (*prev_id != VIRTQUEUE_DESC_ID_INIT) {
> +		desc[*prev_id].next = cpu_to_virtio16(_vq->vdev, i);
> +		desc[*prev_id].flags |= cpu_to_virtio16(_vq->vdev,
> +							 VRING_DESC_F_NEXT);
> +	}
> +	*prev_id = i;
> +	if (*head_id == VIRTQUEUE_DESC_ID_INIT)
> +		*head_id = *prev_id;
>  
> +	vq->vq.num_free--;
> +	vq->free_head = virtio16_to_cpu(_vq->vdev, desc[i].next);
>  	END_USE(vq);
> -	return -EIO;
> +
> +	return 0;
>  }
> +EXPORT_SYMBOL_GPL(virtqueue_add_chain_desc);
>  
>  /**
>   * virtqueue_add_sgs - expose buffers to other end
> @@ -627,6 +746,56 @@ bool virtqueue_kick(struct virtqueue *vq)
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_kick);
>  
> +/**
> + * virtqueue_kick_sync - update after add_buf and busy wait till update is done
> + * @vq: the struct virtqueue
> + *
> + * After one or more virtqueue_add_* calls, invoke this to kick
> + * the other side. Busy wait till the other side is done with the update.
> + *
> + * Caller must ensure we don't call this with other virtqueue
> + * operations at the same time (except where noted).
> + *
> + * Returns false if kick failed, otherwise true.
> + */
> +bool virtqueue_kick_sync(struct virtqueue *vq)
> +{
> +	u32 len;
> +
> +	if (likely(virtqueue_kick(vq))) {
> +		while (!virtqueue_get_buf(vq, &len) &&
> +		       !virtqueue_is_broken(vq))
> +			cpu_relax();
> +		return true;
> +	}
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_kick_sync);
> +
> +/**
> + * virtqueue_kick_async - update after add_buf and blocking till update is done
> + * @vq: the struct virtqueue
> + *
> + * After one or more virtqueue_add_* calls, invoke this to kick
> + * the other side. Blocking till the other side is done with the update.
> + *
> + * Caller must ensure we don't call this with other virtqueue
> + * operations at the same time (except where noted).
> + *
> + * Returns false if kick failed, otherwise true.
> + */
> +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq)
> +{
> +	u32 len;
> +
> +	if (likely(virtqueue_kick(vq))) {
> +		wait_event(wq, virtqueue_get_buf(vq, &len));
> +		return true;
> +	}
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_kick_async);
> +

This happens to
1. drop the buf
2. not do the right thing if more than one is in flight

which means this API isn't all that useful. Even balloon
might benefit from keeping multiple bufs in flight down
the road.


>  static void detach_buf(struct vring_virtqueue *vq, unsigned int head,
>  		       void **ctx)
>  {
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index 28b0e96..9f27101 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -57,8 +57,28 @@ int virtqueue_add_sgs(struct virtqueue *vq,
>  		      void *data,
>  		      gfp_t gfp);
>  
> +/* A desc with this init id is treated as an invalid desc */
> +#define VIRTQUEUE_DESC_ID_INIT UINT_MAX
> +int virtqueue_add_chain_desc(struct virtqueue *_vq,
> +			     uint64_t addr,
> +			     uint32_t len,
> +			     unsigned int *head_id,
> +			     unsigned int *prev_id,
> +			     bool in);
> +
> +int virtqueue_add_chain(struct virtqueue *_vq,
> +			unsigned int head,
> +			bool indirect,
> +			struct vring_desc *indirect_desc,
> +			void *data,
> +			void *ctx);
> +
>  bool virtqueue_kick(struct virtqueue *vq);
>  
> +bool virtqueue_kick_sync(struct virtqueue *vq);
> +
> +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq);
> +
>  bool virtqueue_kick_prepare(struct virtqueue *vq);
>  
>  bool virtqueue_notify(struct virtqueue *vq);
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 343d7dd..37780a7 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> -- 
> 2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-07-12 12:40 ` [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
  2017-07-12 13:06   ` Michael S. Tsirkin
  2017-07-13  0:44   ` Michael S. Tsirkin
@ 2017-07-13  1:16   ` kbuild test robot
  2017-07-13  4:21   ` kbuild test robot
  2017-07-28  8:25   ` Wei Wang
  4 siblings, 0 replies; 60+ messages in thread
From: kbuild test robot @ 2017-07-13  1:16 UTC (permalink / raw)
  To: Wei Wang
  Cc: kbuild-all, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, virtio-dev,
	yang.zhang.wz, quan.xu

[-- Attachment #1: Type: text/plain, Size: 1509 bytes --]

Hi Wei,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.12 next-20170712]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Wei-Wang/Virtio-balloon-Enhancement/20170713-074956
config: i386-randconfig-x071-07121639 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   drivers//virtio/virtio_balloon.c: In function 'tell_host_one_page':
>> drivers//virtio/virtio_balloon.c:535:39: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
     virtqueue_add_chain(vq, id, 0, NULL, (void *)addr, NULL);
                                          ^

vim +535 drivers//virtio/virtio_balloon.c

   527	
   528	static void tell_host_one_page(struct virtio_balloon *vb, struct virtqueue *vq,
   529				       struct page *page)
   530	{
   531		unsigned int id = VIRTQUEUE_DESC_ID_INIT;
   532		u64 addr = page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT;
   533	
   534		virtqueue_add_chain_desc(vq, addr, PAGE_SIZE, &id, &id, 0);
 > 535		virtqueue_add_chain(vq, id, 0, NULL, (void *)addr, NULL);
   536		virtqueue_kick_async(vq, vb->acked);
   537	}
   538	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 29833 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-07-12 12:40 ` [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
                     ` (2 preceding siblings ...)
  2017-07-13  1:16   ` kbuild test robot
@ 2017-07-13  4:21   ` kbuild test robot
  2017-07-28  8:25   ` Wei Wang
  4 siblings, 0 replies; 60+ messages in thread
From: kbuild test robot @ 2017-07-13  4:21 UTC (permalink / raw)
  To: Wei Wang
  Cc: kbuild-all, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, virtio-dev,
	yang.zhang.wz, quan.xu

[-- Attachment #1: Type: text/plain, Size: 1065 bytes --]

Hi Wei,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.12 next-20170712]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Wei-Wang/Virtio-balloon-Enhancement/20170713-074956
config: powerpc-defconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

>> ERROR: ".xb_set_bit" [drivers/virtio/virtio_balloon.ko] undefined!
>> ERROR: ".xb_zero" [drivers/virtio/virtio_balloon.ko] undefined!
>> ERROR: ".xb_find_next_bit" [drivers/virtio/virtio_balloon.ko] undefined!

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 23467 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-07-12 13:56       ` Michael S. Tsirkin
@ 2017-07-13  7:42         ` Wei Wang
  2017-07-13 20:19           ` Michael S. Tsirkin
  0 siblings, 1 reply; 60+ messages in thread
From: Wei Wang @ 2017-07-13  7:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:
>
> So the way I see it, there are several issues:
>
> - internal wait - forces multiple APIs like kick/kick_sync
>    note how kick_sync can fail but your code never checks return code
> - need to re-write the last descriptor - might not work
>    for alternative layouts which always expose descriptors
>    immediately

Probably it wasn't clear. Please let me explain the two functions here:

1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
grabs a desc from the vq and inserts it to the chain tail (which is 
indexed by
prev_id, probably better to call it tail_id). Then, the new added desc 
becomes
the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc 
when it's
added to the chain, and set when another desc comes to follow later.

2) virtqueue_add_chain(vq, head_id,..): expose the chain to the other end.

So, if people want to add a desc and immediately expose it to the other end,
i.e. build a single desc chain, they can just add and expose:

virtqueue_add_chain_desc(..);
virtqueue_add_chain(..,head_id);

Would you see any issues here?


> - some kind of iterator type would be nicer instead of
>    maintaining head/prev explicitly

Why would we need to iterate the chain? I think it would be simpler to use
a wrapper struct:

struct virtqueue_desc_chain {
     unsigned int head;  // head desc id of the chain
     unsigned int tail;     // tail desc id of the chain
}

The new desc will be put to desc[tail].next, and we don't need to walk
from the head desc[head].next when inserting a new desc to the chain, right?


>
> As for the use, it would be better to do
>
> if (!add_next(vq, ...)) {
> 	add_last(vq, ...)
> 	kick
> 	wait
> }

"!add_next(vq, ...)" means that the vq is full? If so, what would 
add_last() do then?


> Using VIRTQUEUE_DESC_ID_INIT seems to avoid a branch in the driver, but
> in fact it merely puts the branch in the virtio code.
>

Actually it wasn't intended to improve performance. It is used to 
indicate the "init" state
of the chain. So, when virtqueue_add_chain_desc(, head_id,..) finds head 
id=INIT, it will
assign the grabbed desc id to &head_id. In some sense, it is equivalent 
to add_first().

Do you have a different opinion here?

Best,
Wei



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 6/8] mm: support reporting free page blocks
  2017-07-13  0:33   ` Michael S. Tsirkin
@ 2017-07-13  8:25     ` Wei Wang
  0 siblings, 0 replies; 60+ messages in thread
From: Wei Wang @ 2017-07-13  8:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On 07/13/2017 08:33 AM, Michael S. Tsirkin wrote:
> On Wed, Jul 12, 2017 at 08:40:19PM +0800, Wei Wang wrote:
>> This patch adds support for reporting blocks of pages on the free list
>> specified by the caller.
>>
>> As pages can leave the free list during this call or immediately
>> afterwards, they are not guaranteed to be free after the function
>> returns. The only guarantee this makes is that the page was on the free
>> list at some point in time after the function has been invoked.
>>
>> Therefore, it is not safe for caller to use any pages on the returned
>> block or to discard data that is put there after the function returns.
>> However, it is safe for caller to discard data that was in one of these
>> pages before the function was invoked.
>>
>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>> ---
>>   include/linux/mm.h |  5 +++
>>   mm/page_alloc.c    | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 101 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 46b9ac5..76cb433 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1835,6 +1835,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>>   		unsigned long zone_start_pfn, unsigned long *zholes_size);
>>   extern void free_initmem(void);
>>   
>> +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
>> +extern int report_unused_page_block(struct zone *zone, unsigned int order,
>> +				    unsigned int migratetype,
>> +				    struct page **page);
>> +#endif
>>   /*
>>    * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>>    * into the buddy system. The freed pages will be poisoned with pattern
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 64b7d82..8b3c9dd 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4753,6 +4753,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>>   	show_swap_cache_info();
>>   }
>>   
>> +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
>> +
>> +/*
>> + * Heuristically get a page block in the system that is unused.
>> + * It is possible that pages from the page block are used immediately after
>> + * report_unused_page_block() returns. It is the caller's responsibility
>> + * to either detect or prevent the use of such pages.
>> + *
>> + * The free list to check: zone->free_area[order].free_list[migratetype].
>> + *
>> + * If the caller supplied page block (i.e. **page) is on the free list, offer
>> + * the next page block on the list to the caller. Otherwise, offer the first
>> + * page block on the list.
>> + *
>> + * Note: it is not safe for caller to use any pages on the returned
>> + * block or to discard data that is put there after the function returns.
>> + * However, it is safe for caller to discard data that was in one of these
>> + * pages before the function was invoked.
>> + *
>> + * Return 0 when a page block is found on the caller specified free list.
> Otherwise?

Other values mean that no page block is found. I will add them.

>
>> + */
> As an alternative, we could have an API that scans free pages
> and invokes a callback under a lock. Granted, this might end up
> staying a lot of time under a lock. Is this a big issue?
> Some benchmarking will tell.
>
> It would then be up to the hypervisor to decide whether it wants to play
> tricks with the dirty bit or just wants to drop pages while VCPU is
> stopped.
>
>
>> +int report_unused_page_block(struct zone *zone, unsigned int order,
>> +			     unsigned int migratetype, struct page **page)
>> +{
>> +	struct zone *this_zone;
>> +	struct list_head *this_list;
>> +	int ret = 0;
>> +	unsigned long flags;
>> +
>> +	/* Sanity check */
>> +	if (zone == NULL || page == NULL || order >= MAX_ORDER ||
>> +	    migratetype >= MIGRATE_TYPES)
>> +		return -EINVAL;
> Why do callers this?
>
>> +
>> +	/* Zone validity check */
>> +	for_each_populated_zone(this_zone) {
>> +		if (zone == this_zone)
>> +			break;
>> +	}
> Why?  Will take a long time if there are lots of zones.
>
>> +
>> +	/* Got a non-existent zone from the caller? */
>> +	if (zone != this_zone)
>> +		return -EINVAL;
> When does this happen?

The above lines of code are just sanity check. If not
necessary, we can remove them.

>
>> +
>> +	spin_lock_irqsave(&this_zone->lock, flags);
>> +
>> +	this_list = &zone->free_area[order].free_list[migratetype];
>> +	if (list_empty(this_list)) {
>> +		*page = NULL;
>> +		ret = 1;
>
> What does this mean?

Just means the list is empty, and expects the caller to try again
in the next list.

Probably, use "-EAGAIN" is better?

>
>> +		*page = list_first_entry(this_list, struct page, lru);
>> +		ret = 0;
>> +		goto out;
>> +	}
>> +
>> +	/*
>> +	 * The page block passed from the caller is not on this free list
>> +	 * anymore (e.g. a 1MB free page block has been split). In this case,
>> +	 * offer the first page block on the free list that the caller is
>> +	 * asking for.
> This just might keep giving you same block over and over again.
> E.g.
> 	- get 1st block
> 	- get 2nd block
> 	- 2nd gets broken up
> 	- get 1st block again
>
> this way we might never make progress beyond the 1st 2 blocks

Not really. I think the pages are allocated in order. If the 2nd block 
isn't there, then
the 1st block must have gone, too. So, the call will return the 3rd one 
(which is the
new first) on the list.

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [virtio-dev] Re: [PATCH v12 7/8] mm: export symbol of next_zone and first_online_pgdat
  2017-07-13  0:16   ` Michael S. Tsirkin
@ 2017-07-13  8:41     ` Wei Wang
  0 siblings, 0 replies; 60+ messages in thread
From: Wei Wang @ 2017-07-13  8:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On 07/13/2017 08:16 AM, Michael S. Tsirkin wrote:
> On Wed, Jul 12, 2017 at 08:40:20PM +0800, Wei Wang wrote:
>> This patch enables for_each_zone()/for_each_populated_zone() to be
>> invoked by a kernel module.
> ... for use by virtio balloon.

With this patch, other kernel modules can also use the for_each_zone().
Would it be better to claim it broader?

>
>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> balloon seems to only use
> +       for_each_populated_zone(zone)
> +               for_each_migratetype_order(order, type)
>

Yes. using for_each_populated_zone() requires the following export.

Best,
Wei
>> ---
>>   mm/mmzone.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/mm/mmzone.c b/mm/mmzone.c
>> index a51c0a6..08a2a3a 100644
>> --- a/mm/mmzone.c
>> +++ b/mm/mmzone.c
>> @@ -13,6 +13,7 @@ struct pglist_data *first_online_pgdat(void)
>>   {
>>   	return NODE_DATA(first_online_node);
>>   }
>> +EXPORT_SYMBOL_GPL(first_online_pgdat);
>>   
>>   struct pglist_data *next_online_pgdat(struct pglist_data *pgdat)
>>   {
>> @@ -41,6 +42,7 @@ struct zone *next_zone(struct zone *zone)
>>   	}
>>   	return zone;
>>   }
>> +EXPORT_SYMBOL_GPL(next_zone);
>>   
>>   static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
>>   {
>> -- 
>> 2.7.4
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 8/8] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
  2017-07-13  0:22   ` Michael S. Tsirkin
@ 2017-07-13  8:46     ` Wei Wang
  2017-07-13 17:59       ` Michael S. Tsirkin
  0 siblings, 1 reply; 60+ messages in thread
From: Wei Wang @ 2017-07-13  8:46 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On 07/13/2017 08:22 AM, Michael S. Tsirkin wrote:
> On Wed, Jul 12, 2017 at 08:40:21PM +0800, Wei Wang wrote:
>> Add a new vq, cmdq, to handle requests between the device and driver.
>>
>> This patch implements two commands sent from the device and handled in
>> the driver.
>> 1) VIRTIO_BALLOON_CMDQ_REPORT_STATS: this command is used to report
>> the guest memory statistics to the host. The stats_vq mechanism is not
>> used when the cmdq mechanism is enabled.
>> 2) VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES: this command is used to
>> report the guest unused pages to the host.
>>
>> Since now we have a vq to handle multiple commands, we need to keep only
>> one vq operation at a time. Here, we change the existing START_USE()
>> and END_USE() to lock on each vq operation.
>>
>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>> ---
>>   drivers/virtio/virtio_balloon.c     | 245 ++++++++++++++++++++++++++++++++++--
>>   drivers/virtio/virtio_ring.c        |  25 +++-
>>   include/linux/virtio.h              |   2 +
>>   include/uapi/linux/virtio_balloon.h |  10 ++
>>   4 files changed, 265 insertions(+), 17 deletions(-)
>>
>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>> index aa4e7ec..ae91fbf 100644
>> --- a/drivers/virtio/virtio_balloon.c
>> +++ b/drivers/virtio/virtio_balloon.c
>> @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
>>   
>>   struct virtio_balloon {
>>   	struct virtio_device *vdev;
>> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
>> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *cmd_vq;
>>   
>>   	/* The balloon servicing is delegated to a freezable workqueue. */
>>   	struct work_struct update_balloon_stats_work;
>>   	struct work_struct update_balloon_size_work;
>> +	struct work_struct cmdq_handle_work;
>>   
>>   	/* Prevent updating balloon when it is being canceled. */
>>   	spinlock_t stop_update_lock;
>> @@ -90,6 +91,12 @@ struct virtio_balloon {
>>   	/* Memory statistics */
>>   	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>>   
>> +	/* Cmdq msg buffer for memory statistics */
>> +	struct virtio_balloon_cmdq_hdr cmdq_stats_hdr;
>> +
>> +	/* Cmdq msg buffer for reporting ununsed pages */
>> +	struct virtio_balloon_cmdq_hdr cmdq_unused_page_hdr;
>> +
>>   	/* To register callback in oom notifier call chain */
>>   	struct notifier_block nb;
>>   };
>> @@ -485,25 +492,214 @@ static void update_balloon_size_func(struct work_struct *work)
>>   		queue_work(system_freezable_wq, work);
>>   }
>>   
>> +static unsigned int cmdq_hdr_add(struct virtqueue *vq,
>> +				 struct virtio_balloon_cmdq_hdr *hdr,
>> +				 bool in)
>> +{
>> +	unsigned int id = VIRTQUEUE_DESC_ID_INIT;
>> +	uint64_t hdr_pa = (uint64_t)virt_to_phys((void *)hdr);
>> +
>> +	virtqueue_add_chain_desc(vq, hdr_pa, sizeof(*hdr), &id, &id, in);
>> +
>> +	/* Deliver the hdr for the host to send commands. */
>> +	if (in) {
>> +		hdr->flags = 0;
>> +		virtqueue_add_chain(vq, id, 0, NULL, hdr, NULL);
>> +		virtqueue_kick(vq);
>> +	}
>> +
>> +	return id;
>> +}
>> +
>> +static void cmdq_add_chain_desc(struct virtio_balloon *vb,
>> +				struct virtio_balloon_cmdq_hdr *hdr,
>> +				uint64_t addr,
>> +				uint32_t len,
>> +				unsigned int *head_id,
>> +				unsigned int *prev_id)
>> +{
>> +retry:
>> +	if (*head_id == VIRTQUEUE_DESC_ID_INIT) {
>> +		*head_id = cmdq_hdr_add(vb->cmd_vq, hdr, 0);
>> +		*prev_id = *head_id;
>> +	}
>> +
>> +	virtqueue_add_chain_desc(vb->cmd_vq, addr, len, head_id, prev_id, 0);
>> +	if (*head_id == *prev_id) {
> That's an ugly way to detect ring full.

It's actually not detecting ring full. I will call it tail_id, instead 
of prev_id.
So, *head_id == *tail_id is the case that the first desc was just added by
  virtqueue_add_chain_desc().

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 8/8] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
  2017-07-13  8:46     ` Wei Wang
@ 2017-07-13 17:59       ` Michael S. Tsirkin
  0 siblings, 0 replies; 60+ messages in thread
From: Michael S. Tsirkin @ 2017-07-13 17:59 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On Thu, Jul 13, 2017 at 04:46:29PM +0800, Wei Wang wrote:
> On 07/13/2017 08:22 AM, Michael S. Tsirkin wrote:
> > On Wed, Jul 12, 2017 at 08:40:21PM +0800, Wei Wang wrote:
> > > Add a new vq, cmdq, to handle requests between the device and driver.
> > > 
> > > This patch implements two commands sent from the device and handled in
> > > the driver.
> > > 1) VIRTIO_BALLOON_CMDQ_REPORT_STATS: this command is used to report
> > > the guest memory statistics to the host. The stats_vq mechanism is not
> > > used when the cmdq mechanism is enabled.
> > > 2) VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES: this command is used to
> > > report the guest unused pages to the host.
> > > 
> > > Since now we have a vq to handle multiple commands, we need to keep only
> > > one vq operation at a time. Here, we change the existing START_USE()
> > > and END_USE() to lock on each vq operation.
> > > 
> > > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > > ---
> > >   drivers/virtio/virtio_balloon.c     | 245 ++++++++++++++++++++++++++++++++++--
> > >   drivers/virtio/virtio_ring.c        |  25 +++-
> > >   include/linux/virtio.h              |   2 +
> > >   include/uapi/linux/virtio_balloon.h |  10 ++
> > >   4 files changed, 265 insertions(+), 17 deletions(-)
> > > 
> > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> > > index aa4e7ec..ae91fbf 100644
> > > --- a/drivers/virtio/virtio_balloon.c
> > > +++ b/drivers/virtio/virtio_balloon.c
> > > @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
> > >   struct virtio_balloon {
> > >   	struct virtio_device *vdev;
> > > -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> > > +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *cmd_vq;
> > >   	/* The balloon servicing is delegated to a freezable workqueue. */
> > >   	struct work_struct update_balloon_stats_work;
> > >   	struct work_struct update_balloon_size_work;
> > > +	struct work_struct cmdq_handle_work;
> > >   	/* Prevent updating balloon when it is being canceled. */
> > >   	spinlock_t stop_update_lock;
> > > @@ -90,6 +91,12 @@ struct virtio_balloon {
> > >   	/* Memory statistics */
> > >   	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
> > > +	/* Cmdq msg buffer for memory statistics */
> > > +	struct virtio_balloon_cmdq_hdr cmdq_stats_hdr;
> > > +
> > > +	/* Cmdq msg buffer for reporting ununsed pages */

typo above btw

> > > +	struct virtio_balloon_cmdq_hdr cmdq_unused_page_hdr;
> > > +
> > >   	/* To register callback in oom notifier call chain */
> > >   	struct notifier_block nb;
> > >   };
> > > @@ -485,25 +492,214 @@ static void update_balloon_size_func(struct work_struct *work)
> > >   		queue_work(system_freezable_wq, work);
> > >   }
> > > +static unsigned int cmdq_hdr_add(struct virtqueue *vq,
> > > +				 struct virtio_balloon_cmdq_hdr *hdr,
> > > +				 bool in)
> > > +{
> > > +	unsigned int id = VIRTQUEUE_DESC_ID_INIT;
> > > +	uint64_t hdr_pa = (uint64_t)virt_to_phys((void *)hdr);
> > > +
> > > +	virtqueue_add_chain_desc(vq, hdr_pa, sizeof(*hdr), &id, &id, in);
> > > +
> > > +	/* Deliver the hdr for the host to send commands. */
> > > +	if (in) {
> > > +		hdr->flags = 0;
> > > +		virtqueue_add_chain(vq, id, 0, NULL, hdr, NULL);
> > > +		virtqueue_kick(vq);
> > > +	}
> > > +
> > > +	return id;
> > > +}
> > > +
> > > +static void cmdq_add_chain_desc(struct virtio_balloon *vb,
> > > +				struct virtio_balloon_cmdq_hdr *hdr,
> > > +				uint64_t addr,
> > > +				uint32_t len,
> > > +				unsigned int *head_id,
> > > +				unsigned int *prev_id)
> > > +{
> > > +retry:
> > > +	if (*head_id == VIRTQUEUE_DESC_ID_INIT) {
> > > +		*head_id = cmdq_hdr_add(vb->cmd_vq, hdr, 0);
> > > +		*prev_id = *head_id;
> > > +	}
> > > +
> > > +	virtqueue_add_chain_desc(vb->cmd_vq, addr, len, head_id, prev_id, 0);
> > > +	if (*head_id == *prev_id) {
> > That's an ugly way to detect ring full.
> 
> It's actually not detecting ring full. I will call it tail_id, instead of
> prev_id.
> So, *head_id == *tail_id is the case that the first desc was just added by
>  virtqueue_add_chain_desc().
> 
> Best,
> Wei

Oh so it's adding header before each list. Ugh.

I don't think we should stay with this API. It's just too tricky to use.

If we have an API that fails when it can't add descriptors
(you can reserve space for the last descriptor)
the balloon knows whether it's the first descriptor in a chain
and can just use a boolean that tells it whether that is the case.


-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-07-13  7:42         ` Wei Wang
@ 2017-07-13 20:19           ` Michael S. Tsirkin
  2017-07-14  7:12             ` Wei Wang
  0 siblings, 1 reply; 60+ messages in thread
From: Michael S. Tsirkin @ 2017-07-13 20:19 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote:
> On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:
> > 
> > So the way I see it, there are several issues:
> > 
> > - internal wait - forces multiple APIs like kick/kick_sync
> >    note how kick_sync can fail but your code never checks return code
> > - need to re-write the last descriptor - might not work
> >    for alternative layouts which always expose descriptors
> >    immediately
> 
> Probably it wasn't clear. Please let me explain the two functions here:
> 
> 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
> grabs a desc from the vq and inserts it to the chain tail (which is indexed
> by
> prev_id, probably better to call it tail_id). Then, the new added desc
> becomes
> the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc
> when it's
> added to the chain, and set when another desc comes to follow later.

And this only works if there are multiple rings like
avail + descriptor ring.
It won't work e.g. with the proposed new layout where
writing out a descriptor exposes it immediately.

> 2) virtqueue_add_chain(vq, head_id,..): expose the chain to the other end.
> 
> So, if people want to add a desc and immediately expose it to the other end,
> i.e. build a single desc chain, they can just add and expose:
> 
> virtqueue_add_chain_desc(..);
> virtqueue_add_chain(..,head_id);
> 
> Would you see any issues here?

The way the new APIs poll used ring internally.

> 
> > - some kind of iterator type would be nicer instead of
> >    maintaining head/prev explicitly
> 
> Why would we need to iterate the chain?

In your patches prev/tail are iterators - they keep track of
where you are in the chain.

> I think it would be simpler to use
> a wrapper struct:
> 
> struct virtqueue_desc_chain {
>     unsigned int head;  // head desc id of the chain
>     unsigned int tail;     // tail desc id of the chain
> }
> 
> The new desc will be put to desc[tail].next, and we don't need to walk
> from the head desc[head].next when inserting a new desc to the chain, right?
> 
> 
> > 
> > As for the use, it would be better to do
> > 
> > if (!add_next(vq, ...)) {
> > 	add_last(vq, ...)
> > 	kick
> > 	wait
> > }
> 
> "!add_next(vq, ...)" means that the vq is full?


No - it means there's only 1 entry left for the last descriptor.


> If so, what would add_last()
> do then?
> 
> > Using VIRTQUEUE_DESC_ID_INIT seems to avoid a branch in the driver, but
> > in fact it merely puts the branch in the virtio code.
> > 
> 
> Actually it wasn't intended to improve performance. It is used to indicate
> the "init" state
> of the chain. So, when virtqueue_add_chain_desc(, head_id,..) finds head
> id=INIT, it will
> assign the grabbed desc id to &head_id. In some sense, it is equivalent to
> add_first().
> 
> Do you have a different opinion here?
> 
> Best,
> Wei
> 

It is but let's make it explicit here - an API function is better
than a special value.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-07-13 20:19           ` Michael S. Tsirkin
@ 2017-07-14  7:12             ` Wei Wang
  2017-07-23  1:45               ` Michael S. Tsirkin
  0 siblings, 1 reply; 60+ messages in thread
From: Wei Wang @ 2017-07-14  7:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote:
> On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote:
>> On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:
>>> So the way I see it, there are several issues:
>>>
>>> - internal wait - forces multiple APIs like kick/kick_sync
>>>     note how kick_sync can fail but your code never checks return code
>>> - need to re-write the last descriptor - might not work
>>>     for alternative layouts which always expose descriptors
>>>     immediately
>> Probably it wasn't clear. Please let me explain the two functions here:
>>
>> 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
>> grabs a desc from the vq and inserts it to the chain tail (which is indexed
>> by
>> prev_id, probably better to call it tail_id). Then, the new added desc
>> becomes
>> the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc
>> when it's
>> added to the chain, and set when another desc comes to follow later.
> And this only works if there are multiple rings like
> avail + descriptor ring.
> It won't work e.g. with the proposed new layout where
> writing out a descriptor exposes it immediately.

I think it can support the 1.1 proposal, too. But before getting
into that, I think we first need to deep dive into the implementation
and usage of _first/next/last. The usage would need to lock the vq
from the first to the end (otherwise, the returned info about the number
of available desc in the vq, i.e. num_free, would be invalid):

lock(vq);
add_first();
add_next();
add_last();
unlock(vq);

However, I think the case isn't this simple, since we need to check more 
things
after each add_xx() step. For example, if only one entry is available at 
the time
we start to use the vq, that is, num_free is 0 after add_first(), we 
wouldn't be
able to add_next and add_last. So, it would work like this:

start:
     ...get free page block..
     lock(vq)
retry:
     ret = add_first(..,&num_free,);
     if(ret == -ENOSPC) {
         goto retry;
     } else if (!num_free) {
         add_chain_head();
         unlock(vq);
         kick & wait;
         goto start;
     }
next_one:
     ...get free page block..
     add_next(..,&num_free,);
     if (!num_free) {
         add_chain_head();
         unlock(vq);
         kick & wait;
         goto start;
     } if (num_free == 1) {
         ...get free page block..
         add_last(..);
         unlock(vq);
         kick & wait;
         goto start;
     } else {
         goto next_one;
     }

The above seems unnecessary to me to have three different APIs.
That's the reason to combine them into one virtqueue_add_chain_desc().

-- or, do you have a different thought about using the three APIs?


Implementation Reference:

struct desc_iterator {
     unsigned int head;
     unsigned int tail;
};

add_first(*vq, *desc_iterator, *num_free, ..)
{
     if (vq->vq.num_free < 1)
         return -ENOSPC;
     get_desc(&desc_id);
     desc[desc_id].flag &= ~_F_NEXT;
     desc_iterator->head = desc_id
     desc_iterator->tail = desc_iterator->head;
     *num_free = vq->vq.num_free;
}

add_next(vq, desc_iterator, *num_free,..)
{
     get_desc(&desc_id);
     desc[desc_id].flag &= ~_F_NEXT;
     desc[desc_iterator.tail].next = desc_id;
     desc[desc_iterator->tail].flag |= _F_NEXT;
     desc_iterator->tail = desc_id;
     *num_free = vq->vq.num_free;
}

add_last(vq, desc_iterator,..)
{
     get_desc(&desc_id);
     desc[desc_id].flag &= ~_F_NEXT;
     desc[desc_iterator.tail].next = desc_id;
     desc_iterator->tail = desc_id;

     add_chain_head(); // put the desc_iterator.head to the ring
}


Best,
Wei


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 6/8] mm: support reporting free page blocks
  2017-07-12 12:40 ` [PATCH v12 6/8] mm: support reporting free page blocks Wei Wang
  2017-07-13  0:33   ` Michael S. Tsirkin
@ 2017-07-14 12:30   ` Michal Hocko
  2017-07-14 12:54     ` Michal Hocko
  2017-07-14 19:17     ` Michael S. Tsirkin
  1 sibling, 2 replies; 60+ messages in thread
From: Michal Hocko @ 2017-07-14 12:30 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, mst,
	david, cornelia.huck, akpm, mgorman, aarcange, amit.shah,
	pbonzini, liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On Wed 12-07-17 20:40:19, Wei Wang wrote:
> This patch adds support for reporting blocks of pages on the free list
> specified by the caller.
> 
> As pages can leave the free list during this call or immediately
> afterwards, they are not guaranteed to be free after the function
> returns. The only guarantee this makes is that the page was on the free
> list at some point in time after the function has been invoked.
> 
> Therefore, it is not safe for caller to use any pages on the returned
> block or to discard data that is put there after the function returns.
> However, it is safe for caller to discard data that was in one of these
> pages before the function was invoked.

I do not understand what is the point of such a function and how it is
used because the patch doesn't give us any user (I haven't checked other
patches yet).

But just from the semantic point of view this sounds like a horrible
idea. The only way to get a free block of pages is to call the page
allocator. I am tempted to give it Nack right on those grounds but I
would like to hear more about what you actually want to achieve.

> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
>  include/linux/mm.h |  5 +++
>  mm/page_alloc.c    | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 101 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 46b9ac5..76cb433 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1835,6 +1835,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>  		unsigned long zone_start_pfn, unsigned long *zholes_size);
>  extern void free_initmem(void);
>  
> +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
> +extern int report_unused_page_block(struct zone *zone, unsigned int order,
> +				    unsigned int migratetype,
> +				    struct page **page);
> +#endif
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>   * into the buddy system. The freed pages will be poisoned with pattern
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 64b7d82..8b3c9dd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4753,6 +4753,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  	show_swap_cache_info();
>  }
>  
> +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
> +
> +/*
> + * Heuristically get a page block in the system that is unused.
> + * It is possible that pages from the page block are used immediately after
> + * report_unused_page_block() returns. It is the caller's responsibility
> + * to either detect or prevent the use of such pages.
> + *
> + * The free list to check: zone->free_area[order].free_list[migratetype].
> + *
> + * If the caller supplied page block (i.e. **page) is on the free list, offer
> + * the next page block on the list to the caller. Otherwise, offer the first
> + * page block on the list.
> + *
> + * Note: it is not safe for caller to use any pages on the returned
> + * block or to discard data that is put there after the function returns.
> + * However, it is safe for caller to discard data that was in one of these
> + * pages before the function was invoked.
> + *
> + * Return 0 when a page block is found on the caller specified free list.
> + */
> +int report_unused_page_block(struct zone *zone, unsigned int order,
> +			     unsigned int migratetype, struct page **page)
> +{
> +	struct zone *this_zone;
> +	struct list_head *this_list;
> +	int ret = 0;
> +	unsigned long flags;
> +
> +	/* Sanity check */
> +	if (zone == NULL || page == NULL || order >= MAX_ORDER ||
> +	    migratetype >= MIGRATE_TYPES)
> +		return -EINVAL;
> +
> +	/* Zone validity check */
> +	for_each_populated_zone(this_zone) {
> +		if (zone == this_zone)
> +			break;
> +	}
> +
> +	/* Got a non-existent zone from the caller? */
> +	if (zone != this_zone)
> +		return -EINVAL;

Huh, what do you check for here? Why don't you simply
populated_zone(zone)?

> +
> +	spin_lock_irqsave(&this_zone->lock, flags);
> +
> +	this_list = &zone->free_area[order].free_list[migratetype];
> +	if (list_empty(this_list)) {
> +		*page = NULL;
> +		ret = 1;
> +		goto out;
> +	}
> +
> +	/* The caller is asking for the first free page block on the list */
> +	if ((*page) == NULL) {
> +		*page = list_first_entry(this_list, struct page, lru);
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	/*
> +	 * The page block passed from the caller is not on this free list
> +	 * anymore (e.g. a 1MB free page block has been split). In this case,
> +	 * offer the first page block on the free list that the caller is
> +	 * asking for.
> +	 */
> +	if (PageBuddy(*page) && order != page_order(*page)) {
> +		*page = list_first_entry(this_list, struct page, lru);
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	/*
> +	 * The page block passed from the caller has been the last page block
> +	 * on the list.
> +	 */
> +	if ((*page)->lru.next == this_list) {
> +		*page = NULL;
> +		ret = 1;
> +		goto out;
> +	}
> +
> +	/*
> +	 * Finally, fall into the regular case: the page block passed from the
> +	 * caller is still on the free list. Offer the next one.
> +	 */
> +	*page = list_next_entry((*page), lru);
> +	ret = 0;
> +out:
> +	spin_unlock_irqrestore(&this_zone->lock, flags);
> +	return ret;
> +}
> +EXPORT_SYMBOL(report_unused_page_block);
> +
> +#endif
> +
>  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
>  {
>  	zoneref->zone = zone;
> -- 
> 2.7.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 7/8] mm: export symbol of next_zone and first_online_pgdat
  2017-07-12 12:40 ` [PATCH v12 7/8] mm: export symbol of next_zone and first_online_pgdat Wei Wang
  2017-07-13  0:16   ` Michael S. Tsirkin
@ 2017-07-14 12:31   ` Michal Hocko
  1 sibling, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2017-07-14 12:31 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, mst,
	david, cornelia.huck, akpm, mgorman, aarcange, amit.shah,
	pbonzini, liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On Wed 12-07-17 20:40:20, Wei Wang wrote:
> This patch enables for_each_zone()/for_each_populated_zone() to be
> invoked by a kernel module.

This needs much better justification with an example of who is going to
use these symbols and what for.
 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> ---
>  mm/mmzone.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/mmzone.c b/mm/mmzone.c
> index a51c0a6..08a2a3a 100644
> --- a/mm/mmzone.c
> +++ b/mm/mmzone.c
> @@ -13,6 +13,7 @@ struct pglist_data *first_online_pgdat(void)
>  {
>  	return NODE_DATA(first_online_node);
>  }
> +EXPORT_SYMBOL_GPL(first_online_pgdat);
>  
>  struct pglist_data *next_online_pgdat(struct pglist_data *pgdat)
>  {
> @@ -41,6 +42,7 @@ struct zone *next_zone(struct zone *zone)
>  	}
>  	return zone;
>  }
> +EXPORT_SYMBOL_GPL(next_zone);
>  
>  static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
>  {
> -- 
> 2.7.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 6/8] mm: support reporting free page blocks
  2017-07-14 12:30   ` Michal Hocko
@ 2017-07-14 12:54     ` Michal Hocko
  2017-07-14 15:46       ` Michael S. Tsirkin
  2017-07-14 19:17     ` Michael S. Tsirkin
  1 sibling, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2017-07-14 12:54 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, mst,
	david, cornelia.huck, akpm, mgorman, aarcange, amit.shah,
	pbonzini, liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On Fri 14-07-17 14:30:23, Michal Hocko wrote:
> On Wed 12-07-17 20:40:19, Wei Wang wrote:
> > This patch adds support for reporting blocks of pages on the free list
> > specified by the caller.
> > 
> > As pages can leave the free list during this call or immediately
> > afterwards, they are not guaranteed to be free after the function
> > returns. The only guarantee this makes is that the page was on the free
> > list at some point in time after the function has been invoked.
> > 
> > Therefore, it is not safe for caller to use any pages on the returned
> > block or to discard data that is put there after the function returns.
> > However, it is safe for caller to discard data that was in one of these
> > pages before the function was invoked.
> 
> I do not understand what is the point of such a function and how it is
> used because the patch doesn't give us any user (I haven't checked other
> patches yet).
> 
> But just from the semantic point of view this sounds like a horrible
> idea. The only way to get a free block of pages is to call the page
> allocator. I am tempted to give it Nack right on those grounds but I
> would like to hear more about what you actually want to achieve.

OK, so I gave it another thought and giving a page which is still on the
free list to a random module is just a free ticket to a disaster.
Nacked-by: Michal Hocko <mhocko@suse.com>

> 
> > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > ---
> >  include/linux/mm.h |  5 +++
> >  mm/page_alloc.c    | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 101 insertions(+)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 46b9ac5..76cb433 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1835,6 +1835,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
> >  		unsigned long zone_start_pfn, unsigned long *zholes_size);
> >  extern void free_initmem(void);
> >  
> > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
> > +extern int report_unused_page_block(struct zone *zone, unsigned int order,
> > +				    unsigned int migratetype,
> > +				    struct page **page);
> > +#endif
> >  /*
> >   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
> >   * into the buddy system. The freed pages will be poisoned with pattern
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 64b7d82..8b3c9dd 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4753,6 +4753,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
> >  	show_swap_cache_info();
> >  }
> >  
> > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
> > +
> > +/*
> > + * Heuristically get a page block in the system that is unused.
> > + * It is possible that pages from the page block are used immediately after
> > + * report_unused_page_block() returns. It is the caller's responsibility
> > + * to either detect or prevent the use of such pages.
> > + *
> > + * The free list to check: zone->free_area[order].free_list[migratetype].
> > + *
> > + * If the caller supplied page block (i.e. **page) is on the free list, offer
> > + * the next page block on the list to the caller. Otherwise, offer the first
> > + * page block on the list.
> > + *
> > + * Note: it is not safe for caller to use any pages on the returned
> > + * block or to discard data that is put there after the function returns.
> > + * However, it is safe for caller to discard data that was in one of these
> > + * pages before the function was invoked.
> > + *
> > + * Return 0 when a page block is found on the caller specified free list.
> > + */
> > +int report_unused_page_block(struct zone *zone, unsigned int order,
> > +			     unsigned int migratetype, struct page **page)
> > +{
> > +	struct zone *this_zone;
> > +	struct list_head *this_list;
> > +	int ret = 0;
> > +	unsigned long flags;
> > +
> > +	/* Sanity check */
> > +	if (zone == NULL || page == NULL || order >= MAX_ORDER ||
> > +	    migratetype >= MIGRATE_TYPES)
> > +		return -EINVAL;
> > +
> > +	/* Zone validity check */
> > +	for_each_populated_zone(this_zone) {
> > +		if (zone == this_zone)
> > +			break;
> > +	}
> > +
> > +	/* Got a non-existent zone from the caller? */
> > +	if (zone != this_zone)
> > +		return -EINVAL;
> 
> Huh, what do you check for here? Why don't you simply
> populated_zone(zone)?
> 
> > +
> > +	spin_lock_irqsave(&this_zone->lock, flags);
> > +
> > +	this_list = &zone->free_area[order].free_list[migratetype];
> > +	if (list_empty(this_list)) {
> > +		*page = NULL;
> > +		ret = 1;
> > +		goto out;
> > +	}
> > +
> > +	/* The caller is asking for the first free page block on the list */
> > +	if ((*page) == NULL) {
> > +		*page = list_first_entry(this_list, struct page, lru);
> > +		ret = 0;
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * The page block passed from the caller is not on this free list
> > +	 * anymore (e.g. a 1MB free page block has been split). In this case,
> > +	 * offer the first page block on the free list that the caller is
> > +	 * asking for.
> > +	 */
> > +	if (PageBuddy(*page) && order != page_order(*page)) {
> > +		*page = list_first_entry(this_list, struct page, lru);
> > +		ret = 0;
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * The page block passed from the caller has been the last page block
> > +	 * on the list.
> > +	 */
> > +	if ((*page)->lru.next == this_list) {
> > +		*page = NULL;
> > +		ret = 1;
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * Finally, fall into the regular case: the page block passed from the
> > +	 * caller is still on the free list. Offer the next one.
> > +	 */
> > +	*page = list_next_entry((*page), lru);
> > +	ret = 0;
> > +out:
> > +	spin_unlock_irqrestore(&this_zone->lock, flags);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(report_unused_page_block);
> > +
> > +#endif
> > +
> >  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
> >  {
> >  	zoneref->zone = zone;
> > -- 
> > 2.7.4
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 6/8] mm: support reporting free page blocks
  2017-07-14 12:54     ` Michal Hocko
@ 2017-07-14 15:46       ` Michael S. Tsirkin
  0 siblings, 0 replies; 60+ messages in thread
From: Michael S. Tsirkin @ 2017-07-14 15:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Wei Wang, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, virtio-dev,
	yang.zhang.wz, quan.xu

On Fri, Jul 14, 2017 at 02:54:30PM +0200, Michal Hocko wrote:
> On Fri 14-07-17 14:30:23, Michal Hocko wrote:
> > On Wed 12-07-17 20:40:19, Wei Wang wrote:
> > > This patch adds support for reporting blocks of pages on the free list
> > > specified by the caller.
> > > 
> > > As pages can leave the free list during this call or immediately
> > > afterwards, they are not guaranteed to be free after the function
> > > returns. The only guarantee this makes is that the page was on the free
> > > list at some point in time after the function has been invoked.
> > > 
> > > Therefore, it is not safe for caller to use any pages on the returned
> > > block or to discard data that is put there after the function returns.
> > > However, it is safe for caller to discard data that was in one of these
> > > pages before the function was invoked.
> > 
> > I do not understand what is the point of such a function and how it is
> > used because the patch doesn't give us any user (I haven't checked other
> > patches yet).
> > 
> > But just from the semantic point of view this sounds like a horrible
> > idea. The only way to get a free block of pages is to call the page
> > allocator. I am tempted to give it Nack right on those grounds but I
> > would like to hear more about what you actually want to achieve.
> 
> OK, so I gave it another thought and giving a page which is still on the
> free list to a random module is just a free ticket to a disaster.
> Nacked-by: Michal Hocko <mhocko@suse.com>

I agree it should be EXPORT_SYMBOL_GPL. Too much power
to give to non-GPL modules.

But pls take a look at the explanation I posted.  Any kind of hypervisor
hinting will need to do this by definition - best we can do is keep a
lock while we do this.

> > 
> > > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > > ---
> > >  include/linux/mm.h |  5 +++
> > >  mm/page_alloc.c    | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 101 insertions(+)
> > > 
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 46b9ac5..76cb433 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -1835,6 +1835,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
> > >  		unsigned long zone_start_pfn, unsigned long *zholes_size);
> > >  extern void free_initmem(void);
> > >  
> > > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
> > > +extern int report_unused_page_block(struct zone *zone, unsigned int order,
> > > +				    unsigned int migratetype,
> > > +				    struct page **page);
> > > +#endif
> > >  /*
> > >   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
> > >   * into the buddy system. The freed pages will be poisoned with pattern
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 64b7d82..8b3c9dd 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -4753,6 +4753,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
> > >  	show_swap_cache_info();
> > >  }
> > >  
> > > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
> > > +
> > > +/*
> > > + * Heuristically get a page block in the system that is unused.
> > > + * It is possible that pages from the page block are used immediately after
> > > + * report_unused_page_block() returns. It is the caller's responsibility
> > > + * to either detect or prevent the use of such pages.
> > > + *
> > > + * The free list to check: zone->free_area[order].free_list[migratetype].
> > > + *
> > > + * If the caller supplied page block (i.e. **page) is on the free list, offer
> > > + * the next page block on the list to the caller. Otherwise, offer the first
> > > + * page block on the list.
> > > + *
> > > + * Note: it is not safe for caller to use any pages on the returned
> > > + * block or to discard data that is put there after the function returns.
> > > + * However, it is safe for caller to discard data that was in one of these
> > > + * pages before the function was invoked.
> > > + *
> > > + * Return 0 when a page block is found on the caller specified free list.
> > > + */
> > > +int report_unused_page_block(struct zone *zone, unsigned int order,
> > > +			     unsigned int migratetype, struct page **page)
> > > +{
> > > +	struct zone *this_zone;
> > > +	struct list_head *this_list;
> > > +	int ret = 0;
> > > +	unsigned long flags;
> > > +
> > > +	/* Sanity check */
> > > +	if (zone == NULL || page == NULL || order >= MAX_ORDER ||
> > > +	    migratetype >= MIGRATE_TYPES)
> > > +		return -EINVAL;
> > > +
> > > +	/* Zone validity check */
> > > +	for_each_populated_zone(this_zone) {
> > > +		if (zone == this_zone)
> > > +			break;
> > > +	}
> > > +
> > > +	/* Got a non-existent zone from the caller? */
> > > +	if (zone != this_zone)
> > > +		return -EINVAL;
> > 
> > Huh, what do you check for here? Why don't you simply
> > populated_zone(zone)?
> > 
> > > +
> > > +	spin_lock_irqsave(&this_zone->lock, flags);
> > > +
> > > +	this_list = &zone->free_area[order].free_list[migratetype];
> > > +	if (list_empty(this_list)) {
> > > +		*page = NULL;
> > > +		ret = 1;
> > > +		goto out;
> > > +	}
> > > +
> > > +	/* The caller is asking for the first free page block on the list */
> > > +	if ((*page) == NULL) {
> > > +		*page = list_first_entry(this_list, struct page, lru);
> > > +		ret = 0;
> > > +		goto out;
> > > +	}
> > > +
> > > +	/*
> > > +	 * The page block passed from the caller is not on this free list
> > > +	 * anymore (e.g. a 1MB free page block has been split). In this case,
> > > +	 * offer the first page block on the free list that the caller is
> > > +	 * asking for.
> > > +	 */
> > > +	if (PageBuddy(*page) && order != page_order(*page)) {
> > > +		*page = list_first_entry(this_list, struct page, lru);
> > > +		ret = 0;
> > > +		goto out;
> > > +	}
> > > +
> > > +	/*
> > > +	 * The page block passed from the caller has been the last page block
> > > +	 * on the list.
> > > +	 */
> > > +	if ((*page)->lru.next == this_list) {
> > > +		*page = NULL;
> > > +		ret = 1;
> > > +		goto out;
> > > +	}
> > > +
> > > +	/*
> > > +	 * Finally, fall into the regular case: the page block passed from the
> > > +	 * caller is still on the free list. Offer the next one.
> > > +	 */
> > > +	*page = list_next_entry((*page), lru);
> > > +	ret = 0;
> > > +out:
> > > +	spin_unlock_irqrestore(&this_zone->lock, flags);
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL(report_unused_page_block);
> > > +
> > > +#endif
> > > +
> > >  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
> > >  {
> > >  	zoneref->zone = zone;
> > > -- 
> > > 2.7.4
> > > 
> > > --
> > > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > > the body to majordomo@kvack.org.  For more info on Linux MM,
> > > see: http://www.linux-mm.org/ .
> > > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > 
> > -- 
> > Michal Hocko
> > SUSE Labs
> 
> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 6/8] mm: support reporting free page blocks
  2017-07-14 12:30   ` Michal Hocko
  2017-07-14 12:54     ` Michal Hocko
@ 2017-07-14 19:17     ` Michael S. Tsirkin
  2017-07-17 15:24       ` Michal Hocko
  1 sibling, 1 reply; 60+ messages in thread
From: Michael S. Tsirkin @ 2017-07-14 19:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Wei Wang, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, virtio-dev,
	yang.zhang.wz, quan.xu

On Fri, Jul 14, 2017 at 02:30:23PM +0200, Michal Hocko wrote:
> On Wed 12-07-17 20:40:19, Wei Wang wrote:
> > This patch adds support for reporting blocks of pages on the free list
> > specified by the caller.
> > 
> > As pages can leave the free list during this call or immediately
> > afterwards, they are not guaranteed to be free after the function
> > returns. The only guarantee this makes is that the page was on the free
> > list at some point in time after the function has been invoked.
> > 
> > Therefore, it is not safe for caller to use any pages on the returned
> > block or to discard data that is put there after the function returns.
> > However, it is safe for caller to discard data that was in one of these
> > pages before the function was invoked.
> 
> I do not understand what is the point of such a function and how it is
> used because the patch doesn't give us any user (I haven't checked other
> patches yet).
> 
> But just from the semantic point of view this sounds like a horrible
> idea. The only way to get a free block of pages is to call the page
> allocator. I am tempted to give it Nack right on those grounds but I
> would like to hear more about what you actually want to achieve.

Basically it's a performance hint to the hypervisor.
For example, these pages would be good candidates to
move around as they are not mapped into any running
applications.

As such, it's important not to slow down other parts of the system too
much - otherwise we are speeding up one part of the system while we slow
down other parts of it, which is why it's trying to drop the lock as
soon a possible.

As long as hypervisor does not assume it can drop these pages, and as
long it's correct in most cases.  we are OK even if the hint is slightly
wrong because hypervisor notifications are racing with allocations.

There are patches to do more tricks - if hypervisor tracks all
memory writes we might actually use this hint to discard data -
but that is just implementation detail.


> > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > ---
> >  include/linux/mm.h |  5 +++
> >  mm/page_alloc.c    | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 101 insertions(+)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 46b9ac5..76cb433 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1835,6 +1835,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
> >  		unsigned long zone_start_pfn, unsigned long *zholes_size);
> >  extern void free_initmem(void);
> >  
> > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
> > +extern int report_unused_page_block(struct zone *zone, unsigned int order,
> > +				    unsigned int migratetype,
> > +				    struct page **page);
> > +#endif
> >  /*
> >   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
> >   * into the buddy system. The freed pages will be poisoned with pattern
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 64b7d82..8b3c9dd 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4753,6 +4753,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
> >  	show_swap_cache_info();
> >  }
> >  
> > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
> > +
> > +/*
> > + * Heuristically get a page block in the system that is unused.
> > + * It is possible that pages from the page block are used immediately after
> > + * report_unused_page_block() returns. It is the caller's responsibility
> > + * to either detect or prevent the use of such pages.
> > + *
> > + * The free list to check: zone->free_area[order].free_list[migratetype].
> > + *
> > + * If the caller supplied page block (i.e. **page) is on the free list, offer
> > + * the next page block on the list to the caller. Otherwise, offer the first
> > + * page block on the list.
> > + *
> > + * Note: it is not safe for caller to use any pages on the returned
> > + * block or to discard data that is put there after the function returns.
> > + * However, it is safe for caller to discard data that was in one of these
> > + * pages before the function was invoked.
> > + *
> > + * Return 0 when a page block is found on the caller specified free list.
> > + */
> > +int report_unused_page_block(struct zone *zone, unsigned int order,
> > +			     unsigned int migratetype, struct page **page)
> > +{
> > +	struct zone *this_zone;
> > +	struct list_head *this_list;
> > +	int ret = 0;
> > +	unsigned long flags;
> > +
> > +	/* Sanity check */
> > +	if (zone == NULL || page == NULL || order >= MAX_ORDER ||
> > +	    migratetype >= MIGRATE_TYPES)
> > +		return -EINVAL;
> > +
> > +	/* Zone validity check */
> > +	for_each_populated_zone(this_zone) {
> > +		if (zone == this_zone)
> > +			break;
> > +	}
> > +
> > +	/* Got a non-existent zone from the caller? */
> > +	if (zone != this_zone)
> > +		return -EINVAL;
> 
> Huh, what do you check for here? Why don't you simply
> populated_zone(zone)?
> 
> > +
> > +	spin_lock_irqsave(&this_zone->lock, flags);
> > +
> > +	this_list = &zone->free_area[order].free_list[migratetype];
> > +	if (list_empty(this_list)) {
> > +		*page = NULL;
> > +		ret = 1;
> > +		goto out;
> > +	}
> > +
> > +	/* The caller is asking for the first free page block on the list */
> > +	if ((*page) == NULL) {
> > +		*page = list_first_entry(this_list, struct page, lru);
> > +		ret = 0;
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * The page block passed from the caller is not on this free list
> > +	 * anymore (e.g. a 1MB free page block has been split). In this case,
> > +	 * offer the first page block on the free list that the caller is
> > +	 * asking for.
> > +	 */
> > +	if (PageBuddy(*page) && order != page_order(*page)) {
> > +		*page = list_first_entry(this_list, struct page, lru);
> > +		ret = 0;
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * The page block passed from the caller has been the last page block
> > +	 * on the list.
> > +	 */
> > +	if ((*page)->lru.next == this_list) {
> > +		*page = NULL;
> > +		ret = 1;
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * Finally, fall into the regular case: the page block passed from the
> > +	 * caller is still on the free list. Offer the next one.
> > +	 */
> > +	*page = list_next_entry((*page), lru);
> > +	ret = 0;
> > +out:
> > +	spin_unlock_irqrestore(&this_zone->lock, flags);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(report_unused_page_block);
> > +
> > +#endif
> > +
> >  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
> >  {
> >  	zoneref->zone = zone;
> > -- 
> > 2.7.4
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 6/8] mm: support reporting free page blocks
  2017-07-14 19:17     ` Michael S. Tsirkin
@ 2017-07-17 15:24       ` Michal Hocko
  2017-07-18  2:12         ` Wei Wang
  0 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2017-07-17 15:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Wang, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, virtio-dev,
	yang.zhang.wz, quan.xu

On Fri 14-07-17 22:17:13, Michael S. Tsirkin wrote:
> On Fri, Jul 14, 2017 at 02:30:23PM +0200, Michal Hocko wrote:
> > On Wed 12-07-17 20:40:19, Wei Wang wrote:
> > > This patch adds support for reporting blocks of pages on the free list
> > > specified by the caller.
> > > 
> > > As pages can leave the free list during this call or immediately
> > > afterwards, they are not guaranteed to be free after the function
> > > returns. The only guarantee this makes is that the page was on the free
> > > list at some point in time after the function has been invoked.
> > > 
> > > Therefore, it is not safe for caller to use any pages on the returned
> > > block or to discard data that is put there after the function returns.
> > > However, it is safe for caller to discard data that was in one of these
> > > pages before the function was invoked.
> > 
> > I do not understand what is the point of such a function and how it is
> > used because the patch doesn't give us any user (I haven't checked other
> > patches yet).
> > 
> > But just from the semantic point of view this sounds like a horrible
> > idea. The only way to get a free block of pages is to call the page
> > allocator. I am tempted to give it Nack right on those grounds but I
> > would like to hear more about what you actually want to achieve.
> 
> Basically it's a performance hint to the hypervisor.
> For example, these pages would be good candidates to
> move around as they are not mapped into any running
> applications.
>
> As such, it's important not to slow down other parts of the system too
> much - otherwise we are speeding up one part of the system while we slow
> down other parts of it, which is why it's trying to drop the lock as
> soon a possible.

So why cannot you simply allocate those page and then do whatever you
need. You can tell the page allocator to do only a lightweight
allocation by the gfp_mask - e.g. GFP_NOWAIT or if you even do not want
to risk kswapd intervening then 0 mask.

> As long as hypervisor does not assume it can drop these pages, and as
> long it's correct in most cases.  we are OK even if the hint is slightly
> wrong because hypervisor notifications are racing with allocations.

But the page could have been reused anytime after the lock is dropped
and you cannot check for that except for elevating the reference count.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 6/8] mm: support reporting free page blocks
  2017-07-17 15:24       ` Michal Hocko
@ 2017-07-18  2:12         ` Wei Wang
  2017-07-19  8:13           ` Michal Hocko
  0 siblings, 1 reply; 60+ messages in thread
From: Wei Wang @ 2017-07-18  2:12 UTC (permalink / raw)
  To: Michal Hocko, Michael S. Tsirkin
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On 07/17/2017 11:24 PM, Michal Hocko wrote:
> On Fri 14-07-17 22:17:13, Michael S. Tsirkin wrote:
>> On Fri, Jul 14, 2017 at 02:30:23PM +0200, Michal Hocko wrote:
>>> On Wed 12-07-17 20:40:19, Wei Wang wrote:
>>>> This patch adds support for reporting blocks of pages on the free list
>>>> specified by the caller.
>>>>
>>>> As pages can leave the free list during this call or immediately
>>>> afterwards, they are not guaranteed to be free after the function
>>>> returns. The only guarantee this makes is that the page was on the free
>>>> list at some point in time after the function has been invoked.
>>>>
>>>> Therefore, it is not safe for caller to use any pages on the returned
>>>> block or to discard data that is put there after the function returns.
>>>> However, it is safe for caller to discard data that was in one of these
>>>> pages before the function was invoked.
>>> I do not understand what is the point of such a function and how it is
>>> used because the patch doesn't give us any user (I haven't checked other
>>> patches yet).
>>>
>>> But just from the semantic point of view this sounds like a horrible
>>> idea. The only way to get a free block of pages is to call the page
>>> allocator. I am tempted to give it Nack right on those grounds but I
>>> would like to hear more about what you actually want to achieve.
>> Basically it's a performance hint to the hypervisor.
>> For example, these pages would be good candidates to
>> move around as they are not mapped into any running
>> applications.
>>
>> As such, it's important not to slow down other parts of the system too
>> much - otherwise we are speeding up one part of the system while we slow
>> down other parts of it, which is why it's trying to drop the lock as
>> soon a possible.


Probably I should have included the introduction of the usages in
the log. Hope it is not too later to explain here:

Live migration needs to transfer the VM's memory from the source
machine to the destination round by round. For the 1st round, all the VM's
memory is transferred. From the 2nd round, only the pieces of memory
that were written by the guest (after the 1st round) are transferred. One
method that is popularly used by the hypervisor to track which part of
memory is written is to write-protect all the guest memory.

This patch enables the optimization of the 1st round memory transfer -
the hypervisor can skip the transfer of guest unused pages in the 1st round.
It is not concerned that the memory pages are used after they are given to
the hypervisor as a hint of the unused pages, because they will be tracked
by the hypervisor and transferred in the next round if they are used and
written.


> So why cannot you simply allocate those page and then do whatever you
> need. You can tell the page allocator to do only a lightweight
> allocation by the gfp_mask - e.g. GFP_NOWAIT or if you even do not want
> to risk kswapd intervening then 0 mask.


Here are the 2 reasons that we can't get the hint of unused pages by 
allocating
them:

1) It's expected that live migration shouldn't affect the things running 
inside
the VM - take away all the free pages from the guest would greatly slow 
down the
activities inside guest (e.g. the network transmission may be stuck due 
to the lack of
sk_buf).

2) The hint of free pages are used to optimize the 1st round memory 
transfer, so the hint
is expect to be gotten by the hypervisor as quick as possible. Depending 
on the memory
size of the guest, allocation of all the free memory would be too long 
for the case.

Hope it clarifies the use case.


>> As long as hypervisor does not assume it can drop these pages, and as
>> long it's correct in most cases.  we are OK even if the hint is slightly
>> wrong because hypervisor notifications are racing with allocations.
> But the page could have been reused anytime after the lock is dropped
> and you cannot check for that except for elevating the reference count.

As also introduced above, the hypervisor uses a dirty page logging mechanism
to track which memory page is written by the guest when live migration 
begins.


Best,
Wei


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 6/8] mm: support reporting free page blocks
  2017-07-18  2:12         ` Wei Wang
@ 2017-07-19  8:13           ` Michal Hocko
  2017-07-19 12:01             ` Wei Wang
  0 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2017-07-19  8:13 UTC (permalink / raw)
  To: Wei Wang
  Cc: Michael S. Tsirkin, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, virtio-dev,
	yang.zhang.wz, quan.xu

On Tue 18-07-17 10:12:14, Wei Wang wrote:
[...]
> Probably I should have included the introduction of the usages in
> the log. Hope it is not too later to explain here:

Yes this should have been described in the cover.
 
> Live migration needs to transfer the VM's memory from the source
> machine to the destination round by round. For the 1st round, all the VM's
> memory is transferred. From the 2nd round, only the pieces of memory
> that were written by the guest (after the 1st round) are transferred. One
> method that is popularly used by the hypervisor to track which part of
> memory is written is to write-protect all the guest memory.
> 
> This patch enables the optimization of the 1st round memory transfer -
> the hypervisor can skip the transfer of guest unused pages in the 1st round.

All you should need is the check for the page reference count, no?  I
assume you do some sort of pfn walk and so you should be able to get an
access to the struct page.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 6/8] mm: support reporting free page blocks
  2017-07-19  8:13           ` Michal Hocko
@ 2017-07-19 12:01             ` Wei Wang
  2017-07-24  9:00               ` Michal Hocko
  0 siblings, 1 reply; 60+ messages in thread
From: Wei Wang @ 2017-07-19 12:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Michael S. Tsirkin, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, virtio-dev,
	yang.zhang.wz, quan.xu

On 07/19/2017 04:13 PM, Michal Hocko wrote:
> On Tue 18-07-17 10:12:14, Wei Wang wrote:
> [...]
>> Probably I should have included the introduction of the usages in
>> the log. Hope it is not too later to explain here:
> Yes this should have been described in the cover.


OK, I will do it in the next version.


>   
>> Live migration needs to transfer the VM's memory from the source
>> machine to the destination round by round. For the 1st round, all the VM's
>> memory is transferred. From the 2nd round, only the pieces of memory
>> that were written by the guest (after the 1st round) are transferred. One
>> method that is popularly used by the hypervisor to track which part of
>> memory is written is to write-protect all the guest memory.
>>
>> This patch enables the optimization of the 1st round memory transfer -
>> the hypervisor can skip the transfer of guest unused pages in the 1st round.
> All you should need is the check for the page reference count, no?  I
> assume you do some sort of pfn walk and so you should be able to get an
> access to the struct page.


Not necessarily - the guest struct page is not seen by the hypervisor. The
hypervisor only gets those guest pfns which are hinted as unused. From the
hypervisor (host) point of view, a guest physical address corresponds to a
virtual address of a host process. So, once the hypervisor knows a guest
physical page is unsued, it knows that the corresponding virtual memory of
the process doesn't need to be transferred in the 1st round.


Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-07-14  7:12             ` Wei Wang
@ 2017-07-23  1:45               ` Michael S. Tsirkin
  2017-07-26  3:48                 ` Wei Wang
  0 siblings, 1 reply; 60+ messages in thread
From: Michael S. Tsirkin @ 2017-07-23  1:45 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On Fri, Jul 14, 2017 at 03:12:43PM +0800, Wei Wang wrote:
> On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote:
> > On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote:
> > > On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:
> > > > So the way I see it, there are several issues:
> > > > 
> > > > - internal wait - forces multiple APIs like kick/kick_sync
> > > >     note how kick_sync can fail but your code never checks return code
> > > > - need to re-write the last descriptor - might not work
> > > >     for alternative layouts which always expose descriptors
> > > >     immediately
> > > Probably it wasn't clear. Please let me explain the two functions here:
> > > 
> > > 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
> > > grabs a desc from the vq and inserts it to the chain tail (which is indexed
> > > by
> > > prev_id, probably better to call it tail_id). Then, the new added desc
> > > becomes
> > > the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc
> > > when it's
> > > added to the chain, and set when another desc comes to follow later.
> > And this only works if there are multiple rings like
> > avail + descriptor ring.
> > It won't work e.g. with the proposed new layout where
> > writing out a descriptor exposes it immediately.
> 
> I think it can support the 1.1 proposal, too. But before getting
> into that, I think we first need to deep dive into the implementation
> and usage of _first/next/last. The usage would need to lock the vq
> from the first to the end (otherwise, the returned info about the number
> of available desc in the vq, i.e. num_free, would be invalid):
> 
> lock(vq);
> add_first();
> add_next();
> add_last();
> unlock(vq);
> 
> However, I think the case isn't this simple, since we need to check more
> things
> after each add_xx() step. For example, if only one entry is available at the
> time
> we start to use the vq, that is, num_free is 0 after add_first(), we
> wouldn't be
> able to add_next and add_last. So, it would work like this:
> 
> start:
>     ...get free page block..
>     lock(vq)
> retry:
>     ret = add_first(..,&num_free,);
>     if(ret == -ENOSPC) {
>         goto retry;
>     } else if (!num_free) {
>         add_chain_head();
>         unlock(vq);
>         kick & wait;
>         goto start;
>     }
> next_one:
>     ...get free page block..
>     add_next(..,&num_free,);
>     if (!num_free) {
>         add_chain_head();
>         unlock(vq);
>         kick & wait;
>         goto start;
>     } if (num_free == 1) {
>         ...get free page block..
>         add_last(..);
>         unlock(vq);
>         kick & wait;
>         goto start;
>     } else {
>         goto next_one;
>     }
> 
> The above seems unnecessary to me to have three different APIs.
> That's the reason to combine them into one virtqueue_add_chain_desc().
> 
> -- or, do you have a different thought about using the three APIs?
> 
> 
> Implementation Reference:
> 
> struct desc_iterator {
>     unsigned int head;
>     unsigned int tail;
> };
> 
> add_first(*vq, *desc_iterator, *num_free, ..)
> {
>     if (vq->vq.num_free < 1)
>         return -ENOSPC;
>     get_desc(&desc_id);
>     desc[desc_id].flag &= ~_F_NEXT;
>     desc_iterator->head = desc_id
>     desc_iterator->tail = desc_iterator->head;
>     *num_free = vq->vq.num_free;
> }
> 
> add_next(vq, desc_iterator, *num_free,..)
> {
>     get_desc(&desc_id);
>     desc[desc_id].flag &= ~_F_NEXT;
>     desc[desc_iterator.tail].next = desc_id;
>     desc[desc_iterator->tail].flag |= _F_NEXT;
>     desc_iterator->tail = desc_id;
>     *num_free = vq->vq.num_free;
> }
> 
> add_last(vq, desc_iterator,..)
> {
>     get_desc(&desc_id);
>     desc[desc_id].flag &= ~_F_NEXT;
>     desc[desc_iterator.tail].next = desc_id;
>     desc_iterator->tail = desc_id;
> 
>     add_chain_head(); // put the desc_iterator.head to the ring
> }
> 
> 
> Best,
> Wei

OK I thought this over. While we might need these new APIs in
the future, I think that at the moment, there's a way to implement
this feature that is significantly simpler. Just add each s/g
as a separate input buffer.

This needs zero new APIs.

I know that follow-up patches need to add a header in front
so you might be thinking: how am I going to add this
header? The answer is quite simple - add it as a separate
out header.

Host will be able to distinguish between header and pages
by looking at the direction, and - should we want to add
IN data to header - additionally size (<4K => header).

We will be able to look at extended APIs separately down
the road.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 6/8] mm: support reporting free page blocks
  2017-07-19 12:01             ` Wei Wang
@ 2017-07-24  9:00               ` Michal Hocko
  2017-07-25  9:32                 ` Wei Wang
  0 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2017-07-24  9:00 UTC (permalink / raw)
  To: Wei Wang
  Cc: Michael S. Tsirkin, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, virtio-dev,
	yang.zhang.wz, quan.xu

On Wed 19-07-17 20:01:18, Wei Wang wrote:
> On 07/19/2017 04:13 PM, Michal Hocko wrote:
[...
> >All you should need is the check for the page reference count, no?  I
> >assume you do some sort of pfn walk and so you should be able to get an
> >access to the struct page.
> 
> Not necessarily - the guest struct page is not seen by the hypervisor. The
> hypervisor only gets those guest pfns which are hinted as unused. From the
> hypervisor (host) point of view, a guest physical address corresponds to a
> virtual address of a host process. So, once the hypervisor knows a guest
> physical page is unsued, it knows that the corresponding virtual memory of
> the process doesn't need to be transferred in the 1st round.

I am sorry, but I do not understand. Why cannot _guest_ simply check the
struct page ref count and send them to the hypervisor? Is there any
documentation which describes the workflow or code which would use your
new API?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 6/8] mm: support reporting free page blocks
  2017-07-24  9:00               ` Michal Hocko
@ 2017-07-25  9:32                 ` Wei Wang
  2017-07-25 11:25                   ` Michal Hocko
  0 siblings, 1 reply; 60+ messages in thread
From: Wei Wang @ 2017-07-25  9:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Michael S. Tsirkin, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, virtio-dev,
	yang.zhang.wz, quan.xu

On 07/24/2017 05:00 PM, Michal Hocko wrote:
> On Wed 19-07-17 20:01:18, Wei Wang wrote:
>> On 07/19/2017 04:13 PM, Michal Hocko wrote:
> [...
>>> All you should need is the check for the page reference count, no?  I
>>> assume you do some sort of pfn walk and so you should be able to get an
>>> access to the struct page.
>> Not necessarily - the guest struct page is not seen by the hypervisor. The
>> hypervisor only gets those guest pfns which are hinted as unused. From the
>> hypervisor (host) point of view, a guest physical address corresponds to a
>> virtual address of a host process. So, once the hypervisor knows a guest
>> physical page is unsued, it knows that the corresponding virtual memory of
>> the process doesn't need to be transferred in the 1st round.
> I am sorry, but I do not understand. Why cannot _guest_ simply check the
> struct page ref count and send them to the hypervisor?

Were you suggesting the following?
1) get a free page block from the page list using the API;
2) if page->ref_count == 0, send it to the hypervisor

Btw, ref_count may also change at any time.

> Is there any
> documentation which describes the workflow or code which would use your
> new API?
>

It's used in the balloon driver (patch 8). We don't have any docs yet, but
I think the high level workflow is the two steps above.


Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 6/8] mm: support reporting free page blocks
  2017-07-25  9:32                 ` Wei Wang
@ 2017-07-25 11:25                   ` Michal Hocko
  2017-07-25 11:56                     ` Wei Wang
  0 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2017-07-25 11:25 UTC (permalink / raw)
  To: Wei Wang
  Cc: Michael S. Tsirkin, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, virtio-dev,
	yang.zhang.wz, quan.xu

On Tue 25-07-17 17:32:00, Wei Wang wrote:
> On 07/24/2017 05:00 PM, Michal Hocko wrote:
> >On Wed 19-07-17 20:01:18, Wei Wang wrote:
> >>On 07/19/2017 04:13 PM, Michal Hocko wrote:
> >[...
> >>>All you should need is the check for the page reference count, no?  I
> >>>assume you do some sort of pfn walk and so you should be able to get an
> >>>access to the struct page.
> >>Not necessarily - the guest struct page is not seen by the hypervisor. The
> >>hypervisor only gets those guest pfns which are hinted as unused. From the
> >>hypervisor (host) point of view, a guest physical address corresponds to a
> >>virtual address of a host process. So, once the hypervisor knows a guest
> >>physical page is unsued, it knows that the corresponding virtual memory of
> >>the process doesn't need to be transferred in the 1st round.
> >I am sorry, but I do not understand. Why cannot _guest_ simply check the
> >struct page ref count and send them to the hypervisor?
> 
> Were you suggesting the following?
> 1) get a free page block from the page list using the API;

No. Use a pfn walk, check the reference count and skip those pages which
have 0 ref count. I suspected that you need to do some sort of the pfn
walk anyway because you somehow have to evaluate a memory to migrate,
right?

> 2) if page->ref_count == 0, send it to the hypervisor

yes

> Btw, ref_count may also change at any time.
> 
> >Is there any
> >documentation which describes the workflow or code which would use your
> >new API?
> >
> 
> It's used in the balloon driver (patch 8). We don't have any docs yet, but
> I think the high level workflow is the two steps above.

I will have a look.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 6/8] mm: support reporting free page blocks
  2017-07-25 11:25                   ` Michal Hocko
@ 2017-07-25 11:56                     ` Wei Wang
  2017-07-25 12:41                       ` Michal Hocko
  0 siblings, 1 reply; 60+ messages in thread
From: Wei Wang @ 2017-07-25 11:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Michael S. Tsirkin, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, virtio-dev,
	yang.zhang.wz, quan.xu

On 07/25/2017 07:25 PM, Michal Hocko wrote:
> On Tue 25-07-17 17:32:00, Wei Wang wrote:
>> On 07/24/2017 05:00 PM, Michal Hocko wrote:
>>> On Wed 19-07-17 20:01:18, Wei Wang wrote:
>>>> On 07/19/2017 04:13 PM, Michal Hocko wrote:
>>> [...
>>>>> All you should need is the check for the page reference count, no?  I
>>>>> assume you do some sort of pfn walk and so you should be able to get an
>>>>> access to the struct page.
>>>> Not necessarily - the guest struct page is not seen by the hypervisor. The
>>>> hypervisor only gets those guest pfns which are hinted as unused. From the
>>>> hypervisor (host) point of view, a guest physical address corresponds to a
>>>> virtual address of a host process. So, once the hypervisor knows a guest
>>>> physical page is unsued, it knows that the corresponding virtual memory of
>>>> the process doesn't need to be transferred in the 1st round.
>>> I am sorry, but I do not understand. Why cannot _guest_ simply check the
>>> struct page ref count and send them to the hypervisor?
>> Were you suggesting the following?
>> 1) get a free page block from the page list using the API;
> No. Use a pfn walk, check the reference count and skip those pages which
> have 0 ref count.


"pfn walk" - do you mean start from the first pfn, and scan all the pfns 
that the VM has?


> I suspected that you need to do some sort of the pfn
> walk anyway because you somehow have to evaluate a memory to migrate,
> right?


We don't need to do the pfn walk in the guest kernel. When the API 
reports, for example,
a 2MB free page block, the API caller offers to the hypervisor the base 
address of the page
block, and size=2MB, to the hypervisor.

The hypervisor maintains a bitmap of all the guest physical memory (a 
bit corresponds to
a guest pfn). When migrating memory, only the pfns that are set in the 
bitmap are transferred
to the destination machine. So, when the hypervisor receives a 2MB free 
page block, the
corresponding bits in the bitmap are cleared.

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 6/8] mm: support reporting free page blocks
  2017-07-25 11:56                     ` Wei Wang
@ 2017-07-25 12:41                       ` Michal Hocko
  2017-07-25 14:47                         ` Wang, Wei W
  0 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2017-07-25 12:41 UTC (permalink / raw)
  To: Wei Wang
  Cc: Michael S. Tsirkin, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, virtio-dev,
	yang.zhang.wz, quan.xu

On Tue 25-07-17 19:56:24, Wei Wang wrote:
> On 07/25/2017 07:25 PM, Michal Hocko wrote:
> >On Tue 25-07-17 17:32:00, Wei Wang wrote:
> >>On 07/24/2017 05:00 PM, Michal Hocko wrote:
> >>>On Wed 19-07-17 20:01:18, Wei Wang wrote:
> >>>>On 07/19/2017 04:13 PM, Michal Hocko wrote:
> >>>[...
> >>>>>All you should need is the check for the page reference count, no?  I
> >>>>>assume you do some sort of pfn walk and so you should be able to get an
> >>>>>access to the struct page.
> >>>>Not necessarily - the guest struct page is not seen by the hypervisor. The
> >>>>hypervisor only gets those guest pfns which are hinted as unused. From the
> >>>>hypervisor (host) point of view, a guest physical address corresponds to a
> >>>>virtual address of a host process. So, once the hypervisor knows a guest
> >>>>physical page is unsued, it knows that the corresponding virtual memory of
> >>>>the process doesn't need to be transferred in the 1st round.
> >>>I am sorry, but I do not understand. Why cannot _guest_ simply check the
> >>>struct page ref count and send them to the hypervisor?
> >>Were you suggesting the following?
> >>1) get a free page block from the page list using the API;
> >No. Use a pfn walk, check the reference count and skip those pages which
> >have 0 ref count.
> 
> 
> "pfn walk" - do you mean start from the first pfn, and scan all the pfns
> that the VM has?

yes

> >I suspected that you need to do some sort of the pfn
> >walk anyway because you somehow have to evaluate a memory to migrate,
> >right?
> 
> We don't need to do the pfn walk in the guest kernel. When the API
> reports, for example, a 2MB free page block, the API caller offers to
> the hypervisor the base address of the page block, and size=2MB, to
> the hypervisor.

So you want to skip pfn walks by regularly calling into the page
allocator to update your bitmap. If that is the case then would an API
that would allow you to update your bitmap via a callback be s
sufficient? Something like
	void walk_free_mem(int node, int min_order,
			void (*visit)(unsigned long pfn, unsigned long nr_pages))

The function will call the given callback for each free memory block on
the given node starting from the given min_order. The callback will be
strictly an atomic and very light context. You can update your bitmap
from there.

This would address my main concern that the allocator internals would
get outside of the allocator proper. A nasty callback which would be too
expensive could still stall other allocations and cause latencies but
the locking will be under mm core control at least.

Does that sound useful?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH v12 6/8] mm: support reporting free page blocks
  2017-07-25 12:41                       ` Michal Hocko
@ 2017-07-25 14:47                         ` Wang, Wei W
  2017-07-25 14:53                           ` Michal Hocko
  0 siblings, 1 reply; 60+ messages in thread
From: Wang, Wei W @ 2017-07-25 14:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Michael S. Tsirkin, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, virtio-dev,
	yang.zhang.wz, quan.xu

On Tuesday, July 25, 2017 8:42 PM, hal Hocko wrote:
> On Tue 25-07-17 19:56:24, Wei Wang wrote:
> > On 07/25/2017 07:25 PM, Michal Hocko wrote:
> > >On Tue 25-07-17 17:32:00, Wei Wang wrote:
> > >>On 07/24/2017 05:00 PM, Michal Hocko wrote:
> > >>>On Wed 19-07-17 20:01:18, Wei Wang wrote:
> > >>>>On 07/19/2017 04:13 PM, Michal Hocko wrote:
> > >>>[...
> > We don't need to do the pfn walk in the guest kernel. When the API
> > reports, for example, a 2MB free page block, the API caller offers to
> > the hypervisor the base address of the page block, and size=2MB, to
> > the hypervisor.
> 
> So you want to skip pfn walks by regularly calling into the page allocator to
> update your bitmap. If that is the case then would an API that would allow you
> to update your bitmap via a callback be s sufficient? Something like
> 	void walk_free_mem(int node, int min_order,
> 			void (*visit)(unsigned long pfn, unsigned long nr_pages))
> 
> The function will call the given callback for each free memory block on the given
> node starting from the given min_order. The callback will be strictly an atomic
> and very light context. You can update your bitmap from there.

I would need to introduce more about the background here:
The hypervisor and the guest live in their own address space. The hypervisor's bitmap
isn't seen by the guest. I think we also wouldn't be able to give a callback function 
from the hypervisor to the guest in this case.

> 
> This would address my main concern that the allocator internals would get
> outside of the allocator proper. 

What issue would it have to expose the internal, for_each_zone()?
I think new code which would call it will also be strictly checked when they
are pushed to upstream.

Best,
Wei



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 6/8] mm: support reporting free page blocks
  2017-07-25 14:47                         ` Wang, Wei W
@ 2017-07-25 14:53                           ` Michal Hocko
  2017-07-26  2:22                             ` Wei Wang
  0 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2017-07-25 14:53 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: Michael S. Tsirkin, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, virtio-dev,
	yang.zhang.wz, quan.xu

On Tue 25-07-17 14:47:16, Wang, Wei W wrote:
> On Tuesday, July 25, 2017 8:42 PM, hal Hocko wrote:
> > On Tue 25-07-17 19:56:24, Wei Wang wrote:
> > > On 07/25/2017 07:25 PM, Michal Hocko wrote:
> > > >On Tue 25-07-17 17:32:00, Wei Wang wrote:
> > > >>On 07/24/2017 05:00 PM, Michal Hocko wrote:
> > > >>>On Wed 19-07-17 20:01:18, Wei Wang wrote:
> > > >>>>On 07/19/2017 04:13 PM, Michal Hocko wrote:
> > > >>>[...
> > > We don't need to do the pfn walk in the guest kernel. When the API
> > > reports, for example, a 2MB free page block, the API caller offers to
> > > the hypervisor the base address of the page block, and size=2MB, to
> > > the hypervisor.
> > 
> > So you want to skip pfn walks by regularly calling into the page allocator to
> > update your bitmap. If that is the case then would an API that would allow you
> > to update your bitmap via a callback be s sufficient? Something like
> > 	void walk_free_mem(int node, int min_order,
> > 			void (*visit)(unsigned long pfn, unsigned long nr_pages))
> > 
> > The function will call the given callback for each free memory block on the given
> > node starting from the given min_order. The callback will be strictly an atomic
> > and very light context. You can update your bitmap from there.
> 
> I would need to introduce more about the background here:
> The hypervisor and the guest live in their own address space. The hypervisor's bitmap
> isn't seen by the guest. I think we also wouldn't be able to give a callback function 
> from the hypervisor to the guest in this case.

How did you plan to use your original API which export struct page array
then?

> > This would address my main concern that the allocator internals would get
> > outside of the allocator proper. 
> 
> What issue would it have to expose the internal, for_each_zone()?

zone is a MM internal concept. No code outside of the MM proper should
really care about zones. 
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 6/8] mm: support reporting free page blocks
  2017-07-25 14:53                           ` Michal Hocko
@ 2017-07-26  2:22                             ` Wei Wang
  2017-07-26 10:24                               ` Michal Hocko
  0 siblings, 1 reply; 60+ messages in thread
From: Wei Wang @ 2017-07-26  2:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Michael S. Tsirkin, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, virtio-dev,
	yang.zhang.wz, quan.xu

On 07/25/2017 10:53 PM, Michal Hocko wrote:
> On Tue 25-07-17 14:47:16, Wang, Wei W wrote:
>> On Tuesday, July 25, 2017 8:42 PM, hal Hocko wrote:
>>> On Tue 25-07-17 19:56:24, Wei Wang wrote:
>>>> On 07/25/2017 07:25 PM, Michal Hocko wrote:
>>>>> On Tue 25-07-17 17:32:00, Wei Wang wrote:
>>>>>> On 07/24/2017 05:00 PM, Michal Hocko wrote:
>>>>>>> On Wed 19-07-17 20:01:18, Wei Wang wrote:
>>>>>>>> On 07/19/2017 04:13 PM, Michal Hocko wrote:
>>>>>>> [...
>>>> We don't need to do the pfn walk in the guest kernel. When the API
>>>> reports, for example, a 2MB free page block, the API caller offers to
>>>> the hypervisor the base address of the page block, and size=2MB, to
>>>> the hypervisor.
>>> So you want to skip pfn walks by regularly calling into the page allocator to
>>> update your bitmap. If that is the case then would an API that would allow you
>>> to update your bitmap via a callback be s sufficient? Something like
>>> 	void walk_free_mem(int node, int min_order,
>>> 			void (*visit)(unsigned long pfn, unsigned long nr_pages))
>>>
>>> The function will call the given callback for each free memory block on the given
>>> node starting from the given min_order. The callback will be strictly an atomic
>>> and very light context. You can update your bitmap from there.
>> I would need to introduce more about the background here:
>> The hypervisor and the guest live in their own address space. The hypervisor's bitmap
>> isn't seen by the guest. I think we also wouldn't be able to give a callback function
>> from the hypervisor to the guest in this case.
> How did you plan to use your original API which export struct page array
> then?


That's where the virtio-balloon driver comes in. It uses a shared ring 
mechanism to
send the guest memory info to the hypervisor.

We didn't expose the struct page array from the guest to the hypervisor. 
For example, when
a 2MB free page block is reported from the free page list, the info put 
on the ring is just
(base address of the 2MB continuous memory, size=2M).


>
>>> This would address my main concern that the allocator internals would get
>>> outside of the allocator proper.
>> What issue would it have to expose the internal, for_each_zone()?
> zone is a MM internal concept. No code outside of the MM proper should
> really care about zones.

I think this is also what Andrew suggested in the previous discussion:
https://lkml.org/lkml/2017/3/16/951

Move the code to virtio-balloon and a little layering violation seems 
acceptable.


Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-07-23  1:45               ` Michael S. Tsirkin
@ 2017-07-26  3:48                 ` Wei Wang
  2017-07-26 17:02                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 60+ messages in thread
From: Wei Wang @ 2017-07-26  3:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On 07/23/2017 09:45 AM, Michael S. Tsirkin wrote:
> On Fri, Jul 14, 2017 at 03:12:43PM +0800, Wei Wang wrote:
>> On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote:
>>> On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote:
>>>> On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:
>>>>> So the way I see it, there are several issues:
>>>>>
>>>>> - internal wait - forces multiple APIs like kick/kick_sync
>>>>>      note how kick_sync can fail but your code never checks return code
>>>>> - need to re-write the last descriptor - might not work
>>>>>      for alternative layouts which always expose descriptors
>>>>>      immediately
>>>> Probably it wasn't clear. Please let me explain the two functions here:
>>>>
>>>> 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
>>>> grabs a desc from the vq and inserts it to the chain tail (which is indexed
>>>> by
>>>> prev_id, probably better to call it tail_id). Then, the new added desc
>>>> becomes
>>>> the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc
>>>> when it's
>>>> added to the chain, and set when another desc comes to follow later.
>>> And this only works if there are multiple rings like
>>> avail + descriptor ring.
>>> It won't work e.g. with the proposed new layout where
>>> writing out a descriptor exposes it immediately.
>> I think it can support the 1.1 proposal, too. But before getting
>> into that, I think we first need to deep dive into the implementation
>> and usage of _first/next/last. The usage would need to lock the vq
>> from the first to the end (otherwise, the returned info about the number
>> of available desc in the vq, i.e. num_free, would be invalid):
>>
>> lock(vq);
>> add_first();
>> add_next();
>> add_last();
>> unlock(vq);
>>
>> However, I think the case isn't this simple, since we need to check more
>> things
>> after each add_xx() step. For example, if only one entry is available at the
>> time
>> we start to use the vq, that is, num_free is 0 after add_first(), we
>> wouldn't be
>> able to add_next and add_last. So, it would work like this:
>>
>> start:
>>      ...get free page block..
>>      lock(vq)
>> retry:
>>      ret = add_first(..,&num_free,);
>>      if(ret == -ENOSPC) {
>>          goto retry;
>>      } else if (!num_free) {
>>          add_chain_head();
>>          unlock(vq);
>>          kick & wait;
>>          goto start;
>>      }
>> next_one:
>>      ...get free page block..
>>      add_next(..,&num_free,);
>>      if (!num_free) {
>>          add_chain_head();
>>          unlock(vq);
>>          kick & wait;
>>          goto start;
>>      } if (num_free == 1) {
>>          ...get free page block..
>>          add_last(..);
>>          unlock(vq);
>>          kick & wait;
>>          goto start;
>>      } else {
>>          goto next_one;
>>      }
>>
>> The above seems unnecessary to me to have three different APIs.
>> That's the reason to combine them into one virtqueue_add_chain_desc().
>>
>> -- or, do you have a different thought about using the three APIs?
>>
>>
>> Implementation Reference:
>>
>> struct desc_iterator {
>>      unsigned int head;
>>      unsigned int tail;
>> };
>>
>> add_first(*vq, *desc_iterator, *num_free, ..)
>> {
>>      if (vq->vq.num_free < 1)
>>          return -ENOSPC;
>>      get_desc(&desc_id);
>>      desc[desc_id].flag &= ~_F_NEXT;
>>      desc_iterator->head = desc_id
>>      desc_iterator->tail = desc_iterator->head;
>>      *num_free = vq->vq.num_free;
>> }
>>
>> add_next(vq, desc_iterator, *num_free,..)
>> {
>>      get_desc(&desc_id);
>>      desc[desc_id].flag &= ~_F_NEXT;
>>      desc[desc_iterator.tail].next = desc_id;
>>      desc[desc_iterator->tail].flag |= _F_NEXT;
>>      desc_iterator->tail = desc_id;
>>      *num_free = vq->vq.num_free;
>> }
>>
>> add_last(vq, desc_iterator,..)
>> {
>>      get_desc(&desc_id);
>>      desc[desc_id].flag &= ~_F_NEXT;
>>      desc[desc_iterator.tail].next = desc_id;
>>      desc_iterator->tail = desc_id;
>>
>>      add_chain_head(); // put the desc_iterator.head to the ring
>> }
>>
>>
>> Best,
>> Wei
> OK I thought this over. While we might need these new APIs in
> the future, I think that at the moment, there's a way to implement
> this feature that is significantly simpler. Just add each s/g
> as a separate input buffer.


Should it be an output buffer? I think output means from the
driver to device (i.e. DMA_TO_DEVICE).

>
> This needs zero new APIs.
>
> I know that follow-up patches need to add a header in front
> so you might be thinking: how am I going to add this
> header? The answer is quite simple - add it as a separate
> out header.
>
> Host will be able to distinguish between header and pages
> by looking at the direction, and - should we want to add
> IN data to header - additionally size (<4K => header).


I think this works fine when the cmdq is only used for
reporting the unused pages. It would be an issue
if there are other usages (e.g. report memory statistics)
interleaving. I think one solution would be to lock the cmdq until
a cmd usage is done ((e.g. all the unused pages are reported) ) -
in this case, the periodically updated guest memory statistics
may be delayed for a while occasionally when live migration starts.
Would this be acceptable? If not, probably we can have the cmdq
for one usage only.


Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 6/8] mm: support reporting free page blocks
  2017-07-26  2:22                             ` Wei Wang
@ 2017-07-26 10:24                               ` Michal Hocko
  2017-07-26 11:44                                 ` Wei Wang
  0 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2017-07-26 10:24 UTC (permalink / raw)
  To: Wei Wang
  Cc: Michael S. Tsirkin, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, virtio-dev,
	yang.zhang.wz, quan.xu

On Wed 26-07-17 10:22:23, Wei Wang wrote:
> On 07/25/2017 10:53 PM, Michal Hocko wrote:
> >On Tue 25-07-17 14:47:16, Wang, Wei W wrote:
> >>On Tuesday, July 25, 2017 8:42 PM, hal Hocko wrote:
> >>>On Tue 25-07-17 19:56:24, Wei Wang wrote:
> >>>>On 07/25/2017 07:25 PM, Michal Hocko wrote:
> >>>>>On Tue 25-07-17 17:32:00, Wei Wang wrote:
> >>>>>>On 07/24/2017 05:00 PM, Michal Hocko wrote:
> >>>>>>>On Wed 19-07-17 20:01:18, Wei Wang wrote:
> >>>>>>>>On 07/19/2017 04:13 PM, Michal Hocko wrote:
> >>>>>>>[...
> >>>>We don't need to do the pfn walk in the guest kernel. When the API
> >>>>reports, for example, a 2MB free page block, the API caller offers to
> >>>>the hypervisor the base address of the page block, and size=2MB, to
> >>>>the hypervisor.
> >>>So you want to skip pfn walks by regularly calling into the page allocator to
> >>>update your bitmap. If that is the case then would an API that would allow you
> >>>to update your bitmap via a callback be s sufficient? Something like
> >>>	void walk_free_mem(int node, int min_order,
> >>>			void (*visit)(unsigned long pfn, unsigned long nr_pages))
> >>>
> >>>The function will call the given callback for each free memory block on the given
> >>>node starting from the given min_order. The callback will be strictly an atomic
> >>>and very light context. You can update your bitmap from there.
> >>I would need to introduce more about the background here:
> >>The hypervisor and the guest live in their own address space. The hypervisor's bitmap
> >>isn't seen by the guest. I think we also wouldn't be able to give a callback function
> >>from the hypervisor to the guest in this case.
> >How did you plan to use your original API which export struct page array
> >then?
> 
> 
> That's where the virtio-balloon driver comes in. It uses a shared ring
> mechanism to
> send the guest memory info to the hypervisor.
> 
> We didn't expose the struct page array from the guest to the hypervisor. For
> example, when
> a 2MB free page block is reported from the free page list, the info put on
> the ring is just
> (base address of the 2MB continuous memory, size=2M).

So what exactly prevents virtio-balloon from using the above proposed
callback mechanism and export what is needed to the hypervisor?
 
> >>>This would address my main concern that the allocator internals would get
> >>>outside of the allocator proper.
> >>What issue would it have to expose the internal, for_each_zone()?
> >zone is a MM internal concept. No code outside of the MM proper should
> >really care about zones.
> 
> I think this is also what Andrew suggested in the previous discussion:
> https://lkml.org/lkml/2017/3/16/951
> 
> Move the code to virtio-balloon and a little layering violation seems
> acceptable.

Andrew didn't suggest to expose zones to modules. If I read his words
properly he said that a functionality which "provides a snapshot of the
present system unused pages" is just too specific for virtio usecase
to be a generic function and as such it should be in virtio. I tend
to agree. Even the proposed callback API is a layer violation. We
should just make sure that the API is not inherently dangerous. That
is why I have chosen to give only pfn and nr_pages to the caller. Sure
somebody could argue that the caller could do pfn_to_page and do nasty
things. That would be a fair argument but nothing really prevents
anybody to do th pfn walk already so there shouldn't inherently more
risk. We can document what we expect from the API user and that would be
much easier to describe than struct page API IMHO.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 6/8] mm: support reporting free page blocks
  2017-07-26 10:24                               ` Michal Hocko
@ 2017-07-26 11:44                                 ` Wei Wang
  2017-07-26 11:55                                   ` Michal Hocko
  0 siblings, 1 reply; 60+ messages in thread
From: Wei Wang @ 2017-07-26 11:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Michael S. Tsirkin, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, virtio-dev,
	yang.zhang.wz, quan.xu

On 07/26/2017 06:24 PM, Michal Hocko wrote:
> On Wed 26-07-17 10:22:23, Wei Wang wrote:
>> On 07/25/2017 10:53 PM, Michal Hocko wrote:
>>> On Tue 25-07-17 14:47:16, Wang, Wei W wrote:
>>>> On Tuesday, July 25, 2017 8:42 PM, hal Hocko wrote:
>>>>> On Tue 25-07-17 19:56:24, Wei Wang wrote:
>>>>>> On 07/25/2017 07:25 PM, Michal Hocko wrote:
>>>>>>> On Tue 25-07-17 17:32:00, Wei Wang wrote:
>>>>>>>> On 07/24/2017 05:00 PM, Michal Hocko wrote:
>>>>>>>>> On Wed 19-07-17 20:01:18, Wei Wang wrote:
>>>>>>>>>> On 07/19/2017 04:13 PM, Michal Hocko wrote:
>>>>>>>>> [...
>>>>>> We don't need to do the pfn walk in the guest kernel. When the API
>>>>>> reports, for example, a 2MB free page block, the API caller offers to
>>>>>> the hypervisor the base address of the page block, and size=2MB, to
>>>>>> the hypervisor.
>>>>> So you want to skip pfn walks by regularly calling into the page allocator to
>>>>> update your bitmap. If that is the case then would an API that would allow you
>>>>> to update your bitmap via a callback be s sufficient? Something like
>>>>> 	void walk_free_mem(int node, int min_order,
>>>>> 			void (*visit)(unsigned long pfn, unsigned long nr_pages))
>>>>>
>>>>> The function will call the given callback for each free memory block on the given
>>>>> node starting from the given min_order. The callback will be strictly an atomic
>>>>> and very light context. You can update your bitmap from there.
>>>> I would need to introduce more about the background here:
>>>> The hypervisor and the guest live in their own address space. The hypervisor's bitmap
>>>> isn't seen by the guest. I think we also wouldn't be able to give a callback function
>>> >from the hypervisor to the guest in this case.
>>> How did you plan to use your original API which export struct page array
>>> then?
>>
>> That's where the virtio-balloon driver comes in. It uses a shared ring
>> mechanism to
>> send the guest memory info to the hypervisor.
>>
>> We didn't expose the struct page array from the guest to the hypervisor. For
>> example, when
>> a 2MB free page block is reported from the free page list, the info put on
>> the ring is just
>> (base address of the 2MB continuous memory, size=2M).
> So what exactly prevents virtio-balloon from using the above proposed
> callback mechanism and export what is needed to the hypervisor?

I thought about it more. Probably we can use the callback function with 
a little change like this:

void walk_free_mem(void *opaque1, void (*visit)(void *opaque2, unsigned 
long pfn,
            unsigned long nr_pages))
{
     ...
     for_each_populated_zone(zone) {
                    for_each_migratetype_order(order, type) {
                         report_unused_page_block(zone, order, type, 
&page); // from patch 6
                         pfn = page_to_pfn(page);
                         visit(opaque1, pfn, 1 << order);
                     }
     }
}

The above function scans all the free list and directly sends each free 
page block to the
hypervisor via the virtio_balloon callback below. No need to implement a 
bitmap.

In virtio-balloon, we have the callback:
void *virtio_balloon_report_unused_pages(void *opaque,  unsigned long pfn,
unsigned long nr_pages)
{
     struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
     ...put the free page block to the the ring of vb;
}


What do you think?


Best,
Wei


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 6/8] mm: support reporting free page blocks
  2017-07-26 11:44                                 ` Wei Wang
@ 2017-07-26 11:55                                   ` Michal Hocko
  2017-07-26 12:47                                     ` Wang, Wei W
  0 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2017-07-26 11:55 UTC (permalink / raw)
  To: Wei Wang
  Cc: Michael S. Tsirkin, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, virtio-dev,
	yang.zhang.wz, quan.xu

On Wed 26-07-17 19:44:23, Wei Wang wrote:
[...]
> I thought about it more. Probably we can use the callback function with a
> little change like this:
> 
> void walk_free_mem(void *opaque1, void (*visit)(void *opaque2, unsigned long
> pfn,
>            unsigned long nr_pages))
> {
>     ...
>     for_each_populated_zone(zone) {
>                    for_each_migratetype_order(order, type) {
>                         report_unused_page_block(zone, order, type, &page);
> // from patch 6
>                         pfn = page_to_pfn(page);
>                         visit(opaque1, pfn, 1 << order);
>                     }
>     }
> }
> 
> The above function scans all the free list and directly sends each free page
> block to the
> hypervisor via the virtio_balloon callback below. No need to implement a
> bitmap.
> 
> In virtio-balloon, we have the callback:
> void *virtio_balloon_report_unused_pages(void *opaque,  unsigned long pfn,
> unsigned long nr_pages)
> {
>     struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
>     ...put the free page block to the the ring of vb;
> }
> 
> 
> What do you think?

I do not mind conveying a context to the callback. I would still prefer
to keep the original min_order to check semantic though. Why? Well,
it doesn't make much sense to scan low order free blocks all the time
because they are simply too volatile. Larger blocks tend to surivive for
longer. So I assume you would only care about larger free blocks. This
will also make the call cheaper.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH v12 6/8] mm: support reporting free page blocks
  2017-07-26 11:55                                   ` Michal Hocko
@ 2017-07-26 12:47                                     ` Wang, Wei W
  0 siblings, 0 replies; 60+ messages in thread
From: Wang, Wei W @ 2017-07-26 12:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Michael S. Tsirkin, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, virtio-dev,
	yang.zhang.wz, quan.xu

On Wednesday, July 26, 2017 7:55 PM, Michal Hocko wrote:
> On Wed 26-07-17 19:44:23, Wei Wang wrote:
> [...]
> > I thought about it more. Probably we can use the callback function
> > with a little change like this:
> >
> > void walk_free_mem(void *opaque1, void (*visit)(void *opaque2,
> > unsigned long pfn,
> >            unsigned long nr_pages))
> > {
> >     ...
> >     for_each_populated_zone(zone) {
> >                    for_each_migratetype_order(order, type) {
> >                         report_unused_page_block(zone, order, type,
> > &page); // from patch 6
> >                         pfn = page_to_pfn(page);
> >                         visit(opaque1, pfn, 1 << order);
> >                     }
> >     }
> > }
> >
> > The above function scans all the free list and directly sends each
> > free page block to the hypervisor via the virtio_balloon callback
> > below. No need to implement a bitmap.
> >
> > In virtio-balloon, we have the callback:
> > void *virtio_balloon_report_unused_pages(void *opaque,  unsigned long
> > pfn, unsigned long nr_pages) {
> >     struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> >     ...put the free page block to the the ring of vb; }
> >
> >
> > What do you think?
> 
> I do not mind conveying a context to the callback. I would still prefer
> to keep the original min_order to check semantic though. Why? Well,
> it doesn't make much sense to scan low order free blocks all the time
> because they are simply too volatile. Larger blocks tend to surivive for
> longer. So I assume you would only care about larger free blocks. This
> will also make the call cheaper.
> --

OK, I will keep min order there in the next version.

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-07-26  3:48                 ` Wei Wang
@ 2017-07-26 17:02                   ` Michael S. Tsirkin
  2017-07-27  2:50                     ` Wei Wang
  0 siblings, 1 reply; 60+ messages in thread
From: Michael S. Tsirkin @ 2017-07-26 17:02 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On Wed, Jul 26, 2017 at 11:48:41AM +0800, Wei Wang wrote:
> On 07/23/2017 09:45 AM, Michael S. Tsirkin wrote:
> > On Fri, Jul 14, 2017 at 03:12:43PM +0800, Wei Wang wrote:
> > > On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote:
> > > > On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote:
> > > > > On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:
> > > > > > So the way I see it, there are several issues:
> > > > > > 
> > > > > > - internal wait - forces multiple APIs like kick/kick_sync
> > > > > >      note how kick_sync can fail but your code never checks return code
> > > > > > - need to re-write the last descriptor - might not work
> > > > > >      for alternative layouts which always expose descriptors
> > > > > >      immediately
> > > > > Probably it wasn't clear. Please let me explain the two functions here:
> > > > > 
> > > > > 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
> > > > > grabs a desc from the vq and inserts it to the chain tail (which is indexed
> > > > > by
> > > > > prev_id, probably better to call it tail_id). Then, the new added desc
> > > > > becomes
> > > > > the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc
> > > > > when it's
> > > > > added to the chain, and set when another desc comes to follow later.
> > > > And this only works if there are multiple rings like
> > > > avail + descriptor ring.
> > > > It won't work e.g. with the proposed new layout where
> > > > writing out a descriptor exposes it immediately.
> > > I think it can support the 1.1 proposal, too. But before getting
> > > into that, I think we first need to deep dive into the implementation
> > > and usage of _first/next/last. The usage would need to lock the vq
> > > from the first to the end (otherwise, the returned info about the number
> > > of available desc in the vq, i.e. num_free, would be invalid):
> > > 
> > > lock(vq);
> > > add_first();
> > > add_next();
> > > add_last();
> > > unlock(vq);
> > > 
> > > However, I think the case isn't this simple, since we need to check more
> > > things
> > > after each add_xx() step. For example, if only one entry is available at the
> > > time
> > > we start to use the vq, that is, num_free is 0 after add_first(), we
> > > wouldn't be
> > > able to add_next and add_last. So, it would work like this:
> > > 
> > > start:
> > >      ...get free page block..
> > >      lock(vq)
> > > retry:
> > >      ret = add_first(..,&num_free,);
> > >      if(ret == -ENOSPC) {
> > >          goto retry;
> > >      } else if (!num_free) {
> > >          add_chain_head();
> > >          unlock(vq);
> > >          kick & wait;
> > >          goto start;
> > >      }
> > > next_one:
> > >      ...get free page block..
> > >      add_next(..,&num_free,);
> > >      if (!num_free) {
> > >          add_chain_head();
> > >          unlock(vq);
> > >          kick & wait;
> > >          goto start;
> > >      } if (num_free == 1) {
> > >          ...get free page block..
> > >          add_last(..);
> > >          unlock(vq);
> > >          kick & wait;
> > >          goto start;
> > >      } else {
> > >          goto next_one;
> > >      }
> > > 
> > > The above seems unnecessary to me to have three different APIs.
> > > That's the reason to combine them into one virtqueue_add_chain_desc().
> > > 
> > > -- or, do you have a different thought about using the three APIs?
> > > 
> > > 
> > > Implementation Reference:
> > > 
> > > struct desc_iterator {
> > >      unsigned int head;
> > >      unsigned int tail;
> > > };
> > > 
> > > add_first(*vq, *desc_iterator, *num_free, ..)
> > > {
> > >      if (vq->vq.num_free < 1)
> > >          return -ENOSPC;
> > >      get_desc(&desc_id);
> > >      desc[desc_id].flag &= ~_F_NEXT;
> > >      desc_iterator->head = desc_id
> > >      desc_iterator->tail = desc_iterator->head;
> > >      *num_free = vq->vq.num_free;
> > > }
> > > 
> > > add_next(vq, desc_iterator, *num_free,..)
> > > {
> > >      get_desc(&desc_id);
> > >      desc[desc_id].flag &= ~_F_NEXT;
> > >      desc[desc_iterator.tail].next = desc_id;
> > >      desc[desc_iterator->tail].flag |= _F_NEXT;
> > >      desc_iterator->tail = desc_id;
> > >      *num_free = vq->vq.num_free;
> > > }
> > > 
> > > add_last(vq, desc_iterator,..)
> > > {
> > >      get_desc(&desc_id);
> > >      desc[desc_id].flag &= ~_F_NEXT;
> > >      desc[desc_iterator.tail].next = desc_id;
> > >      desc_iterator->tail = desc_id;
> > > 
> > >      add_chain_head(); // put the desc_iterator.head to the ring
> > > }
> > > 
> > > 
> > > Best,
> > > Wei
> > OK I thought this over. While we might need these new APIs in
> > the future, I think that at the moment, there's a way to implement
> > this feature that is significantly simpler. Just add each s/g
> > as a separate input buffer.
> 
> 
> Should it be an output buffer?

Hypervisor overwrites these pages with zeroes. Therefore it is
writeable by device: DMA_FROM_DEVICE.

> I think output means from the
> driver to device (i.e. DMA_TO_DEVICE).

This part is correct I believe.

> > 
> > This needs zero new APIs.
> > 
> > I know that follow-up patches need to add a header in front
> > so you might be thinking: how am I going to add this
> > header? The answer is quite simple - add it as a separate
> > out header.
> > 
> > Host will be able to distinguish between header and pages
> > by looking at the direction, and - should we want to add
> > IN data to header - additionally size (<4K => header).
> 
> 
> I think this works fine when the cmdq is only used for
> reporting the unused pages.
> It would be an issue
> if there are other usages (e.g. report memory statistics)
> interleaving. I think one solution would be to lock the cmdq until
> a cmd usage is done ((e.g. all the unused pages are reported) ) -
> in this case, the periodically updated guest memory statistics
> may be delayed for a while occasionally when live migration starts.
> Would this be acceptable? If not, probably we can have the cmdq
> for one usage only.
> 
> 
> Best,
> Wei

OK I see, I think the issue is that reporting free pages
was structured like stats. Let's split it -
send pages on e.g. free_vq, get commands on vq shared with
stats.


-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-07-26 17:02                   ` Michael S. Tsirkin
@ 2017-07-27  2:50                     ` Wei Wang
  2017-07-28 23:08                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 60+ messages in thread
From: Wei Wang @ 2017-07-27  2:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On 07/27/2017 01:02 AM, Michael S. Tsirkin wrote:
> On Wed, Jul 26, 2017 at 11:48:41AM +0800, Wei Wang wrote:
>> On 07/23/2017 09:45 AM, Michael S. Tsirkin wrote:
>>> On Fri, Jul 14, 2017 at 03:12:43PM +0800, Wei Wang wrote:
>>>> On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote:
>>>>> On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote:
>>>>>> On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:
>>>>>>> So the way I see it, there are several issues:
>>>>>>>
>>>>>>> - internal wait - forces multiple APIs like kick/kick_sync
>>>>>>>       note how kick_sync can fail but your code never checks return code
>>>>>>> - need to re-write the last descriptor - might not work
>>>>>>>       for alternative layouts which always expose descriptors
>>>>>>>       immediately
>>>>>> Probably it wasn't clear. Please let me explain the two functions here:
>>>>>>
>>>>>> 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
>>>>>> grabs a desc from the vq and inserts it to the chain tail (which is indexed
>>>>>> by
>>>>>> prev_id, probably better to call it tail_id). Then, the new added desc
>>>>>> becomes
>>>>>> the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc
>>>>>> when it's
>>>>>> added to the chain, and set when another desc comes to follow later.
>>>>> And this only works if there are multiple rings like
>>>>> avail + descriptor ring.
>>>>> It won't work e.g. with the proposed new layout where
>>>>> writing out a descriptor exposes it immediately.
>>>> I think it can support the 1.1 proposal, too. But before getting
>>>> into that, I think we first need to deep dive into the implementation
>>>> and usage of _first/next/last. The usage would need to lock the vq
>>>> from the first to the end (otherwise, the returned info about the number
>>>> of available desc in the vq, i.e. num_free, would be invalid):
>>>>
>>>> lock(vq);
>>>> add_first();
>>>> add_next();
>>>> add_last();
>>>> unlock(vq);
>>>>
>>>> However, I think the case isn't this simple, since we need to check more
>>>> things
>>>> after each add_xx() step. For example, if only one entry is available at the
>>>> time
>>>> we start to use the vq, that is, num_free is 0 after add_first(), we
>>>> wouldn't be
>>>> able to add_next and add_last. So, it would work like this:
>>>>
>>>> start:
>>>>       ...get free page block..
>>>>       lock(vq)
>>>> retry:
>>>>       ret = add_first(..,&num_free,);
>>>>       if(ret == -ENOSPC) {
>>>>           goto retry;
>>>>       } else if (!num_free) {
>>>>           add_chain_head();
>>>>           unlock(vq);
>>>>           kick & wait;
>>>>           goto start;
>>>>       }
>>>> next_one:
>>>>       ...get free page block..
>>>>       add_next(..,&num_free,);
>>>>       if (!num_free) {
>>>>           add_chain_head();
>>>>           unlock(vq);
>>>>           kick & wait;
>>>>           goto start;
>>>>       } if (num_free == 1) {
>>>>           ...get free page block..
>>>>           add_last(..);
>>>>           unlock(vq);
>>>>           kick & wait;
>>>>           goto start;
>>>>       } else {
>>>>           goto next_one;
>>>>       }
>>>>
>>>> The above seems unnecessary to me to have three different APIs.
>>>> That's the reason to combine them into one virtqueue_add_chain_desc().
>>>>
>>>> -- or, do you have a different thought about using the three APIs?
>>>>
>>>>
>>>> Implementation Reference:
>>>>
>>>> struct desc_iterator {
>>>>       unsigned int head;
>>>>       unsigned int tail;
>>>> };
>>>>
>>>> add_first(*vq, *desc_iterator, *num_free, ..)
>>>> {
>>>>       if (vq->vq.num_free < 1)
>>>>           return -ENOSPC;
>>>>       get_desc(&desc_id);
>>>>       desc[desc_id].flag &= ~_F_NEXT;
>>>>       desc_iterator->head = desc_id
>>>>       desc_iterator->tail = desc_iterator->head;
>>>>       *num_free = vq->vq.num_free;
>>>> }
>>>>
>>>> add_next(vq, desc_iterator, *num_free,..)
>>>> {
>>>>       get_desc(&desc_id);
>>>>       desc[desc_id].flag &= ~_F_NEXT;
>>>>       desc[desc_iterator.tail].next = desc_id;
>>>>       desc[desc_iterator->tail].flag |= _F_NEXT;
>>>>       desc_iterator->tail = desc_id;
>>>>       *num_free = vq->vq.num_free;
>>>> }
>>>>
>>>> add_last(vq, desc_iterator,..)
>>>> {
>>>>       get_desc(&desc_id);
>>>>       desc[desc_id].flag &= ~_F_NEXT;
>>>>       desc[desc_iterator.tail].next = desc_id;
>>>>       desc_iterator->tail = desc_id;
>>>>
>>>>       add_chain_head(); // put the desc_iterator.head to the ring
>>>> }
>>>>
>>>>
>>>> Best,
>>>> Wei
>>> OK I thought this over. While we might need these new APIs in
>>> the future, I think that at the moment, there's a way to implement
>>> this feature that is significantly simpler. Just add each s/g
>>> as a separate input buffer.
>>
>> Should it be an output buffer?
> Hypervisor overwrites these pages with zeroes. Therefore it is
> writeable by device: DMA_FROM_DEVICE.

Why would the hypervisor need to zero the buffer? I think it may only
need to read out the info(base,size).

I think it should be like this:
the cmd hdr buffer: input, because the hypervisor would write it to
send a cmd to the guest
the payload buffer: output, for the hypervisor to read the info

>> I think output means from the
>> driver to device (i.e. DMA_TO_DEVICE).
> This part is correct I believe.
>
>>> This needs zero new APIs.
>>>
>>> I know that follow-up patches need to add a header in front
>>> so you might be thinking: how am I going to add this
>>> header? The answer is quite simple - add it as a separate
>>> out header.
>>>
>>> Host will be able to distinguish between header and pages
>>> by looking at the direction, and - should we want to add
>>> IN data to header - additionally size (<4K => header).
>>
>> I think this works fine when the cmdq is only used for
>> reporting the unused pages.
>> It would be an issue
>> if there are other usages (e.g. report memory statistics)
>> interleaving. I think one solution would be to lock the cmdq until
>> a cmd usage is done ((e.g. all the unused pages are reported) ) -
>> in this case, the periodically updated guest memory statistics
>> may be delayed for a while occasionally when live migration starts.
>> Would this be acceptable? If not, probably we can have the cmdq
>> for one usage only.
>>
>>
>> Best,
>> Wei
> OK I see, I think the issue is that reporting free pages
> was structured like stats. Let's split it -
> send pages on e.g. free_vq, get commands on vq shared with
> stats.
>

Would it be better to have the "report free page" command to be sent
through the free_vq? In this case,we will have
stats_vq: for the stats usage, which is already there
free_vq: for reporting free pages.

Best,
Wei






--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-07-12 12:40 ` [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
                     ` (3 preceding siblings ...)
  2017-07-13  4:21   ` kbuild test robot
@ 2017-07-28  8:25   ` Wei Wang
  2017-07-28 23:01     ` Michael S. Tsirkin
  4 siblings, 1 reply; 60+ messages in thread
From: Wei Wang @ 2017-07-28  8:25 UTC (permalink / raw)
  To: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, mst,
	david, cornelia.huck, akpm, mgorman, aarcange, amit.shah,
	pbonzini, liliang.opensource, mhocko, willy
  Cc: virtio-dev, yang.zhang.wz, quan.xu

On 07/12/2017 08:40 PM, Wei Wang wrote:
> Add a new feature, VIRTIO_BALLOON_F_SG, which enables to
> transfer a chunk of ballooned (i.e. inflated/deflated) pages using
> scatter-gather lists to the host.
>
> The implementation of the previous virtio-balloon is not very
> efficient, because the balloon pages are transferred to the
> host one by one. Here is the breakdown of the time in percentage
> spent on each step of the balloon inflating process (inflating
> 7GB of an 8GB idle guest).
>
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
>
> It takes about 4126ms for the inflating process to complete.
> The above profiling shows that the bottlenecks are stage 2)
> and stage 4).
>
> This patch optimizes step 2) by transferring pages to the host in
> sgs. An sg describes a chunk of guest physically continuous pages.
> With this mechanism, step 4) can also be optimized by doing address
> translation and madvise() in chunks rather than page by page.
>
> With this new feature, the above ballooning process takes ~491ms
> resulting in an improvement of ~88%.
>


I found a recent mm patch, bb01b64cfab7c22f3848cb73dc0c2b46b8d38499
, zeros all the ballooned pages, which is very time consuming.

Tests show that the time to balloon 7G pages is increased from ~491 ms to
2.8 seconds with the above patch.

How about moving the zero operation to the hypervisor? In this way, we
will have a much faster balloon process.


Best,
Wei


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-07-28  8:25   ` Wei Wang
@ 2017-07-28 23:01     ` Michael S. Tsirkin
  0 siblings, 0 replies; 60+ messages in thread
From: Michael S. Tsirkin @ 2017-07-28 23:01 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, mhocko, willy, virtio-dev, yang.zhang.wz,
	quan.xu

On Fri, Jul 28, 2017 at 04:25:19PM +0800, Wei Wang wrote:
> On 07/12/2017 08:40 PM, Wei Wang wrote:
> > Add a new feature, VIRTIO_BALLOON_F_SG, which enables to
> > transfer a chunk of ballooned (i.e. inflated/deflated) pages using
> > scatter-gather lists to the host.
> > 
> > The implementation of the previous virtio-balloon is not very
> > efficient, because the balloon pages are transferred to the
> > host one by one. Here is the breakdown of the time in percentage
> > spent on each step of the balloon inflating process (inflating
> > 7GB of an 8GB idle guest).
> > 
> > 1) allocating pages (6.5%)
> > 2) sending PFNs to host (68.3%)
> > 3) address translation (6.1%)
> > 4) madvise (19%)
> > 
> > It takes about 4126ms for the inflating process to complete.
> > The above profiling shows that the bottlenecks are stage 2)
> > and stage 4).
> > 
> > This patch optimizes step 2) by transferring pages to the host in
> > sgs. An sg describes a chunk of guest physically continuous pages.
> > With this mechanism, step 4) can also be optimized by doing address
> > translation and madvise() in chunks rather than page by page.
> > 
> > With this new feature, the above ballooning process takes ~491ms
> > resulting in an improvement of ~88%.
> > 
> 
> 
> I found a recent mm patch, bb01b64cfab7c22f3848cb73dc0c2b46b8d38499
> , zeros all the ballooned pages, which is very time consuming.
> 
> Tests show that the time to balloon 7G pages is increased from ~491 ms to
> 2.8 seconds with the above patch.

Sounds like it should be reverted. Post a revert pls and
we'll discuss.

> How about moving the zero operation to the hypervisor? In this way, we
> will have a much faster balloon process.
> 
> 
> Best,
> Wei

Or in other words hypervisors should not be stupid and
should not try to run ksm on DONTNEED pages.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-07-27  2:50                     ` Wei Wang
@ 2017-07-28 23:08                       ` Michael S. Tsirkin
  2017-07-29 12:47                         ` Wei Wang
  0 siblings, 1 reply; 60+ messages in thread
From: Michael S. Tsirkin @ 2017-07-28 23:08 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:
> > > > OK I thought this over. While we might need these new APIs in
> > > > the future, I think that at the moment, there's a way to implement
> > > > this feature that is significantly simpler. Just add each s/g
> > > > as a separate input buffer.
> > > 
> > > Should it be an output buffer?
> > Hypervisor overwrites these pages with zeroes. Therefore it is
> > writeable by device: DMA_FROM_DEVICE.
> 
> Why would the hypervisor need to zero the buffer?

The page is supplied to hypervisor and can lose the value that
is there.  That is the definition of writeable by device.

> I think it may only
> need to read out the info(base,size).

And then do what?

> I think it should be like this:
> the cmd hdr buffer: input, because the hypervisor would write it to
> send a cmd to the guest
> the payload buffer: output, for the hypervisor to read the info

These should be split.

We have:

1. request that hypervisor sends to guest, includes ID: input
2. synchronisation header with ID sent by guest: output
3. list of pages: input

2 and 3 must be on the same VQ. 1 can come on any VQ - reusing stats VQ
might make sense.


> > > I think output means from the
> > > driver to device (i.e. DMA_TO_DEVICE).
> > This part is correct I believe.
> > 
> > > > This needs zero new APIs.
> > > > 
> > > > I know that follow-up patches need to add a header in front
> > > > so you might be thinking: how am I going to add this
> > > > header? The answer is quite simple - add it as a separate
> > > > out header.
> > > > 
> > > > Host will be able to distinguish between header and pages
> > > > by looking at the direction, and - should we want to add
> > > > IN data to header - additionally size (<4K => header).
> > > 
> > > I think this works fine when the cmdq is only used for
> > > reporting the unused pages.
> > > It would be an issue
> > > if there are other usages (e.g. report memory statistics)
> > > interleaving. I think one solution would be to lock the cmdq until
> > > a cmd usage is done ((e.g. all the unused pages are reported) ) -
> > > in this case, the periodically updated guest memory statistics
> > > may be delayed for a while occasionally when live migration starts.
> > > Would this be acceptable? If not, probably we can have the cmdq
> > > for one usage only.
> > > 
> > > 
> > > Best,
> > > Wei
> > OK I see, I think the issue is that reporting free pages
> > was structured like stats. Let's split it -
> > send pages on e.g. free_vq, get commands on vq shared with
> > stats.
> > 
> 
> Would it be better to have the "report free page" command to be sent
> through the free_vq? In this case,we will have
> stats_vq: for the stats usage, which is already there
> free_vq: for reporting free pages.
> 
> Best,
> Wei

See above. I would get requests on stats vq but report
free pages separately on free vq.

> 
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-07-28 23:08                       ` Michael S. Tsirkin
@ 2017-07-29 12:47                         ` Wei Wang
  2017-07-30  4:22                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 60+ messages in thread
From: Wei Wang @ 2017-07-29 12:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote:
> On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:
>>>>> OK I thought this over. While we might need these new APIs in
>>>>> the future, I think that at the moment, there's a way to implement
>>>>> this feature that is significantly simpler. Just add each s/g
>>>>> as a separate input buffer.
>>>> Should it be an output buffer?
>>> Hypervisor overwrites these pages with zeroes. Therefore it is
>>> writeable by device: DMA_FROM_DEVICE.
>> Why would the hypervisor need to zero the buffer?
> The page is supplied to hypervisor and can lose the value that
> is there.  That is the definition of writeable by device.

I think for the free pages, it should be clear that they will be added as
output buffer to the device, because (as we discussed) they are just hints,
and some of them may be used by the guest after the report_ API is invoked.
The device/hypervisor should not use or discard them.

For the balloon pages, I kind of agree with the existing implementation
(e.g. inside tell_host()), which uses virtqueue_add_outbuf (instead of 
_add_inbuf()).
I think inbuf should be a buffer which will be written by the device and 
read by the
driver. The cmd buffer put on the vq for the device to send commands can 
be an
inbuf, I think.


>
>> I think it may only
>> need to read out the info(base,size).
> And then do what?


For the free pages, the info will be used to clear the corresponding "1" 
in the dirty bitmap.
For balloon pages, they will be made DONTNEED and given to other host 
processes to
use (the device won't write them, so no need to set "write" when 
virtqueue_map_desc() in
the device).


>
>> I think it should be like this:
>> the cmd hdr buffer: input, because the hypervisor would write it to
>> send a cmd to the guest
>> the payload buffer: output, for the hypervisor to read the info
> These should be split.
>
> We have:
>
> 1. request that hypervisor sends to guest, includes ID: input
> 2. synchronisation header with ID sent by guest: output
> 3. list of pages: input
>
> 2 and 3 must be on the same VQ. 1 can come on any VQ - reusing stats VQ
> might make sense.

I would prefer to make the above item 1 come on the the free page vq,
because the existing stat_vq doesn't support the cmd hdr.
Now, I think it is also not necessary to move the existing stat_vq 
implementation to
a new implementation under the cmd hdr, because
that new support doesn't make a difference for stats, it will still use 
its stat_vq (the free
page vq will be used for reporting free pages only)

What do you think?


Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-07-29 12:47                         ` Wei Wang
@ 2017-07-30  4:22                           ` Michael S. Tsirkin
  2017-07-30  5:59                             ` Wang, Wei W
  0 siblings, 1 reply; 60+ messages in thread
From: Michael S. Tsirkin @ 2017-07-30  4:22 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On Sat, Jul 29, 2017 at 08:47:08PM +0800, Wei Wang wrote:
> On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote:
> > On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:
> > > > > > OK I thought this over. While we might need these new APIs in
> > > > > > the future, I think that at the moment, there's a way to implement
> > > > > > this feature that is significantly simpler. Just add each s/g
> > > > > > as a separate input buffer.
> > > > > Should it be an output buffer?
> > > > Hypervisor overwrites these pages with zeroes. Therefore it is
> > > > writeable by device: DMA_FROM_DEVICE.
> > > Why would the hypervisor need to zero the buffer?
> > The page is supplied to hypervisor and can lose the value that
> > is there.  That is the definition of writeable by device.
> 
> I think for the free pages, it should be clear that they will be added as
> output buffer to the device, because (as we discussed) they are just hints,
> and some of them may be used by the guest after the report_ API is invoked.
> The device/hypervisor should not use or discard them.

Discarding contents is exactly what you propose doing if
migration is going on, isn't it?

> For the balloon pages, I kind of agree with the existing implementation
> (e.g. inside tell_host()), which uses virtqueue_add_outbuf (instead of
> _add_inbuf()).


This is because it does not pass SGs, it passes weirdly
formatted PA within the buffer.

> I think inbuf should be a buffer which will be written by the device and
> read by the
> driver.

If hypervisor can change it, it's an inbuf. Should not matter
whether driver reads it.

> The cmd buffer put on the vq for the device to send commands can be
> an
> inbuf, I think.
> 
> 
> > 
> > > I think it may only
> > > need to read out the info(base,size).
> > And then do what?
> 
> 
> For the free pages, the info will be used to clear the corresponding "1" in
> the dirty bitmap.
> For balloon pages, they will be made DONTNEED and given to other host
> processes to
> use (the device won't write them, so no need to set "write" when
> virtqueue_map_desc() in
> the device).
> 
> 
> > 
> > > I think it should be like this:
> > > the cmd hdr buffer: input, because the hypervisor would write it to
> > > send a cmd to the guest
> > > the payload buffer: output, for the hypervisor to read the info
> > These should be split.
> > 
> > We have:
> > 
> > 1. request that hypervisor sends to guest, includes ID: input
> > 2. synchronisation header with ID sent by guest: output
> > 3. list of pages: input
> > 
> > 2 and 3 must be on the same VQ. 1 can come on any VQ - reusing stats VQ
> > might make sense.
> 
> I would prefer to make the above item 1 come on the the free page vq,
> because the existing stat_vq doesn't support the cmd hdr.
> Now, I think it is also not necessary to move the existing stat_vq
> implementation to
> a new implementation under the cmd hdr, because
> that new support doesn't make a difference for stats, it will still use its
> stat_vq (the free
> page vq will be used for reporting free pages only)
> 
> What do you think?
> 
> 
> Best,
> Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-07-30  4:22                           ` Michael S. Tsirkin
@ 2017-07-30  5:59                             ` Wang, Wei W
  2017-07-30 16:18                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 60+ messages in thread
From: Wang, Wei W @ 2017-07-30  5:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On Sunday, July 30, 2017 12:23 PM, Michael S. Tsirkin wrote:
> On Sat, Jul 29, 2017 at 08:47:08PM +0800, Wei Wang wrote:
> > On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote:
> > > On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:
> > > > > > > OK I thought this over. While we might need these new APIs
> > > > > > > in the future, I think that at the moment, there's a way to
> > > > > > > implement this feature that is significantly simpler. Just
> > > > > > > add each s/g as a separate input buffer.
> > > > > > Should it be an output buffer?
> > > > > Hypervisor overwrites these pages with zeroes. Therefore it is
> > > > > writeable by device: DMA_FROM_DEVICE.
> > > > Why would the hypervisor need to zero the buffer?
> > > The page is supplied to hypervisor and can lose the value that is
> > > there.  That is the definition of writeable by device.
> >
> > I think for the free pages, it should be clear that they will be added
> > as output buffer to the device, because (as we discussed) they are
> > just hints, and some of them may be used by the guest after the report_ API is
> invoked.
> > The device/hypervisor should not use or discard them.
> 
> Discarding contents is exactly what you propose doing if migration is going on,
> isn't it?

That's actually a different concept. Please let me explain it with this example:

The hypervisor receives the hint saying the guest PageX is a free page, but as we know, 
after that report_ API exits, the guest kernel may take PageX to use, so PageX is not free
page any more. At this time, if the hypervisor writes to the page, that would crash the guest.
So, I think the cornerstone of this work is that the hypervisor should not touch the
reported pages.

Best,
Wei    

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-07-30  5:59                             ` Wang, Wei W
@ 2017-07-30 16:18                               ` Michael S. Tsirkin
  2017-07-30 16:20                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 60+ messages in thread
From: Michael S. Tsirkin @ 2017-07-30 16:18 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On Sun, Jul 30, 2017 at 05:59:17AM +0000, Wang, Wei W wrote:
> On Sunday, July 30, 2017 12:23 PM, Michael S. Tsirkin wrote:
> > On Sat, Jul 29, 2017 at 08:47:08PM +0800, Wei Wang wrote:
> > > On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote:
> > > > On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:
> > > > > > > > OK I thought this over. While we might need these new APIs
> > > > > > > > in the future, I think that at the moment, there's a way to
> > > > > > > > implement this feature that is significantly simpler. Just
> > > > > > > > add each s/g as a separate input buffer.
> > > > > > > Should it be an output buffer?
> > > > > > Hypervisor overwrites these pages with zeroes. Therefore it is
> > > > > > writeable by device: DMA_FROM_DEVICE.
> > > > > Why would the hypervisor need to zero the buffer?
> > > > The page is supplied to hypervisor and can lose the value that is
> > > > there.  That is the definition of writeable by device.
> > >
> > > I think for the free pages, it should be clear that they will be added
> > > as output buffer to the device, because (as we discussed) they are
> > > just hints, and some of them may be used by the guest after the report_ API is
> > invoked.
> > > The device/hypervisor should not use or discard them.
> > 
> > Discarding contents is exactly what you propose doing if migration is going on,
> > isn't it?
> 
> That's actually a different concept. Please let me explain it with this example:
> 
> The hypervisor receives the hint saying the guest PageX is a free page, but as we know, 
> after that report_ API exits, the guest kernel may take PageX to use, so PageX is not free
> page any more. At this time, if the hypervisor writes to the page, that would crash the guest.
> So, I think the cornerstone of this work is that the hypervisor should not touch the
> reported pages.
> 
> Best,
> Wei    

That's a hypervisor implementation detail. From guest point of view,
discarding contents can not be distinguished from writing old contents.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-07-30 16:18                               ` Michael S. Tsirkin
@ 2017-07-30 16:20                                 ` Michael S. Tsirkin
  2017-07-31 12:36                                   ` Wei Wang
  0 siblings, 1 reply; 60+ messages in thread
From: Michael S. Tsirkin @ 2017-07-30 16:20 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On Sun, Jul 30, 2017 at 07:18:33PM +0300, Michael S. Tsirkin wrote:
> On Sun, Jul 30, 2017 at 05:59:17AM +0000, Wang, Wei W wrote:
> > On Sunday, July 30, 2017 12:23 PM, Michael S. Tsirkin wrote:
> > > On Sat, Jul 29, 2017 at 08:47:08PM +0800, Wei Wang wrote:
> > > > On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote:
> > > > > On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:
> > > > > > > > > OK I thought this over. While we might need these new APIs
> > > > > > > > > in the future, I think that at the moment, there's a way to
> > > > > > > > > implement this feature that is significantly simpler. Just
> > > > > > > > > add each s/g as a separate input buffer.
> > > > > > > > Should it be an output buffer?
> > > > > > > Hypervisor overwrites these pages with zeroes. Therefore it is
> > > > > > > writeable by device: DMA_FROM_DEVICE.
> > > > > > Why would the hypervisor need to zero the buffer?
> > > > > The page is supplied to hypervisor and can lose the value that is
> > > > > there.  That is the definition of writeable by device.
> > > >
> > > > I think for the free pages, it should be clear that they will be added
> > > > as output buffer to the device, because (as we discussed) they are
> > > > just hints, and some of them may be used by the guest after the report_ API is
> > > invoked.
> > > > The device/hypervisor should not use or discard them.
> > > 
> > > Discarding contents is exactly what you propose doing if migration is going on,
> > > isn't it?
> > 
> > That's actually a different concept. Please let me explain it with this example:
> > 
> > The hypervisor receives the hint saying the guest PageX is a free page, but as we know, 
> > after that report_ API exits, the guest kernel may take PageX to use, so PageX is not free
> > page any more. At this time, if the hypervisor writes to the page, that would crash the guest.
> > So, I think the cornerstone of this work is that the hypervisor should not touch the
> > reported pages.
> > 
> > Best,
> > Wei    
> 
> That's a hypervisor implementation detail. From guest point of view,
> discarding contents can not be distinguished from writing old contents.
> 

Besides, ignoring the free page tricks, consider regular ballooning.
We map page with DONTNEED then back with WILLNEED. Result is
getting a zero page. So at least one of deflate/inflate should be input.
I'd say both for symmetry.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-07-30 16:20                                 ` Michael S. Tsirkin
@ 2017-07-31 12:36                                   ` Wei Wang
  0 siblings, 0 replies; 60+ messages in thread
From: Wei Wang @ 2017-07-31 12:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, virtio-dev, yang.zhang.wz, quan.xu

On 07/31/2017 12:20 AM, Michael S. Tsirkin wrote:
> On Sun, Jul 30, 2017 at 07:18:33PM +0300, Michael S. Tsirkin wrote:
>> On Sun, Jul 30, 2017 at 05:59:17AM +0000, Wang, Wei W wrote:
>> That's a hypervisor implementation detail. From guest point of view,
>> discarding contents can not be distinguished from writing old contents.
>>
> Besides, ignoring the free page tricks, consider regular ballooning.
> We map page with DONTNEED then back with WILLNEED. Result is
> getting a zero page. So at least one of deflate/inflate should be input.
> I'd say both for symmetry.
>

OK, I see the point. Thanks.

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2017-07-31 12:33 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-12 12:40 [PATCH v12 0/8] Virtio-balloon Enhancement Wei Wang
2017-07-12 12:40 ` [PATCH v12 1/8] virtio-balloon: deflate via a page list Wei Wang
2017-07-12 12:40 ` [PATCH v12 2/8] virtio-balloon: coding format cleanup Wei Wang
2017-07-12 12:40 ` [PATCH v12 3/8] Introduce xbitmap Wei Wang
2017-07-12 12:40 ` [PATCH v12 4/8] xbitmap: add xb_find_next_bit() and xb_zero() Wei Wang
2017-07-12 12:40 ` [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
2017-07-12 13:06   ` Michael S. Tsirkin
2017-07-12 13:29     ` Wei Wang
2017-07-12 13:56       ` Michael S. Tsirkin
2017-07-13  7:42         ` Wei Wang
2017-07-13 20:19           ` Michael S. Tsirkin
2017-07-14  7:12             ` Wei Wang
2017-07-23  1:45               ` Michael S. Tsirkin
2017-07-26  3:48                 ` Wei Wang
2017-07-26 17:02                   ` Michael S. Tsirkin
2017-07-27  2:50                     ` Wei Wang
2017-07-28 23:08                       ` Michael S. Tsirkin
2017-07-29 12:47                         ` Wei Wang
2017-07-30  4:22                           ` Michael S. Tsirkin
2017-07-30  5:59                             ` Wang, Wei W
2017-07-30 16:18                               ` Michael S. Tsirkin
2017-07-30 16:20                                 ` Michael S. Tsirkin
2017-07-31 12:36                                   ` Wei Wang
2017-07-13  0:44   ` Michael S. Tsirkin
2017-07-13  1:16   ` kbuild test robot
2017-07-13  4:21   ` kbuild test robot
2017-07-28  8:25   ` Wei Wang
2017-07-28 23:01     ` Michael S. Tsirkin
2017-07-12 12:40 ` [PATCH v12 6/8] mm: support reporting free page blocks Wei Wang
2017-07-13  0:33   ` Michael S. Tsirkin
2017-07-13  8:25     ` Wei Wang
2017-07-14 12:30   ` Michal Hocko
2017-07-14 12:54     ` Michal Hocko
2017-07-14 15:46       ` Michael S. Tsirkin
2017-07-14 19:17     ` Michael S. Tsirkin
2017-07-17 15:24       ` Michal Hocko
2017-07-18  2:12         ` Wei Wang
2017-07-19  8:13           ` Michal Hocko
2017-07-19 12:01             ` Wei Wang
2017-07-24  9:00               ` Michal Hocko
2017-07-25  9:32                 ` Wei Wang
2017-07-25 11:25                   ` Michal Hocko
2017-07-25 11:56                     ` Wei Wang
2017-07-25 12:41                       ` Michal Hocko
2017-07-25 14:47                         ` Wang, Wei W
2017-07-25 14:53                           ` Michal Hocko
2017-07-26  2:22                             ` Wei Wang
2017-07-26 10:24                               ` Michal Hocko
2017-07-26 11:44                                 ` Wei Wang
2017-07-26 11:55                                   ` Michal Hocko
2017-07-26 12:47                                     ` Wang, Wei W
2017-07-12 12:40 ` [PATCH v12 7/8] mm: export symbol of next_zone and first_online_pgdat Wei Wang
2017-07-13  0:16   ` Michael S. Tsirkin
2017-07-13  8:41     ` [virtio-dev] " Wei Wang
2017-07-14 12:31   ` Michal Hocko
2017-07-12 12:40 ` [PATCH v12 8/8] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ Wei Wang
2017-07-13  0:22   ` Michael S. Tsirkin
2017-07-13  8:46     ` Wei Wang
2017-07-13 17:59       ` Michael S. Tsirkin
2017-07-13  0:14 ` [PATCH v12 0/8] Virtio-balloon Enhancement Michael S. Tsirkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).