linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v15 0/5] Virtio-balloon Enhancement
@ 2017-08-28 10:08 Wei Wang
  2017-08-28 10:08 ` [PATCH v15 1/5] lib/xbitmap: Introduce xbitmap Wei Wang
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Wei Wang @ 2017-08-28 10:08 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

This patch series enhances the existing virtio-balloon with the following
new features:
1) fast ballooning: transfer ballooned pages between the guest and host in
chunks using sgs, instead of one by one; and
2) free page block reporting: a new virtqueue to report guest free pages
to the host.

The second feature can be used to accelerate live migration of VMs. Here
are some details:

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.

The second feature  enables the optimization of the 1st round memory
transfer - the hypervisor can skip the transfer of guest free pages in the
1st round. It is not concerned that the memory pages are used after they
are given to the hypervisor as a hint of the free pages, because they will
be tracked by the hypervisor and transferred in the next round if they are
used and written.

Change Log:
v14->v15:
1) mm: make the report callback return a bool value - returning 1 to stop
walking through the free page list.
2) virtio-balloon: batching sgs of balloon pages till the vq is full
3) virtio-balloon: create a new workqueue, rather than using the default
system_wq, to queue the free page reporting work item.
4) virtio-balloon: add a ctrl_vq to be a central control plane which will
handle all the future control related commands between the host and guest.
Add free page report as the first feature controlled under ctrl_vq, and
the free_page_vq is a data plane vq dedicated to the transmission of free
page blocks.

v13->v14:
1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c.
2) xbitmap: consolidate the implementation of xb_bit_set/clear/test into
one xb_bit_ops.
3) xbitmap: add documents for the exported APIs.
4) mm: rewrite the function to walk through free page blocks.
5) virtio-balloon: when reporting a free page blcok to the device, if the
vq is full (less likey to happen in practice), just skip reporting this
block, instead of busywaiting till an entry gets released.
6) virtio-balloon: fail the probe function if adding the signal buf in
init_vqs fails.

v12->v13:
1) mm: use a callback function to handle the the free page blocks from the
report function. This avoids exposing the zone internal to a kernel
module.
2) virtio-balloon: send balloon pages or a free page block using a single
sg each time. This has the benefits of simpler implementation with no new
APIs.
3) virtio-balloon: the free_page_vq is used to report free pages only (no
multiple usages interleaving)
4) virtio-balloon: Balloon pages and free page blocks are sent via input
sgs, and the completion signal to the host is sent via an output sg.

v11->v12:
1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned pages.
2) virtio-ring: enable the driver to build up a desc chain using vring
desc.
3) virtio-ring: Add locking to the existing START_USE() and END_USE()
macro to lock/unlock the vq when a vq operation starts/ends.
4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
5) virtio-balloon: describe chunks of ballooned pages and free pages
blocks directly using one or more chains of desc from the vq.

v10->v11:
1) virtio_balloon: use vring_desc to describe a chunk;
2) virtio_ring: support to add an indirect desc table to virtqueue;
3)  virtio_balloon: use cmdq to report guest memory statistics.

v9->v10:
1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
2) virtio-balloon: add virtballoon_validate();
3) virtio-balloon: msg format change;
4) virtio-balloon: move miscq handling to a task on system_freezable_wq;
5) virtio-balloon: code cleanup.

v8->v9:
1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
implementation;
2) Simpler function to get the free page block.

v7->v8:
1) Use only one chunk format, instead of two.
2) re-write the virtio-balloon implementation patch.
3) commit changes
4) patch re-org


Matthew Wilcox (1):
  lib/xbitmap: Introduce xbitmap

Wei Wang (4):
  lib/xbitmap: add xb_find_next_bit() and xb_zero()
  virtio-balloon: VIRTIO_BALLOON_F_SG
  mm: support reporting free page blocks
  virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ

 drivers/virtio/virtio_balloon.c     | 418 ++++++++++++++++++++++++++++++++----
 include/linux/mm.h                  |   5 +
 include/linux/radix-tree.h          |   3 +
 include/linux/xbitmap.h             |  64 ++++++
 include/uapi/linux/virtio_balloon.h |  16 ++
 lib/Makefile                        |   2 +-
 lib/radix-tree.c                    |  22 +-
 lib/xbitmap.c                       | 215 +++++++++++++++++++
 mm/page_alloc.c                     |  65 ++++++
 9 files changed, 769 insertions(+), 41 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c

-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v15 1/5] lib/xbitmap: Introduce xbitmap
  2017-08-28 10:08 [PATCH v15 0/5] Virtio-balloon Enhancement Wei Wang
@ 2017-08-28 10:08 ` Wei Wang
  2017-09-11 12:54   ` Matthew Wilcox
  2017-08-28 10:08 ` [PATCH v15 2/5] lib/xbitmap: add xb_find_next_bit() and xb_zero() Wei Wang
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Wei Wang @ 2017-08-28 10:08 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

From: Matthew Wilcox <mawilcox@microsoft.com>

The eXtensible Bitmap is a sparse bitmap representation which is
efficient for set bits which tend to cluster.  It supports up to
'unsigned long' worth of bits, and this commit adds the bare bones --
xb_set_bit(), xb_clear_bit() and xb_test_bit().

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/radix-tree.h |   3 +
 include/linux/xbitmap.h    |  61 ++++++++++++++++
 lib/Makefile               |   2 +-
 lib/radix-tree.c           |  22 +++++-
 lib/xbitmap.c              | 176 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 260 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 3e57350..e1203b1 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -309,6 +309,8 @@ void radix_tree_iter_replace(struct radix_tree_root *,
 		const struct radix_tree_iter *, void __rcu **slot, void *entry);
 void radix_tree_replace_slot(struct radix_tree_root *,
 			     void __rcu **slot, void *entry);
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot);
 void __radix_tree_delete_node(struct radix_tree_root *,
 			      struct radix_tree_node *,
 			      radix_tree_update_node_t update_node,
@@ -325,6 +327,7 @@ unsigned int radix_tree_gang_lookup(const struct radix_tree_root *,
 unsigned int radix_tree_gang_lookup_slot(const struct radix_tree_root *,
 			void __rcu ***results, unsigned long *indices,
 			unsigned long first_index, unsigned int max_items);
+int __radix_tree_preload(gfp_t gfp_mask, unsigned int nr);
 int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order);
diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
new file mode 100644
index 0000000..25b05ff
--- /dev/null
+++ b/include/linux/xbitmap.h
@@ -0,0 +1,61 @@
+/*
+ * eXtensible Bitmaps
+ * Copyright (c) 2017 Microsoft Corporation <mawilcox@microsoft.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * eXtensible Bitmaps provide an unlimited-size sparse bitmap facility.
+ * All bits are initially zero.
+ */
+
+#ifndef __XBITMAP_H__
+#define __XBITMAP_H__
+
+#include <linux/idr.h>
+
+struct xb {
+	struct radix_tree_root xbrt;
+};
+
+#define XB_INIT {							\
+	.xbrt = RADIX_TREE_INIT(IDR_RT_MARKER | GFP_NOWAIT),		\
+}
+#define DEFINE_XB(name)		struct xb name = XB_INIT
+
+static inline void xb_init(struct xb *xb)
+{
+	INIT_RADIX_TREE(&xb->xbrt, IDR_RT_MARKER | GFP_NOWAIT);
+}
+
+int xb_set_bit(struct xb *xb, unsigned long bit);
+bool xb_test_bit(struct xb *xb, unsigned long bit);
+void xb_clear_bit(struct xb *xb, unsigned long bit);
+
+/* Check if the xb tree is empty */
+static inline bool xb_is_empty(const struct xb *xb)
+{
+	return radix_tree_empty(&xb->xbrt);
+}
+
+void xb_preload(gfp_t gfp);
+
+/**
+ * xb_preload_end - end preload section started with xb_preload()
+ *
+ * Each xb_preload() should be matched with an invocation of this
+ * function. See xb_preload() for details.
+ */
+static inline void xb_preload_end(void)
+{
+	preempt_enable();
+}
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 40c1837..ea50496 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -18,7 +18,7 @@ KCOV_INSTRUMENT_dynamic_debug.o := n
 
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o dump_stack.o timerqueue.o\
-	 idr.o int_sqrt.o extable.o \
+	 idr.o xbitmap.o int_sqrt.o extable.o \
 	 sha1.o chacha20.o irq_regs.o argv_split.o \
 	 flex_proportions.o ratelimit.o show_mem.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 898e879..ee72e2c 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -463,7 +463,7 @@ radix_tree_node_free(struct radix_tree_node *node)
  * To make use of this facility, the radix tree must be initialised without
  * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
  */
-static int __radix_tree_preload(gfp_t gfp_mask, unsigned nr)
+int __radix_tree_preload(gfp_t gfp_mask, unsigned int nr)
 {
 	struct radix_tree_preload *rtp;
 	struct radix_tree_node *node;
@@ -496,6 +496,7 @@ static int __radix_tree_preload(gfp_t gfp_mask, unsigned nr)
 out:
 	return ret;
 }
+EXPORT_SYMBOL(__radix_tree_preload);
 
 /*
  * Load up this CPU's radix_tree_node buffer with sufficient objects to
@@ -840,6 +841,8 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
 							offset, 0, 0);
 			if (!child)
 				return -ENOMEM;
+			if (is_idr(root))
+				all_tag_set(child, IDR_FREE);
 			rcu_assign_pointer(*slot, node_to_entry(child));
 			if (node)
 				node->count++;
@@ -1986,8 +1989,20 @@ void __radix_tree_delete_node(struct radix_tree_root *root,
 	delete_node(root, node, update_node, private);
 }
 
-static bool __radix_tree_delete(struct radix_tree_root *root,
-				struct radix_tree_node *node, void __rcu **slot)
+/**
+ * __radix_tree_delete - delete a slot from a radix tree
+ * @root: radix tree root
+ * @node: node containing the slot
+ * @slot: pointer to the slot to delete
+ *
+ * Clear @slot from @node of the radix tree. This may cause the current node to
+ * be freed. This function may be called without any locking if there are no
+ * other threads which can access this tree.
+ *
+ * Return: the node or NULL if the node is freed.
+ */
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot)
 {
 	void *old = rcu_dereference_raw(*slot);
 	int exceptional = radix_tree_exceptional_entry(old) ? -1 : 0;
@@ -2003,6 +2018,7 @@ static bool __radix_tree_delete(struct radix_tree_root *root,
 	replace_slot(slot, NULL, node, -1, exceptional);
 	return node && delete_node(root, node, NULL, NULL);
 }
+EXPORT_SYMBOL(__radix_tree_delete);
 
 /**
  * radix_tree_iter_delete - delete the entry at this iterator position
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
new file mode 100644
index 0000000..8c55296
--- /dev/null
+++ b/lib/xbitmap.c
@@ -0,0 +1,176 @@
+#include <linux/slab.h>
+#include <linux/xbitmap.h>
+
+/*
+ * The xbitmap implementation supports up to ULONG_MAX bits, and it is
+ * implemented based on ida bitmaps. So, given an unsigned long index,
+ * the high order XB_INDEX_BITS bits of the index is used to find the
+ * corresponding item (i.e. ida bitmap) from the radix tree, and the low
+ * order (i.e. ilog2(IDA_BITMAP_BITS)) bits of the index are indexed into
+ * the ida bitmap to find the bit.
+ */
+#define XB_INDEX_BITS		(BITS_PER_LONG - ilog2(IDA_BITMAP_BITS))
+#define XB_MAX_PATH		(DIV_ROUND_UP(XB_INDEX_BITS, \
+					      RADIX_TREE_MAP_SHIFT))
+#define XB_PRELOAD_SIZE		(XB_MAX_PATH * 2 - 1)
+
+enum xb_ops {
+	XB_SET,
+	XB_CLEAR,
+	XB_TEST
+};
+
+static int xb_bit_ops(struct xb *xb, unsigned long bit, enum xb_ops ops)
+{
+	int ret = 0;
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit, tmp;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + RADIX_TREE_EXCEPTIONAL_SHIFT;
+
+	switch (ops) {
+	case XB_SET:
+		ret = __radix_tree_create(root, index, 0, &node, &slot);
+		if (ret)
+			return ret;
+		bitmap = rcu_dereference_raw(*slot);
+		if (radix_tree_exception(bitmap)) {
+			tmp = (unsigned long)bitmap;
+			if (ebit < BITS_PER_LONG) {
+				tmp |= 1UL << ebit;
+				rcu_assign_pointer(*slot, (void *)tmp);
+				return 0;
+			}
+			bitmap = this_cpu_xchg(ida_bitmap, NULL);
+			if (!bitmap)
+				return -EAGAIN;
+			memset(bitmap, 0, sizeof(*bitmap));
+			bitmap->bitmap[0] =
+					tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+			rcu_assign_pointer(*slot, bitmap);
+		}
+		if (!bitmap) {
+			if (ebit < BITS_PER_LONG) {
+				bitmap = (void *)((1UL << ebit) |
+					RADIX_TREE_EXCEPTIONAL_ENTRY);
+				__radix_tree_replace(root, node, slot, bitmap,
+						     NULL, NULL);
+				return 0;
+			}
+			bitmap = this_cpu_xchg(ida_bitmap, NULL);
+			if (!bitmap)
+				return -EAGAIN;
+			memset(bitmap, 0, sizeof(*bitmap));
+			__radix_tree_replace(root, node, slot, bitmap, NULL,
+					     NULL);
+		}
+		__set_bit(bit, bitmap->bitmap);
+		break;
+	case XB_CLEAR:
+		bitmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bitmap)) {
+			tmp = (unsigned long)bitmap;
+			if (ebit >= BITS_PER_LONG)
+				return 0;
+			tmp &= ~(1UL << ebit);
+			if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+				__radix_tree_delete(root, node, slot);
+			else
+				rcu_assign_pointer(*slot, (void *)tmp);
+			return 0;
+		}
+		if (!bitmap)
+			return 0;
+		__clear_bit(bit, bitmap->bitmap);
+		if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+			kfree(bitmap);
+			__radix_tree_delete(root, node, slot);
+		}
+		break;
+	case XB_TEST:
+		bitmap = radix_tree_lookup(root, index);
+		if (!bitmap)
+			return 0;
+		if (radix_tree_exception(bitmap)) {
+			if (ebit > BITS_PER_LONG)
+				return 0;
+			return (unsigned long)bitmap & (1UL << bit);
+		}
+		ret = test_bit(bit, bitmap->bitmap);
+		break;
+	default:
+		return -EINVAL;
+	}
+	return ret;
+}
+
+/**
+ *  xb_set_bit - set a bit in the xbitmap
+ *  @xb: the xbitmap tree used to record the bit
+ *  @bit: index of the bit to set
+ *
+ * This function is used to set a bit in the xbitmap. If the bitmap that @bit
+ * resides in is not there, it will be allocated.
+ *
+ * Returns: 0 on success. %-EAGAIN indicates that @bit was not set. The caller
+ * may want to call the function again.
+ */
+int xb_set_bit(struct xb *xb, unsigned long bit)
+{
+	return xb_bit_ops(xb, bit, XB_SET);
+}
+EXPORT_SYMBOL(xb_set_bit);
+
+/**
+ * xb_clear_bit - clear a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to set
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit(struct xb *xb, unsigned long bit)
+{
+	xb_bit_ops(xb, bit, XB_CLEAR);
+}
+EXPORT_SYMBOL(xb_clear_bit);
+
+/**
+ * xb_test_bit - test a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to set
+ *
+ * This function is used to test a bit in the xbitmap.
+ * Returns: 1 if the bit is set, or 0 otherwise.
+ */
+bool xb_test_bit(struct xb *xb, unsigned long bit)
+{
+	return (bool)xb_bit_ops(xb, bit, XB_TEST);
+}
+EXPORT_SYMBOL(xb_test_bit);
+
+/**
+ *  xb_preload - preload for xb_set_bit()
+ *  @gfp_mask: allocation mask to use for preloading
+ *
+ * Preallocate memory to use for the next call to xb_set_bit(). This function
+ * returns with preemption disabled. It will be enabled by xb_preload_end().
+ */
+void xb_preload(gfp_t gfp)
+{
+	__radix_tree_preload(gfp, XB_PRELOAD_SIZE);
+	if (!this_cpu_read(ida_bitmap)) {
+		struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
+
+		if (!bitmap)
+			return;
+		bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
+		kfree(bitmap);
+	}
+}
+EXPORT_SYMBOL(xb_preload);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v15 2/5] lib/xbitmap: add xb_find_next_bit() and xb_zero()
  2017-08-28 10:08 [PATCH v15 0/5] Virtio-balloon Enhancement Wei Wang
  2017-08-28 10:08 ` [PATCH v15 1/5] lib/xbitmap: Introduce xbitmap Wei Wang
@ 2017-08-28 10:08 ` Wei Wang
  2017-09-11 13:27   ` Matthew Wilcox
  2017-08-28 10:08 ` [PATCH v15 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Wei Wang @ 2017-08-28 10:08 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

xb_find_next_bit() is used to find the next "1" or "0" bit in the
given range. xb_zero() is used to zero the given range of bits.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/xbitmap.h |  3 +++
 lib/xbitmap.c           | 39 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)

diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
index 25b05ff..0061f7a 100644
--- a/include/linux/xbitmap.h
+++ b/include/linux/xbitmap.h
@@ -38,6 +38,9 @@ static inline void xb_init(struct xb *xb)
 int xb_set_bit(struct xb *xb, unsigned long bit);
 bool xb_test_bit(struct xb *xb, unsigned long bit);
 void xb_clear_bit(struct xb *xb, unsigned long bit);
+void xb_zero(struct xb *xb, unsigned long start, unsigned long end);
+unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+			       unsigned long end, bool set);
 
 /* Check if the xb tree is empty */
 static inline bool xb_is_empty(const struct xb *xb)
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
index 8c55296..b9e2a0c 100644
--- a/lib/xbitmap.c
+++ b/lib/xbitmap.c
@@ -174,3 +174,42 @@ void xb_preload(gfp_t gfp)
 	}
 }
 EXPORT_SYMBOL(xb_preload);
+
+/**
+ *  xb_zero - zero a range of bits in the xbitmap
+ *  @xb: the xbitmap that the bits reside in
+ *  @start: the start of the range, inclusive
+ *  @end: the end of the range, inclusive
+ */
+void xb_zero(struct xb *xb, unsigned long start, unsigned long end)
+{
+	unsigned long i;
+
+	for (i = start; i <= end; i++)
+		xb_clear_bit(xb, i);
+}
+EXPORT_SYMBOL(xb_zero);
+
+/**
+ * xb_find_next_bit - find next 1 or 0 in the give range of bits
+ * @xb: the xbitmap that the bits reside in
+ * @start: the start of the range, inclusive
+ * @end: the end of the range, inclusive
+ * @set: the polarity (1 or 0) of the next bit to find
+ *
+ * Return the index of the found bit in the xbitmap. If the returned index
+ * exceeds @end, it indicates that no such bit is found in the given range.
+ */
+unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+			       unsigned long end, bool set)
+{
+	unsigned long i;
+
+	for (i = start; i <= end; i++) {
+		if (xb_test_bit(xb, i) == set)
+			break;
+	}
+
+	return i;
+}
+EXPORT_SYMBOL(xb_find_next_bit);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v15 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-08-28 10:08 [PATCH v15 0/5] Virtio-balloon Enhancement Wei Wang
  2017-08-28 10:08 ` [PATCH v15 1/5] lib/xbitmap: Introduce xbitmap Wei Wang
  2017-08-28 10:08 ` [PATCH v15 2/5] lib/xbitmap: add xb_find_next_bit() and xb_zero() Wei Wang
@ 2017-08-28 10:08 ` Wei Wang
  2017-08-28 18:03   ` Michael S. Tsirkin
  2017-08-28 10:08 ` [PATCH v15 4/5] mm: support reporting free page blocks Wei Wang
  2017-08-28 10:08 ` [PATCH v15 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ Wei Wang
  4 siblings, 1 reply; 20+ messages in thread
From: Wei Wang @ 2017-08-28 10:08 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer
of balloon (i.e. inflated/deflated) pages using scatter-gather lists
to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the balloon pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
sgs. An sg describes a chunk of guest physically continuous pages.
With this mechanism, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~597ms
resulting in an improvement of ~86%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_balloon.c     | 171 ++++++++++++++++++++++++++++++++----
 include/uapi/linux/virtio_balloon.h |   1 +
 2 files changed, 155 insertions(+), 17 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f0b3a0b..8ecc1d4 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -32,6 +32,8 @@
 #include <linux/mm.h>
 #include <linux/mount.h>
 #include <linux/magic.h>
+#include <linux/xbitmap.h>
+#include <asm/page.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -79,6 +81,9 @@ struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	/* The xbitmap used to record balloon pages */
+	struct xb page_xb;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -141,13 +146,111 @@ static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
+static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
+{
+	struct scatterlist sg;
+
+	sg_init_one(&sg, addr, size);
+	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
+}
+
+static void send_balloon_page_sg(struct virtio_balloon *vb,
+				 struct virtqueue *vq,
+				 void *addr,
+				 uint32_t size,
+				 bool batch)
+{
+	unsigned int len;
+	int err;
+
+	err = add_one_sg(vq, addr, size);
+	/* Sanity check: this can't really happen */
+	WARN_ON(err);
+
+	/* If batching is in use, we batch the sgs till the vq is full. */
+	if (!batch || !vq->num_free) {
+		virtqueue_kick(vq);
+		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		/* Release all the entries if there are */
+		while (virtqueue_get_buf(vq, &len))
+			;
+	}
+}
+
+/*
+ * Send balloon pages in sgs to host. The balloon pages are recorded in the
+ * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
+ * The page xbitmap is searched for continuous "1" bits, which correspond
+ * to continuous pages, to chunk into sgs.
+ *
+ * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
+ * need to be searched.
+ */
+static void tell_host_sgs(struct virtio_balloon *vb,
+			  struct virtqueue *vq,
+			  unsigned long page_xb_start,
+			  unsigned long page_xb_end)
+{
+	unsigned long sg_pfn_start, sg_pfn_end;
+	void *sg_addr;
+	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
+
+	sg_pfn_start = page_xb_start;
+	while (sg_pfn_start < page_xb_end) {
+		sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start,
+						page_xb_end, 1);
+		if (sg_pfn_start == page_xb_end + 1)
+			break;
+		sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1,
+					      page_xb_end, 0);
+		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
+		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
+		while (sg_len > sg_max_len) {
+			send_balloon_page_sg(vb, vq, sg_addr, sg_max_len, 1);
+			sg_addr += sg_max_len;
+			sg_len -= sg_max_len;
+		}
+		send_balloon_page_sg(vb, vq, sg_addr, sg_len, 1);
+		xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end);
+		sg_pfn_start = sg_pfn_end + 1;
+	}
+
+	/*
+	 * The last few sgs may not reach the batch size, but need a kick to
+	 * notify the device to handle them.
+	 */
+	if (vq->num_free != virtqueue_get_vring_size(vq)) {
+		virtqueue_kick(vq);
+		wait_event(vb->acked, virtqueue_get_buf(vq, &sg_len));
+		while (virtqueue_get_buf(vq, &sg_len))
+			;
+	}
+}
+
+static inline void xb_set_page(struct virtio_balloon *vb,
+			       struct page *page,
+			       unsigned long *pfn_min,
+			       unsigned long *pfn_max)
+{
+	unsigned long pfn = page_to_pfn(page);
+
+	*pfn_min = min(pfn, *pfn_min);
+	*pfn_max = max(pfn, *pfn_max);
+	xb_preload(GFP_KERNEL);
+	xb_set_bit(&vb->page_xb, pfn);
+	xb_preload_end();
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	unsigned num_allocated_pages;
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -162,7 +265,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
+		if (use_sg)
+			xb_set_page(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -171,8 +279,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -198,9 +310,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	/* Traditionally, we can only do one array worth at a time. */
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	/* We can't release more pages than taken */
@@ -210,7 +325,11 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (use_sg)
+			xb_set_page(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -221,8 +340,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
@@ -441,6 +564,7 @@ static int init_vqs(struct virtio_balloon *vb)
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
+
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
  *			     a compation thread.     (called under page lock)
@@ -464,6 +588,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
 	unsigned long flags;
 
 	/*
@@ -485,16 +610,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (use_sg) {
+		send_balloon_page_sg(vb, vb->inflate_vq, page_address(newpage),
+				     PAGE_SIZE, 0);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (use_sg) {
+		send_balloon_page_sg(vb, vb->deflate_vq, page_address(page),
+				     PAGE_SIZE, 0);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -553,6 +686,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (err)
 		goto out_free_vb;
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
+		xb_init(&vb->page_xb);
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -669,6 +805,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_SG,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..37780a7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v15 4/5] mm: support reporting free page blocks
  2017-08-28 10:08 [PATCH v15 0/5] Virtio-balloon Enhancement Wei Wang
                   ` (2 preceding siblings ...)
  2017-08-28 10:08 ` [PATCH v15 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
@ 2017-08-28 10:08 ` Wei Wang
  2017-08-28 13:33   ` Michal Hocko
  2017-08-28 10:08 ` [PATCH v15 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ Wei Wang
  4 siblings, 1 reply; 20+ messages in thread
From: Wei Wang @ 2017-08-28 10:08 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

This patch adds support to walk through the free page blocks in the
system and report them via a callback function. Some page blocks may
leave the free list after zone->lock is released, so it is the caller's
responsibility to either detect or prevent the use of such pages.

One use example of this patch is to accelerate live migration by skipping
the transfer of free pages reported from the guest. A popular method used
by the hypervisor to track which part of memory is written during live
migration is to write-protect all the guest memory. So, those pages that
are reported as free pages but are written after the report function
returns will be captured by the hypervisor, and they will be added to the
next round of memory transfer.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/mm.h |  5 +++++
 mm/page_alloc.c    | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5..3c4267d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1835,6 +1835,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 
+extern void walk_free_mem_block(void *opaque,
+				int min_order,
+				bool (*report_page_block)(void *, unsigned long,
+							  unsigned long));
+
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
  * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d00f74..81eedc7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4762,6 +4762,71 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 	show_swap_cache_info();
 }
 
+/**
+ * walk_free_mem_block - Walk through the free page blocks in the system
+ * @opaque: the context passed from the caller
+ * @min_order: the minimum order of free lists to check
+ * @report_page_block: the callback function to report free page blocks
+ *
+ * If the callback returns 1, stop iterating the list of free page blocks.
+ * Otherwise, continue to report.
+ *
+ * Please note that there are no locking guarantees for the callback and
+ * that the reported pfn range might be freed or disappear after the
+ * callback returns so the caller has to be very careful how it is used.
+ *
+ * The callback itself must not sleep or perform any operations which would
+ * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
+ * or via any lock dependency. It is generally advisable to implement
+ * the callback as simple as possible and defer any heavy lifting to a
+ * different context.
+ *
+ * There is no guarantee that each free range will be reported only once
+ * during one walk_free_mem_block invocation.
+ *
+ * pfn_to_page on the given range is strongly discouraged and if there is
+ * an absolute need for that make sure to contact MM people to discuss
+ * potential problems.
+ *
+ * The function itself might sleep so it cannot be called from atomic
+ * contexts.
+ *
+ * In general low orders tend to be very volatile and so it makes more
+ * sense to query larger ones first for various optimizations which like
+ * ballooning etc... This will reduce the overhead as well.
+ */
+void walk_free_mem_block(void *opaque,
+			 int min_order,
+			 bool (*report_page_block)(void *, unsigned long,
+						   unsigned long))
+{
+	struct zone *zone;
+	struct page *page;
+	struct list_head *list;
+	int order;
+	enum migratetype mt;
+	unsigned long pfn, flags;
+	bool stop = 0;
+
+	for_each_populated_zone(zone) {
+		for (order = MAX_ORDER - 1; order >= min_order; order--) {
+			for (mt = 0; !stop && mt < MIGRATE_TYPES; mt++) {
+				spin_lock_irqsave(&zone->lock, flags);
+				list = &zone->free_area[order].free_list[mt];
+				list_for_each_entry(page, list, lru) {
+					pfn = page_to_pfn(page);
+					stop = report_page_block(opaque, pfn,
+								 1 << order);
+					if (stop)
+						break;
+				}
+				spin_unlock_irqrestore(&zone->lock, flags);
+			}
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(walk_free_mem_block);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v15 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
  2017-08-28 10:08 [PATCH v15 0/5] Virtio-balloon Enhancement Wei Wang
                   ` (3 preceding siblings ...)
  2017-08-28 10:08 ` [PATCH v15 4/5] mm: support reporting free page blocks Wei Wang
@ 2017-08-28 10:08 ` Wei Wang
  2017-09-05 11:57   ` Wang, Wei W
  4 siblings, 1 reply; 20+ messages in thread
From: Wei Wang @ 2017-08-28 10:08 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

Add a new vq, ctrl_vq, to handle commands between the host and guest.
With this feature, we will be able to have the control plane and data
plane separated. In other words, the control related data of each
feature will be sent via the ctrl_vq cmds, meanwhile each feature may
have its own data plane vq.

Free page report is the the first new feature controlled via ctrl_vq,
and a new cmd class, VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE, is added.
Currently, this feature has two cmds:
VIRTIO_BALLOON_FREE_PAGE_F_START: This cmd is sent from host to guest
to start the free page reporting work.
VIRTIO_BALLOON_FREE_PAGE_F_STOP: This cmd is used bidirectionally. The
guest would send the cmd to the host to indicate the reporting work is
done. The host would send the cmd to the guest to actively request the
stop of the reporting work.

The free_page_vq is used to transmit the guest free page blocks to the
host.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 drivers/virtio/virtio_balloon.c     | 247 +++++++++++++++++++++++++++++++++---
 include/uapi/linux/virtio_balloon.h |  15 +++
 2 files changed, 242 insertions(+), 20 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 8ecc1d4..1d384a4 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -55,7 +55,13 @@ static struct vfsmount *balloon_mnt;
 
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *ctrl_vq,
+			 *free_page_vq;
+
+	/* Balloon's own wq for cpu-intensive work items */
+	struct workqueue_struct *balloon_wq;
+	/* The work items submitted to the balloon wq are listed here */
+	struct work_struct report_free_page_work;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
@@ -65,6 +71,9 @@ struct virtio_balloon {
 	spinlock_t stop_update_lock;
 	bool stop_update;
 
+	/* Stop reporting free pages */
+	bool report_free_page_stop;
+
 	/* Waiting for host to ack the pages we released. */
 	wait_queue_head_t acked;
 
@@ -93,6 +102,11 @@ struct virtio_balloon {
 
 	/* To register callback in oom notifier call chain */
 	struct notifier_block nb;
+
+	/* Host to guest ctrlq cmd buf for free page report */
+	struct virtio_balloon_ctrlq_cmd free_page_cmd_in;
+	/* Guest to Host ctrlq cmd buf for free page report */
+	struct virtio_balloon_ctrlq_cmd free_page_cmd_out;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -177,6 +191,26 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
 	}
 }
 
+static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
+{
+	unsigned int len;
+	int err = -ENOSPC;
+
+	do {
+		if (vq->num_free) {
+			err = add_one_sg(vq, addr, size);
+			/* Sanity check: this can't really happen */
+			WARN_ON(err);
+			if (!err)
+				virtqueue_kick(vq);
+		}
+
+		/* Release entries if there are */
+		while (virtqueue_get_buf(vq, &len))
+			;
+	} while (err == -ENOSPC && vq->num_free);
+}
+
 /*
  * Send balloon pages in sgs to host. The balloon pages are recorded in the
  * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
@@ -525,42 +559,206 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
-static int init_vqs(struct virtio_balloon *vb)
+static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
+					   unsigned long nr_pages)
+{
+	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
+	void *addr = (void *)pfn_to_kaddr(pfn);
+	uint32_t len = nr_pages << PAGE_SHIFT;
+
+	if (vb->report_free_page_stop)
+		return 1;
+
+	send_free_page_sg(vb->free_page_vq, addr, len);
+
+	return 0;
+}
+
+static void ctrlq_add_cmd(struct virtqueue *vq,
+			  struct virtio_balloon_ctrlq_cmd *cmd,
+			  bool inbuf)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
-	int err, nvqs;
+	struct scatterlist sg;
+	int err;
+
+	sg_init_one(&sg, cmd, sizeof(struct virtio_balloon_ctrlq_cmd));
+	if (inbuf)
+		err = virtqueue_add_inbuf(vq, &sg, 1, cmd, GFP_KERNEL);
+	else
+		err = virtqueue_add_outbuf(vq, &sg, 1, cmd, GFP_KERNEL);
+
+	/* Sanity check: this can't really happen */
+	WARN_ON(err);
+}
 
+static void ctrlq_send_cmd(struct virtio_balloon *vb,
+			  struct virtio_balloon_ctrlq_cmd *cmd,
+			  bool inbuf)
+{
+	struct virtqueue *vq = vb->ctrl_vq;
+
+	ctrlq_add_cmd(vq, cmd, inbuf);
+	if (!inbuf) {
+		/*
+		 * All the input cmd buffers are replenished here.
+		 * This is necessary because the input cmd buffers are lost
+		 * after live migration. The device needs to rewind all of
+		 * them from the ctrl_vq.
+		 */
+		ctrlq_add_cmd(vq, &vb->free_page_cmd_in, true);
+	}
+	virtqueue_kick(vq);
+}
+
+static void report_free_page_end(struct virtio_balloon *vb)
+{
 	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
+	 * The host may have already requested to stop the reporting before we
+	 * finish, so no need to notify the host in this case.
 	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
+	if (vb->report_free_page_stop)
+		return;
+
+	vb->free_page_cmd_out.class = VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
+	vb->free_page_cmd_out.cmd = VIRTIO_BALLOON_FREE_PAGE_F_STOP;
+	ctrlq_send_cmd(vb, &vb->free_page_cmd_out, false);
+	vb->report_free_page_stop = true;
+}
+
+static void report_free_page(struct work_struct *work)
+{
+	struct virtio_balloon *vb;
+
+	vb = container_of(work, struct virtio_balloon, report_free_page_work);
+	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
+	report_free_page_end(vb);
+}
+
+static void ctrlq_handle(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+	struct virtio_balloon_ctrlq_cmd *cmd;
+	unsigned int len;
+
+	cmd = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len);
+
+	if (unlikely(!cmd))
+		return;
+
+	/* The outbuf is sent by the host for recycling, so just return. */
+	if (cmd == &vb->free_page_cmd_out)
+		return;
+
+	switch (cmd->class) {
+	case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
+		if (cmd->cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
+			vb->report_free_page_stop = true;
+		} else if (cmd->cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
+			vb->report_free_page_stop = false;
+			queue_work(vb->balloon_wq, &vb->report_free_page_work);
+		}
+		vb->free_page_cmd_in.class =
+					VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
+		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
+	break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n",
+			 __func__);
+	}
+}
+
+static int init_vqs(struct virtio_balloon *vb)
+{
+	struct virtqueue **vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	struct scatterlist sg;
+	int i, nvqs, err = -ENOMEM;
+
+	/* Inflateq and deflateq are used unconditionally */
+	nvqs = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs++;
+	/* If ctrlq is enabled, the free page vq will also be created */
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ))
+		nvqs += 2;
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vq;
+	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
+	if (!callbacks)
+		goto err_callback;
+	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
+	if (!names)
+		goto err_names;
+
+	callbacks[0] = balloon_ack;
+	names[0] = "inflate";
+	callbacks[1] = balloon_ack;
+	names[1] = "deflate";
+
+	i = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		callbacks[i] = stats_request;
+		names[i] = "stats";
+		i++;
+	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
+		callbacks[i] = ctrlq_handle;
+		names[i++] = "ctrlq";
+		callbacks[i] = NULL;
+		names[i] = "free_page_vq";
+	}
+
+	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
+					 NULL, NULL);
 	if (err)
-		return err;
+		goto err_find;
 
 	vb->inflate_vq = vqs[0];
 	vb->deflate_vq = vqs[1];
+	i = 2;
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
-		struct scatterlist sg;
-		unsigned int num_stats;
-		vb->stats_vq = vqs[2];
-
+		vb->stats_vq = vqs[i++];
 		/*
 		 * Prime this virtqueue with one buffer so the hypervisor can
 		 * use it to signal us later (it can't be broken yet!).
 		 */
-		num_stats = update_balloon_stats(vb);
-
-		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
+		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
 		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
-		    < 0)
-			BUG();
+		    < 0) {
+			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
+				 __func__);
+			goto err_find;
+		}
 		virtqueue_kick(vb->stats_vq);
 	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
+		vb->ctrl_vq = vqs[i++];
+		vb->free_page_vq = vqs[i];
+		/* Prime the ctrlq with an inbuf for the host to send a cmd */
+		vb->free_page_cmd_in.class =
+					VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
+		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
+	}
+
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
 	return 0;
+
+err_find:
+	kfree(names);
+err_names:
+	kfree(callbacks);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return err;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -689,6 +887,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
 		xb_init(&vb->page_xb);
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
+		vb->balloon_wq = alloc_workqueue("balloon-wq",
+					WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
+		INIT_WORK(&vb->report_free_page_work, report_free_page);
+		vb->report_free_page_stop = true;
+	}
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -753,6 +958,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	spin_unlock_irq(&vb->stop_update_lock);
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
+	cancel_work_sync(&vb->report_free_page_work);
 
 	remove_common(vb);
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -806,6 +1012,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_SG,
+	VIRTIO_BALLOON_F_CTRL_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 37780a7..dbf0616 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
+#define VIRTIO_BALLOON_F_CTRL_VQ	4 /* Control Virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -83,4 +84,18 @@ struct virtio_balloon_stat {
 	__virtio64 val;
 } __attribute__((packed));
 
+enum {
+	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE = 0,
+	VIRTIO_BALLOON_CTRLQ_CLASS_MAX,
+};
+
+struct virtio_balloon_ctrlq_cmd {
+	__virtio32 class;
+	__virtio32 cmd;
+};
+
+/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE */
+#define VIRTIO_BALLOON_FREE_PAGE_F_STOP		0
+#define VIRTIO_BALLOON_FREE_PAGE_F_START	1
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v15 4/5] mm: support reporting free page blocks
  2017-08-28 10:08 ` [PATCH v15 4/5] mm: support reporting free page blocks Wei Wang
@ 2017-08-28 13:33   ` Michal Hocko
  2017-08-28 14:09     ` Michal Hocko
  2017-08-29  3:23     ` Wei Wang
  0 siblings, 2 replies; 20+ messages in thread
From: Michal Hocko @ 2017-08-28 13:33 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Mon 28-08-17 18:08:32, Wei Wang wrote:
> This patch adds support to walk through the free page blocks in the
> system and report them via a callback function. Some page blocks may
> leave the free list after zone->lock is released, so it is the caller's
> responsibility to either detect or prevent the use of such pages.
> 
> One use example of this patch is to accelerate live migration by skipping
> the transfer of free pages reported from the guest. A popular method used
> by the hypervisor to track which part of memory is written during live
> migration is to write-protect all the guest memory. So, those pages that
> are reported as free pages but are written after the report function
> returns will be captured by the hypervisor, and they will be added to the
> next round of memory transfer.

OK, looks much better. I still have few nits.

> +extern void walk_free_mem_block(void *opaque,
> +				int min_order,
> +				bool (*report_page_block)(void *, unsigned long,
> +							  unsigned long));
> +

please add names to arguments of the prototype

>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>   * into the buddy system. The freed pages will be poisoned with pattern
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6d00f74..81eedc7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4762,6 +4762,71 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  	show_swap_cache_info();
>  }
>  
> +/**
> + * walk_free_mem_block - Walk through the free page blocks in the system
> + * @opaque: the context passed from the caller
> + * @min_order: the minimum order of free lists to check
> + * @report_page_block: the callback function to report free page blocks

page_block has meaning in the core MM which doesn't strictly match its
usage here. Moreover we are reporting pfn ranges rather than struct page
range. So report_pfn_range would suit better.

[...]
> +	for_each_populated_zone(zone) {
> +		for (order = MAX_ORDER - 1; order >= min_order; order--) {
> +			for (mt = 0; !stop && mt < MIGRATE_TYPES; mt++) {
> +				spin_lock_irqsave(&zone->lock, flags);
> +				list = &zone->free_area[order].free_list[mt];
> +				list_for_each_entry(page, list, lru) {
> +					pfn = page_to_pfn(page);
> +					stop = report_page_block(opaque, pfn,
> +								 1 << order);
> +					if (stop)
> +						break;

					if (stop) {
						spin_unlock_irqrestore(&zone->lock, flags);
						return;
					}

would be both easier and less error prone. E.g. You wouldn't pointlessly
iterate over remaining orders just to realize there is nothing to be
done for those...

> +				}
> +				spin_unlock_irqrestore(&zone->lock, flags);
> +			}
> +		}
> +	}
> +}
> +EXPORT_SYMBOL_GPL(walk_free_mem_block);

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v15 4/5] mm: support reporting free page blocks
  2017-08-28 13:33   ` Michal Hocko
@ 2017-08-28 14:09     ` Michal Hocko
  2017-08-29  3:23     ` Wei Wang
  1 sibling, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2017-08-28 14:09 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Mon 28-08-17 15:33:26, Michal Hocko wrote:
> On Mon 28-08-17 18:08:32, Wei Wang wrote:
> > This patch adds support to walk through the free page blocks in the
> > system and report them via a callback function. Some page blocks may
> > leave the free list after zone->lock is released, so it is the caller's
> > responsibility to either detect or prevent the use of such pages.
> > 
> > One use example of this patch is to accelerate live migration by skipping
> > the transfer of free pages reported from the guest. A popular method used
> > by the hypervisor to track which part of memory is written during live
> > migration is to write-protect all the guest memory. So, those pages that
> > are reported as free pages but are written after the report function
> > returns will be captured by the hypervisor, and they will be added to the
> > next round of memory transfer.
> 
> OK, looks much better. I still have few nits.
> 
> > +extern void walk_free_mem_block(void *opaque,
> > +				int min_order,
> > +				bool (*report_page_block)(void *, unsigned long,
> > +							  unsigned long));
> > +
> 
> please add names to arguments of the prototype

And one more thing. Your callback returns bool and true usually means a
success while you are using it to break out from the loop. This is
rather confusing. I would expect iterating until false is returned so
the opposite than what you have. You could also change this to int and
return 0 on success and < 0 to break out. 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v15 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-08-28 10:08 ` [PATCH v15 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
@ 2017-08-28 18:03   ` Michael S. Tsirkin
  2017-08-29  3:09     ` Wei Wang
  0 siblings, 1 reply; 20+ messages in thread
From: Michael S. Tsirkin @ 2017-08-28 18:03 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Mon, Aug 28, 2017 at 06:08:31PM +0800, Wei Wang wrote:
> Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer
> of balloon (i.e. inflated/deflated) pages using scatter-gather lists
> to the host.
> 
> The implementation of the previous virtio-balloon is not very
> efficient, because the balloon pages are transferred to the
> host one by one. Here is the breakdown of the time in percentage
> spent on each step of the balloon inflating process (inflating
> 7GB of an 8GB idle guest).
> 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete.
> The above profiling shows that the bottlenecks are stage 2)
> and stage 4).
> 
> This patch optimizes step 2) by transferring pages to the host in
> sgs. An sg describes a chunk of guest physically continuous pages.
> With this mechanism, step 4) can also be optimized by doing address
> translation and madvise() in chunks rather than page by page.
> 
> With this new feature, the above ballooning process takes ~597ms
> resulting in an improvement of ~86%.
> 
> TODO: optimize stage 1) by allocating/freeing a chunk of pages
> instead of a single page each time.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 171 ++++++++++++++++++++++++++++++++----
>  include/uapi/linux/virtio_balloon.h |   1 +
>  2 files changed, 155 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index f0b3a0b..8ecc1d4 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -32,6 +32,8 @@
>  #include <linux/mm.h>
>  #include <linux/mount.h>
>  #include <linux/magic.h>
> +#include <linux/xbitmap.h>
> +#include <asm/page.h>
>  
>  /*
>   * Balloon device works in 4K page units.  So each page is pointed to by
> @@ -79,6 +81,9 @@ struct virtio_balloon {
>  	/* Synchronize access/update to this struct virtio_balloon elements */
>  	struct mutex balloon_lock;
>  
> +	/* The xbitmap used to record balloon pages */
> +	struct xb page_xb;
> +
>  	/* The array of pfns we tell the Host about. */
>  	unsigned int num_pfns;
>  	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> @@ -141,13 +146,111 @@ static void set_page_pfns(struct virtio_balloon *vb,
>  					  page_to_balloon_pfn(page) + i);
>  }
>  
> +static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	struct scatterlist sg;
> +
> +	sg_init_one(&sg, addr, size);
> +	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
> +}
> +
> +static void send_balloon_page_sg(struct virtio_balloon *vb,
> +				 struct virtqueue *vq,
> +				 void *addr,
> +				 uint32_t size,
> +				 bool batch)
> +{
> +	unsigned int len;
> +	int err;
> +
> +	err = add_one_sg(vq, addr, size);
> +	/* Sanity check: this can't really happen */
> +	WARN_ON(err);

It might be cleaner to detect that add failed due to
ring full and kick then. Just an idea, up to you
whether to do it.

> +
> +	/* If batching is in use, we batch the sgs till the vq is full. */
> +	if (!batch || !vq->num_free) {
> +		virtqueue_kick(vq);
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		/* Release all the entries if there are */

Meaning
	Account for all used entries if any
?

> +		while (virtqueue_get_buf(vq, &len))
> +			;


Above code is reused below. Add a function?

> +	}
> +}
> +
> +/*
> + * Send balloon pages in sgs to host. The balloon pages are recorded in the
> + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> + * The page xbitmap is searched for continuous "1" bits, which correspond
> + * to continuous pages, to chunk into sgs.
> + *
> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
> + * need to be searched.
> + */
> +static void tell_host_sgs(struct virtio_balloon *vb,
> +			  struct virtqueue *vq,
> +			  unsigned long page_xb_start,
> +			  unsigned long page_xb_end)
> +{
> +	unsigned long sg_pfn_start, sg_pfn_end;
> +	void *sg_addr;
> +	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
> +
> +	sg_pfn_start = page_xb_start;
> +	while (sg_pfn_start < page_xb_end) {
> +		sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start,
> +						page_xb_end, 1);
> +		if (sg_pfn_start == page_xb_end + 1)
> +			break;
> +		sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1,
> +					      page_xb_end, 0);
> +		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
> +		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
> +		while (sg_len > sg_max_len) {
> +			send_balloon_page_sg(vb, vq, sg_addr, sg_max_len, 1);

Last argument should be true, not 1.

> +			sg_addr += sg_max_len;
> +			sg_len -= sg_max_len;
> +		}
> +		send_balloon_page_sg(vb, vq, sg_addr, sg_len, 1);
> +		xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end);
> +		sg_pfn_start = sg_pfn_end + 1;
> +	}
> +
> +	/*
> +	 * The last few sgs may not reach the batch size, but need a kick to
> +	 * notify the device to handle them.
> +	 */
> +	if (vq->num_free != virtqueue_get_vring_size(vq)) {
> +		virtqueue_kick(vq);
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &sg_len));
> +		while (virtqueue_get_buf(vq, &sg_len))
> +			;

Some entries can get used after a pause. Looks like they will leak then?
One fix would be to convert above if to a while loop.
I don't know whether to do it like this in send_balloon_page_sg too.

> +	}
> +}
> +
> +static inline void xb_set_page(struct virtio_balloon *vb,
> +			       struct page *page,
> +			       unsigned long *pfn_min,
> +			       unsigned long *pfn_max)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +	xb_preload(GFP_KERNEL);
> +	xb_set_bit(&vb->page_xb, pfn);
> +	xb_preload_end();
> +}
> +
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	unsigned num_allocated_pages;
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
>  	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
> @@ -162,7 +265,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  			msleep(200);
>  			break;
>  		}
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
> +		if (use_sg)
> +			xb_set_page(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		if (!virtio_has_feature(vb->vdev,
>  					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> @@ -171,8 +279,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  
>  	num_allocated_pages = vb->num_pfns;
>  	/* Did we get any? */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->inflate_vq);
> +	if (vb->num_pfns) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->inflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	return num_allocated_pages;
> @@ -198,9 +310,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> -	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	/* Traditionally, we can only do one array worth at a time. */
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	/* We can't release more pages than taken */
> @@ -210,7 +325,11 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  		page = balloon_page_dequeue(vb_dev_info);
>  		if (!page)
>  			break;
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (use_sg)
> +			xb_set_page(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
>  		list_add(&page->lru, &pages);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}
> @@ -221,8 +340,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>  	 * is true, we *have* to do it in this order
>  	 */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->deflate_vq);
> +	if (vb->num_pfns) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->deflate_vq);
> +	}
>  	release_pages_balloon(vb, &pages);
>  	mutex_unlock(&vb->balloon_lock);
>  	return num_freed_pages;
> @@ -441,6 +564,7 @@ static int init_vqs(struct virtio_balloon *vb)
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> +
>  /*
>   * virtballoon_migratepage - perform the balloon page migration on behalf of
>   *			     a compation thread.     (called under page lock)
> @@ -464,6 +588,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  {
>  	struct virtio_balloon *vb = container_of(vb_dev_info,
>  			struct virtio_balloon, vb_dev_info);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
>  	unsigned long flags;
>  
>  	/*
> @@ -485,16 +610,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  	vb_dev_info->isolated_pages--;
>  	__count_vm_event(BALLOON_MIGRATE);
>  	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, newpage);
> -	tell_host(vb, vb->inflate_vq);
> -
> +	if (use_sg) {
> +		send_balloon_page_sg(vb, vb->inflate_vq, page_address(newpage),
> +				     PAGE_SIZE, 0);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, newpage);
> +		tell_host(vb, vb->inflate_vq);
> +	}
>  	/* balloon's page migration 2nd step -- deflate "page" */
>  	balloon_page_delete(page);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, page);
> -	tell_host(vb, vb->deflate_vq);
> -
> +	if (use_sg) {
> +		send_balloon_page_sg(vb, vb->deflate_vq, page_address(page),
> +				     PAGE_SIZE, 0);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, page);
> +		tell_host(vb, vb->deflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	put_page(page); /* balloon reference */
> @@ -553,6 +686,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (err)
>  		goto out_free_vb;
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
> +		xb_init(&vb->page_xb);
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -669,6 +805,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_SG,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 343d7dd..37780a7 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> -- 
> 2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v15 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-08-28 18:03   ` Michael S. Tsirkin
@ 2017-08-29  3:09     ` Wei Wang
  2017-09-08  3:36       ` Michael S. Tsirkin
  0 siblings, 1 reply; 20+ messages in thread
From: Wei Wang @ 2017-08-29  3:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/29/2017 02:03 AM, Michael S. Tsirkin wrote:
> On Mon, Aug 28, 2017 at 06:08:31PM +0800, Wei Wang wrote:
>> Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer
>> of balloon (i.e. inflated/deflated) pages using scatter-gather lists
>> to the host.
>>
>> The implementation of the previous virtio-balloon is not very
>> efficient, because the balloon pages are transferred to the
>> host one by one. Here is the breakdown of the time in percentage
>> spent on each step of the balloon inflating process (inflating
>> 7GB of an 8GB idle guest).
>>
>> 1) allocating pages (6.5%)
>> 2) sending PFNs to host (68.3%)
>> 3) address translation (6.1%)
>> 4) madvise (19%)
>>
>> It takes about 4126ms for the inflating process to complete.
>> The above profiling shows that the bottlenecks are stage 2)
>> and stage 4).
>>
>> This patch optimizes step 2) by transferring pages to the host in
>> sgs. An sg describes a chunk of guest physically continuous pages.
>> With this mechanism, step 4) can also be optimized by doing address
>> translation and madvise() in chunks rather than page by page.
>>
>> With this new feature, the above ballooning process takes ~597ms
>> resulting in an improvement of ~86%.
>>
>> TODO: optimize stage 1) by allocating/freeing a chunk of pages
>> instead of a single page each time.
>>
>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
>> ---
>>   drivers/virtio/virtio_balloon.c     | 171 ++++++++++++++++++++++++++++++++----
>>   include/uapi/linux/virtio_balloon.h |   1 +
>>   2 files changed, 155 insertions(+), 17 deletions(-)
>>
>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>> index f0b3a0b..8ecc1d4 100644
>> --- a/drivers/virtio/virtio_balloon.c
>> +++ b/drivers/virtio/virtio_balloon.c
>> @@ -32,6 +32,8 @@
>>   #include <linux/mm.h>
>>   #include <linux/mount.h>
>>   #include <linux/magic.h>
>> +#include <linux/xbitmap.h>
>> +#include <asm/page.h>
>>   
>>   /*
>>    * Balloon device works in 4K page units.  So each page is pointed to by
>> @@ -79,6 +81,9 @@ struct virtio_balloon {
>>   	/* Synchronize access/update to this struct virtio_balloon elements */
>>   	struct mutex balloon_lock;
>>   
>> +	/* The xbitmap used to record balloon pages */
>> +	struct xb page_xb;
>> +
>>   	/* The array of pfns we tell the Host about. */
>>   	unsigned int num_pfns;
>>   	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
>> @@ -141,13 +146,111 @@ static void set_page_pfns(struct virtio_balloon *vb,
>>   					  page_to_balloon_pfn(page) + i);
>>   }
>>   
>> +static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
>> +{
>> +	struct scatterlist sg;
>> +
>> +	sg_init_one(&sg, addr, size);
>> +	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
>> +}
>> +
>> +static void send_balloon_page_sg(struct virtio_balloon *vb,
>> +				 struct virtqueue *vq,
>> +				 void *addr,
>> +				 uint32_t size,
>> +				 bool batch)
>> +{
>> +	unsigned int len;
>> +	int err;
>> +
>> +	err = add_one_sg(vq, addr, size);
>> +	/* Sanity check: this can't really happen */
>> +	WARN_ON(err);
> It might be cleaner to detect that add failed due to
> ring full and kick then. Just an idea, up to you
> whether to do it.
>
>> +
>> +	/* If batching is in use, we batch the sgs till the vq is full. */
>> +	if (!batch || !vq->num_free) {
>> +		virtqueue_kick(vq);
>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>> +		/* Release all the entries if there are */
> Meaning
> 	Account for all used entries if any
> ?
>
>> +		while (virtqueue_get_buf(vq, &len))
>> +			;
>
> Above code is reused below. Add a function?
>
>> +	}
>> +}
>> +
>> +/*
>> + * Send balloon pages in sgs to host. The balloon pages are recorded in the
>> + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
>> + * The page xbitmap is searched for continuous "1" bits, which correspond
>> + * to continuous pages, to chunk into sgs.
>> + *
>> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
>> + * need to be searched.
>> + */
>> +static void tell_host_sgs(struct virtio_balloon *vb,
>> +			  struct virtqueue *vq,
>> +			  unsigned long page_xb_start,
>> +			  unsigned long page_xb_end)
>> +{
>> +	unsigned long sg_pfn_start, sg_pfn_end;
>> +	void *sg_addr;
>> +	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
>> +
>> +	sg_pfn_start = page_xb_start;
>> +	while (sg_pfn_start < page_xb_end) {
>> +		sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start,
>> +						page_xb_end, 1);
>> +		if (sg_pfn_start == page_xb_end + 1)
>> +			break;
>> +		sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1,
>> +					      page_xb_end, 0);
>> +		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
>> +		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
>> +		while (sg_len > sg_max_len) {
>> +			send_balloon_page_sg(vb, vq, sg_addr, sg_max_len, 1);
> Last argument should be true, not 1.
>
>> +			sg_addr += sg_max_len;
>> +			sg_len -= sg_max_len;
>> +		}
>> +		send_balloon_page_sg(vb, vq, sg_addr, sg_len, 1);
>> +		xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end);
>> +		sg_pfn_start = sg_pfn_end + 1;
>> +	}
>> +
>> +	/*
>> +	 * The last few sgs may not reach the batch size, but need a kick to
>> +	 * notify the device to handle them.
>> +	 */
>> +	if (vq->num_free != virtqueue_get_vring_size(vq)) {
>> +		virtqueue_kick(vq);
>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &sg_len));
>> +		while (virtqueue_get_buf(vq, &sg_len))
>> +			;
> Some entries can get used after a pause. Looks like they will leak then?
> One fix would be to convert above if to a while loop.
> I don't know whether to do it like this in send_balloon_page_sg too.
>

Thanks for the above comments. I've re-written this part of code.
Please have a check below if there is anything more we could improve:

static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
{
         unsigned int len;

         virtqueue_kick(vq);
         wait_event(wq_head, virtqueue_get_buf(vq, &len));
         /* Detach all the used buffers from the vq */
         while (virtqueue_get_buf(vq, &len))
                 ;
}

static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
{
         struct scatterlist sg;
         int ret;

         sg_init_one(&sg, addr, size);
         ret = virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
         if (unlikely(ret == -ENOSPC))
                 dev_warn(&vq->vdev->dev, "%s: failed due to ring full\n",
                                  __func__);

         return ret;
}

static void send_balloon_page_sg(struct virtio_balloon *vb,
                                                       struct virtqueue *vq,
                                                       void *addr,
                                                       uint32_t size,
                                                       bool batch)
{
         int err;

         do {
                 err = add_one_sg(vq, addr, size);
                 if (err == -ENOSPC || !batch || !vq->num_free)
                         kick_and_wait(vq, vb->acked);
         } while (err == -ENOSPC);
}


Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v15 4/5] mm: support reporting free page blocks
  2017-08-28 13:33   ` Michal Hocko
  2017-08-28 14:09     ` Michal Hocko
@ 2017-08-29  3:23     ` Wei Wang
  1 sibling, 0 replies; 20+ messages in thread
From: Wei Wang @ 2017-08-29  3:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/28/2017 09:33 PM, Michal Hocko wrote:
> On Mon 28-08-17 18:08:32, Wei Wang wrote:
>> This patch adds support to walk through the free page blocks in the
>> system and report them via a callback function. Some page blocks may
>> leave the free list after zone->lock is released, so it is the caller's
>> responsibility to either detect or prevent the use of such pages.
>>
>> One use example of this patch is to accelerate live migration by skipping
>> the transfer of free pages reported from the guest. A popular method used
>> by the hypervisor to track which part of memory is written during live
>> migration is to write-protect all the guest memory. So, those pages that
>> are reported as free pages but are written after the report function
>> returns will be captured by the hypervisor, and they will be added to the
>> next round of memory transfer.
> OK, looks much better. I still have few nits.
>
>> +extern void walk_free_mem_block(void *opaque,
>> +				int min_order,
>> +				bool (*report_page_block)(void *, unsigned long,
>> +							  unsigned long));
>> +
> please add names to arguments of the prototype
>
>>   /*
>>    * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>>    * into the buddy system. The freed pages will be poisoned with pattern
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 6d00f74..81eedc7 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4762,6 +4762,71 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>>   	show_swap_cache_info();
>>   }
>>   
>> +/**
>> + * walk_free_mem_block - Walk through the free page blocks in the system
>> + * @opaque: the context passed from the caller
>> + * @min_order: the minimum order of free lists to check
>> + * @report_page_block: the callback function to report free page blocks
> page_block has meaning in the core MM which doesn't strictly match its
> usage here. Moreover we are reporting pfn ranges rather than struct page
> range. So report_pfn_range would suit better.
>
> [...]
>> +	for_each_populated_zone(zone) {
>> +		for (order = MAX_ORDER - 1; order >= min_order; order--) {
>> +			for (mt = 0; !stop && mt < MIGRATE_TYPES; mt++) {
>> +				spin_lock_irqsave(&zone->lock, flags);
>> +				list = &zone->free_area[order].free_list[mt];
>> +				list_for_each_entry(page, list, lru) {
>> +					pfn = page_to_pfn(page);
>> +					stop = report_page_block(opaque, pfn,
>> +								 1 << order);
>> +					if (stop)
>> +						break;
> 					if (stop) {
> 						spin_unlock_irqrestore(&zone->lock, flags);
> 						return;
> 					}
>
> would be both easier and less error prone. E.g. You wouldn't pointlessly
> iterate over remaining orders just to realize there is nothing to be
> done for those...
>

Yes, that's better, thanks. I will take other suggestions as well.

Best,
Wei



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH v15 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
  2017-08-28 10:08 ` [PATCH v15 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ Wei Wang
@ 2017-09-05 11:57   ` Wang, Wei W
  0 siblings, 0 replies; 20+ messages in thread
From: Wang, Wei W @ 2017-09-05 11:57 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, liliang.opensource, yang.zhang.wz, quan.xu

Ping for comments if possible. Thanks.

On Monday, August 28, 2017 6:09 PM, Wang, Wei W wrote:
> [PATCH v15 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
> 
> Add a new vq, ctrl_vq, to handle commands between the host and guest.
> With this feature, we will be able to have the control plane and data plane
> separated. In other words, the control related data of each feature will be sent
> via the ctrl_vq cmds, meanwhile each feature may have its own data plane vq.
> 
> Free page report is the the first new feature controlled via ctrl_vq, and a new
> cmd class, VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE, is added.
> Currently, this feature has two cmds:
> VIRTIO_BALLOON_FREE_PAGE_F_START: This cmd is sent from host to guest to
> start the free page reporting work.
> VIRTIO_BALLOON_FREE_PAGE_F_STOP: This cmd is used bidirectionally. The
> guest would send the cmd to the host to indicate the reporting work is done.
> The host would send the cmd to the guest to actively request the stop of the
> reporting work.
> 
> The free_page_vq is used to transmit the guest free page blocks to the host.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 247 +++++++++++++++++++++++++++++++++-
> --
>  include/uapi/linux/virtio_balloon.h |  15 +++
>  2 files changed, 242 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index
> 8ecc1d4..1d384a4 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -55,7 +55,13 @@ static struct vfsmount *balloon_mnt;
> 
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *ctrl_vq,
> +			 *free_page_vq;
> +
> +	/* Balloon's own wq for cpu-intensive work items */
> +	struct workqueue_struct *balloon_wq;
> +	/* The work items submitted to the balloon wq are listed here */
> +	struct work_struct report_free_page_work;
> 
>  	/* The balloon servicing is delegated to a freezable workqueue. */
>  	struct work_struct update_balloon_stats_work; @@ -65,6 +71,9 @@
> struct virtio_balloon {
>  	spinlock_t stop_update_lock;
>  	bool stop_update;
> 
> +	/* Stop reporting free pages */
> +	bool report_free_page_stop;
> +
>  	/* Waiting for host to ack the pages we released. */
>  	wait_queue_head_t acked;
> 
> @@ -93,6 +102,11 @@ struct virtio_balloon {
> 
>  	/* To register callback in oom notifier call chain */
>  	struct notifier_block nb;
> +
> +	/* Host to guest ctrlq cmd buf for free page report */
> +	struct virtio_balloon_ctrlq_cmd free_page_cmd_in;
> +	/* Guest to Host ctrlq cmd buf for free page report */
> +	struct virtio_balloon_ctrlq_cmd free_page_cmd_out;
>  };
> 
>  static struct virtio_device_id id_table[] = { @@ -177,6 +191,26 @@ static void
> send_balloon_page_sg(struct virtio_balloon *vb,
>  	}
>  }
> 
> +static void send_free_page_sg(struct virtqueue *vq, void *addr,
> +uint32_t size) {
> +	unsigned int len;
> +	int err = -ENOSPC;
> +
> +	do {
> +		if (vq->num_free) {
> +			err = add_one_sg(vq, addr, size);
> +			/* Sanity check: this can't really happen */
> +			WARN_ON(err);
> +			if (!err)
> +				virtqueue_kick(vq);
> +		}
> +
> +		/* Release entries if there are */
> +		while (virtqueue_get_buf(vq, &len))
> +			;
> +	} while (err == -ENOSPC && vq->num_free); }
> +
>  /*
>   * Send balloon pages in sgs to host. The balloon pages are recorded in the
>   * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> @@ -525,42 +559,206 @@ static void update_balloon_size_func(struct
> work_struct *work)
>  		queue_work(system_freezable_wq, work);  }
> 
> -static int init_vqs(struct virtio_balloon *vb)
> +static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> +					   unsigned long nr_pages)
> +{
> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> +	void *addr = (void *)pfn_to_kaddr(pfn);
> +	uint32_t len = nr_pages << PAGE_SHIFT;
> +
> +	if (vb->report_free_page_stop)
> +		return 1;
> +
> +	send_free_page_sg(vb->free_page_vq, addr, len);
> +
> +	return 0;
> +}
> +
> +static void ctrlq_add_cmd(struct virtqueue *vq,
> +			  struct virtio_balloon_ctrlq_cmd *cmd,
> +			  bool inbuf)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct scatterlist sg;
> +	int err;
> +
> +	sg_init_one(&sg, cmd, sizeof(struct virtio_balloon_ctrlq_cmd));
> +	if (inbuf)
> +		err = virtqueue_add_inbuf(vq, &sg, 1, cmd, GFP_KERNEL);
> +	else
> +		err = virtqueue_add_outbuf(vq, &sg, 1, cmd, GFP_KERNEL);
> +
> +	/* Sanity check: this can't really happen */
> +	WARN_ON(err);
> +}
> 
> +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> +			  struct virtio_balloon_ctrlq_cmd *cmd,
> +			  bool inbuf)
> +{
> +	struct virtqueue *vq = vb->ctrl_vq;
> +
> +	ctrlq_add_cmd(vq, cmd, inbuf);
> +	if (!inbuf) {
> +		/*
> +		 * All the input cmd buffers are replenished here.
> +		 * This is necessary because the input cmd buffers are lost
> +		 * after live migration. The device needs to rewind all of
> +		 * them from the ctrl_vq.
> +		 */
> +		ctrlq_add_cmd(vq, &vb->free_page_cmd_in, true);
> +	}
> +	virtqueue_kick(vq);
> +}
> +
> +static void report_free_page_end(struct virtio_balloon *vb) {
>  	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> +	 * The host may have already requested to stop the reporting before we
> +	 * finish, so no need to notify the host in this case.
>  	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ?
> 3 : 2;
> -	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> +	if (vb->report_free_page_stop)
> +		return;
> +
> +	vb->free_page_cmd_out.class =
> VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> +	vb->free_page_cmd_out.cmd = VIRTIO_BALLOON_FREE_PAGE_F_STOP;
> +	ctrlq_send_cmd(vb, &vb->free_page_cmd_out, false);
> +	vb->report_free_page_stop = true;
> +}
> +
> +static void report_free_page(struct work_struct *work) {
> +	struct virtio_balloon *vb;
> +
> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> +	report_free_page_end(vb);
> +}
> +
> +static void ctrlq_handle(struct virtqueue *vq) {
> +	struct virtio_balloon *vb = vq->vdev->priv;
> +	struct virtio_balloon_ctrlq_cmd *cmd;
> +	unsigned int len;
> +
> +	cmd = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len);
> +
> +	if (unlikely(!cmd))
> +		return;
> +
> +	/* The outbuf is sent by the host for recycling, so just return. */
> +	if (cmd == &vb->free_page_cmd_out)
> +		return;
> +
> +	switch (cmd->class) {
> +	case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
> +		if (cmd->cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
> +			vb->report_free_page_stop = true;
> +		} else if (cmd->cmd == VIRTIO_BALLOON_FREE_PAGE_F_START)
> {
> +			vb->report_free_page_stop = false;
> +			queue_work(vb->balloon_wq, &vb-
> >report_free_page_work);
> +		}
> +		vb->free_page_cmd_in.class =
> +
> 	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> +		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
> +	break;
> +	default:
> +		dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n",
> +			 __func__);
> +	}
> +}
> +
> +static int init_vqs(struct virtio_balloon *vb) {
> +	struct virtqueue **vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	struct scatterlist sg;
> +	int i, nvqs, err = -ENOMEM;
> +
> +	/* Inflateq and deflateq are used unconditionally */
> +	nvqs = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> +		nvqs++;
> +	/* If ctrlq is enabled, the free page vq will also be created */
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ))
> +		nvqs += 2;
> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs)
> +		goto err_vq;
> +	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
> +	if (!callbacks)
> +		goto err_callback;
> +	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
> +	if (!names)
> +		goto err_names;
> +
> +	callbacks[0] = balloon_ack;
> +	names[0] = "inflate";
> +	callbacks[1] = balloon_ack;
> +	names[1] = "deflate";
> +
> +	i = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		callbacks[i] = stats_request;
> +		names[i] = "stats";
> +		i++;
> +	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
> +		callbacks[i] = ctrlq_handle;
> +		names[i++] = "ctrlq";
> +		callbacks[i] = NULL;
> +		names[i] = "free_page_vq";
> +	}
> +
> +	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
> +					 NULL, NULL);
>  	if (err)
> -		return err;
> +		goto err_find;
> 
>  	vb->inflate_vq = vqs[0];
>  	vb->deflate_vq = vqs[1];
> +	i = 2;
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> -		struct scatterlist sg;
> -		unsigned int num_stats;
> -		vb->stats_vq = vqs[2];
> -
> +		vb->stats_vq = vqs[i++];
>  		/*
>  		 * Prime this virtqueue with one buffer so the hypervisor can
>  		 * use it to signal us later (it can't be broken yet!).
>  		 */
> -		num_stats = update_balloon_stats(vb);
> -
> -		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
> +		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
>  		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
> -		    < 0)
> -			BUG();
> +		    < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
>  		virtqueue_kick(vb->stats_vq);
>  	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
> +		vb->ctrl_vq = vqs[i++];
> +		vb->free_page_vq = vqs[i];
> +		/* Prime the ctrlq with an inbuf for the host to send a cmd */
> +		vb->free_page_cmd_in.class =
> +
> 	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> +		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
> +	}
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
>  	return 0;
> +
> +err_find:
> +	kfree(names);
> +err_names:
> +	kfree(callbacks);
> +err_callback:
> +	kfree(vqs);
> +err_vq:
> +	return err;
>  }
> 
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -689,6 +887,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
>  		xb_init(&vb->page_xb);
> 
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
> +		vb->balloon_wq = alloc_workqueue("balloon-wq",
> +					WQ_FREEZABLE |
> WQ_CPU_INTENSIVE, 0);
> +		INIT_WORK(&vb->report_free_page_work, report_free_page);
> +		vb->report_free_page_stop = true;
> +	}
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -753,6 +958,7 @@ static void virtballoon_remove(struct virtio_device
> *vdev)
>  	spin_unlock_irq(&vb->stop_update_lock);
>  	cancel_work_sync(&vb->update_balloon_size_work);
>  	cancel_work_sync(&vb->update_balloon_stats_work);
> +	cancel_work_sync(&vb->report_free_page_work);
> 
>  	remove_common(vb);
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -806,6 +1012,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>  	VIRTIO_BALLOON_F_SG,
> +	VIRTIO_BALLOON_F_CTRL_VQ,
>  };
> 
>  static struct virtio_driver virtio_balloon_driver = { diff --git
> a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 37780a7..dbf0616 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -35,6 +35,7 @@
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon
> on OOM */
>  #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists
> */
> +#define VIRTIO_BALLOON_F_CTRL_VQ	4 /* Control Virtqueue */
> 
>  /* Size of a PFN in the balloon interface. */  #define
> VIRTIO_BALLOON_PFN_SHIFT 12 @@ -83,4 +84,18 @@ struct
> virtio_balloon_stat {
>  	__virtio64 val;
>  } __attribute__((packed));
> 
> +enum {
> +	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE = 0,
> +	VIRTIO_BALLOON_CTRLQ_CLASS_MAX,
> +};
> +
> +struct virtio_balloon_ctrlq_cmd {
> +	__virtio32 class;
> +	__virtio32 cmd;
> +};
> +
> +/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE */
> +#define VIRTIO_BALLOON_FREE_PAGE_F_STOP		0
> +#define VIRTIO_BALLOON_FREE_PAGE_F_START	1
> +
>  #endif /* _LINUX_VIRTIO_BALLOON_H */
> --
> 2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v15 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-08-29  3:09     ` Wei Wang
@ 2017-09-08  3:36       ` Michael S. Tsirkin
  2017-09-08 11:09         ` [virtio-dev] " Wei Wang
  0 siblings, 1 reply; 20+ messages in thread
From: Michael S. Tsirkin @ 2017-09-08  3:36 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Tue, Aug 29, 2017 at 11:09:18AM +0800, Wei Wang wrote:
> On 08/29/2017 02:03 AM, Michael S. Tsirkin wrote:
> > On Mon, Aug 28, 2017 at 06:08:31PM +0800, Wei Wang wrote:
> > > Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer
> > > of balloon (i.e. inflated/deflated) pages using scatter-gather lists
> > > to the host.
> > > 
> > > The implementation of the previous virtio-balloon is not very
> > > efficient, because the balloon pages are transferred to the
> > > host one by one. Here is the breakdown of the time in percentage
> > > spent on each step of the balloon inflating process (inflating
> > > 7GB of an 8GB idle guest).
> > > 
> > > 1) allocating pages (6.5%)
> > > 2) sending PFNs to host (68.3%)
> > > 3) address translation (6.1%)
> > > 4) madvise (19%)
> > > 
> > > It takes about 4126ms for the inflating process to complete.
> > > The above profiling shows that the bottlenecks are stage 2)
> > > and stage 4).
> > > 
> > > This patch optimizes step 2) by transferring pages to the host in
> > > sgs. An sg describes a chunk of guest physically continuous pages.
> > > With this mechanism, step 4) can also be optimized by doing address
> > > translation and madvise() in chunks rather than page by page.
> > > 
> > > With this new feature, the above ballooning process takes ~597ms
> > > resulting in an improvement of ~86%.
> > > 
> > > TODO: optimize stage 1) by allocating/freeing a chunk of pages
> > > instead of a single page each time.
> > > 
> > > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > > Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> > > ---
> > >   drivers/virtio/virtio_balloon.c     | 171 ++++++++++++++++++++++++++++++++----
> > >   include/uapi/linux/virtio_balloon.h |   1 +
> > >   2 files changed, 155 insertions(+), 17 deletions(-)
> > > 
> > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> > > index f0b3a0b..8ecc1d4 100644
> > > --- a/drivers/virtio/virtio_balloon.c
> > > +++ b/drivers/virtio/virtio_balloon.c
> > > @@ -32,6 +32,8 @@
> > >   #include <linux/mm.h>
> > >   #include <linux/mount.h>
> > >   #include <linux/magic.h>
> > > +#include <linux/xbitmap.h>
> > > +#include <asm/page.h>
> > >   /*
> > >    * Balloon device works in 4K page units.  So each page is pointed to by
> > > @@ -79,6 +81,9 @@ struct virtio_balloon {
> > >   	/* Synchronize access/update to this struct virtio_balloon elements */
> > >   	struct mutex balloon_lock;
> > > +	/* The xbitmap used to record balloon pages */
> > > +	struct xb page_xb;
> > > +
> > >   	/* The array of pfns we tell the Host about. */
> > >   	unsigned int num_pfns;
> > >   	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> > > @@ -141,13 +146,111 @@ static void set_page_pfns(struct virtio_balloon *vb,
> > >   					  page_to_balloon_pfn(page) + i);
> > >   }
> > > +static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
> > > +{
> > > +	struct scatterlist sg;
> > > +
> > > +	sg_init_one(&sg, addr, size);
> > > +	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
> > > +}
> > > +
> > > +static void send_balloon_page_sg(struct virtio_balloon *vb,
> > > +				 struct virtqueue *vq,
> > > +				 void *addr,
> > > +				 uint32_t size,
> > > +				 bool batch)
> > > +{
> > > +	unsigned int len;
> > > +	int err;
> > > +
> > > +	err = add_one_sg(vq, addr, size);
> > > +	/* Sanity check: this can't really happen */
> > > +	WARN_ON(err);
> > It might be cleaner to detect that add failed due to
> > ring full and kick then. Just an idea, up to you
> > whether to do it.
> > 
> > > +
> > > +	/* If batching is in use, we batch the sgs till the vq is full. */
> > > +	if (!batch || !vq->num_free) {
> > > +		virtqueue_kick(vq);
> > > +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > +		/* Release all the entries if there are */
> > Meaning
> > 	Account for all used entries if any
> > ?
> > 
> > > +		while (virtqueue_get_buf(vq, &len))
> > > +			;
> > 
> > Above code is reused below. Add a function?
> > 
> > > +	}
> > > +}
> > > +
> > > +/*
> > > + * Send balloon pages in sgs to host. The balloon pages are recorded in the
> > > + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> > > + * The page xbitmap is searched for continuous "1" bits, which correspond
> > > + * to continuous pages, to chunk into sgs.
> > > + *
> > > + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
> > > + * need to be searched.
> > > + */
> > > +static void tell_host_sgs(struct virtio_balloon *vb,
> > > +			  struct virtqueue *vq,
> > > +			  unsigned long page_xb_start,
> > > +			  unsigned long page_xb_end)
> > > +{
> > > +	unsigned long sg_pfn_start, sg_pfn_end;
> > > +	void *sg_addr;
> > > +	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
> > > +
> > > +	sg_pfn_start = page_xb_start;
> > > +	while (sg_pfn_start < page_xb_end) {
> > > +		sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start,
> > > +						page_xb_end, 1);
> > > +		if (sg_pfn_start == page_xb_end + 1)
> > > +			break;
> > > +		sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1,
> > > +					      page_xb_end, 0);
> > > +		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
> > > +		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
> > > +		while (sg_len > sg_max_len) {
> > > +			send_balloon_page_sg(vb, vq, sg_addr, sg_max_len, 1);
> > Last argument should be true, not 1.
> > 
> > > +			sg_addr += sg_max_len;
> > > +			sg_len -= sg_max_len;
> > > +		}
> > > +		send_balloon_page_sg(vb, vq, sg_addr, sg_len, 1);
> > > +		xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end);
> > > +		sg_pfn_start = sg_pfn_end + 1;
> > > +	}
> > > +
> > > +	/*
> > > +	 * The last few sgs may not reach the batch size, but need a kick to
> > > +	 * notify the device to handle them.
> > > +	 */
> > > +	if (vq->num_free != virtqueue_get_vring_size(vq)) {
> > > +		virtqueue_kick(vq);
> > > +		wait_event(vb->acked, virtqueue_get_buf(vq, &sg_len));
> > > +		while (virtqueue_get_buf(vq, &sg_len))
> > > +			;
> > Some entries can get used after a pause. Looks like they will leak then?
> > One fix would be to convert above if to a while loop.
> > I don't know whether to do it like this in send_balloon_page_sg too.
> > 
> 
> Thanks for the above comments. I've re-written this part of code.
> Please have a check below if there is anything more we could improve:
> 
> static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
> {
>         unsigned int len;
> 
>         virtqueue_kick(vq);
>         wait_event(wq_head, virtqueue_get_buf(vq, &len));
>         /* Detach all the used buffers from the vq */
>         while (virtqueue_get_buf(vq, &len))
>                 ;

I would move this last part to before add_buf. Increases chances
it succeeds even in case of a bug.

> }
> 
> static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
> {
>         struct scatterlist sg;
>         int ret;
> 
>         sg_init_one(&sg, addr, size);
>         ret = virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
>         if (unlikely(ret == -ENOSPC))
>                 dev_warn(&vq->vdev->dev, "%s: failed due to ring full\n",
>                                  __func__);

So if this ever triggers then kick and wait might fail, right?
I think you should not special-case this one then.

> 
>         return ret;
> }
> 
> static void send_balloon_page_sg(struct virtio_balloon *vb,
>                                                       struct virtqueue *vq,
>                                                       void *addr,
>                                                       uint32_t size,
>                                                       bool batch)
> {
>         int err;
> 
>         do {
>                 err = add_one_sg(vq, addr, size);
>                 if (err == -ENOSPC || !batch || !vq->num_free)
>                         kick_and_wait(vq, vb->acked);
>         } while (err == -ENOSPC);
> }
> 
> 
> Best,
> Wei

I think this fixes the bug, yes. I would skip trying to handle ENOSPC,
it's more a less a bug if it triggers. Just bail out on any error.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [virtio-dev] Re: [PATCH v15 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-09-08  3:36       ` Michael S. Tsirkin
@ 2017-09-08 11:09         ` Wei Wang
  2017-09-29  4:01           ` Michael S. Tsirkin
  0 siblings, 1 reply; 20+ messages in thread
From: Wei Wang @ 2017-09-08 11:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 09/08/2017 11:36 AM, Michael S. Tsirkin wrote:
> On Tue, Aug 29, 2017 at 11:09:18AM +0800, Wei Wang wrote:
>> On 08/29/2017 02:03 AM, Michael S. Tsirkin wrote:
>>> On Mon, Aug 28, 2017 at 06:08:31PM +0800, Wei Wang wrote:
>>>> Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer
>>>> of balloon (i.e. inflated/deflated) pages using scatter-gather lists
>>>> to the host.
>>>>
>>>> The implementation of the previous virtio-balloon is not very
>>>> efficient, because the balloon pages are transferred to the
>>>> host one by one. Here is the breakdown of the time in percentage
>>>> spent on each step of the balloon inflating process (inflating
>>>> 7GB of an 8GB idle guest).
>>>>
>>>> 1) allocating pages (6.5%)
>>>> 2) sending PFNs to host (68.3%)
>>>> 3) address translation (6.1%)
>>>> 4) madvise (19%)
>>>>
>>>> It takes about 4126ms for the inflating process to complete.
>>>> The above profiling shows that the bottlenecks are stage 2)
>>>> and stage 4).
>>>>
>>>> This patch optimizes step 2) by transferring pages to the host in
>>>> sgs. An sg describes a chunk of guest physically continuous pages.
>>>> With this mechanism, step 4) can also be optimized by doing address
>>>> translation and madvise() in chunks rather than page by page.
>>>>
>>>> With this new feature, the above ballooning process takes ~597ms
>>>> resulting in an improvement of ~86%.
>>>>
>>>> TODO: optimize stage 1) by allocating/freeing a chunk of pages
>>>> instead of a single page each time.
>>>>
>>>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>>>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>>>> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
>>>> ---
>>>>    drivers/virtio/virtio_balloon.c     | 171 ++++++++++++++++++++++++++++++++----
>>>>    include/uapi/linux/virtio_balloon.h |   1 +
>>>>    2 files changed, 155 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>>>> index f0b3a0b..8ecc1d4 100644
>>>> --- a/drivers/virtio/virtio_balloon.c
>>>> +++ b/drivers/virtio/virtio_balloon.c
>>>> @@ -32,6 +32,8 @@
>>>>    #include <linux/mm.h>
>>>>    #include <linux/mount.h>
>>>>    #include <linux/magic.h>
>>>> +#include <linux/xbitmap.h>
>>>> +#include <asm/page.h>
>>>>    /*
>>>>     * Balloon device works in 4K page units.  So each page is pointed to by
>>>> @@ -79,6 +81,9 @@ struct virtio_balloon {
>>>>    	/* Synchronize access/update to this struct virtio_balloon elements */
>>>>    	struct mutex balloon_lock;
>>>> +	/* The xbitmap used to record balloon pages */
>>>> +	struct xb page_xb;
>>>> +
>>>>    	/* The array of pfns we tell the Host about. */
>>>>    	unsigned int num_pfns;
>>>>    	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
>>>> @@ -141,13 +146,111 @@ static void set_page_pfns(struct virtio_balloon *vb,
>>>>    					  page_to_balloon_pfn(page) + i);
>>>>    }
>>>> +static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
>>>> +{
>>>> +	struct scatterlist sg;
>>>> +
>>>> +	sg_init_one(&sg, addr, size);
>>>> +	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
>>>> +}
>>>> +
>>>> +static void send_balloon_page_sg(struct virtio_balloon *vb,
>>>> +				 struct virtqueue *vq,
>>>> +				 void *addr,
>>>> +				 uint32_t size,
>>>> +				 bool batch)
>>>> +{
>>>> +	unsigned int len;
>>>> +	int err;
>>>> +
>>>> +	err = add_one_sg(vq, addr, size);
>>>> +	/* Sanity check: this can't really happen */
>>>> +	WARN_ON(err);
>>> It might be cleaner to detect that add failed due to
>>> ring full and kick then. Just an idea, up to you
>>> whether to do it.
>>>
>>>> +
>>>> +	/* If batching is in use, we batch the sgs till the vq is full. */
>>>> +	if (!batch || !vq->num_free) {
>>>> +		virtqueue_kick(vq);
>>>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>> +		/* Release all the entries if there are */
>>> Meaning
>>> 	Account for all used entries if any
>>> ?
>>>
>>>> +		while (virtqueue_get_buf(vq, &len))
>>>> +			;
>>> Above code is reused below. Add a function?
>>>
>>>> +	}
>>>> +}
>>>> +
>>>> +/*
>>>> + * Send balloon pages in sgs to host. The balloon pages are recorded in the
>>>> + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
>>>> + * The page xbitmap is searched for continuous "1" bits, which correspond
>>>> + * to continuous pages, to chunk into sgs.
>>>> + *
>>>> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
>>>> + * need to be searched.
>>>> + */
>>>> +static void tell_host_sgs(struct virtio_balloon *vb,
>>>> +			  struct virtqueue *vq,
>>>> +			  unsigned long page_xb_start,
>>>> +			  unsigned long page_xb_end)
>>>> +{
>>>> +	unsigned long sg_pfn_start, sg_pfn_end;
>>>> +	void *sg_addr;
>>>> +	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
>>>> +
>>>> +	sg_pfn_start = page_xb_start;
>>>> +	while (sg_pfn_start < page_xb_end) {
>>>> +		sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start,
>>>> +						page_xb_end, 1);
>>>> +		if (sg_pfn_start == page_xb_end + 1)
>>>> +			break;
>>>> +		sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1,
>>>> +					      page_xb_end, 0);
>>>> +		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
>>>> +		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
>>>> +		while (sg_len > sg_max_len) {
>>>> +			send_balloon_page_sg(vb, vq, sg_addr, sg_max_len, 1);
>>> Last argument should be true, not 1.
>>>
>>>> +			sg_addr += sg_max_len;
>>>> +			sg_len -= sg_max_len;
>>>> +		}
>>>> +		send_balloon_page_sg(vb, vq, sg_addr, sg_len, 1);
>>>> +		xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end);
>>>> +		sg_pfn_start = sg_pfn_end + 1;
>>>> +	}
>>>> +
>>>> +	/*
>>>> +	 * The last few sgs may not reach the batch size, but need a kick to
>>>> +	 * notify the device to handle them.
>>>> +	 */
>>>> +	if (vq->num_free != virtqueue_get_vring_size(vq)) {
>>>> +		virtqueue_kick(vq);
>>>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &sg_len));
>>>> +		while (virtqueue_get_buf(vq, &sg_len))
>>>> +			;
>>> Some entries can get used after a pause. Looks like they will leak then?
>>> One fix would be to convert above if to a while loop.
>>> I don't know whether to do it like this in send_balloon_page_sg too.
>>>
>> Thanks for the above comments. I've re-written this part of code.
>> Please have a check below if there is anything more we could improve:
>>
>> static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
>> {
>>          unsigned int len;
>>
>>          virtqueue_kick(vq);
>>          wait_event(wq_head, virtqueue_get_buf(vq, &len));
>>          /* Detach all the used buffers from the vq */
>>          while (virtqueue_get_buf(vq, &len))
>>                  ;
> I would move this last part to before add_buf. Increases chances
> it succeeds even in case of a bug.

>
>> }
>>
>> static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
>> {
>>          struct scatterlist sg;
>>          int ret;
>>
>>          sg_init_one(&sg, addr, size);
>>          ret = virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
>>          if (unlikely(ret == -ENOSPC))
>>                  dev_warn(&vq->vdev->dev, "%s: failed due to ring full\n",
>>                                   __func__);
> So if this ever triggers then kick and wait might fail, right?
> I think you should not special-case this one then.

OK, I will remove the check above, and take other suggestions as well. 
Thanks.

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v15 1/5] lib/xbitmap: Introduce xbitmap
  2017-08-28 10:08 ` [PATCH v15 1/5] lib/xbitmap: Introduce xbitmap Wei Wang
@ 2017-09-11 12:54   ` Matthew Wilcox
  2017-09-12 13:23     ` Wei Wang
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2017-09-11 12:54 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource,
	yang.zhang.wz, quan.xu

On Mon, Aug 28, 2017 at 06:08:29PM +0800, Wei Wang wrote:
> From: Matthew Wilcox <mawilcox@microsoft.com>
> 
> The eXtensible Bitmap is a sparse bitmap representation which is
> efficient for set bits which tend to cluster.  It supports up to
> 'unsigned long' worth of bits, and this commit adds the bare bones --
> xb_set_bit(), xb_clear_bit() and xb_test_bit().
> 
> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Michael S. Tsirkin <mst@redhat.com>

This is quite naughty of you.  You've modified the xbitmap implementation
without any indication in the changelog that you did so.  I don't
think the modifications you made are an improvement, but without any
argumentation from you I don't know why you think they're an improvement.

> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 898e879..ee72e2c 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -496,6 +496,7 @@ static int __radix_tree_preload(gfp_t gfp_mask, unsigned nr)
>  out:
>  	return ret;
>  }
> +EXPORT_SYMBOL(__radix_tree_preload);
>  
>  /*
>   * Load up this CPU's radix_tree_node buffer with sufficient objects to

You exported this to modules for some reason.  Why?

> @@ -2003,6 +2018,7 @@ static bool __radix_tree_delete(struct radix_tree_root *root,
>  	replace_slot(slot, NULL, node, -1, exceptional);
>  	return node && delete_node(root, node, NULL, NULL);
>  }
> +EXPORT_SYMBOL(__radix_tree_delete);
>  
>  /**
>   * radix_tree_iter_delete - delete the entry at this iterator position

Ditto?

> diff --git a/lib/xbitmap.c b/lib/xbitmap.c
> new file mode 100644
> index 0000000..8c55296
> --- /dev/null
> +++ b/lib/xbitmap.c
> @@ -0,0 +1,176 @@
> +#include <linux/slab.h>
> +#include <linux/xbitmap.h>
> +
> +/*
> + * The xbitmap implementation supports up to ULONG_MAX bits, and it is
> + * implemented based on ida bitmaps. So, given an unsigned long index,
> + * the high order XB_INDEX_BITS bits of the index is used to find the
> + * corresponding item (i.e. ida bitmap) from the radix tree, and the low
> + * order (i.e. ilog2(IDA_BITMAP_BITS)) bits of the index are indexed into
> + * the ida bitmap to find the bit.
> + */
> +#define XB_INDEX_BITS		(BITS_PER_LONG - ilog2(IDA_BITMAP_BITS))
> +#define XB_MAX_PATH		(DIV_ROUND_UP(XB_INDEX_BITS, \
> +					      RADIX_TREE_MAP_SHIFT))
> +#define XB_PRELOAD_SIZE		(XB_MAX_PATH * 2 - 1)

I don't understand why you moved the xb_preload code here from the
radix tree.  I want all the code which touches the preload implementation
together in one place, which is the radix tree.

> +enum xb_ops {
> +	XB_SET,
> +	XB_CLEAR,
> +	XB_TEST
> +};
> +
> +static int xb_bit_ops(struct xb *xb, unsigned long bit, enum xb_ops ops)
> +{
> +	int ret = 0;
> +	unsigned long index = bit / IDA_BITMAP_BITS;
> +	struct radix_tree_root *root = &xb->xbrt;
> +	struct radix_tree_node *node;
> +	void **slot;
> +	struct ida_bitmap *bitmap;
> +	unsigned long ebit, tmp;
> +
> +	bit %= IDA_BITMAP_BITS;
> +	ebit = bit + RADIX_TREE_EXCEPTIONAL_SHIFT;
> +
> +	switch (ops) {
> +	case XB_SET:
> +		ret = __radix_tree_create(root, index, 0, &node, &slot);
> +		if (ret)
> +			return ret;
> +		bitmap = rcu_dereference_raw(*slot);
> +		if (radix_tree_exception(bitmap)) {
> +			tmp = (unsigned long)bitmap;
> +			if (ebit < BITS_PER_LONG) {
> +				tmp |= 1UL << ebit;
> +				rcu_assign_pointer(*slot, (void *)tmp);
> +				return 0;
> +			}
> +			bitmap = this_cpu_xchg(ida_bitmap, NULL);
> +			if (!bitmap)
> +				return -EAGAIN;
> +			memset(bitmap, 0, sizeof(*bitmap));
> +			bitmap->bitmap[0] =
> +					tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
> +			rcu_assign_pointer(*slot, bitmap);
> +		}
> +		if (!bitmap) {
> +			if (ebit < BITS_PER_LONG) {
> +				bitmap = (void *)((1UL << ebit) |
> +					RADIX_TREE_EXCEPTIONAL_ENTRY);
> +				__radix_tree_replace(root, node, slot, bitmap,
> +						     NULL, NULL);
> +				return 0;
> +			}
> +			bitmap = this_cpu_xchg(ida_bitmap, NULL);
> +			if (!bitmap)
> +				return -EAGAIN;
> +			memset(bitmap, 0, sizeof(*bitmap));
> +			__radix_tree_replace(root, node, slot, bitmap, NULL,
> +					     NULL);
> +		}
> +		__set_bit(bit, bitmap->bitmap);
> +		break;
> +	case XB_CLEAR:
> +		bitmap = __radix_tree_lookup(root, index, &node, &slot);
> +		if (radix_tree_exception(bitmap)) {
> +			tmp = (unsigned long)bitmap;
> +			if (ebit >= BITS_PER_LONG)
> +				return 0;
> +			tmp &= ~(1UL << ebit);
> +			if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
> +				__radix_tree_delete(root, node, slot);
> +			else
> +				rcu_assign_pointer(*slot, (void *)tmp);
> +			return 0;
> +		}
> +		if (!bitmap)
> +			return 0;
> +		__clear_bit(bit, bitmap->bitmap);
> +		if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
> +			kfree(bitmap);
> +			__radix_tree_delete(root, node, slot);
> +		}
> +		break;
> +	case XB_TEST:
> +		bitmap = radix_tree_lookup(root, index);
> +		if (!bitmap)
> +			return 0;
> +		if (radix_tree_exception(bitmap)) {
> +			if (ebit > BITS_PER_LONG)
> +				return 0;
> +			return (unsigned long)bitmap & (1UL << bit);
> +		}
> +		ret = test_bit(bit, bitmap->bitmap);
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +	return ret;
> +}

This is what I have the biggest problem with.  You've spliced
three functions together into a single 86-line function.  All that
they share is the first 11 lines of setup!  Go back and read
Documentation/process/coding-style.rst section 6 again.


And you've just deleted the test suite.  Test suites are incredibly
important!  They keep us from regressing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v15 2/5] lib/xbitmap: add xb_find_next_bit() and xb_zero()
  2017-08-28 10:08 ` [PATCH v15 2/5] lib/xbitmap: add xb_find_next_bit() and xb_zero() Wei Wang
@ 2017-09-11 13:27   ` Matthew Wilcox
  2017-09-30  4:24     ` Wang, Wei W
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2017-09-11 13:27 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource,
	yang.zhang.wz, quan.xu

On Mon, Aug 28, 2017 at 06:08:30PM +0800, Wei Wang wrote:
> +/**
> + *  xb_zero - zero a range of bits in the xbitmap
> + *  @xb: the xbitmap that the bits reside in
> + *  @start: the start of the range, inclusive
> + *  @end: the end of the range, inclusive
> + */
> +void xb_zero(struct xb *xb, unsigned long start, unsigned long end)
> +{
> +	unsigned long i;
> +
> +	for (i = start; i <= end; i++)
> +		xb_clear_bit(xb, i);
> +}
> +EXPORT_SYMBOL(xb_zero);

Um.  This is not exactly going to be quick if you're clearing a range of bits.
I think it needs to be more along the lines of this:

void xb_clear(struct xb *xb, unsigned long start, unsigned long end) 
{ 
        struct radix_tree_root *root = &xb->xbrt; 
        struct radix_tree_node *node; 
        void **slot; 
        struct ida_bitmap *bitmap; 
 
        for (; end < start; start = (start | (IDA_BITMAP_BITS - 1)) + 1) { 
                unsigned long index = start / IDA_BITMAP_BITS; 
                unsigned long bit = start % IDA_BITMAP_BITS; 

                bitmap = __radix_tree_lookup(root, index, &node, &slot);
                if (radix_tree_exception(bitmap)) {
                        unsigned long ebit = bit + 2;
                        unsigned long tmp = (unsigned long)bitmap;
                        if (ebit >= BITS_PER_LONG)
                                continue;
                        tmp &= ... something ...;
                        if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
                                __radix_tree_delete(root, node, slot);
                        else
                                rcu_assign_pointer(*slot, (void *)tmp);
                } else if (bitmap) {
                        unsigned int nbits = end - start + 1;
                        if (nbits + bit > IDA_BITMAP_BITS)
                                nbits = IDA_BITMAP_BITS - bit;
                        bitmap_clear(bitmap->bitmap, bit, nbits);
                        if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
                                kfree(bitmap);
                                __radix_tree_delete(root, node, slot);
                        }
                }
        }
}

Also note that this should be called xb_clear(), not xb_zero() to fit
in with bitmap_clear().  And this needs a thorough test suite testing
all values for 'start' and 'end' between 0 and at least 1024; probably
much higher.  And a variable number of bits need to be set before calling
xb_clear() in the test suite.

Also, this implementation above is missing a few tricks.  For example,
if 'bit' is 0 and 'nbits' == IDA_BITMAP_BITS, we can simply call kfree
without first zeroing out the bits and then checking if the whole thing
is zero.

Another missing optimisation above is that we always restart the radix
tree walk from the top instead of simply moving on to the next bitmap.
This is still a thousand times faster than the implementation you posted,
but I'd be keen to add that optimisation too.

> +/**
> + * xb_find_next_bit - find next 1 or 0 in the give range of bits
> + * @xb: the xbitmap that the bits reside in
> + * @start: the start of the range, inclusive
> + * @end: the end of the range, inclusive
> + * @set: the polarity (1 or 0) of the next bit to find
> + *
> + * Return the index of the found bit in the xbitmap. If the returned index
> + * exceeds @end, it indicates that no such bit is found in the given range.
> + */
> +unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
> +			       unsigned long end, bool set)
> +{
> +	unsigned long i;
> +
> +	for (i = start; i <= end; i++) {
> +		if (xb_test_bit(xb, i) == set)
> +			break;
> +	}
> +
> +	return i;
> +}
> +EXPORT_SYMBOL(xb_find_next_bit);

Similar comments ... this is going to be very slow.  You can use the
tags in the tree to help you find set and clear bits so performance
doesn't fall apart in big trees.

I'd like to see this be two functions, xb_find_next_zero_bit() and
xb_find_next_set_bit().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v15 1/5] lib/xbitmap: Introduce xbitmap
  2017-09-11 12:54   ` Matthew Wilcox
@ 2017-09-12 13:23     ` Wei Wang
  0 siblings, 0 replies; 20+ messages in thread
From: Wei Wang @ 2017-09-12 13:23 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource,
	yang.zhang.wz, quan.xu

On 09/11/2017 08:54 PM, Matthew Wilcox wrote:
> On Mon, Aug 28, 2017 at 06:08:29PM +0800, Wei Wang wrote:
>> From: Matthew Wilcox <mawilcox@microsoft.com>
>>
>> The eXtensible Bitmap is a sparse bitmap representation which is
>> efficient for set bits which tend to cluster.  It supports up to
>> 'unsigned long' worth of bits, and this commit adds the bare bones --
>> xb_set_bit(), xb_clear_bit() and xb_test_bit().
>>
>> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Michael S. Tsirkin <mst@redhat.com>
> This is quite naughty of you.  You've modified the xbitmap implementation
> without any indication in the changelog that you did so.

This was changed in the previous version and included in that
v13->v14 ChangeLog: https://lkml.org/lkml/2017/8/16/923


> I don't
> think the modifications you made are an improvement, but without any
> argumentation from you I don't know why you think they're an improvement.

Probably it shouldn't be modified when the discussion is incomplete:
https://lkml.org/lkml/2017/8/10/36
Sorry about that. Hope we could get more feedback from you on the
changes later.

If you want, we can continue this part from the the v13 patch, which might
be closer to the implementation that you like: 
https://lkml.org/lkml/2017/8/3/60

>> diff --git a/lib/xbitmap.c b/lib/xbitmap.c
>> new file mode 100644
>> index 0000000..8c55296
>> --- /dev/null
>> +++ b/lib/xbitmap.c
>> @@ -0,0 +1,176 @@
>> +#include <linux/slab.h>
>> +#include <linux/xbitmap.h>
>> +
>> +/*
>> + * The xbitmap implementation supports up to ULONG_MAX bits, and it is
>> + * implemented based on ida bitmaps. So, given an unsigned long index,
>> + * the high order XB_INDEX_BITS bits of the index is used to find the
>> + * corresponding item (i.e. ida bitmap) from the radix tree, and the low
>> + * order (i.e. ilog2(IDA_BITMAP_BITS)) bits of the index are indexed into
>> + * the ida bitmap to find the bit.
>> + */
>> +#define XB_INDEX_BITS		(BITS_PER_LONG - ilog2(IDA_BITMAP_BITS))
>> +#define XB_MAX_PATH		(DIV_ROUND_UP(XB_INDEX_BITS, \
>> +					      RADIX_TREE_MAP_SHIFT))
>> +#define XB_PRELOAD_SIZE		(XB_MAX_PATH * 2 - 1)
> I don't understand why you moved the xb_preload code here from the
> radix tree.  I want all the code which touches the preload implementation
> together in one place, which is the radix tree.

Based on the previous comments (put all the code to lib/xbitmap.c) and your
comment here, I will move xb_preload() and the above Macro to radix-tree.c,
while leaving the rest in xbitmap.c.

Would this be something you expected? Or would you like to move all back
to radix-tree.c like that in v13?


Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [virtio-dev] Re: [PATCH v15 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-09-08 11:09         ` [virtio-dev] " Wei Wang
@ 2017-09-29  4:01           ` Michael S. Tsirkin
  2017-09-29  6:55             ` Wei Wang
  0 siblings, 1 reply; 20+ messages in thread
From: Michael S. Tsirkin @ 2017-09-29  4:01 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Fri, Sep 08, 2017 at 07:09:24PM +0800, Wei Wang wrote:
> On 09/08/2017 11:36 AM, Michael S. Tsirkin wrote:
> > On Tue, Aug 29, 2017 at 11:09:18AM +0800, Wei Wang wrote:
> > > On 08/29/2017 02:03 AM, Michael S. Tsirkin wrote:
> > > > On Mon, Aug 28, 2017 at 06:08:31PM +0800, Wei Wang wrote:
> > > > > Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer
> > > > > of balloon (i.e. inflated/deflated) pages using scatter-gather lists
> > > > > to the host.
> > > > > 
> > > > > The implementation of the previous virtio-balloon is not very
> > > > > efficient, because the balloon pages are transferred to the
> > > > > host one by one. Here is the breakdown of the time in percentage
> > > > > spent on each step of the balloon inflating process (inflating
> > > > > 7GB of an 8GB idle guest).
> > > > > 
> > > > > 1) allocating pages (6.5%)
> > > > > 2) sending PFNs to host (68.3%)
> > > > > 3) address translation (6.1%)
> > > > > 4) madvise (19%)
> > > > > 
> > > > > It takes about 4126ms for the inflating process to complete.
> > > > > The above profiling shows that the bottlenecks are stage 2)
> > > > > and stage 4).
> > > > > 
> > > > > This patch optimizes step 2) by transferring pages to the host in
> > > > > sgs. An sg describes a chunk of guest physically continuous pages.
> > > > > With this mechanism, step 4) can also be optimized by doing address
> > > > > translation and madvise() in chunks rather than page by page.
> > > > > 
> > > > > With this new feature, the above ballooning process takes ~597ms
> > > > > resulting in an improvement of ~86%.
> > > > > 
> > > > > TODO: optimize stage 1) by allocating/freeing a chunk of pages
> > > > > instead of a single page each time.
> > > > > 
> > > > > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > > > > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > > > > Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > ---
> > > > >    drivers/virtio/virtio_balloon.c     | 171 ++++++++++++++++++++++++++++++++----
> > > > >    include/uapi/linux/virtio_balloon.h |   1 +
> > > > >    2 files changed, 155 insertions(+), 17 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> > > > > index f0b3a0b..8ecc1d4 100644
> > > > > --- a/drivers/virtio/virtio_balloon.c
> > > > > +++ b/drivers/virtio/virtio_balloon.c
> > > > > @@ -32,6 +32,8 @@
> > > > >    #include <linux/mm.h>
> > > > >    #include <linux/mount.h>
> > > > >    #include <linux/magic.h>
> > > > > +#include <linux/xbitmap.h>
> > > > > +#include <asm/page.h>
> > > > >    /*
> > > > >     * Balloon device works in 4K page units.  So each page is pointed to by
> > > > > @@ -79,6 +81,9 @@ struct virtio_balloon {
> > > > >    	/* Synchronize access/update to this struct virtio_balloon elements */
> > > > >    	struct mutex balloon_lock;
> > > > > +	/* The xbitmap used to record balloon pages */
> > > > > +	struct xb page_xb;
> > > > > +
> > > > >    	/* The array of pfns we tell the Host about. */
> > > > >    	unsigned int num_pfns;
> > > > >    	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> > > > > @@ -141,13 +146,111 @@ static void set_page_pfns(struct virtio_balloon *vb,
> > > > >    					  page_to_balloon_pfn(page) + i);
> > > > >    }
> > > > > +static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
> > > > > +{
> > > > > +	struct scatterlist sg;
> > > > > +
> > > > > +	sg_init_one(&sg, addr, size);
> > > > > +	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
> > > > > +}
> > > > > +
> > > > > +static void send_balloon_page_sg(struct virtio_balloon *vb,
> > > > > +				 struct virtqueue *vq,
> > > > > +				 void *addr,
> > > > > +				 uint32_t size,
> > > > > +				 bool batch)
> > > > > +{
> > > > > +	unsigned int len;
> > > > > +	int err;
> > > > > +
> > > > > +	err = add_one_sg(vq, addr, size);
> > > > > +	/* Sanity check: this can't really happen */
> > > > > +	WARN_ON(err);
> > > > It might be cleaner to detect that add failed due to
> > > > ring full and kick then. Just an idea, up to you
> > > > whether to do it.
> > > > 
> > > > > +
> > > > > +	/* If batching is in use, we batch the sgs till the vq is full. */
> > > > > +	if (!batch || !vq->num_free) {
> > > > > +		virtqueue_kick(vq);
> > > > > +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > > > +		/* Release all the entries if there are */
> > > > Meaning
> > > > 	Account for all used entries if any
> > > > ?
> > > > 
> > > > > +		while (virtqueue_get_buf(vq, &len))
> > > > > +			;
> > > > Above code is reused below. Add a function?
> > > > 
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Send balloon pages in sgs to host. The balloon pages are recorded in the
> > > > > + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> > > > > + * The page xbitmap is searched for continuous "1" bits, which correspond
> > > > > + * to continuous pages, to chunk into sgs.
> > > > > + *
> > > > > + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
> > > > > + * need to be searched.
> > > > > + */
> > > > > +static void tell_host_sgs(struct virtio_balloon *vb,
> > > > > +			  struct virtqueue *vq,
> > > > > +			  unsigned long page_xb_start,
> > > > > +			  unsigned long page_xb_end)
> > > > > +{
> > > > > +	unsigned long sg_pfn_start, sg_pfn_end;
> > > > > +	void *sg_addr;
> > > > > +	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
> > > > > +
> > > > > +	sg_pfn_start = page_xb_start;
> > > > > +	while (sg_pfn_start < page_xb_end) {
> > > > > +		sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start,
> > > > > +						page_xb_end, 1);
> > > > > +		if (sg_pfn_start == page_xb_end + 1)
> > > > > +			break;
> > > > > +		sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1,
> > > > > +					      page_xb_end, 0);
> > > > > +		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
> > > > > +		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
> > > > > +		while (sg_len > sg_max_len) {
> > > > > +			send_balloon_page_sg(vb, vq, sg_addr, sg_max_len, 1);
> > > > Last argument should be true, not 1.
> > > > 
> > > > > +			sg_addr += sg_max_len;
> > > > > +			sg_len -= sg_max_len;
> > > > > +		}
> > > > > +		send_balloon_page_sg(vb, vq, sg_addr, sg_len, 1);
> > > > > +		xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end);
> > > > > +		sg_pfn_start = sg_pfn_end + 1;
> > > > > +	}
> > > > > +
> > > > > +	/*
> > > > > +	 * The last few sgs may not reach the batch size, but need a kick to
> > > > > +	 * notify the device to handle them.
> > > > > +	 */
> > > > > +	if (vq->num_free != virtqueue_get_vring_size(vq)) {
> > > > > +		virtqueue_kick(vq);
> > > > > +		wait_event(vb->acked, virtqueue_get_buf(vq, &sg_len));
> > > > > +		while (virtqueue_get_buf(vq, &sg_len))
> > > > > +			;
> > > > Some entries can get used after a pause. Looks like they will leak then?
> > > > One fix would be to convert above if to a while loop.
> > > > I don't know whether to do it like this in send_balloon_page_sg too.
> > > > 
> > > Thanks for the above comments. I've re-written this part of code.
> > > Please have a check below if there is anything more we could improve:
> > > 
> > > static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
> > > {
> > >          unsigned int len;
> > > 
> > >          virtqueue_kick(vq);
> > >          wait_event(wq_head, virtqueue_get_buf(vq, &len));
> > >          /* Detach all the used buffers from the vq */
> > >          while (virtqueue_get_buf(vq, &len))
> > >                  ;
> > I would move this last part to before add_buf. Increases chances
> > it succeeds even in case of a bug.
> 
> > 
> > > }
> > > 
> > > static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
> > > {
> > >          struct scatterlist sg;
> > >          int ret;
> > > 
> > >          sg_init_one(&sg, addr, size);
> > >          ret = virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
> > >          if (unlikely(ret == -ENOSPC))
> > >                  dev_warn(&vq->vdev->dev, "%s: failed due to ring full\n",
> > >                                   __func__);
> > So if this ever triggers then kick and wait might fail, right?
> > I think you should not special-case this one then.
> 
> OK, I will remove the check above, and take other suggestions as well.
> Thanks.
> 
> Best,
> Wei

Any updates here? It's been a while.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [virtio-dev] Re: [PATCH v15 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-09-29  4:01           ` Michael S. Tsirkin
@ 2017-09-29  6:55             ` Wei Wang
  0 siblings, 0 replies; 20+ messages in thread
From: Wei Wang @ 2017-09-29  6:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 09/29/2017 12:01 PM, Michael S. Tsirkin wrote:
> On Fri, Sep 08, 2017 at 07:09:24PM +0800, Wei Wang wrote:
>> On 09/08/2017 11:36 AM, Michael S. Tsirkin wrote:
>>> On Tue, Aug 29, 2017 at 11:09:18AM +0800, Wei Wang wrote:
>>>> On 08/29/2017 02:03 AM, Michael S. Tsirkin wrote:
>>>>> On Mon, Aug 28, 2017 at 06:08:31PM +0800, Wei Wang wrote:
>>>>>> Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer
>>>>>> of balloon (i.e. inflated/deflated) pages using scatter-gather lists
>>>>>> to the host.
>>>>>>
>>>>>> The implementation of the previous virtio-balloon is not very
>>>>>> efficient, because the balloon pages are transferred to the
>>>>>> host one by one. Here is the breakdown of the time in percentage
>>>>>> spent on each step of the balloon inflating process (inflating
>>>>>> 7GB of an 8GB idle guest).
>>>>>>
>>>>>> 1) allocating pages (6.5%)
>>>>>> 2) sending PFNs to host (68.3%)
>>>>>> 3) address translation (6.1%)
>>>>>> 4) madvise (19%)
>>>>>>
>>>>>> It takes about 4126ms for the inflating process to complete.
>>>>>> The above profiling shows that the bottlenecks are stage 2)
>>>>>> and stage 4).
>>>>>>
>>>>>> This patch optimizes step 2) by transferring pages to the host in
>>>>>> sgs. An sg describes a chunk of guest physically continuous pages.
>>>>>> With this mechanism, step 4) can also be optimized by doing address
>>>>>> translation and madvise() in chunks rather than page by page.
>>>>>>
>>>>>> With this new feature, the above ballooning process takes ~597ms
>>>>>> resulting in an improvement of ~86%.
>>>>>>
>>>>>> TODO: optimize stage 1) by allocating/freeing a chunk of pages
>>>>>> instead of a single page each time.
>>>>>>
>>>>>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>>>>>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>>>>>> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>> ---
>>>>>>     drivers/virtio/virtio_balloon.c     | 171 ++++++++++++++++++++++++++++++++----
>>>>>>     include/uapi/linux/virtio_balloon.h |   1 +
>>>>>>     2 files changed, 155 insertions(+), 17 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>>>>>> index f0b3a0b..8ecc1d4 100644
>>>>>> --- a/drivers/virtio/virtio_balloon.c
>>>>>> +++ b/drivers/virtio/virtio_balloon.c
>>>>>> @@ -32,6 +32,8 @@
>>>>>>     #include <linux/mm.h>
>>>>>>     #include <linux/mount.h>
>>>>>>     #include <linux/magic.h>
>>>>>> +#include <linux/xbitmap.h>
>>>>>> +#include <asm/page.h>
>>>>>>     /*
>>>>>>      * Balloon device works in 4K page units.  So each page is pointed to by
>>>>>> @@ -79,6 +81,9 @@ struct virtio_balloon {
>>>>>>     	/* Synchronize access/update to this struct virtio_balloon elements */
>>>>>>     	struct mutex balloon_lock;
>>>>>> +	/* The xbitmap used to record balloon pages */
>>>>>> +	struct xb page_xb;
>>>>>> +
>>>>>>     	/* The array of pfns we tell the Host about. */
>>>>>>     	unsigned int num_pfns;
>>>>>>     	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
>>>>>> @@ -141,13 +146,111 @@ static void set_page_pfns(struct virtio_balloon *vb,
>>>>>>     					  page_to_balloon_pfn(page) + i);
>>>>>>     }
>>>>>> +static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
>>>>>> +{
>>>>>> +	struct scatterlist sg;
>>>>>> +
>>>>>> +	sg_init_one(&sg, addr, size);
>>>>>> +	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
>>>>>> +}
>>>>>> +
>>>>>> +static void send_balloon_page_sg(struct virtio_balloon *vb,
>>>>>> +				 struct virtqueue *vq,
>>>>>> +				 void *addr,
>>>>>> +				 uint32_t size,
>>>>>> +				 bool batch)
>>>>>> +{
>>>>>> +	unsigned int len;
>>>>>> +	int err;
>>>>>> +
>>>>>> +	err = add_one_sg(vq, addr, size);
>>>>>> +	/* Sanity check: this can't really happen */
>>>>>> +	WARN_ON(err);
>>>>> It might be cleaner to detect that add failed due to
>>>>> ring full and kick then. Just an idea, up to you
>>>>> whether to do it.
>>>>>
>>>>>> +
>>>>>> +	/* If batching is in use, we batch the sgs till the vq is full. */
>>>>>> +	if (!batch || !vq->num_free) {
>>>>>> +		virtqueue_kick(vq);
>>>>>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>>>> +		/* Release all the entries if there are */
>>>>> Meaning
>>>>> 	Account for all used entries if any
>>>>> ?
>>>>>
>>>>>> +		while (virtqueue_get_buf(vq, &len))
>>>>>> +			;
>>>>> Above code is reused below. Add a function?
>>>>>
>>>>>> +	}
>>>>>> +}
>>>>>> +
>>>>>> +/*
>>>>>> + * Send balloon pages in sgs to host. The balloon pages are recorded in the
>>>>>> + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
>>>>>> + * The page xbitmap is searched for continuous "1" bits, which correspond
>>>>>> + * to continuous pages, to chunk into sgs.
>>>>>> + *
>>>>>> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
>>>>>> + * need to be searched.
>>>>>> + */
>>>>>> +static void tell_host_sgs(struct virtio_balloon *vb,
>>>>>> +			  struct virtqueue *vq,
>>>>>> +			  unsigned long page_xb_start,
>>>>>> +			  unsigned long page_xb_end)
>>>>>> +{
>>>>>> +	unsigned long sg_pfn_start, sg_pfn_end;
>>>>>> +	void *sg_addr;
>>>>>> +	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
>>>>>> +
>>>>>> +	sg_pfn_start = page_xb_start;
>>>>>> +	while (sg_pfn_start < page_xb_end) {
>>>>>> +		sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start,
>>>>>> +						page_xb_end, 1);
>>>>>> +		if (sg_pfn_start == page_xb_end + 1)
>>>>>> +			break;
>>>>>> +		sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1,
>>>>>> +					      page_xb_end, 0);
>>>>>> +		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
>>>>>> +		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
>>>>>> +		while (sg_len > sg_max_len) {
>>>>>> +			send_balloon_page_sg(vb, vq, sg_addr, sg_max_len, 1);
>>>>> Last argument should be true, not 1.
>>>>>
>>>>>> +			sg_addr += sg_max_len;
>>>>>> +			sg_len -= sg_max_len;
>>>>>> +		}
>>>>>> +		send_balloon_page_sg(vb, vq, sg_addr, sg_len, 1);
>>>>>> +		xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end);
>>>>>> +		sg_pfn_start = sg_pfn_end + 1;
>>>>>> +	}
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * The last few sgs may not reach the batch size, but need a kick to
>>>>>> +	 * notify the device to handle them.
>>>>>> +	 */
>>>>>> +	if (vq->num_free != virtqueue_get_vring_size(vq)) {
>>>>>> +		virtqueue_kick(vq);
>>>>>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &sg_len));
>>>>>> +		while (virtqueue_get_buf(vq, &sg_len))
>>>>>> +			;
>>>>> Some entries can get used after a pause. Looks like they will leak then?
>>>>> One fix would be to convert above if to a while loop.
>>>>> I don't know whether to do it like this in send_balloon_page_sg too.
>>>>>
>>>> Thanks for the above comments. I've re-written this part of code.
>>>> Please have a check below if there is anything more we could improve:
>>>>
>>>> static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
>>>> {
>>>>           unsigned int len;
>>>>
>>>>           virtqueue_kick(vq);
>>>>           wait_event(wq_head, virtqueue_get_buf(vq, &len));
>>>>           /* Detach all the used buffers from the vq */
>>>>           while (virtqueue_get_buf(vq, &len))
>>>>                   ;
>>> I would move this last part to before add_buf. Increases chances
>>> it succeeds even in case of a bug.
>>>> }
>>>>
>>>> static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
>>>> {
>>>>           struct scatterlist sg;
>>>>           int ret;
>>>>
>>>>           sg_init_one(&sg, addr, size);
>>>>           ret = virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
>>>>           if (unlikely(ret == -ENOSPC))
>>>>                   dev_warn(&vq->vdev->dev, "%s: failed due to ring full\n",
>>>>                                    __func__);
>>> So if this ever triggers then kick and wait might fail, right?
>>> I think you should not special-case this one then.
>> OK, I will remove the check above, and take other suggestions as well.
>> Thanks.
>>
>> Best,
>> Wei
> Any updates here? It's been a while.
>

Yes. with some major optimization on xbitmap, we can improve the 
ballooning time to ~492ms.
I will send out the patches soon.

Best,
Wei


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH v15 2/5] lib/xbitmap: add xb_find_next_bit() and xb_zero()
  2017-09-11 13:27   ` Matthew Wilcox
@ 2017-09-30  4:24     ` Wang, Wei W
  0 siblings, 0 replies; 20+ messages in thread
From: Wang, Wei W @ 2017-09-30  4:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource,
	yang.zhang.wz, quan.xu

On Monday, September 11, 2017 9:27 PM, Matthew Wilcox wrote
> On Mon, Aug 28, 2017 at 06:08:30PM +0800, Wei Wang wrote:
> > +/**
> > + *  xb_zero - zero a range of bits in the xbitmap
> > + *  @xb: the xbitmap that the bits reside in
> > + *  @start: the start of the range, inclusive
> > + *  @end: the end of the range, inclusive  */ void xb_zero(struct xb
> > +*xb, unsigned long start, unsigned long end) {
> > +	unsigned long i;
> > +
> > +	for (i = start; i <= end; i++)
> > +		xb_clear_bit(xb, i);
> > +}
> > +EXPORT_SYMBOL(xb_zero);
> 
> Um.  This is not exactly going to be quick if you're clearing a range of bits.
> I think it needs to be more along the lines of this:
> 
> void xb_clear(struct xb *xb, unsigned long start, unsigned long end) {
>         struct radix_tree_root *root = &xb->xbrt;
>         struct radix_tree_node *node;
>         void **slot;
>         struct ida_bitmap *bitmap;
> 
>         for (; end < start; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
>                 unsigned long index = start / IDA_BITMAP_BITS;
>                 unsigned long bit = start % IDA_BITMAP_BITS;
> 
>                 bitmap = __radix_tree_lookup(root, index, &node, &slot);
>                 if (radix_tree_exception(bitmap)) {
>                         unsigned long ebit = bit + 2;
>                         unsigned long tmp = (unsigned long)bitmap;
>                         if (ebit >= BITS_PER_LONG)
>                                 continue;
>                         tmp &= ... something ...;
>                         if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
>                                 __radix_tree_delete(root, node, slot);
>                         else
>                                 rcu_assign_pointer(*slot, (void *)tmp);
>                 } else if (bitmap) {
>                         unsigned int nbits = end - start + 1;
>                         if (nbits + bit > IDA_BITMAP_BITS)
>                                 nbits = IDA_BITMAP_BITS - bit;
>                         bitmap_clear(bitmap->bitmap, bit, nbits);
>                         if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
>                                 kfree(bitmap);
>                                 __radix_tree_delete(root, node, slot);
>                         }
>                 }
>         }
> }
> 
> Also note that this should be called xb_clear(), not xb_zero() to fit in with
> bitmap_clear().  And this needs a thorough test suite testing all values for 'start'
> and 'end' between 0 and at least 1024; probably much higher.  And a variable
> number of bits need to be set before calling
> xb_clear() in the test suite.
> 
> Also, this implementation above is missing a few tricks.  For example, if 'bit' is 0
> and 'nbits' == IDA_BITMAP_BITS, we can simply call kfree without first zeroing
> out the bits and then checking if the whole thing is zero.

Thanks for the optimization suggestions. We've seen significant improvement of
the ballooning time. Some other optimizations (stated in the changelog) haven't
been included in the new version. If possible, we can leave that to a second step
optimization outside this patch series.

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-09-30  4:24 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-28 10:08 [PATCH v15 0/5] Virtio-balloon Enhancement Wei Wang
2017-08-28 10:08 ` [PATCH v15 1/5] lib/xbitmap: Introduce xbitmap Wei Wang
2017-09-11 12:54   ` Matthew Wilcox
2017-09-12 13:23     ` Wei Wang
2017-08-28 10:08 ` [PATCH v15 2/5] lib/xbitmap: add xb_find_next_bit() and xb_zero() Wei Wang
2017-09-11 13:27   ` Matthew Wilcox
2017-09-30  4:24     ` Wang, Wei W
2017-08-28 10:08 ` [PATCH v15 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
2017-08-28 18:03   ` Michael S. Tsirkin
2017-08-29  3:09     ` Wei Wang
2017-09-08  3:36       ` Michael S. Tsirkin
2017-09-08 11:09         ` [virtio-dev] " Wei Wang
2017-09-29  4:01           ` Michael S. Tsirkin
2017-09-29  6:55             ` Wei Wang
2017-08-28 10:08 ` [PATCH v15 4/5] mm: support reporting free page blocks Wei Wang
2017-08-28 13:33   ` Michal Hocko
2017-08-28 14:09     ` Michal Hocko
2017-08-29  3:23     ` Wei Wang
2017-08-28 10:08 ` [PATCH v15 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ Wei Wang
2017-09-05 11:57   ` Wang, Wei W

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).