linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v18 00/10] Virtio-balloon Enhancement
@ 2017-11-29 13:55 Wei Wang
  2017-11-29 13:55 ` [PATCH v18 01/10] idr: add #include <linux/bug.h> Wei Wang
                   ` (9 more replies)
  0 siblings, 10 replies; 37+ messages in thread
From: Wei Wang @ 2017-11-29 13:55 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, penguin-kernel, cornelia.huck, mgorman, aarcange,
	amit.shah, pbonzini, willy, wei.w.wang, liliang.opensource,
	yang.zhang.wz, quan.xu, nilal, riel

This patch series enhances the existing virtio-balloon with the following
new features:
1) fast ballooning: transfer ballooned pages between the guest and host in
chunks using sgs, instead of one array each time; and
2) free page block reporting: a new virtqueue to report guest free pages
to the host.

The second feature can be used to accelerate live migration of VMs. Here
are some details:

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.

The second feature enables the optimization of the 1st round memory
transfer - the hypervisor can skip the transfer of guest free pages in the
1st round. It is not concerned that the memory pages are used after they
are given to the hypervisor as a hint of the free pages, because they will
be tracked by the hypervisor and transferred in the next round if they are
used and written.

ChangeLog:
v17->v18:
1) patch 1-2: new to solve some tools related compilation issues
2) patch 3: revert to the original xbitmap implementation from Matthew
Wilcox with some minor changes (e.g. comments added to the exported
functions)
3) patch 4: summarize the changes we want to make to patch 3
4) patch 5: add the developer notes as a reminder for users to avoid
concurrent accesses to the ida bitmap
5) patch 6: a new vring API to allow users to directly pass in a physical
address to a vring desc
6) patch 7: ballooning time is reduced from ~490ms to ~440ms with the new
implementation
	- use the new API from patch 6 to send balloon pages
	- xb_preload with "GFP_NOWAIT | __GFP_NOWARN" flag;
	- handle the case when xb_set_page() fails to avoid memory leak;
	- put xb_set_page() under the balloon lock
7) patch 9: simper implementation
	- start free page reporting by sending a new cmd id from the host
	- guest acks the start or stop via adding a cmd id to the free page vq
	- use vb->report_free_page, instead of vb->report_free_page_stop
	- use WRITE_ONCE/READ_ONCE to access vb->report_free_page
	- use the new API from patch 6 to send free pages to avoid the
	  unnecessary use of kaddr.
8) patch 10: new patch to solve the page posioning issue reported by
Michael S. Tsirkin 

v16->v17:
1) patch 1: please check the commit log there;
2) patch 3: included Michael S. Tsirkin patch to fix the potential
deadlock issue;
3) patch 4: use BUG_ON if virtqueue_add_ returns error, which is
expected never to happen;
4) patch 4: add leak_balloon_sg_oom, which is used in the oom case when
VIRTIO_BALLOON_F_SG is in use;
5) patch 6: use config registers, instead of a vq, as the command channel
between the host and guest;
6) patch 6: add the command sequence id support.

v15->v16:
1) mm: stop reporting the free pfn range if the callback returns false;
2) mm: move some implementaion of walk_free_mem_block into a function to
make the code layout looks better;
3) xbitmap: added some optimizations suggested by Matthew, please refer to
the ChangLog in the xbitmap patch for details.
4) xbitmap: added a test suite
5) virtio-balloon: bail out with a warning when virtqueue_add_inbuf returns
an error
6) virtio-balloon: some small code re-arrangement, e.g. detachinf used buf
from the vq before adding a new buf

v14->v15:
1) mm: make the report callback return a bool value - returning 1 to stop
walking through the free page list.
2) virtio-balloon: batching sgs of balloon pages till the vq is full
3) virtio-balloon: create a new workqueue, rather than using the default
system_wq, to queue the free page reporting work item.
4) virtio-balloon: add a ctrl_vq to be a central control plane which will
handle all the future control related commands between the host and guest.
Add free page report as the first feature controlled under ctrl_vq, and
the free_page_vq is a data plane vq dedicated to the transmission of free
page blocks.

v13->v14:
1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c.
2) xbitmap: consolidate the implementation of xb_bit_set/clear/test into
one xb_bit_ops.
3) xbitmap: add documents for the exported APIs.
4) mm: rewrite the function to walk through free page blocks.
5) virtio-balloon: when reporting a free page blcok to the device, if the
vq is full (less likey to happen in practice), just skip reporting this
block, instead of busywaiting till an entry gets released.
6) virtio-balloon: fail the probe function if adding the signal buf in
init_vqs fails.

v12->v13:
1) mm: use a callback function to handle the the free page blocks from the
report function. This avoids exposing the zone internal to a kernel
module.
2) virtio-balloon: send balloon pages or a free page block using a single
sg each time. This has the benefits of simpler implementation with no new
APIs.
3) virtio-balloon: the free_page_vq is used to report free pages only (no
multiple usages interleaving)
4) virtio-balloon: Balloon pages and free page blocks are sent via input
sgs, and the completion signal to the host is sent via an output sg.

v11->v12:
1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned pages.
2) virtio-ring: enable the driver to build up a desc chain using vring
desc.
3) virtio-ring: Add locking to the existing START_USE() and END_USE()
macro to lock/unlock the vq when a vq operation starts/ends.
4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
5) virtio-balloon: describe chunks of ballooned pages and free pages
blocks directly using one or more chains of desc from the vq.

v10->v11:
1) virtio_balloon: use vring_desc to describe a chunk;
2) virtio_ring: support to add an indirect desc table to virtqueue;
3)  virtio_balloon: use cmdq to report guest memory statistics.

v9->v10:
1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
2) virtio-balloon: add virtballoon_validate();
3) virtio-balloon: msg format change;
4) virtio-balloon: move miscq handling to a task on system_freezable_wq;
5) virtio-balloon: code cleanup.

v8->v9:
1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
implementation;
2) Simpler function to get the free page block.

v7->v8:
1) Use only one chunk format, instead of two.
2) re-write the virtio-balloon implementation patch.
3) commit changes
4) patch re-org

Matthew Wilcox (1):
  xbitmap: Introduce xbitmap

Wei Wang (9):
  idr: add #include <linux/bug.h>
  radix tree test suite: remove ARRAY_SIZE to avoid redefinition
  xbitmap: potential improvement
  xbitmap: add more operations
  virtio_ring: add a new API, virtqueue_add_one_desc
  virtio-balloon: VIRTIO_BALLOON_F_SG
  mm: support reporting free page blocks
  virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  virtio-balloon: don't report free pages when page poisoning is enabled

 drivers/virtio/virtio_balloon.c          | 418 +++++++++++++++++++++++++++----
 drivers/virtio/virtio_ring.c             |  94 +++++--
 include/linux/idr.h                      |   1 +
 include/linux/mm.h                       |   6 +
 include/linux/radix-tree.h               |   2 +
 include/linux/virtio.h                   |   6 +
 include/linux/xbitmap.h                  |  55 ++++
 include/uapi/linux/virtio_balloon.h      |   5 +
 lib/Makefile                             |   2 +-
 lib/radix-tree.c                         |  41 ++-
 lib/xbitmap.c                            | 361 ++++++++++++++++++++++++++
 mm/page_alloc.c                          |  91 +++++++
 tools/include/linux/bitmap.h             |  34 +++
 tools/include/linux/kernel.h             |   2 +
 tools/testing/radix-tree/Makefile        |  12 +-
 tools/testing/radix-tree/linux/kernel.h  |   2 -
 tools/testing/radix-tree/linux/xbitmap.h |   1 +
 tools/testing/radix-tree/main.c          |   4 +
 tools/testing/radix-tree/test.h          |   1 +
 19 files changed, 1063 insertions(+), 75 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c
 create mode 100644 tools/testing/radix-tree/linux/xbitmap.h

-- 
2.7.4

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v18 01/10] idr: add #include <linux/bug.h>
  2017-11-29 13:55 [PATCH v18 00/10] Virtio-balloon Enhancement Wei Wang
@ 2017-11-29 13:55 ` Wei Wang
  2017-11-30  0:58   ` Matthew Wilcox
  2017-11-29 13:55 ` [PATCH v18 02/10] radix tree test suite: remove ARRAY_SIZE to avoid redefinition Wei Wang
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Wei Wang @ 2017-11-29 13:55 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, penguin-kernel, cornelia.huck, mgorman, aarcange,
	amit.shah, pbonzini, willy, wei.w.wang, liliang.opensource,
	yang.zhang.wz, quan.xu, nilal, riel

The <linux/bug.h> was removed from radix-tree.h by the following commit:
f5bba9d11a256ad2a1c2f8e7fc6aabe6416b7890.

Since that commit, tools/testing/radix-tree/ couldn't pass compilation
due to: tools/testing/radix-tree/idr.c:17: undefined reference to
WARN_ON_ONCE. This patch adds the bug.h header to idr.h to solve the
issue.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
---
 include/linux/idr.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index 7c3a365..fa14f83 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -15,6 +15,7 @@
 #include <linux/radix-tree.h>
 #include <linux/gfp.h>
 #include <linux/percpu.h>
+#include <linux/bug.h>
 
 struct idr {
 	struct radix_tree_root	idr_rt;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v18 02/10] radix tree test suite: remove ARRAY_SIZE to avoid redefinition
  2017-11-29 13:55 [PATCH v18 00/10] Virtio-balloon Enhancement Wei Wang
  2017-11-29 13:55 ` [PATCH v18 01/10] idr: add #include <linux/bug.h> Wei Wang
@ 2017-11-29 13:55 ` Wei Wang
  2017-11-29 13:55 ` [PATCH v18 03/10] xbitmap: Introduce xbitmap Wei Wang
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 37+ messages in thread
From: Wei Wang @ 2017-11-29 13:55 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, penguin-kernel, cornelia.huck, mgorman, aarcange,
	amit.shah, pbonzini, willy, wei.w.wang, liliang.opensource,
	yang.zhang.wz, quan.xu, nilal, riel

ARRAY_SIZE() has been defined in include/linux/kernel.h, and "make"
complains a warning of redefinition of ARRAY_SIZE() in
testing/radix/linux/kernel.h. So, remove ARRAY_SIZE() from there.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 tools/testing/radix-tree/linux/kernel.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-tree/linux/kernel.h
index c3bc3f3..426f32f 100644
--- a/tools/testing/radix-tree/linux/kernel.h
+++ b/tools/testing/radix-tree/linux/kernel.h
@@ -17,6 +17,4 @@
 #define pr_debug printk
 #define pr_cont printk
 
-#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
-
 #endif /* _KERNEL_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v18 03/10] xbitmap: Introduce xbitmap
  2017-11-29 13:55 [PATCH v18 00/10] Virtio-balloon Enhancement Wei Wang
  2017-11-29 13:55 ` [PATCH v18 01/10] idr: add #include <linux/bug.h> Wei Wang
  2017-11-29 13:55 ` [PATCH v18 02/10] radix tree test suite: remove ARRAY_SIZE to avoid redefinition Wei Wang
@ 2017-11-29 13:55 ` Wei Wang
  2017-11-29 13:55 ` [PATCH v18 04/10] xbitmap: potential improvement Wei Wang
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 37+ messages in thread
From: Wei Wang @ 2017-11-29 13:55 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, penguin-kernel, cornelia.huck, mgorman, aarcange,
	amit.shah, pbonzini, willy, wei.w.wang, liliang.opensource,
	yang.zhang.wz, quan.xu, nilal, riel

From: Matthew Wilcox <mawilcox@microsoft.com>

The eXtensible Bitmap is a sparse bitmap representation which is
efficient for set bits which tend to cluster.  It supports up to
'unsigned long' worth of bits, and this commit adds the bare bones --
xb_set_bit(), xb_clear_bit() and xb_test_bit().

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 include/linux/radix-tree.h               |   2 +
 include/linux/xbitmap.h                  |  52 +++++++++
 lib/Makefile                             |   2 +-
 lib/radix-tree.c                         |  26 ++++-
 lib/xbitmap.c                            | 179 +++++++++++++++++++++++++++++++
 tools/testing/radix-tree/Makefile        |  12 ++-
 tools/testing/radix-tree/linux/xbitmap.h |   1 +
 tools/testing/radix-tree/main.c          |   4 +
 tools/testing/radix-tree/test.h          |   1 +
 9 files changed, 275 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c
 create mode 100644 tools/testing/radix-tree/linux/xbitmap.h

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 23a9c89..fe44f4b 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -315,6 +315,8 @@ void radix_tree_iter_delete(struct radix_tree_root *,
 			struct radix_tree_iter *iter, void __rcu **slot);
 void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
+bool __radix_tree_delete(struct radix_tree_root *r, struct radix_tree_node *n,
+				void **slot);
 void radix_tree_clear_tags(struct radix_tree_root *, struct radix_tree_node *,
 			   void __rcu **slot);
 unsigned int radix_tree_gang_lookup(const struct radix_tree_root *,
diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
new file mode 100644
index 0000000..ed75d87
--- /dev/null
+++ b/include/linux/xbitmap.h
@@ -0,0 +1,52 @@
+/*
+ * eXtensible Bitmaps
+ * Copyright (c) 2017 Microsoft Corporation <mawilcox@microsoft.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * eXtensible Bitmaps provide an unlimited-size sparse bitmap facility.
+ * All bits are initially zero.
+ */
+
+#include <linux/idr.h>
+
+struct xb {
+	struct radix_tree_root xbrt;
+};
+
+#define XB_INIT {							\
+	.xbrt = RADIX_TREE_INIT(IDR_RT_MARKER | GFP_NOWAIT),		\
+}
+#define DEFINE_XB(name)		struct xb name = XB_INIT
+
+static inline void xb_init(struct xb *xb)
+{
+	INIT_RADIX_TREE(&xb->xbrt, IDR_RT_MARKER | GFP_NOWAIT);
+}
+
+int xb_set_bit(struct xb *xb, unsigned long bit);
+bool xb_test_bit(const struct xb *xb, unsigned long bit);
+int xb_clear_bit(struct xb *xb, unsigned long bit);
+
+int xb_zero(struct xb *xb, unsigned long start, unsigned long nbits);
+int xb_fill(struct xb *xb, unsigned long start, unsigned long nbits);
+
+static inline bool xb_empty(const struct xb *xb)
+{
+	return radix_tree_empty(&xb->xbrt);
+}
+
+void xb_preload(gfp_t);
+
+static inline void xb_preload_end(void)
+{
+	preempt_enable();
+}
diff --git a/lib/Makefile b/lib/Makefile
index d11c48e..08a8183 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -19,7 +19,7 @@ KCOV_INSTRUMENT_dynamic_debug.o := n
 
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o dump_stack.o timerqueue.o\
-	 idr.o int_sqrt.o extable.o \
+	 idr.o xbitmap.o int_sqrt.o extable.o \
 	 sha1.o chacha20.o irq_regs.o argv_split.o \
 	 flex_proportions.o ratelimit.o show_mem.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index c8d5556..7000ad6 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -78,6 +78,14 @@ static struct kmem_cache *radix_tree_node_cachep;
 #define IDA_PRELOAD_SIZE	(IDA_MAX_PATH * 2 - 1)
 
 /*
+ * The XB can go up to unsigned long, but also uses a bitmap.
+ */
+#define XB_INDEX_BITS		(BITS_PER_LONG - ilog2(IDA_BITMAP_BITS))
+#define XB_MAX_PATH		(DIV_ROUND_UP(XB_INDEX_BITS, \
+						RADIX_TREE_MAP_SHIFT))
+#define XB_PRELOAD_SIZE		(XB_MAX_PATH * 2 - 1)
+
+/*
  * Per-cpu pool of preloaded nodes
  */
 struct radix_tree_preload {
@@ -839,6 +847,8 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
 							offset, 0, 0);
 			if (!child)
 				return -ENOMEM;
+			if (is_idr(root))
+				all_tag_set(child, IDR_FREE);
 			rcu_assign_pointer(*slot, node_to_entry(child));
 			if (node)
 				node->count++;
@@ -1982,7 +1992,7 @@ void __radix_tree_delete_node(struct radix_tree_root *root,
 	delete_node(root, node, update_node);
 }
 
-static bool __radix_tree_delete(struct radix_tree_root *root,
+bool __radix_tree_delete(struct radix_tree_root *root,
 				struct radix_tree_node *node, void __rcu **slot)
 {
 	void *old = rcu_dereference_raw(*slot);
@@ -2135,6 +2145,20 @@ int ida_pre_get(struct ida *ida, gfp_t gfp)
 }
 EXPORT_SYMBOL(ida_pre_get);
 
+void xb_preload(gfp_t gfp)
+{
+	__radix_tree_preload(gfp, XB_PRELOAD_SIZE);
+	if (!this_cpu_read(ida_bitmap)) {
+		struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
+
+		if (!bitmap)
+			return;
+		bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
+		kfree(bitmap);
+	}
+}
+EXPORT_SYMBOL(xb_preload);
+
 void __rcu **idr_get_free_cmn(struct radix_tree_root *root,
 			      struct radix_tree_iter *iter, gfp_t gfp,
 			      unsigned long max)
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
new file mode 100644
index 0000000..2b547a73
--- /dev/null
+++ b/lib/xbitmap.c
@@ -0,0 +1,179 @@
+#include <linux/export.h>
+#include <linux/xbitmap.h>
+#include <linux/bitmap.h>
+#include <linux/slab.h>
+
+/**
+ *  xb_set_bit - set a bit in the xbitmap
+ *  @xb: the xbitmap tree used to record the bit
+ *  @bit: index of the bit to set
+ *
+ * This function is used to set a bit in the xbitmap. If the bitmap that @bit
+ * resides in is not there, the per-cpu ida_bitmap will be taken.
+ *
+ * Returns: 0 on success. %-EAGAIN indicates that @bit was not set.
+ */
+int xb_set_bit(struct xb *xb, unsigned long bit)
+{
+	int err;
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	err = __radix_tree_create(root, index, 0, &node, &slot);
+	if (err)
+		return err;
+	bitmap = rcu_dereference_raw(*slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit < BITS_PER_LONG) {
+			tmp |= 1UL << ebit;
+			rcu_assign_pointer(*slot, (void *)tmp);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+		rcu_assign_pointer(*slot, bitmap);
+	}
+
+	if (!bitmap) {
+		if (ebit < BITS_PER_LONG) {
+			bitmap = (void *)((1UL << ebit) |
+					RADIX_TREE_EXCEPTIONAL_ENTRY);
+			__radix_tree_replace(root, node, slot, bitmap, NULL);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		__radix_tree_replace(root, node, slot, bitmap, NULL);
+	}
+
+	__set_bit(bit, bitmap->bitmap);
+	return 0;
+}
+EXPORT_SYMBOL(xb_set_bit);
+
+/**
+ * xb_clear_bit - clear a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to clear
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+int xb_clear_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	bitmap = __radix_tree_lookup(root, index, &node, &slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit >= BITS_PER_LONG)
+			return 0;
+		tmp &= ~(1UL << ebit);
+		if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+			__radix_tree_delete(root, node, slot);
+		else
+			rcu_assign_pointer(*slot, (void *)tmp);
+		return 0;
+	}
+
+	if (!bitmap)
+		return 0;
+
+	__clear_bit(bit, bitmap->bitmap);
+	if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+		kfree(bitmap);
+		__radix_tree_delete(root, node, slot);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(xb_clear_bit);
+
+/**
+ * xb_test_bit - test a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to test
+ *
+ * This function is used to test a bit in the xbitmap.
+ *
+ * Returns: true if the bit is set, or false otherwise.
+ */
+bool xb_test_bit(const struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	const struct radix_tree_root *root = &xb->xbrt;
+	struct ida_bitmap *bitmap = radix_tree_lookup(root, index);
+
+	bit %= IDA_BITMAP_BITS;
+
+	if (!bitmap)
+		return false;
+	if (radix_tree_exception(bitmap)) {
+		bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
+		if (bit > BITS_PER_LONG)
+			return false;
+		return (unsigned long)bitmap & (1UL << bit);
+	}
+	return test_bit(bit, bitmap->bitmap);
+}
+EXPORT_SYMBOL(xb_test_bit);
+
+#ifndef __KERNEL__
+
+static DEFINE_XB(xb1);
+
+void xbitmap_check_bit(unsigned long bit)
+{
+	xb_preload(GFP_KERNEL);
+	assert(!xb_test_bit(&xb1, bit));
+	assert(xb_set_bit(&xb1, bit) == 0);
+	assert(xb_test_bit(&xb1, bit));
+	assert(xb_clear_bit(&xb1, bit) == 0);
+	assert(xb_empty(&xb1));
+	assert(xb_clear_bit(&xb1, bit) == 0);
+	assert(xb_empty(&xb1));
+	xb_preload_end();
+}
+
+void xbitmap_checks(void)
+{
+	xb_init(&xb1);
+	xbitmap_check_bit(0);
+	xbitmap_check_bit(30);
+	xbitmap_check_bit(31);
+	xbitmap_check_bit(1023);
+	xbitmap_check_bit(1024);
+	xbitmap_check_bit(1025);
+	xbitmap_check_bit((1UL << 63) | (1UL << 24));
+	xbitmap_check_bit((1UL << 63) | (1UL << 24) | 70);
+}
+
+int __weak main(void)
+{
+	radix_tree_init();
+	xbitmap_checks();
+}
+#endif
diff --git a/tools/testing/radix-tree/Makefile b/tools/testing/radix-tree/Makefile
index fa7ee36..34ece78 100644
--- a/tools/testing/radix-tree/Makefile
+++ b/tools/testing/radix-tree/Makefile
@@ -6,7 +6,8 @@ LDLIBS+= -lpthread -lurcu
 TARGETS = main idr-test multiorder
 CORE_OFILES := radix-tree.o idr.o linux.o test.o find_bit.o
 OFILES = main.o $(CORE_OFILES) regression1.o regression2.o regression3.o \
-	 tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o
+	 tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o \
+	 xbitmap.o
 
 ifndef SHIFT
 	SHIFT=3
@@ -25,8 +26,11 @@ idr-test: idr-test.o $(CORE_OFILES)
 
 multiorder: multiorder.o $(CORE_OFILES)
 
+xbitmap: xbitmap.o $(CORE_OFILES)
+	$(CC) $(CFLAGS) $(LDFLAGS) $^ -o xbitmap
+
 clean:
-	$(RM) $(TARGETS) *.o radix-tree.c idr.c generated/map-shift.h
+	$(RM) $(TARGETS) *.o radix-tree.c idr.c xbitmap.c generated/map-shift.h
 
 vpath %.c ../../lib
 
@@ -34,6 +38,7 @@ $(OFILES): Makefile *.h */*.h generated/map-shift.h \
 	../../include/linux/*.h \
 	../../include/asm/*.h \
 	../../../include/linux/radix-tree.h \
+	../../../include/linux/xbitmap.h \
 	../../../include/linux/idr.h
 
 radix-tree.c: ../../../lib/radix-tree.c
@@ -42,6 +47,9 @@ radix-tree.c: ../../../lib/radix-tree.c
 idr.c: ../../../lib/idr.c
 	sed -e 's/^static //' -e 's/__always_inline //' -e 's/inline //' < $< > $@
 
+xbitmap.c: ../../../lib/xbitmap.c
+	sed -e 's/^static //' -e 's/__always_inline //' -e 's/inline //' < $< > $@
+
 .PHONY: mapshift
 
 mapshift:
diff --git a/tools/testing/radix-tree/linux/xbitmap.h b/tools/testing/radix-tree/linux/xbitmap.h
new file mode 100644
index 0000000..61de214
--- /dev/null
+++ b/tools/testing/radix-tree/linux/xbitmap.h
@@ -0,0 +1 @@
+#include "../../../../include/linux/xbitmap.h"
diff --git a/tools/testing/radix-tree/main.c b/tools/testing/radix-tree/main.c
index 257f3f8..d112363 100644
--- a/tools/testing/radix-tree/main.c
+++ b/tools/testing/radix-tree/main.c
@@ -326,6 +326,10 @@ static void single_thread_tests(bool long_run)
 	rcu_barrier();
 	printv(2, "after idr_checks: %d allocated, preempt %d\n",
 		nr_allocated, preempt_count);
+	xbitmap_checks();
+	rcu_barrier();
+	printv(2, "after xbitmap_checks: %d allocated, preempt %d\n",
+			nr_allocated, preempt_count);
 	big_gang_check(long_run);
 	rcu_barrier();
 	printv(2, "after big_gang_check: %d allocated, preempt %d\n",
diff --git a/tools/testing/radix-tree/test.h b/tools/testing/radix-tree/test.h
index d9c031d..8175d6b 100644
--- a/tools/testing/radix-tree/test.h
+++ b/tools/testing/radix-tree/test.h
@@ -38,6 +38,7 @@ void benchmark(void);
 void idr_checks(void);
 void ida_checks(void);
 void ida_thread_tests(void);
+void xbitmap_checks(void);
 
 struct item *
 item_tag_set(struct radix_tree_root *root, unsigned long index, int tag);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v18 04/10] xbitmap: potential improvement
  2017-11-29 13:55 [PATCH v18 00/10] Virtio-balloon Enhancement Wei Wang
                   ` (2 preceding siblings ...)
  2017-11-29 13:55 ` [PATCH v18 03/10] xbitmap: Introduce xbitmap Wei Wang
@ 2017-11-29 13:55 ` Wei Wang
  2017-11-29 13:55 ` [PATCH v18 05/10] xbitmap: add more operations Wei Wang
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 37+ messages in thread
From: Wei Wang @ 2017-11-29 13:55 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, penguin-kernel, cornelia.huck, mgorman, aarcange,
	amit.shah, pbonzini, willy, wei.w.wang, liliang.opensource,
	yang.zhang.wz, quan.xu, nilal, riel

This patch made some changes to the original xbitmap implementation from
the linux-dax tree:

- remove xb_fill() and xb_zero() from xbitmap.h since they are not
  implemented;

- xb_test_bit: changed "ebit > BITS_PER_LONG" to "ebit >= BITS_PER_LONG",
  because bit 64 beyonds the "unsigned long" exceptional entry (0 to 63);

- xb_set_bit: delete the new inserted radix_tree_node when failing to
  get the per cpu ida bitmap, this avoids the kind of memory leak of the
  unused radix tree node left in the tree.

- xb_clear_bit: change it to be a void function, since the original
  implementation reurns nothing than a 0.

- remove the comment above "#define XB_INDEX_BITS", because it causes
  confusion based on the feedbacks from the previous discussion;

- xb_preload: with the original implementation, the CPU that successfully
  do __radix_tree_preload() may get into sleep by kmalloc(), which has a
  risk of getting the caller of xb_preload() scheduled to another CPU
  after waken up, and the new CPU may not have radix_tree_node
  pre-allocated there, this will be a problem when inserting a node to
  the tree later. This patch moves __radix_tree_preload() after kmalloc()
  and returns a boolean to indicate the success or failure. Also, add the
  __must_check annotation to xb_preload for prudence purpose.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/xbitmap.h |  5 +----
 lib/radix-tree.c        | 27 +++++++++++++++++++++------
 lib/xbitmap.c           | 24 +++++++++++++-----------
 3 files changed, 35 insertions(+), 21 deletions(-)

diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
index ed75d87..b4d8375 100644
--- a/include/linux/xbitmap.h
+++ b/include/linux/xbitmap.h
@@ -36,15 +36,12 @@ int xb_set_bit(struct xb *xb, unsigned long bit);
 bool xb_test_bit(const struct xb *xb, unsigned long bit);
 int xb_clear_bit(struct xb *xb, unsigned long bit);
 
-int xb_zero(struct xb *xb, unsigned long start, unsigned long nbits);
-int xb_fill(struct xb *xb, unsigned long start, unsigned long nbits);
-
 static inline bool xb_empty(const struct xb *xb)
 {
 	return radix_tree_empty(&xb->xbrt);
 }
 
-void xb_preload(gfp_t);
+bool xb_preload(gfp_t);
 
 static inline void xb_preload_end(void)
 {
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 7000ad6..a039588 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -77,9 +77,6 @@ static struct kmem_cache *radix_tree_node_cachep;
 						RADIX_TREE_MAP_SHIFT))
 #define IDA_PRELOAD_SIZE	(IDA_MAX_PATH * 2 - 1)
 
-/*
- * The XB can go up to unsigned long, but also uses a bitmap.
- */
 #define XB_INDEX_BITS		(BITS_PER_LONG - ilog2(IDA_BITMAP_BITS))
 #define XB_MAX_PATH		(DIV_ROUND_UP(XB_INDEX_BITS, \
 						RADIX_TREE_MAP_SHIFT))
@@ -2145,17 +2142,35 @@ int ida_pre_get(struct ida *ida, gfp_t gfp)
 }
 EXPORT_SYMBOL(ida_pre_get);
 
-void xb_preload(gfp_t gfp)
+/**
+ *  xb_preload - preload for xb_set_bit()
+ *  @gfp_mask: allocation mask to use for preloading
+ *
+ * Preallocate memory to use for the next call to xb_set_bit(). On success,
+ * return true, with preemption disabled. On error, return false with
+ * preemption not disabled.
+ */
+__must_check bool xb_preload(gfp_t gfp)
 {
-	__radix_tree_preload(gfp, XB_PRELOAD_SIZE);
 	if (!this_cpu_read(ida_bitmap)) {
 		struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
 
 		if (!bitmap)
-			return;
+			return false;
+		/*
+		 * The per-CPU variable is updated with preemption enabled.
+		 * If the calling task is unlucky to be scheduled to another
+		 * CPU which has no ida_bitmap allocation, it will be detected
+		 * when setting a bit (i.e. __xb_set_bit()).
+		 */
 		bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
 		kfree(bitmap);
 	}
+
+	if (__radix_tree_preload(gfp, XB_PRELOAD_SIZE) < 0)
+		return false;
+
+	return true;
 }
 EXPORT_SYMBOL(xb_preload);
 
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
index 2b547a73..182aa29 100644
--- a/lib/xbitmap.c
+++ b/lib/xbitmap.c
@@ -39,8 +39,10 @@ int xb_set_bit(struct xb *xb, unsigned long bit)
 			return 0;
 		}
 		bitmap = this_cpu_xchg(ida_bitmap, NULL);
-		if (!bitmap)
+		if (!bitmap) {
+			__radix_tree_delete(root, node, slot);
 			return -EAGAIN;
+		}
 		memset(bitmap, 0, sizeof(*bitmap));
 		bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
 		rcu_assign_pointer(*slot, bitmap);
@@ -54,8 +56,10 @@ int xb_set_bit(struct xb *xb, unsigned long bit)
 			return 0;
 		}
 		bitmap = this_cpu_xchg(ida_bitmap, NULL);
-		if (!bitmap)
+		if (!bitmap) {
+			__radix_tree_delete(root, node, slot);
 			return -EAGAIN;
+		}
 		memset(bitmap, 0, sizeof(*bitmap));
 		__radix_tree_replace(root, node, slot, bitmap, NULL);
 	}
@@ -73,7 +77,7 @@ EXPORT_SYMBOL(xb_set_bit);
  * This function is used to clear a bit in the xbitmap. If all the bits of the
  * bitmap are 0, the bitmap will be freed.
  */
-int xb_clear_bit(struct xb *xb, unsigned long bit)
+void xb_clear_bit(struct xb *xb, unsigned long bit)
 {
 	unsigned long index = bit / IDA_BITMAP_BITS;
 	struct radix_tree_root *root = &xb->xbrt;
@@ -90,25 +94,23 @@ int xb_clear_bit(struct xb *xb, unsigned long bit)
 		unsigned long tmp = (unsigned long)bitmap;
 
 		if (ebit >= BITS_PER_LONG)
-			return 0;
+			return;
 		tmp &= ~(1UL << ebit);
 		if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
 			__radix_tree_delete(root, node, slot);
 		else
 			rcu_assign_pointer(*slot, (void *)tmp);
-		return 0;
+		return;
 	}
 
 	if (!bitmap)
-		return 0;
+		return;
 
 	__clear_bit(bit, bitmap->bitmap);
 	if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
 		kfree(bitmap);
 		__radix_tree_delete(root, node, slot);
 	}
-
-	return 0;
 }
 EXPORT_SYMBOL(xb_clear_bit);
 
@@ -133,7 +135,7 @@ bool xb_test_bit(const struct xb *xb, unsigned long bit)
 		return false;
 	if (radix_tree_exception(bitmap)) {
 		bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
-		if (bit > BITS_PER_LONG)
+		if (bit >= BITS_PER_LONG)
 			return false;
 		return (unsigned long)bitmap & (1UL << bit);
 	}
@@ -151,9 +153,9 @@ void xbitmap_check_bit(unsigned long bit)
 	assert(!xb_test_bit(&xb1, bit));
 	assert(xb_set_bit(&xb1, bit) == 0);
 	assert(xb_test_bit(&xb1, bit));
-	assert(xb_clear_bit(&xb1, bit) == 0);
+	xb_clear_bit(&xb1, bit);
 	assert(xb_empty(&xb1));
-	assert(xb_clear_bit(&xb1, bit) == 0);
+	xb_clear_bit(&xb1, bit);
 	assert(xb_empty(&xb1));
 	xb_preload_end();
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v18 05/10] xbitmap: add more operations
  2017-11-29 13:55 [PATCH v18 00/10] Virtio-balloon Enhancement Wei Wang
                   ` (3 preceding siblings ...)
  2017-11-29 13:55 ` [PATCH v18 04/10] xbitmap: potential improvement Wei Wang
@ 2017-11-29 13:55 ` Wei Wang
  2017-11-30 10:34   ` Tetsuo Handa
  2017-11-29 13:55 ` [PATCH v18 06/10] virtio_ring: add a new API, virtqueue_add_one_desc Wei Wang
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Wei Wang @ 2017-11-29 13:55 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, penguin-kernel, cornelia.huck, mgorman, aarcange,
	amit.shah, pbonzini, willy, wei.w.wang, liliang.opensource,
	yang.zhang.wz, quan.xu, nilal, riel

This patch adds support to find next 1 or 0 bit in a xbmitmap range and
clear a range of bits.

More possible optimizations to add in the future:
1) xb_set_bit_range: set a range of bits.
2) when searching a bit, if the bit is not found in the slot, move on to
the next slot directly.
3) add tags to help searching.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Suggested-by: Matthew Wilcox <mawilcox@microsoft.com>
---
 include/linux/xbitmap.h      |   8 +-
 lib/xbitmap.c                | 180 +++++++++++++++++++++++++++++++++++++++++++
 tools/include/linux/bitmap.h |  34 ++++++++
 tools/include/linux/kernel.h |   2 +
 4 files changed, 223 insertions(+), 1 deletion(-)

diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
index b4d8375..eddf0d5e 100644
--- a/include/linux/xbitmap.h
+++ b/include/linux/xbitmap.h
@@ -33,8 +33,14 @@ static inline void xb_init(struct xb *xb)
 }
 
 int xb_set_bit(struct xb *xb, unsigned long bit);
+int xb_preload_and_set_bit(struct xb *xb, unsigned long bit, gfp_t gfp);
 bool xb_test_bit(const struct xb *xb, unsigned long bit);
-int xb_clear_bit(struct xb *xb, unsigned long bit);
+void xb_clear_bit(struct xb *xb, unsigned long bit);
+unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
+				   unsigned long end);
+unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start,
+				    unsigned long end);
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end);
 
 static inline bool xb_empty(const struct xb *xb)
 {
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
index 182aa29..816dd3e 100644
--- a/lib/xbitmap.c
+++ b/lib/xbitmap.c
@@ -3,6 +3,13 @@
 #include <linux/bitmap.h>
 #include <linux/slab.h>
 
+/*
+ * Developer notes: locks are required to gurantee there is no concurrent
+ * calls of xb_set_bit, xb_clear_bit, xb_clear_bit_range, xb_test_bit,
+ * xb_find_next_set_bit, or xb_find_next_clear_bit to operate on the same
+ * ida bitamp.
+ */
+
 /**
  *  xb_set_bit - set a bit in the xbitmap
  *  @xb: the xbitmap tree used to record the bit
@@ -70,6 +77,28 @@ int xb_set_bit(struct xb *xb, unsigned long bit)
 EXPORT_SYMBOL(xb_set_bit);
 
 /**
+ *  xb_preload_and_set_bit - preload the memory and set a bit in the xbitmap
+ *  @xb: the xbitmap tree used to record the bit
+ *  @bit: index of the bit to set
+ *
+ * A wrapper of the xb_preload() and xb_set_bit().
+ * Returns: 0 on success; -EAGAIN or -ENOMEM on error.
+ */
+int xb_preload_and_set_bit(struct xb *xb, unsigned long bit, gfp_t gfp)
+{
+	int ret = 0;
+
+	if (!xb_preload(gfp))
+		return -ENOMEM;
+
+	ret = xb_set_bit(xb, bit);
+	xb_preload_end();
+
+	return ret;
+}
+EXPORT_SYMBOL(xb_preload_and_set_bit);
+
+/**
  * xb_clear_bit - clear a bit in the xbitmap
  * @xb: the xbitmap tree used to record the bit
  * @bit: index of the bit to clear
@@ -115,6 +144,56 @@ void xb_clear_bit(struct xb *xb, unsigned long bit)
 EXPORT_SYMBOL(xb_clear_bit);
 
 /**
+ * xb_clear_bit - clear a range of bits in the xbitmap
+ * @start: the start of the bit range, inclusive
+ * @end: the end of the bit range, inclusive
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned int nbits;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bitmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bitmap)) {
+			unsigned long ebit = bit + 2;
+			unsigned long tmp = (unsigned long)bitmap;
+
+			nbits = min(end - start + 1, BITS_PER_LONG - ebit);
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			bitmap_clear(&tmp, ebit, nbits);
+			if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+				__radix_tree_delete(root, node, slot);
+			else
+				rcu_assign_pointer(*slot, (void *)tmp);
+		} else if (bitmap) {
+			nbits = min(end - start + 1, IDA_BITMAP_BITS - bit);
+
+			if (nbits != IDA_BITMAP_BITS)
+				bitmap_clear(bitmap->bitmap, bit, nbits);
+
+			if (nbits == IDA_BITMAP_BITS ||
+			    bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+				kfree(bitmap);
+				__radix_tree_delete(root, node, slot);
+			}
+		}
+	}
+}
+EXPORT_SYMBOL(xb_clear_bit_range);
+
+/**
  * xb_test_bit - test a bit in the xbitmap
  * @xb: the xbitmap tree used to record the bit
  * @bit: index of the bit to test
@@ -143,6 +222,80 @@ bool xb_test_bit(const struct xb *xb, unsigned long bit)
 }
 EXPORT_SYMBOL(xb_test_bit);
 
+static unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+				      unsigned long end, bool set)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bmap;
+	unsigned long ret = end + 1;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bmap)) {
+			unsigned long tmp = (unsigned long)bmap;
+			unsigned long ebit = bit + 2;
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			if (set)
+				ret = find_next_bit(&tmp, BITS_PER_LONG, ebit);
+			else
+				ret = find_next_zero_bit(&tmp, BITS_PER_LONG,
+							 ebit);
+			if (ret < BITS_PER_LONG)
+				return ret - 2 + IDA_BITMAP_BITS * index;
+		} else if (bmap) {
+			if (set)
+				ret = find_next_bit(bmap->bitmap,
+						    IDA_BITMAP_BITS, bit);
+			else
+				ret = find_next_zero_bit(bmap->bitmap,
+							 IDA_BITMAP_BITS, bit);
+			if (ret < IDA_BITMAP_BITS)
+				return ret + index * IDA_BITMAP_BITS;
+		} else if (!bmap && !set) {
+			return start;
+		}
+	}
+
+	return ret;
+}
+
+/**
+ * xb_find_next_set_bit - find the next set bit in a range
+ * @xb: the xbitmap to search
+ * @start: the start of the range, inclusive
+ * @end: the end of the range, inclusive
+ *
+ * Returns: the index of the found bit, or @end + 1 if no such bit is found.
+ */
+unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
+				   unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 1);
+}
+EXPORT_SYMBOL(xb_find_next_set_bit);
+
+/**
+ * xb_find_next_zero_bit - find the next zero bit in a range
+ * @xb: the xbitmap to search
+ * @start: the start of the range, inclusive
+ * @end: the end of the range, inclusive
+ *
+ * Returns: the index of the found bit, or @end + 1 if no such bit is found.
+ */
+unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start,
+				    unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 0);
+}
+EXPORT_SYMBOL(xb_find_next_zero_bit);
+
 #ifndef __KERNEL__
 
 static DEFINE_XB(xb1);
@@ -160,6 +313,31 @@ void xbitmap_check_bit(unsigned long bit)
 	xb_preload_end();
 }
 
+static void xbitmap_check_bit_range(void)
+{
+	/* Set a range of bits */
+	assert(!xb_preload_and_set_bit(&xb1, 1060, GFP_KERNEL));
+	assert(!xb_preload_and_set_bit(&xb1, 1061, GFP_KERNEL));
+	assert(!xb_preload_and_set_bit(&xb1, 1064, GFP_KERNEL));
+	assert(!xb_preload_and_set_bit(&xb1, 1065, GFP_KERNEL));
+	assert(!xb_preload_and_set_bit(&xb1, 8180, GFP_KERNEL));
+	assert(!xb_preload_and_set_bit(&xb1, 8181, GFP_KERNEL));
+	assert(!xb_preload_and_set_bit(&xb1, 8190, GFP_KERNEL));
+	assert(!xb_preload_and_set_bit(&xb1, 8191, GFP_KERNEL));
+
+	/* Test a range of bits */
+	assert(xb_find_next_set_bit(&xb1, 0, 10000) == 1060);
+	assert(xb_find_next_zero_bit(&xb1, 1061, 10000) == 1062);
+	assert(xb_find_next_set_bit(&xb1, 1062, 10000) == 1064);
+	assert(xb_find_next_zero_bit(&xb1, 1065, 10000) == 1066);
+	assert(xb_find_next_set_bit(&xb1, 1066, 10000) == 8180);
+	assert(xb_find_next_zero_bit(&xb1, 8180, 10000) == 8182);
+	xb_clear_bit_range(&xb1, 0, 20000);
+	assert(xb_find_next_set_bit(&xb1, 0, 10000) == 10001);
+
+	assert(xb_find_next_zero_bit(&xb1, 20000, 30000) == 20000);
+}
+
 void xbitmap_checks(void)
 {
 	xb_init(&xb1);
@@ -171,6 +349,8 @@ void xbitmap_checks(void)
 	xbitmap_check_bit(1025);
 	xbitmap_check_bit((1UL << 63) | (1UL << 24));
 	xbitmap_check_bit((1UL << 63) | (1UL << 24) | 70);
+
+	xbitmap_check_bit_range();
 }
 
 int __weak main(void)
diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h
index ca16027..8d0bc1b 100644
--- a/tools/include/linux/bitmap.h
+++ b/tools/include/linux/bitmap.h
@@ -37,6 +37,40 @@ static inline void bitmap_zero(unsigned long *dst, int nbits)
 	}
 }
 
+static inline void __bitmap_clear(unsigned long *map, unsigned int start,
+				  int len)
+{
+	unsigned long *p = map + BIT_WORD(start);
+	const unsigned int size = start + len;
+	int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+	unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+
+	while (len - bits_to_clear >= 0) {
+		*p &= ~mask_to_clear;
+		len -= bits_to_clear;
+		bits_to_clear = BITS_PER_LONG;
+		mask_to_clear = ~0UL;
+		p++;
+	}
+	if (len) {
+		mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+		*p &= ~mask_to_clear;
+	}
+}
+
+static inline __always_inline void bitmap_clear(unsigned long *map,
+						unsigned int start,
+						unsigned int nbits)
+{
+	if (__builtin_constant_p(nbits) && nbits == 1)
+		__clear_bit(start, map);
+	else if (__builtin_constant_p(start & 7) && IS_ALIGNED(start, 8) &&
+		 __builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))
+		memset((char *)map + start / 8, 0, nbits / 8);
+	else
+		__bitmap_clear(map, start, nbits);
+}
+
 static inline void bitmap_fill(unsigned long *dst, unsigned int nbits)
 {
 	unsigned int nlongs = BITS_TO_LONGS(nbits);
diff --git a/tools/include/linux/kernel.h b/tools/include/linux/kernel.h
index 0ad8844..3c992ae 100644
--- a/tools/include/linux/kernel.h
+++ b/tools/include/linux/kernel.h
@@ -13,6 +13,8 @@
 #define UINT_MAX	(~0U)
 #endif
 
+#define IS_ALIGNED(x, a)	(((x) & ((typeof(x))(a) - 1)) == 0)
+
 #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
 
 #define PERF_ALIGN(x, a)	__PERF_ALIGN_MASK(x, (typeof(x))(a)-1)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v18 06/10] virtio_ring: add a new API, virtqueue_add_one_desc
  2017-11-29 13:55 [PATCH v18 00/10] Virtio-balloon Enhancement Wei Wang
                   ` (4 preceding siblings ...)
  2017-11-29 13:55 ` [PATCH v18 05/10] xbitmap: add more operations Wei Wang
@ 2017-11-29 13:55 ` Wei Wang
  2017-11-30 19:38   ` Michael S. Tsirkin
  2017-11-29 13:55 ` [PATCH v18 07/10] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Wei Wang @ 2017-11-29 13:55 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, penguin-kernel, cornelia.huck, mgorman, aarcange,
	amit.shah, pbonzini, willy, wei.w.wang, liliang.opensource,
	yang.zhang.wz, quan.xu, nilal, riel

Current virtqueue_add API implementation is based on the scatterlist
struct, which uses kaddr. This is inadequate to all the use case of
vring. For example:
- Some usages don't use IOMMU, in this case the user can directly pass
  in a physical address in hand, instead of going through the sg
  implementation (e.g. the VIRTIO_BALLOON_F_SG feature)
- Sometimes, a guest physical page may not have a kaddr (e.g. high
  memory) but need to use vring (e.g. the VIRTIO_BALLOON_F_FREE_PAGE_VQ
  feature)

The new API virtqueue_add_one_desc enables the caller to assign a vring
desc with a physical address and len. Also, factor out the common code
with virtqueue_add in vring_set_avail.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_ring.c | 94 +++++++++++++++++++++++++++++++++++---------
 include/linux/virtio.h       |  6 +++
 2 files changed, 81 insertions(+), 19 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index eb30f3e..0b87123 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -257,6 +257,79 @@ static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
 	return desc;
 }
 
+static void vring_set_avail(struct virtqueue *_vq,
+			    unsigned int i)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	unsigned int avail;
+
+	avail = vq->avail_idx_shadow & (vq->vring.num - 1);
+	vq->vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, i);
+
+	/*
+	 * Descriptors and available array need to be set before we expose the
+	 * new available array entries.
+	 */
+	virtio_wmb(vq->weak_barriers);
+	vq->avail_idx_shadow++;
+	vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev,
+					       vq->avail_idx_shadow);
+	vq->num_added++;
+
+	pr_debug("Added buffer head %i to %p\n", i, vq);
+
+	/*
+	 * This is very unlikely, but theoretically possible.  Kick
+	 * just in case.
+	 */
+	if (unlikely(vq->num_added == (1 << 16) - 1))
+		virtqueue_kick(_vq);
+}
+
+int virtqueue_add_one_desc(struct virtqueue *_vq,
+			   uint64_t addr,
+			   uint32_t len,
+			   bool in_desc,
+			   void *data)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	struct vring_desc *desc;
+	unsigned int i;
+
+	START_USE(vq);
+	BUG_ON(data == NULL);
+
+	if (unlikely(vq->broken)) {
+		END_USE(vq);
+		return -EIO;
+	}
+
+	if (_vq->num_free < 1) {
+		END_USE(vq);
+		return -ENOSPC;
+	}
+
+	i = vq->free_head;
+	desc = &vq->vring.desc[i];
+	desc->addr = cpu_to_virtio64(_vq->vdev, addr);
+	desc->len = cpu_to_virtio32(_vq->vdev, len);
+	if (in_desc)
+		desc->flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_WRITE);
+	else
+		desc->flags = 0;
+	vq->desc_state[i].data = data;
+	vq->desc_state[i].indir_desc = NULL;
+	vq->free_head = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next);
+	_vq->num_free--;
+
+	vring_set_avail(_vq, i);
+
+	END_USE(vq);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtqueue_add_one_desc);
+
 static inline int virtqueue_add(struct virtqueue *_vq,
 				struct scatterlist *sgs[],
 				unsigned int total_sg,
@@ -269,7 +342,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	struct vring_virtqueue *vq = to_vvq(_vq);
 	struct scatterlist *sg;
 	struct vring_desc *desc;
-	unsigned int i, n, avail, descs_used, uninitialized_var(prev), err_idx;
+	unsigned int i, n, descs_used, uninitialized_var(prev), err_idx;
 	int head;
 	bool indirect;
 
@@ -395,26 +468,9 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	else
 		vq->desc_state[head].indir_desc = ctx;
 
-	/* Put entry in available array (but don't update avail->idx until they
-	 * do sync). */
-	avail = vq->avail_idx_shadow & (vq->vring.num - 1);
-	vq->vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
-
-	/* Descriptors and available array need to be set before we expose the
-	 * new available array entries. */
-	virtio_wmb(vq->weak_barriers);
-	vq->avail_idx_shadow++;
-	vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
-	vq->num_added++;
-
-	pr_debug("Added buffer head %i to %p\n", head, vq);
+	vring_set_avail(_vq, head);
 	END_USE(vq);
 
-	/* This is very unlikely, but theoretically possible.  Kick
-	 * just in case. */
-	if (unlikely(vq->num_added == (1 << 16) - 1))
-		virtqueue_kick(_vq);
-
 	return 0;
 
 unmap_release:
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 988c735..1d89996 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -35,6 +35,12 @@ struct virtqueue {
 	void *priv;
 };
 
+int virtqueue_add_one_desc(struct virtqueue *_vq,
+			   uint64_t addr,
+			   uint32_t len,
+			   bool in_desc,
+			   void *data);
+
 int virtqueue_add_outbuf(struct virtqueue *vq,
 			 struct scatterlist sg[], unsigned int num,
 			 void *data,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v18 07/10] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-11-29 13:55 [PATCH v18 00/10] Virtio-balloon Enhancement Wei Wang
                   ` (5 preceding siblings ...)
  2017-11-29 13:55 ` [PATCH v18 06/10] virtio_ring: add a new API, virtqueue_add_one_desc Wei Wang
@ 2017-11-29 13:55 ` Wei Wang
  2017-11-30 10:35   ` Tetsuo Handa
  2017-12-01 15:38   ` Michael S. Tsirkin
  2017-11-29 13:55 ` [PATCH v18 08/10] mm: support reporting free page blocks Wei Wang
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 37+ messages in thread
From: Wei Wang @ 2017-11-29 13:55 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, penguin-kernel, cornelia.huck, mgorman, aarcange,
	amit.shah, pbonzini, willy, wei.w.wang, liliang.opensource,
	yang.zhang.wz, quan.xu, nilal, riel

Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer of
balloon (i.e. inflated/deflated) pages using scatter-gather lists to the
host. A scatter-gather list is described by a vring desc.

The implementation of the previous virtio-balloon is not very efficient,
because the balloon pages are transferred to the host by one array each
time. Here is the breakdown of the time in percentage spent on each step
of the balloon inflating process (inflating 7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete. The above
profiling shows that the bottlenecks are stage 2) and stage 4).

This patch optimizes step 2) by transferring pages to host in sgs. An sg
describes a chunk of guest physically continuous pages. With this
mechanism, step 4) can also be optimized by doing address translation and
madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~440ms resulting
in an improvement of ~89%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages instead of
a single page each time.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 drivers/virtio/virtio_balloon.c     | 230 +++++++++++++++++++++++++++++++++---
 include/uapi/linux/virtio_balloon.h |   1 +
 2 files changed, 212 insertions(+), 19 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 7960746..2c21c5a 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -32,6 +32,8 @@
 #include <linux/mm.h>
 #include <linux/mount.h>
 #include <linux/magic.h>
+#include <linux/xbitmap.h>
+#include <asm/page.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -79,6 +81,9 @@ struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	/* The xbitmap used to record balloon pages */
+	struct xb page_xb;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -141,15 +146,121 @@ static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
+static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
+{
+	unsigned int len;
+
+	virtqueue_kick(vq);
+	wait_event(wq_head, virtqueue_get_buf(vq, &len));
+}
+
+static void send_one_desc(struct virtio_balloon *vb,
+			  struct virtqueue *vq,
+			  uint64_t addr,
+			  uint32_t len,
+			  bool inbuf,
+			  bool batch)
+{
+	int err;
+	unsigned int size;
+
+	/* Detach all the used buffers from the vq */
+	while (virtqueue_get_buf(vq, &size))
+		;
+
+	err = virtqueue_add_one_desc(vq, addr, len, inbuf, vq);
+	/*
+	 * This is expected to never fail: there is always at least 1 entry
+	 * available on the vq, because when the vq is full the worker thread
+	 * that adds the desc will be put into sleep until at least 1 entry is
+	 * available to use.
+	 */
+	BUG_ON(err);
+
+	/* If batching is requested, we batch till the vq is full */
+	if (!batch || !vq->num_free)
+		kick_and_wait(vq, vb->acked);
+}
+
+/*
+ * Send balloon pages in sgs to host. The balloon pages are recorded in the
+ * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
+ * The page xbitmap is searched for continuous "1" bits, which correspond
+ * to continuous pages, to chunk into sgs.
+ *
+ * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
+ * need to be searched.
+ */
+static void tell_host_sgs(struct virtio_balloon *vb,
+			  struct virtqueue *vq,
+			  unsigned long page_xb_start,
+			  unsigned long page_xb_end)
+{
+	unsigned long pfn_start, pfn_end;
+	uint64_t addr;
+	uint32_t len, max_len = round_down(UINT_MAX, PAGE_SIZE);
+
+	pfn_start = page_xb_start;
+	while (pfn_start < page_xb_end) {
+		pfn_start = xb_find_next_set_bit(&vb->page_xb, pfn_start,
+						 page_xb_end);
+		if (pfn_start == page_xb_end + 1)
+			break;
+		pfn_end = xb_find_next_zero_bit(&vb->page_xb,
+						pfn_start + 1,
+						page_xb_end);
+		addr = pfn_start << PAGE_SHIFT;
+		len = (pfn_end - pfn_start) << PAGE_SHIFT;
+		while (len > max_len) {
+			send_one_desc(vb, vq, addr, max_len, true, true);
+			addr += max_len;
+			len -= max_len;
+		}
+		send_one_desc(vb, vq, addr, len, true, true);
+		pfn_start = pfn_end + 1;
+	}
+
+	/*
+	 * The last few desc-s may not reach the batch size, but need a kick to
+	 * notify the device to handle them.
+	 */
+	if (vq->num_free != virtqueue_get_vring_size(vq))
+		kick_and_wait(vq, vb->acked);
+
+	xb_clear_bit_range(&vb->page_xb, page_xb_start, page_xb_end);
+}
+
+static inline int xb_set_page(struct virtio_balloon *vb,
+			       struct page *page,
+			       unsigned long *pfn_min,
+			       unsigned long *pfn_max)
+{
+	unsigned long pfn = page_to_pfn(page);
+	int ret;
+
+	*pfn_min = min(pfn, *pfn_min);
+	*pfn_max = max(pfn, *pfn_max);
+
+	do {
+		ret = xb_preload_and_set_bit(&vb->page_xb, pfn,
+					     GFP_NOWAIT | __GFP_NOWARN);
+	} while (unlikely(ret == -EAGAIN));
+
+	return ret;
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	unsigned num_allocated_pages;
 	unsigned num_pfns;
 	struct page *page;
 	LIST_HEAD(pages);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	for (num_pfns = 0; num_pfns < num;
 	     num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
@@ -172,11 +283,18 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 	vb->num_pfns = 0;
 
 	while ((page = balloon_page_pop(&pages))) {
+		if (use_sg) {
+			if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
+				__free_page(page);
+				break;
+			}
+		} else {
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		}
+
 		balloon_page_enqueue(&vb->vb_dev_info, page);
 
 		vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE;
-
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -185,8 +303,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -212,9 +334,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	/* Traditionally, we can only do one array worth at a time. */
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	/* We can't release more pages than taken */
@@ -224,7 +349,14 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (use_sg) {
+			if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
+				balloon_page_enqueue(&vb->vb_dev_info, page);
+				break;
+			}
+		} else {
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		}
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -235,13 +367,55 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
 }
 
+/*
+ * The regular leak_balloon() with VIRTIO_BALLOON_F_SG needs memory allocation
+ * for xbitmap, which is not suitable for the oom case. This function does not
+ * use xbitmap to chunk pages, so it can be used by oom notifier to deflate
+ * pages when VIRTIO_BALLOON_F_SG is negotiated.
+ */
+static unsigned int leak_balloon_sg_oom(struct virtio_balloon *vb)
+{
+	unsigned int n;
+	struct page *page;
+	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
+	struct virtqueue *vq = vb->deflate_vq;
+	LIST_HEAD(pages);
+
+	mutex_lock(&vb->balloon_lock);
+	for (n = 0; n < oom_pages; n++) {
+		page = balloon_page_dequeue(vb_dev_info);
+		if (!page)
+			break;
+
+		list_add(&page->lru, &pages);
+		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
+		send_one_desc(vb, vq, virt_to_phys(page_address(page)),
+			      PAGE_SIZE, true, true);
+		release_pages_balloon(vb, &pages);
+	}
+
+	/*
+	 * The last few sgs may not reach the batch size, but need a kick to
+	 * notify the device to handle them.
+	 */
+	if (vq->num_free != virtqueue_get_vring_size(vq))
+		kick_and_wait(vq, vb->acked);
+	mutex_unlock(&vb->balloon_lock);
+
+	return n;
+}
+
 static inline void update_stat(struct virtio_balloon *vb, int idx,
 			       u16 tag, u64 val)
 {
@@ -381,7 +555,10 @@ static int virtballoon_oom_notify(struct notifier_block *self,
 		return NOTIFY_OK;
 
 	freed = parm;
-	num_freed_pages = leak_balloon(vb, oom_pages);
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG))
+		num_freed_pages = leak_balloon_sg_oom(vb);
+	else
+		num_freed_pages = leak_balloon(vb, oom_pages);
 	update_balloon_size(vb);
 	*freed += num_freed_pages;
 
@@ -478,6 +655,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
 	unsigned long flags;
 
 	/*
@@ -499,16 +677,26 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (use_sg) {
+		send_one_desc(vb, vb->inflate_vq,
+			      virt_to_phys(page_address(newpage)),
+			      PAGE_SIZE, true, false);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (use_sg) {
+		send_one_desc(vb, vb->inflate_vq,
+			      virt_to_phys(page_address(page)),
+			      PAGE_SIZE, true, false);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -567,6 +755,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (err)
 		goto out_free_vb;
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
+		xb_init(&vb->page_xb);
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -683,6 +874,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_SG,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..37780a7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v18 08/10] mm: support reporting free page blocks
  2017-11-29 13:55 [PATCH v18 00/10] Virtio-balloon Enhancement Wei Wang
                   ` (6 preceding siblings ...)
  2017-11-29 13:55 ` [PATCH v18 07/10] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
@ 2017-11-29 13:55 ` Wei Wang
  2017-11-29 13:55 ` [PATCH v18 09/10] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ Wei Wang
  2017-11-29 13:55 ` [PATCH v18 10/10] virtio-balloon: don't report free pages when page poisoning is enabled Wei Wang
  9 siblings, 0 replies; 37+ messages in thread
From: Wei Wang @ 2017-11-29 13:55 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, penguin-kernel, cornelia.huck, mgorman, aarcange,
	amit.shah, pbonzini, willy, wei.w.wang, liliang.opensource,
	yang.zhang.wz, quan.xu, nilal, riel

This patch adds support to walk through the free page blocks in the
system and report them via a callback function. Some page blocks may
leave the free list after zone->lock is released, so it is the caller's
responsibility to either detect or prevent the use of such pages.

One use example of this patch is to accelerate live migration by skipping
the transfer of free pages reported from the guest. A popular method used
by the hypervisor to track which part of memory is written during live
migration is to write-protect all the guest memory. So, those pages that
are reported as free pages but are written after the report function
returns will be captured by the hypervisor, and they will be added to the
next round of memory transfer.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
---
 include/linux/mm.h |  6 ++++
 mm/page_alloc.c    | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 97 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ee07314..c1339be 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1924,6 +1924,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 
+extern void walk_free_mem_block(void *opaque,
+				int min_order,
+				bool (*report_pfn_range)(void *opaque,
+							 unsigned long pfn,
+							 unsigned long num));
+
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
  * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d4096f4..0f4a197 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4892,6 +4892,97 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 	show_swap_cache_info();
 }
 
+/*
+ * Walk through a free page list and report the found pfn range via the
+ * callback.
+ *
+ * Return false if the callback requests to stop reporting. Otherwise,
+ * return true.
+ */
+static bool walk_free_page_list(void *opaque,
+				struct zone *zone,
+				int order,
+				enum migratetype mt,
+				bool (*report_pfn_range)(void *,
+							 unsigned long,
+							 unsigned long))
+{
+	struct page *page;
+	struct list_head *list;
+	unsigned long pfn, flags;
+	bool ret;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	list = &zone->free_area[order].free_list[mt];
+	list_for_each_entry(page, list, lru) {
+		pfn = page_to_pfn(page);
+		ret = report_pfn_range(opaque, pfn, 1 << order);
+		if (!ret)
+			break;
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	return ret;
+}
+
+/**
+ * walk_free_mem_block - Walk through the free page blocks in the system
+ * @opaque: the context passed from the caller
+ * @min_order: the minimum order of free lists to check
+ * @report_pfn_range: the callback to report the pfn range of the free pages
+ *
+ * If the callback returns false, stop iterating the list of free page blocks.
+ * Otherwise, continue to report.
+ *
+ * Please note that there are no locking guarantees for the callback and
+ * that the reported pfn range might be freed or disappear after the
+ * callback returns so the caller has to be very careful how it is used.
+ *
+ * The callback itself must not sleep or perform any operations which would
+ * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
+ * or via any lock dependency. It is generally advisable to implement
+ * the callback as simple as possible and defer any heavy lifting to a
+ * different context.
+ *
+ * There is no guarantee that each free range will be reported only once
+ * during one walk_free_mem_block invocation.
+ *
+ * pfn_to_page on the given range is strongly discouraged and if there is
+ * an absolute need for that make sure to contact MM people to discuss
+ * potential problems.
+ *
+ * The function itself might sleep so it cannot be called from atomic
+ * contexts.
+ *
+ * In general low orders tend to be very volatile and so it makes more
+ * sense to query larger ones first for various optimizations which like
+ * ballooning etc... This will reduce the overhead as well.
+ */
+void walk_free_mem_block(void *opaque,
+			 int min_order,
+			 bool (*report_pfn_range)(void *opaque,
+						  unsigned long pfn,
+						  unsigned long num))
+{
+	struct zone *zone;
+	int order;
+	enum migratetype mt;
+	bool ret;
+
+	for_each_populated_zone(zone) {
+		for (order = MAX_ORDER - 1; order >= min_order; order--) {
+			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+				ret = walk_free_page_list(opaque, zone,
+							  order, mt,
+							  report_pfn_range);
+				if (!ret)
+					return;
+			}
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(walk_free_mem_block);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v18 09/10] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-11-29 13:55 [PATCH v18 00/10] Virtio-balloon Enhancement Wei Wang
                   ` (7 preceding siblings ...)
  2017-11-29 13:55 ` [PATCH v18 08/10] mm: support reporting free page blocks Wei Wang
@ 2017-11-29 13:55 ` Wei Wang
  2017-11-29 13:55 ` [PATCH v18 10/10] virtio-balloon: don't report free pages when page poisoning is enabled Wei Wang
  9 siblings, 0 replies; 37+ messages in thread
From: Wei Wang @ 2017-11-29 13:55 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, penguin-kernel, cornelia.huck, mgorman, aarcange,
	amit.shah, pbonzini, willy, wei.w.wang, liliang.opensource,
	yang.zhang.wz, quan.xu, nilal, riel

Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_VQ feature indicates the
support of reporting hints of guest free pages to host via virtio-balloon.

Host requests the guest to report free pages by sending a new cmd
id to the guest via the free_page_report_cmd_id configuration register.

When the guest starts to report, the first element added to the free page
vq is the cmd id given by host. When the guest finishes the reporting
of all the free pages, VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID is added
to the vq to tell host that the reporting is done. Host may also requests
the guest to stop the reporting in advance by sending the stop cmd id to
the guest via the configuration register.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
---
 drivers/virtio/virtio_balloon.c     | 202 +++++++++++++++++++++++++++++-------
 include/uapi/linux/virtio_balloon.h |   4 +
 2 files changed, 167 insertions(+), 39 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 2c21c5a..035bd3a 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -55,7 +55,12 @@ static struct vfsmount *balloon_mnt;
 
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
+
+	/* Balloon's own wq for cpu-intensive work items */
+	struct workqueue_struct *balloon_wq;
+	/* The free page reporting work item submitted to the balloon wq */
+	struct work_struct report_free_page_work;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
@@ -65,6 +70,13 @@ struct virtio_balloon {
 	spinlock_t stop_update_lock;
 	bool stop_update;
 
+	/* Start to report free pages */
+	bool report_free_page;
+	/* Stores the cmd id given by host to start the free page reporting */
+	uint32_t start_cmd_id;
+	/* Stores STOP_ID as a sign to tell host that the reporting is done */
+	uint32_t stop_cmd_id;
+
 	/* Waiting for host to ack the pages we released. */
 	wait_queue_head_t acked;
 
@@ -159,7 +171,8 @@ static void send_one_desc(struct virtio_balloon *vb,
 			  uint64_t addr,
 			  uint32_t len,
 			  bool inbuf,
-			  bool batch)
+			  bool batch,
+			  bool wait)
 {
 	int err;
 	unsigned int size;
@@ -178,8 +191,12 @@ static void send_one_desc(struct virtio_balloon *vb,
 	BUG_ON(err);
 
 	/* If batching is requested, we batch till the vq is full */
-	if (!batch || !vq->num_free)
-		kick_and_wait(vq, vb->acked);
+	if (!batch || !vq->num_free) {
+		if (wait)
+			kick_and_wait(vq, vb->acked);
+		else
+			virtqueue_kick(vq);
+	}
 }
 
 /*
@@ -212,11 +229,11 @@ static void tell_host_sgs(struct virtio_balloon *vb,
 		addr = pfn_start << PAGE_SHIFT;
 		len = (pfn_end - pfn_start) << PAGE_SHIFT;
 		while (len > max_len) {
-			send_one_desc(vb, vq, addr, max_len, true, true);
+			send_one_desc(vb, vq, addr, max_len, true, true, true);
 			addr += max_len;
 			len -= max_len;
 		}
-		send_one_desc(vb, vq, addr, len, true, true);
+		send_one_desc(vb, vq, addr, len, true, true, true);
 		pfn_start = pfn_end + 1;
 	}
 
@@ -401,7 +418,7 @@ static unsigned int leak_balloon_sg_oom(struct virtio_balloon *vb)
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 		send_one_desc(vb, vq, virt_to_phys(page_address(page)),
-			      PAGE_SIZE, true, true);
+			      PAGE_SIZE, true, true, true);
 		release_pages_balloon(vb, &pages);
 	}
 
@@ -491,17 +508,6 @@ static void stats_handle_request(struct virtio_balloon *vb)
 	virtqueue_kick(vq);
 }
 
-static void virtballoon_changed(struct virtio_device *vdev)
-{
-	struct virtio_balloon *vb = vdev->priv;
-	unsigned long flags;
-
-	spin_lock_irqsave(&vb->stop_update_lock, flags);
-	if (!vb->stop_update)
-		queue_work(system_freezable_wq, &vb->update_balloon_size_work);
-	spin_unlock_irqrestore(&vb->stop_update_lock, flags);
-}
-
 static inline s64 towards_target(struct virtio_balloon *vb)
 {
 	s64 target;
@@ -518,6 +524,36 @@ static inline s64 towards_target(struct virtio_balloon *vb)
 	return target - vb->num_pages;
 }
 
+static void virtballoon_changed(struct virtio_device *vdev)
+{
+	struct virtio_balloon *vb = vdev->priv;
+	unsigned long flags;
+	__u32 cmd_id;
+	s64 diff = towards_target(vb);
+
+	if (diff) {
+		spin_lock_irqsave(&vb->stop_update_lock, flags);
+		if (!vb->stop_update)
+			queue_work(system_freezable_wq,
+				   &vb->update_balloon_size_work);
+		spin_unlock_irqrestore(&vb->stop_update_lock, flags);
+	}
+
+	virtio_cread(vb->vdev, struct virtio_balloon_config,
+		     free_page_report_cmd_id, &cmd_id);
+	if (cmd_id == VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID) {
+		WRITE_ONCE(vb->report_free_page, false);
+	} else if (cmd_id != vb->start_cmd_id) {
+		/*
+		 * Host requests to start the reporting by sending a new cmd
+		 * id.
+		 */
+		WRITE_ONCE(vb->report_free_page, true);
+		vb->start_cmd_id = cmd_id;
+		queue_work(vb->balloon_wq, &vb->report_free_page_work);
+	}
+}
+
 static void update_balloon_size(struct virtio_balloon *vb)
 {
 	u32 actual = vb->num_pages;
@@ -593,42 +629,121 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
-static int init_vqs(struct virtio_balloon *vb)
+static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
+					   unsigned long nr_pages)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
-	int err, nvqs;
+	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
+	uint64_t addr = pfn << PAGE_SHIFT;
+	uint32_t len = nr_pages << PAGE_SHIFT;
+
+	if (!READ_ONCE(vb->report_free_page))
+		return false;
 
+	send_one_desc(vb, vb->free_page_vq, addr, len, true, true, false);
+
+	return true;
+}
+
+static void report_free_page(struct work_struct *work)
+{
+	struct virtio_balloon *vb;
+
+	vb = container_of(work, struct virtio_balloon, report_free_page_work);
+	/* Start by sending the obtained cmd id to the host with an outbuf */
+	send_one_desc(vb, vb->free_page_vq, virt_to_phys(&vb->start_cmd_id),
+		      sizeof(uint32_t), false, true, false);
+	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
 	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
+	 * End by sending the stop id to the host with an outbuf. Use the
+	 * non-batching mode here to trigger a kick after adding the stop id.
 	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
+	send_one_desc(vb, vb->free_page_vq, virt_to_phys(&vb->stop_cmd_id),
+		      sizeof(uint32_t), false, false, false);
+}
+
+static int init_vqs(struct virtio_balloon *vb)
+{
+	struct virtqueue **vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	struct scatterlist sg;
+	int i, nvqs, err = -ENOMEM;
+
+	/* Inflateq and deflateq are used unconditionally */
+	nvqs = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs++;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
+		nvqs++;
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vq;
+	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
+	if (!callbacks)
+		goto err_callback;
+	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
+	if (!names)
+		goto err_names;
+
+	callbacks[0] = balloon_ack;
+	names[0] = "inflate";
+	callbacks[1] = balloon_ack;
+	names[1] = "deflate";
+
+	i = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		callbacks[i] = stats_request;
+		names[i] = "stats";
+		i++;
+	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
+		callbacks[i] = NULL;
+		names[i] = "free_page_vq";
+	}
+
+	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
+					 NULL, NULL);
 	if (err)
-		return err;
+		goto err_find;
 
 	vb->inflate_vq = vqs[0];
 	vb->deflate_vq = vqs[1];
+	i = 2;
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
-		struct scatterlist sg;
-		unsigned int num_stats;
-		vb->stats_vq = vqs[2];
-
+		vb->stats_vq = vqs[i++];
 		/*
 		 * Prime this virtqueue with one buffer so the hypervisor can
 		 * use it to signal us later (it can't be broken yet!).
 		 */
-		num_stats = update_balloon_stats(vb);
-
-		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
+		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
 		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
-		    < 0)
-			BUG();
+		    < 0) {
+			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
+				 __func__);
+			goto err_find;
+		}
 		virtqueue_kick(vb->stats_vq);
 	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
+		vb->free_page_vq = vqs[i];
+
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
 	return 0;
+
+err_find:
+	kfree(names);
+err_names:
+	kfree(callbacks);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return err;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -680,7 +795,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	if (use_sg) {
 		send_one_desc(vb, vb->inflate_vq,
 			      virt_to_phys(page_address(newpage)),
-			      PAGE_SIZE, true, false);
+			      PAGE_SIZE, true, false, true);
 	} else {
 		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
 		set_page_pfns(vb, vb->pfns, newpage);
@@ -691,7 +806,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	if (use_sg) {
 		send_one_desc(vb, vb->inflate_vq,
 			      virt_to_phys(page_address(page)),
-			      PAGE_SIZE, true, false);
+			      PAGE_SIZE, true, false, true);
 	} else {
 		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
 		set_page_pfns(vb, vb->pfns, page);
@@ -758,6 +873,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
 		xb_init(&vb->page_xb);
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
+		vb->balloon_wq = alloc_workqueue("balloon-wq",
+					WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
+		INIT_WORK(&vb->report_free_page_work, report_free_page);
+		vb->stop_cmd_id = VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID;
+	}
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -822,6 +944,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	spin_unlock_irq(&vb->stop_update_lock);
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
+	cancel_work_sync(&vb->report_free_page_work);
 
 	remove_common(vb);
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -875,6 +998,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_SG,
+	VIRTIO_BALLOON_F_FREE_PAGE_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 37780a7..58f1274 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,15 +35,19 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
+#define VIRTIO_BALLOON_F_FREE_PAGE_VQ	4 /* VQ to report free pages */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
 
+#define VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID		0
 struct virtio_balloon_config {
 	/* Number of pages host wants Guest to give up. */
 	__u32 num_pages;
 	/* Number of pages we've actually got in balloon. */
 	__u32 actual;
+	/* Free page report command id, readonly by guest */
+	__u32 free_page_report_cmd_id;
 };
 
 #define VIRTIO_BALLOON_S_SWAP_IN  0   /* Amount of memory swapped in */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v18 10/10] virtio-balloon: don't report free pages when page poisoning is enabled
  2017-11-29 13:55 [PATCH v18 00/10] Virtio-balloon Enhancement Wei Wang
                   ` (8 preceding siblings ...)
  2017-11-29 13:55 ` [PATCH v18 09/10] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ Wei Wang
@ 2017-11-29 13:55 ` Wei Wang
  2017-12-01 15:49   ` Michael S. Tsirkin
  9 siblings, 1 reply; 37+ messages in thread
From: Wei Wang @ 2017-11-29 13:55 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, penguin-kernel, cornelia.huck, mgorman, aarcange,
	amit.shah, pbonzini, willy, wei.w.wang, liliang.opensource,
	yang.zhang.wz, quan.xu, nilal, riel

The guest free pages should not be discarded by the live migration thread
when page poisoning is enabled with PAGE_POISONING_NO_SANITY=n, because
skipping the transfer of such poisoned free pages will trigger false
positive when new pages are allocated and checked on the destination.
This patch skips the reporting of free pages in the above case.

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
---
 drivers/virtio/virtio_balloon.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 035bd3a..6ac4cff 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -652,7 +652,9 @@ static void report_free_page(struct work_struct *work)
 	/* Start by sending the obtained cmd id to the host with an outbuf */
 	send_one_desc(vb, vb->free_page_vq, virt_to_phys(&vb->start_cmd_id),
 		      sizeof(uint32_t), false, true, false);
-	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
+	if (!(page_poisoning_enabled() &&
+	    !IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY)))
+		walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
 	/*
 	 * End by sending the stop id to the host with an outbuf. Use the
 	 * non-batching mode here to trigger a kick after adding the stop id.
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v18 01/10] idr: add #include <linux/bug.h>
  2017-11-29 13:55 ` [PATCH v18 01/10] idr: add #include <linux/bug.h> Wei Wang
@ 2017-11-30  0:58   ` Matthew Wilcox
  2017-11-30  7:07     ` Michal Hocko
  2017-11-30 21:49     ` Andrew Morton
  0 siblings, 2 replies; 37+ messages in thread
From: Matthew Wilcox @ 2017-11-30  0:58 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox, david, penguin-kernel,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, yang.zhang.wz, quan.xu, nilal, riel,
	Masahiro Yamada

On Wed, Nov 29, 2017 at 09:55:17PM +0800, Wei Wang wrote:
> The <linux/bug.h> was removed from radix-tree.h by the following commit:
> f5bba9d11a256ad2a1c2f8e7fc6aabe6416b7890.
> 
> Since that commit, tools/testing/radix-tree/ couldn't pass compilation
> due to: tools/testing/radix-tree/idr.c:17: undefined reference to
> WARN_ON_ONCE. This patch adds the bug.h header to idr.h to solve the
> issue.

Thanks; I sent this same patch out yesterday.

Unfortunately, you didn't cc the author of this breakage, Masahiro Yamada.
I want to highlight that these kinds of header cleanups are risky,
and very low reward.  I really don't want to see patches going all over
the tree randomly touching header files.  If we've got a real problem
to solve, then sure.  But I want to see a strong justification for any
more header file cleanups.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v18 01/10] idr: add #include <linux/bug.h>
  2017-11-30  0:58   ` Matthew Wilcox
@ 2017-11-30  7:07     ` Michal Hocko
  2017-11-30 21:49     ` Andrew Morton
  1 sibling, 0 replies; 37+ messages in thread
From: Michal Hocko @ 2017-11-30  7:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, akpm, mawilcox, david, penguin-kernel,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, yang.zhang.wz, quan.xu, nilal, riel,
	Masahiro Yamada

On Wed 29-11-17 16:58:17, Matthew Wilcox wrote:
> On Wed, Nov 29, 2017 at 09:55:17PM +0800, Wei Wang wrote:
> > The <linux/bug.h> was removed from radix-tree.h by the following commit:
> > f5bba9d11a256ad2a1c2f8e7fc6aabe6416b7890.
> > 
> > Since that commit, tools/testing/radix-tree/ couldn't pass compilation
> > due to: tools/testing/radix-tree/idr.c:17: undefined reference to
> > WARN_ON_ONCE. This patch adds the bug.h header to idr.h to solve the
> > issue.
> 
> Thanks; I sent this same patch out yesterday.
> 
> Unfortunately, you didn't cc the author of this breakage, Masahiro Yamada.
> I want to highlight that these kinds of header cleanups are risky,
> and very low reward.  I really don't want to see patches going all over
> the tree randomly touching header files.  If we've got a real problem
> to solve, then sure.  But I want to see a strong justification for any
> more header file cleanups.

I agree. It usually requires unexpected combination of config options to
uncover some nasty include dependencies. So these patches might break
build while their additional value is quite questionable.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v18 05/10] xbitmap: add more operations
  2017-11-29 13:55 ` [PATCH v18 05/10] xbitmap: add more operations Wei Wang
@ 2017-11-30 10:34   ` Tetsuo Handa
  2017-11-30 13:35     ` Tetsuo Handa
  2017-12-01  8:02     ` Wei Wang
  0 siblings, 2 replies; 37+ messages in thread
From: Tetsuo Handa @ 2017-11-30 10:34 UTC (permalink / raw)
  To: wei.w.wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu, nilal, riel

Wei Wang wrote:
>  /**
> + * xb_clear_bit - clear a range of bits in the xbitmap

Name mismatch.

> + * @start: the start of the bit range, inclusive
> + * @end: the end of the bit range, inclusive
> + *
> + * This function is used to clear a bit in the xbitmap. If all the bits of the
> + * bitmap are 0, the bitmap will be freed.
> + */
> +void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
> +{
> +	struct radix_tree_root *root = &xb->xbrt;
> +	struct radix_tree_node *node;
> +	void **slot;
> +	struct ida_bitmap *bitmap;
> +	unsigned int nbits;
> +
> +	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
> +		unsigned long index = start / IDA_BITMAP_BITS;
> +		unsigned long bit = start % IDA_BITMAP_BITS;
> +
> +		bitmap = __radix_tree_lookup(root, index, &node, &slot);
> +		if (radix_tree_exception(bitmap)) {
> +			unsigned long ebit = bit + 2;
> +			unsigned long tmp = (unsigned long)bitmap;
> +
> +			nbits = min(end - start + 1, BITS_PER_LONG - ebit);

"nbits = min(end - start + 1," seems to expect that start == end is legal
for clearing only 1 bit. But this function is no-op if start == end.
Please clarify what "inclusive" intended.

> +
> +			if (ebit >= BITS_PER_LONG)
> +				continue;

(I don't understand how radix tree works, but generally this patchset looks fuzzy
to me about boundary cases. Thus, I want to confirm that this is not an overlook.)
Why is making "ebit >= BITS_PER_LONG" (e.g. start == 62) case a no-op correct?
Aren't there bits which should have been cleared in this case?

> +			bitmap_clear(&tmp, ebit, nbits);
> +			if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
> +				__radix_tree_delete(root, node, slot);
> +			else
> +				rcu_assign_pointer(*slot, (void *)tmp);
> +		} else if (bitmap) {
> +			nbits = min(end - start + 1, IDA_BITMAP_BITS - bit);
> +
> +			if (nbits != IDA_BITMAP_BITS)
> +				bitmap_clear(bitmap->bitmap, bit, nbits);
> +
> +			if (nbits == IDA_BITMAP_BITS ||
> +			    bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
> +				kfree(bitmap);
> +				__radix_tree_delete(root, node, slot);
> +			}
> +		}
> +	}
> +}



> +static inline __always_inline void bitmap_clear(unsigned long *map,
> +						unsigned int start,
> +						unsigned int nbits)
> +{
> +	if (__builtin_constant_p(nbits) && nbits == 1)
> +		__clear_bit(start, map);
> +	else if (__builtin_constant_p(start & 7) && IS_ALIGNED(start, 8) &&
> +		 __builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))

It looks strange to apply __builtin_constant_p test to variables after "& 7".

> +		memset((char *)map + start / 8, 0, nbits / 8);
> +	else
> +		__bitmap_clear(map, start, nbits);
> +}
> +

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v18 07/10] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-11-29 13:55 ` [PATCH v18 07/10] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
@ 2017-11-30 10:35   ` Tetsuo Handa
  2017-11-30 16:25     ` Wang, Wei W
  2017-12-01 15:38   ` Michael S. Tsirkin
  1 sibling, 1 reply; 37+ messages in thread
From: Tetsuo Handa @ 2017-11-30 10:35 UTC (permalink / raw)
  To: wei.w.wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu, nilal, riel

Wei Wang wrote:
> +static inline int xb_set_page(struct virtio_balloon *vb,
> +			       struct page *page,
> +			       unsigned long *pfn_min,
> +			       unsigned long *pfn_max)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +	int ret;
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +
> +	do {
> +		ret = xb_preload_and_set_bit(&vb->page_xb, pfn,
> +					     GFP_NOWAIT | __GFP_NOWARN);

It is a bit of pity that __GFP_NOWARN here is applied to only xb_preload().
Memory allocation by xb_set_bit() will after all emit warnings. Maybe

  xb_init(&vb->page_xb);
  vb->page_xb.gfp_mask |= __GFP_NOWARN;

is tolerable? Or, unconditionally apply __GFP_NOWARN at xb_init()?

  static inline void xb_init(struct xb *xb)
  {
          INIT_RADIX_TREE(&xb->xbrt, IDR_RT_MARKER | GFP_NOWAIT);
  }

> +	} while (unlikely(ret == -EAGAIN));
> +
> +	return ret;
> +}
> +



> @@ -172,11 +283,18 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  	vb->num_pfns = 0;
>  
>  	while ((page = balloon_page_pop(&pages))) {
> +		if (use_sg) {
> +			if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
> +				__free_page(page);
> +				break;

You cannot "break;" without consuming all pages in "pages".

> +			}
> +		} else {
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		}
> +
>  		balloon_page_enqueue(&vb->vb_dev_info, page);
>  
>  		vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE;
> -
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		if (!virtio_has_feature(vb->vdev,
>  					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))



> @@ -212,9 +334,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);

You can pass use_sg as an argument to leak_balloon(). Then, you won't
need to define leak_balloon_sg_oom(). Since xbitmap allocation does not
use __GFP_DIRECT_RECLAIM, it is safe to reuse leak_balloon() for OOM path.
Just be sure to pass use_sg == false because memory allocation for
use_sg == true likely fails when called from OOM path. (But trying
use_sg == true for OOM path and then fallback to use_sg == false is not bad?)

> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> -	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	/* Traditionally, we can only do one array worth at a time. */
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	/* We can't release more pages than taken */



> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 343d7dd..37780a7 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */

Want more explicit comment that PFN lists will be used on OOM and therefore
the host side must be prepared for both sg and PFN lists even if negotiated?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v18 05/10] xbitmap: add more operations
  2017-11-30 10:34   ` Tetsuo Handa
@ 2017-11-30 13:35     ` Tetsuo Handa
  2017-11-30 14:39       ` Matthew Wilcox
  2017-12-01  8:02     ` Wei Wang
  1 sibling, 1 reply; 37+ messages in thread
From: Tetsuo Handa @ 2017-11-30 13:35 UTC (permalink / raw)
  To: wei.w.wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu, nilal, riel

Tetsuo Handa wrote:
> > +
> > +			if (ebit >= BITS_PER_LONG)
> > +				continue;
> 
> (I don't understand how radix tree works, but generally this patchset looks fuzzy
> to me about boundary cases. Thus, I want to confirm that this is not an overlook.)
> Why is making "ebit >= BITS_PER_LONG" (e.g. start == 62) case a no-op correct?
> Aren't there bits which should have been cleared in this case?

According to xb_set_bit(), it seems to me that we are trying to avoid memory allocation
for "struct ida_bitmap" when all set bits within a 1024-bits bitmap reside in the first
61 bits.

But does such saving help? Is there characteristic bias that majority of set bits resides
in the first 61 bits, for "bit" is "unsigned long" which holds a page number (isn't it)?
If no such bias, wouldn't eliminating radix_tree_exception() case and always storing
"struct ida_bitmap" simplifies the code (and make the processing faster)?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v18 05/10] xbitmap: add more operations
  2017-11-30 13:35     ` Tetsuo Handa
@ 2017-11-30 14:39       ` Matthew Wilcox
  2017-12-03  1:44         ` Tetsuo Handa
  0 siblings, 1 reply; 37+ messages in thread
From: Matthew Wilcox @ 2017-11-30 14:39 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: wei.w.wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource,
	yang.zhang.wz, quan.xu, nilal, riel

On Thu, Nov 30, 2017 at 10:35:03PM +0900, Tetsuo Handa wrote:
> According to xb_set_bit(), it seems to me that we are trying to avoid memory allocation
> for "struct ida_bitmap" when all set bits within a 1024-bits bitmap reside in the first
> 61 bits.
> 
> But does such saving help? Is there characteristic bias that majority of set bits resides
> in the first 61 bits, for "bit" is "unsigned long" which holds a page number (isn't it)?
> If no such bias, wouldn't eliminating radix_tree_exception() case and always storing
> "struct ida_bitmap" simplifies the code (and make the processing faster)?

It happens all the time.  The vast majority of users of the IDA set
low bits.  Also, it's the first 62 bits -- going up to 63 bits with the
XArray rewrite.

I do plan to redo the xbitmap on top of the XArray; I'm just trying to
get the XArray merged first.  The IDA and xbitmap code will share much
more code when that happens.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: [PATCH v18 07/10] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-11-30 10:35   ` Tetsuo Handa
@ 2017-11-30 16:25     ` Wang, Wei W
  0 siblings, 0 replies; 37+ messages in thread
From: Wang, Wei W @ 2017-11-30 16:25 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu, nilal, riel

On Thursday, November 30, 2017 6:36 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
> > +static inline int xb_set_page(struct virtio_balloon *vb,
> > +			       struct page *page,
> > +			       unsigned long *pfn_min,
> > +			       unsigned long *pfn_max)
> > +{
> > +	unsigned long pfn = page_to_pfn(page);
> > +	int ret;
> > +
> > +	*pfn_min = min(pfn, *pfn_min);
> > +	*pfn_max = max(pfn, *pfn_max);
> > +
> > +	do {
> > +		ret = xb_preload_and_set_bit(&vb->page_xb, pfn,
> > +					     GFP_NOWAIT | __GFP_NOWARN);
> 
> It is a bit of pity that __GFP_NOWARN here is applied to only xb_preload().
> Memory allocation by xb_set_bit() will after all emit warnings. Maybe
> 
>   xb_init(&vb->page_xb);
>   vb->page_xb.gfp_mask |= __GFP_NOWARN;
> 
> is tolerable? Or, unconditionally apply __GFP_NOWARN at xb_init()?
> 



Please have a check this one: radix_tree_node_alloc()

In our case, I think the code path goes to 

if (!gfpflags_allow_blocking(gfp_mask) && !in_interrupt()) {
...
ret = kmem_cache_alloc(radix_tree_node_cachep,
                                       gfp_mask | __GFP_NOWARN);
...
goto out;
}

So I think the __GFP_NOWARN is already there.



>   static inline void xb_init(struct xb *xb)
>   {
>           INIT_RADIX_TREE(&xb->xbrt, IDR_RT_MARKER | GFP_NOWAIT);
>   }
> 
> > +	} while (unlikely(ret == -EAGAIN));
> > +
> > +	return ret;
> > +}
> > +
> 
> 
> 
> > @@ -172,11 +283,18 @@ static unsigned fill_balloon(struct virtio_balloon
> *vb, size_t num)
> >  	vb->num_pfns = 0;
> >
> >  	while ((page = balloon_page_pop(&pages))) {
> > +		if (use_sg) {
> > +			if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
> > +				__free_page(page);
> > +				break;
> 
> You cannot "break;" without consuming all pages in "pages".


Right, I think it should be "continue" here. Thanks. 

> 
> > +			}
> > +		} else {
> > +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> > +		}
> > +
> >  		balloon_page_enqueue(&vb->vb_dev_info, page);
> >
> >  		vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE;
> > -
> > -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> >  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
> >  		if (!virtio_has_feature(vb->vdev,
> >
> 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> 
> 
> 
> > @@ -212,9 +334,12 @@ static unsigned leak_balloon(struct virtio_balloon
> *vb, size_t num)
> >  	struct page *page;
> >  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
> >  	LIST_HEAD(pages);
> > +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> 
> You can pass use_sg as an argument to leak_balloon(). Then, you won't need
> to define leak_balloon_sg_oom(). Since xbitmap allocation does not use
> __GFP_DIRECT_RECLAIM, it is safe to reuse leak_balloon() for OOM path.
> Just be sure to pass use_sg == false because memory allocation for use_sg ==
> true likely fails when called from OOM path. (But trying use_sg == true for
> OOM path and then fallback to use_sg == false is not bad?)
> 


But once the SG mechanism is in use, we cannot use the non-SG mechanism, because the interface between the guest and host is not the same for SG and non-SG. Methods to make the host support both mechanisms at the same time would complicate the interface and implementation. 

I also think the old non-SG mechanism should be deprecated to use since its implementation isn't perfect in some sense, e.g. it balloons pages using outbuf which implies that no changes are allowed to the balloon pages and this isn't true for ballooning. The new mechanism (SG) has changed it to use inbuf.

So I think using leak_balloon_sg_oom() would be better. Is there any reason that we couldn't use it?

Best,
Wei

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v18 06/10] virtio_ring: add a new API, virtqueue_add_one_desc
  2017-11-29 13:55 ` [PATCH v18 06/10] virtio_ring: add a new API, virtqueue_add_one_desc Wei Wang
@ 2017-11-30 19:38   ` Michael S. Tsirkin
  2017-12-01  8:06     ` Wei Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2017-11-30 19:38 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, penguin-kernel,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu, nilal, riel

On Wed, Nov 29, 2017 at 09:55:22PM +0800, Wei Wang wrote:
> Current virtqueue_add API implementation is based on the scatterlist
> struct, which uses kaddr. This is inadequate to all the use case of
> vring. For example:
> - Some usages don't use IOMMU, in this case the user can directly pass
>   in a physical address in hand, instead of going through the sg
>   implementation (e.g. the VIRTIO_BALLOON_F_SG feature)
> - Sometimes, a guest physical page may not have a kaddr (e.g. high
>   memory) but need to use vring (e.g. the VIRTIO_BALLOON_F_FREE_PAGE_VQ
>   feature)
> 
> The new API virtqueue_add_one_desc enables the caller to assign a vring
> desc with a physical address and len. Also, factor out the common code
> with virtqueue_add in vring_set_avail.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>

You previously managed without this patch, and it's preferable
IMHO since this patchset is already too big.

I don't really understand what is wrong with virtio_add_sgs + sg_set_page.
I don't think is assumes a kaddr.

> ---
>  drivers/virtio/virtio_ring.c | 94 +++++++++++++++++++++++++++++++++++---------
>  include/linux/virtio.h       |  6 +++
>  2 files changed, 81 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index eb30f3e..0b87123 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -257,6 +257,79 @@ static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
>  	return desc;
>  }
>  
> +static void vring_set_avail(struct virtqueue *_vq,
> +			    unsigned int i)
> +{
> +	struct vring_virtqueue *vq = to_vvq(_vq);
> +	unsigned int avail;
> +
> +	avail = vq->avail_idx_shadow & (vq->vring.num - 1);
> +	vq->vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, i);
> +
> +	/*
> +	 * Descriptors and available array need to be set before we expose the
> +	 * new available array entries.
> +	 */
> +	virtio_wmb(vq->weak_barriers);
> +	vq->avail_idx_shadow++;
> +	vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> +					       vq->avail_idx_shadow);
> +	vq->num_added++;
> +
> +	pr_debug("Added buffer head %i to %p\n", i, vq);
> +
> +	/*
> +	 * This is very unlikely, but theoretically possible.  Kick
> +	 * just in case.
> +	 */
> +	if (unlikely(vq->num_added == (1 << 16) - 1))
> +		virtqueue_kick(_vq);
> +}
> +
> +int virtqueue_add_one_desc(struct virtqueue *_vq,
> +			   uint64_t addr,
> +			   uint32_t len,
> +			   bool in_desc,
> +			   void *data)
> +{
> +	struct vring_virtqueue *vq = to_vvq(_vq);
> +	struct vring_desc *desc;
> +	unsigned int i;
> +
> +	START_USE(vq);
> +	BUG_ON(data == NULL);
> +
> +	if (unlikely(vq->broken)) {
> +		END_USE(vq);
> +		return -EIO;
> +	}
> +
> +	if (_vq->num_free < 1) {
> +		END_USE(vq);
> +		return -ENOSPC;
> +	}
> +
> +	i = vq->free_head;
> +	desc = &vq->vring.desc[i];
> +	desc->addr = cpu_to_virtio64(_vq->vdev, addr);
> +	desc->len = cpu_to_virtio32(_vq->vdev, len);
> +	if (in_desc)
> +		desc->flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_WRITE);
> +	else
> +		desc->flags = 0;
> +	vq->desc_state[i].data = data;
> +	vq->desc_state[i].indir_desc = NULL;
> +	vq->free_head = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next);
> +	_vq->num_free--;
> +
> +	vring_set_avail(_vq, i);
> +
> +	END_USE(vq);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_add_one_desc);
> +
>  static inline int virtqueue_add(struct virtqueue *_vq,
>  				struct scatterlist *sgs[],
>  				unsigned int total_sg,
> @@ -269,7 +342,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  	struct vring_virtqueue *vq = to_vvq(_vq);
>  	struct scatterlist *sg;
>  	struct vring_desc *desc;
> -	unsigned int i, n, avail, descs_used, uninitialized_var(prev), err_idx;
> +	unsigned int i, n, descs_used, uninitialized_var(prev), err_idx;
>  	int head;
>  	bool indirect;
>  
> @@ -395,26 +468,9 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  	else
>  		vq->desc_state[head].indir_desc = ctx;
>  
> -	/* Put entry in available array (but don't update avail->idx until they
> -	 * do sync). */
> -	avail = vq->avail_idx_shadow & (vq->vring.num - 1);
> -	vq->vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> -
> -	/* Descriptors and available array need to be set before we expose the
> -	 * new available array entries. */
> -	virtio_wmb(vq->weak_barriers);
> -	vq->avail_idx_shadow++;
> -	vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
> -	vq->num_added++;
> -
> -	pr_debug("Added buffer head %i to %p\n", head, vq);
> +	vring_set_avail(_vq, head);
>  	END_USE(vq);
>  
> -	/* This is very unlikely, but theoretically possible.  Kick
> -	 * just in case. */
> -	if (unlikely(vq->num_added == (1 << 16) - 1))
> -		virtqueue_kick(_vq);
> -
>  	return 0;
>  
>  unmap_release:
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index 988c735..1d89996 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -35,6 +35,12 @@ struct virtqueue {
>  	void *priv;
>  };
>  
> +int virtqueue_add_one_desc(struct virtqueue *_vq,
> +			   uint64_t addr,
> +			   uint32_t len,
> +			   bool in_desc,
> +			   void *data);
> +
>  int virtqueue_add_outbuf(struct virtqueue *vq,
>  			 struct scatterlist sg[], unsigned int num,
>  			 void *data,
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v18 01/10] idr: add #include <linux/bug.h>
  2017-11-30  0:58   ` Matthew Wilcox
  2017-11-30  7:07     ` Michal Hocko
@ 2017-11-30 21:49     ` Andrew Morton
  1 sibling, 0 replies; 37+ messages in thread
From: Andrew Morton @ 2017-11-30 21:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, mhocko, mawilcox, david, penguin-kernel,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, yang.zhang.wz, quan.xu, nilal, riel,
	Masahiro Yamada

On Wed, 29 Nov 2017 16:58:17 -0800 Matthew Wilcox <willy@infradead.org> wrote:

> On Wed, Nov 29, 2017 at 09:55:17PM +0800, Wei Wang wrote:
> > The <linux/bug.h> was removed from radix-tree.h by the following commit:
> > f5bba9d11a256ad2a1c2f8e7fc6aabe6416b7890.
> > 
> > Since that commit, tools/testing/radix-tree/ couldn't pass compilation
> > due to: tools/testing/radix-tree/idr.c:17: undefined reference to
> > WARN_ON_ONCE. This patch adds the bug.h header to idr.h to solve the
> > issue.
> 
> Thanks; I sent this same patch out yesterday.
> 
> Unfortunately, you didn't cc the author of this breakage, Masahiro Yamada.
> I want to highlight that these kinds of header cleanups are risky,
> and very low reward.  I really don't want to see patches going all over
> the tree randomly touching header files.  If we've got a real problem
> to solve, then sure.  But I want to see a strong justification for any
> more header file cleanups.

I tend to disagree.  We accumulate more and more cruft over time so it
is good to be continually hacking away at it.  These little build
breaks happen occasionally but they are trivially and quickly fixed. 
If a small minority of these cleanups require a followup patch which
consumes a global ten person minutes then that seems an acceptable
price to pay.  Says the guy who pays most of that price :)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v18 05/10] xbitmap: add more operations
  2017-11-30 10:34   ` Tetsuo Handa
  2017-11-30 13:35     ` Tetsuo Handa
@ 2017-12-01  8:02     ` Wei Wang
  2017-12-01 13:02       ` Tetsuo Handa
  1 sibling, 1 reply; 37+ messages in thread
From: Wei Wang @ 2017-12-01  8:02 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu, nilal, riel

On 11/30/2017 06:34 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
>> + * @start: the start of the bit range, inclusive
>> + * @end: the end of the bit range, inclusive
>> + *
>> + * This function is used to clear a bit in the xbitmap. If all the bits of the
>> + * bitmap are 0, the bitmap will be freed.
>> + */
>> +void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
>> +{
>> +	struct radix_tree_root *root = &xb->xbrt;
>> +	struct radix_tree_node *node;
>> +	void **slot;
>> +	struct ida_bitmap *bitmap;
>> +	unsigned int nbits;
>> +
>> +	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
>> +		unsigned long index = start / IDA_BITMAP_BITS;
>> +		unsigned long bit = start % IDA_BITMAP_BITS;
>> +
>> +		bitmap = __radix_tree_lookup(root, index, &node, &slot);
>> +		if (radix_tree_exception(bitmap)) {
>> +			unsigned long ebit = bit + 2;
>> +			unsigned long tmp = (unsigned long)bitmap;
>> +
>> +			nbits = min(end - start + 1, BITS_PER_LONG - ebit);
> "nbits = min(end - start + 1," seems to expect that start == end is legal
> for clearing only 1 bit. But this function is no-op if start == end.
> Please clarify what "inclusive" intended.

If xb_clear_bit_range(xb,10,10), then it is effectively the same as 
xb_clear_bit(10). Why would it be illegal?

"@start inclusive" means that the @start will also be included to be 
cleared.

>
>> +static inline __always_inline void bitmap_clear(unsigned long *map,
>> +						unsigned int start,
>> +						unsigned int nbits)
>> +{
>> +	if (__builtin_constant_p(nbits) && nbits == 1)
>> +		__clear_bit(start, map);
>> +	else if (__builtin_constant_p(start & 7) && IS_ALIGNED(start, 8) &&
>> +		 __builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))
> It looks strange to apply __builtin_constant_p test to variables after "& 7".
>

I think this is normal - if the variables are known at compile time, the 
calculation will be done at compile time (termed constant folding).


Best,
Wei

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v18 06/10] virtio_ring: add a new API, virtqueue_add_one_desc
  2017-11-30 19:38   ` Michael S. Tsirkin
@ 2017-12-01  8:06     ` Wei Wang
  0 siblings, 0 replies; 37+ messages in thread
From: Wei Wang @ 2017-12-01  8:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, penguin-kernel,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu, nilal, riel

On 12/01/2017 03:38 AM, Michael S. Tsirkin wrote:
> On Wed, Nov 29, 2017 at 09:55:22PM +0800, Wei Wang wrote:
>> Current virtqueue_add API implementation is based on the scatterlist
>> struct, which uses kaddr. This is inadequate to all the use case of
>> vring. For example:
>> - Some usages don't use IOMMU, in this case the user can directly pass
>>    in a physical address in hand, instead of going through the sg
>>    implementation (e.g. the VIRTIO_BALLOON_F_SG feature)
>> - Sometimes, a guest physical page may not have a kaddr (e.g. high
>>    memory) but need to use vring (e.g. the VIRTIO_BALLOON_F_FREE_PAGE_VQ
>>    feature)
>>
>> The new API virtqueue_add_one_desc enables the caller to assign a vring
>> desc with a physical address and len. Also, factor out the common code
>> with virtqueue_add in vring_set_avail.
>>
>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Cc: Michael S. Tsirkin <mst@redhat.com>
> You previously managed without this patch, and it's preferable
> IMHO since this patchset is already too big.
>
> I don't really understand what is wrong with virtio_add_sgs + sg_set_page.
> I don't think is assumes a kaddr.
>

OK, I will use the previous method to send sgs.
Please have a check if there are other things need to be improved.

Best,
Wei

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v18 05/10] xbitmap: add more operations
  2017-12-01  8:02     ` Wei Wang
@ 2017-12-01 13:02       ` Tetsuo Handa
  2017-12-01 14:13         ` Matthew Wilcox
  2017-12-01 15:09         ` Wang, Wei W
  0 siblings, 2 replies; 37+ messages in thread
From: Tetsuo Handa @ 2017-12-01 13:02 UTC (permalink / raw)
  To: wei.w.wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu, nilal, riel

Wei Wang wrote:
> On 11/30/2017 06:34 PM, Tetsuo Handa wrote:
> > Wei Wang wrote:
> >> + * @start: the start of the bit range, inclusive
> >> + * @end: the end of the bit range, inclusive
> >> + *
> >> + * This function is used to clear a bit in the xbitmap. If all the bits of the
> >> + * bitmap are 0, the bitmap will be freed.
> >> + */
> >> +void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
> >> +{
> >> +	struct radix_tree_root *root = &xb->xbrt;
> >> +	struct radix_tree_node *node;
> >> +	void **slot;
> >> +	struct ida_bitmap *bitmap;
> >> +	unsigned int nbits;
> >> +
> >> +	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
> >> +		unsigned long index = start / IDA_BITMAP_BITS;
> >> +		unsigned long bit = start % IDA_BITMAP_BITS;
> >> +
> >> +		bitmap = __radix_tree_lookup(root, index, &node, &slot);
> >> +		if (radix_tree_exception(bitmap)) {
> >> +			unsigned long ebit = bit + 2;
> >> +			unsigned long tmp = (unsigned long)bitmap;
> >> +
> >> +			nbits = min(end - start + 1, BITS_PER_LONG - ebit);
> > "nbits = min(end - start + 1," seems to expect that start == end is legal
> > for clearing only 1 bit. But this function is no-op if start == end.
> > Please clarify what "inclusive" intended.
> 
> If xb_clear_bit_range(xb,10,10), then it is effectively the same as 
> xb_clear_bit(10). Why would it be illegal?
> 
> "@start inclusive" means that the @start will also be included to be 
> cleared.

If start == end is legal,

   for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {

makes this loop do nothing because 10 < 10 is false.



> 
> >
> >> +static inline __always_inline void bitmap_clear(unsigned long *map,
> >> +						unsigned int start,
> >> +						unsigned int nbits)
> >> +{
> >> +	if (__builtin_constant_p(nbits) && nbits == 1)
> >> +		__clear_bit(start, map);
> >> +	else if (__builtin_constant_p(start & 7) && IS_ALIGNED(start, 8) &&
> >> +		 __builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))
> > It looks strange to apply __builtin_constant_p test to variables after "& 7".
> >
> 
> I think this is normal - if the variables are known at compile time, the 
> calculation will be done at compile time (termed constant folding).

I think that

+	else if (__builtin_constant_p(start) && IS_ALIGNED(start, 8) &&
+		 __builtin_constant_p(nbits) && IS_ALIGNED(nbits, 8))

is more readable.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v18 05/10] xbitmap: add more operations
  2017-12-01 13:02       ` Tetsuo Handa
@ 2017-12-01 14:13         ` Matthew Wilcox
  2017-12-01 15:09         ` Wang, Wei W
  1 sibling, 0 replies; 37+ messages in thread
From: Matthew Wilcox @ 2017-12-01 14:13 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: wei.w.wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource,
	yang.zhang.wz, quan.xu, nilal, riel

On Fri, Dec 01, 2017 at 10:02:01PM +0900, Tetsuo Handa wrote:
> If start == end is legal,
> 
>    for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
> 
> makes this loop do nothing because 10 < 10 is false.

... and this is why we add tests to the test-suite!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: [PATCH v18 05/10] xbitmap: add more operations
  2017-12-01 13:02       ` Tetsuo Handa
  2017-12-01 14:13         ` Matthew Wilcox
@ 2017-12-01 15:09         ` Wang, Wei W
  2017-12-01 17:25           ` Matthew Wilcox
  1 sibling, 1 reply; 37+ messages in thread
From: Wang, Wei W @ 2017-12-01 15:09 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu, nilal, riel

On Friday, December 1, 2017 9:02 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
> > On 11/30/2017 06:34 PM, Tetsuo Handa wrote:
> > > Wei Wang wrote:
> > >> + * @start: the start of the bit range, inclusive
> > >> + * @end: the end of the bit range, inclusive
> > >> + *
> > >> + * This function is used to clear a bit in the xbitmap. If all the
> > >> +bits of the
> > >> + * bitmap are 0, the bitmap will be freed.
> > >> + */
> > >> +void xb_clear_bit_range(struct xb *xb, unsigned long start,
> > >> +unsigned long end) {
> > >> +	struct radix_tree_root *root = &xb->xbrt;
> > >> +	struct radix_tree_node *node;
> > >> +	void **slot;
> > >> +	struct ida_bitmap *bitmap;
> > >> +	unsigned int nbits;
> > >> +
> > >> +	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
> > >> +		unsigned long index = start / IDA_BITMAP_BITS;
> > >> +		unsigned long bit = start % IDA_BITMAP_BITS;
> > >> +
> > >> +		bitmap = __radix_tree_lookup(root, index, &node, &slot);
> > >> +		if (radix_tree_exception(bitmap)) {
> > >> +			unsigned long ebit = bit + 2;
> > >> +			unsigned long tmp = (unsigned long)bitmap;
> > >> +
> > >> +			nbits = min(end - start + 1, BITS_PER_LONG - ebit);
> > > "nbits = min(end - start + 1," seems to expect that start == end is
> > > legal for clearing only 1 bit. But this function is no-op if start == end.
> > > Please clarify what "inclusive" intended.
> >
> > If xb_clear_bit_range(xb,10,10), then it is effectively the same as
> > xb_clear_bit(10). Why would it be illegal?
> >
> > "@start inclusive" means that the @start will also be included to be
> > cleared.
> 
> If start == end is legal,
> 
>    for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
> 
> makes this loop do nothing because 10 < 10 is false.


How about "start <= end "?

Best,
Wei

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v18 07/10] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-11-29 13:55 ` [PATCH v18 07/10] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
  2017-11-30 10:35   ` Tetsuo Handa
@ 2017-12-01 15:38   ` Michael S. Tsirkin
  2017-12-04  3:46     ` Wei Wang
  1 sibling, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2017-12-01 15:38 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, penguin-kernel,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu, nilal, riel

On Wed, Nov 29, 2017 at 09:55:23PM +0800, Wei Wang wrote:
> Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer of
> balloon (i.e. inflated/deflated) pages using scatter-gather lists to the
> host. A scatter-gather list is described by a vring desc.
> 
> The implementation of the previous virtio-balloon is not very efficient,
> because the balloon pages are transferred to the host by one array each
> time. Here is the breakdown of the time in percentage spent on each step
> of the balloon inflating process (inflating 7GB of an 8GB idle guest).
> 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete. The above
> profiling shows that the bottlenecks are stage 2) and stage 4).
> 
> This patch optimizes step 2) by transferring pages to host in sgs. An sg
> describes a chunk of guest physically continuous pages. With this
> mechanism, step 4) can also be optimized by doing address translation and
> madvise() in chunks rather than page by page.
> 
> With this new feature, the above ballooning process takes ~440ms resulting
> in an improvement of ~89%.
> 
> TODO: optimize stage 1) by allocating/freeing a chunk of pages instead of
> a single page each time.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  drivers/virtio/virtio_balloon.c     | 230 +++++++++++++++++++++++++++++++++---
>  include/uapi/linux/virtio_balloon.h |   1 +
>  2 files changed, 212 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 7960746..2c21c5a 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -32,6 +32,8 @@
>  #include <linux/mm.h>
>  #include <linux/mount.h>
>  #include <linux/magic.h>
> +#include <linux/xbitmap.h>
> +#include <asm/page.h>
>  
>  /*
>   * Balloon device works in 4K page units.  So each page is pointed to by
> @@ -79,6 +81,9 @@ struct virtio_balloon {
>  	/* Synchronize access/update to this struct virtio_balloon elements */
>  	struct mutex balloon_lock;
>  
> +	/* The xbitmap used to record balloon pages */
> +	struct xb page_xb;
> +
>  	/* The array of pfns we tell the Host about. */
>  	unsigned int num_pfns;
>  	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> @@ -141,15 +146,121 @@ static void set_page_pfns(struct virtio_balloon *vb,
>  					  page_to_balloon_pfn(page) + i);
>  }
>  
> +static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
> +{
> +	unsigned int len;
> +
> +	virtqueue_kick(vq);
> +	wait_event(wq_head, virtqueue_get_buf(vq, &len));
> +}
> +
> +static void send_one_desc(struct virtio_balloon *vb,
> +			  struct virtqueue *vq,
> +			  uint64_t addr,
> +			  uint32_t len,
> +			  bool inbuf,
> +			  bool batch)
> +{
> +	int err;
> +	unsigned int size;
> +
> +	/* Detach all the used buffers from the vq */
> +	while (virtqueue_get_buf(vq, &size))
> +		;
> +
> +	err = virtqueue_add_one_desc(vq, addr, len, inbuf, vq);
> +	/*
> +	 * This is expected to never fail: there is always at least 1 entry
> +	 * available on the vq, because when the vq is full the worker thread
> +	 * that adds the desc will be put into sleep until at least 1 entry is
> +	 * available to use.
> +	 */
> +	BUG_ON(err);
> +
> +	/* If batching is requested, we batch till the vq is full */
> +	if (!batch || !vq->num_free)
> +		kick_and_wait(vq, vb->acked);
> +}
> +

This internal kick complicates callers. I suggest that instead,
you move this to callers, just return a "kick required" boolean.
This way callers do not need to play with num_free at all.

> +/*
> + * Send balloon pages in sgs to host. The balloon pages are recorded in the
> + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> + * The page xbitmap is searched for continuous "1" bits, which correspond
> + * to continuous pages, to chunk into sgs.
> + *
> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
> + * need to be searched.
> + */
> +static void tell_host_sgs(struct virtio_balloon *vb,
> +			  struct virtqueue *vq,
> +			  unsigned long page_xb_start,
> +			  unsigned long page_xb_end)
> +{
> +	unsigned long pfn_start, pfn_end;
> +	uint64_t addr;
> +	uint32_t len, max_len = round_down(UINT_MAX, PAGE_SIZE);
> +
> +	pfn_start = page_xb_start;
> +	while (pfn_start < page_xb_end) {
> +		pfn_start = xb_find_next_set_bit(&vb->page_xb, pfn_start,
> +						 page_xb_end);
> +		if (pfn_start == page_xb_end + 1)
> +			break;
> +		pfn_end = xb_find_next_zero_bit(&vb->page_xb,
> +						pfn_start + 1,
> +						page_xb_end);
> +		addr = pfn_start << PAGE_SHIFT;
> +		len = (pfn_end - pfn_start) << PAGE_SHIFT;

This assugnment can overflow. Next line compares with UINT_MAX but by
that time it is too late.  I think you should do all math in 64 bit to
avoid surprises, then truncate to max_len and then it's safe to assign
to sg.

> +		while (len > max_len) {
> +			send_one_desc(vb, vq, addr, max_len, true, true);
> +			addr += max_len;
> +			len -= max_len;
> +		}
> +		send_one_desc(vb, vq, addr, len, true, true);
> +		pfn_start = pfn_end + 1;
> +	}
> +
> +	/*
> +	 * The last few desc-s may not reach the batch size, but need a kick to
> +	 * notify the device to handle them.
> +	 */
> +	if (vq->num_free != virtqueue_get_vring_size(vq))
> +		kick_and_wait(vq, vb->acked);

the actual trigger for kick is if we did not kick after
the last send_one_desc. If kick is moved out of send_one_desc,
you can test for that explicitly.

> +
> +	xb_clear_bit_range(&vb->page_xb, page_xb_start, page_xb_end);
> +}
> +
> +static inline int xb_set_page(struct virtio_balloon *vb,
> +			       struct page *page,
> +			       unsigned long *pfn_min,
> +			       unsigned long *pfn_max)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +	int ret;
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +
> +	do {
> +		ret = xb_preload_and_set_bit(&vb->page_xb, pfn,
> +					     GFP_NOWAIT | __GFP_NOWARN);
> +	} while (unlikely(ret == -EAGAIN));

what exactly does this loop do? Does this wait
forever until there is some free memory? why GFP_NOWAIT?

> +
> +	return ret;
> +}
> +
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>  	unsigned num_allocated_pages;
>  	unsigned num_pfns;
>  	struct page *page;
>  	LIST_HEAD(pages);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
>  	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	for (num_pfns = 0; num_pfns < num;
>  	     num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
> @@ -172,11 +283,18 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  	vb->num_pfns = 0;
>  
>  	while ((page = balloon_page_pop(&pages))) {
> +		if (use_sg) {
> +			if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
> +				__free_page(page);
> +				break;
> +			}
> +		} else {
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		}
> +
>  		balloon_page_enqueue(&vb->vb_dev_info, page);
>  
>  		vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE;
> -
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		if (!virtio_has_feature(vb->vdev,
>  					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> @@ -185,8 +303,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  
>  	num_allocated_pages = vb->num_pfns;
>  	/* Did we get any? */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->inflate_vq);
> +	if (vb->num_pfns) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->inflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	return num_allocated_pages;
> @@ -212,9 +334,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> -	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	/* Traditionally, we can only do one array worth at a time. */
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	/* We can't release more pages than taken */
> @@ -224,7 +349,14 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  		page = balloon_page_dequeue(vb_dev_info);
>  		if (!page)
>  			break;
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (use_sg) {
> +			if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
> +				balloon_page_enqueue(&vb->vb_dev_info, page);
> +				break;
> +			}
> +		} else {
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		}
>  		list_add(&page->lru, &pages);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}
> @@ -235,13 +367,55 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>  	 * is true, we *have* to do it in this order
>  	 */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->deflate_vq);
> +	if (vb->num_pfns) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->deflate_vq);
> +	}
>  	release_pages_balloon(vb, &pages);
>  	mutex_unlock(&vb->balloon_lock);
>  	return num_freed_pages;
>  }
>  
> +/*
> + * The regular leak_balloon() with VIRTIO_BALLOON_F_SG needs memory allocation
> + * for xbitmap, which is not suitable for the oom case. This function does not
> + * use xbitmap to chunk pages, so it can be used by oom notifier to deflate
> + * pages when VIRTIO_BALLOON_F_SG is negotiated.
> + */

I guess we can live with this for now.

Two things to consider
- adding support for pre-allocating indirect buffers
- sorting the internal page queue (how?)


> +static unsigned int leak_balloon_sg_oom(struct virtio_balloon *vb)
> +{
> +	unsigned int n;
> +	struct page *page;
> +	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
> +	struct virtqueue *vq = vb->deflate_vq;
> +	LIST_HEAD(pages);
> +
> +	mutex_lock(&vb->balloon_lock);
> +	for (n = 0; n < oom_pages; n++) {
> +		page = balloon_page_dequeue(vb_dev_info);
> +		if (!page)
> +			break;
> +
> +		list_add(&page->lru, &pages);
> +		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		send_one_desc(vb, vq, virt_to_phys(page_address(page)),
> +			      PAGE_SIZE, true, true);
> +		release_pages_balloon(vb, &pages);
> +	}
> +
> +	/*
> +	 * The last few sgs may not reach the batch size, but need a kick to
> +	 * notify the device to handle them.
> +	 */
> +	if (vq->num_free != virtqueue_get_vring_size(vq))
> +		kick_and_wait(vq, vb->acked);
> +	mutex_unlock(&vb->balloon_lock);
> +
> +	return n;
> +}
> +
>  static inline void update_stat(struct virtio_balloon *vb, int idx,
>  			       u16 tag, u64 val)
>  {
> @@ -381,7 +555,10 @@ static int virtballoon_oom_notify(struct notifier_block *self,
>  		return NOTIFY_OK;
>  
>  	freed = parm;
> -	num_freed_pages = leak_balloon(vb, oom_pages);
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG))
> +		num_freed_pages = leak_balloon_sg_oom(vb);
> +	else
> +		num_freed_pages = leak_balloon(vb, oom_pages);
>  	update_balloon_size(vb);
>  	*freed += num_freed_pages;
>  
> @@ -478,6 +655,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  {
>  	struct virtio_balloon *vb = container_of(vb_dev_info,
>  			struct virtio_balloon, vb_dev_info);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
>  	unsigned long flags;
>  
>  	/*
> @@ -499,16 +677,26 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  	vb_dev_info->isolated_pages--;
>  	__count_vm_event(BALLOON_MIGRATE);
>  	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, newpage);
> -	tell_host(vb, vb->inflate_vq);
> -
> +	if (use_sg) {
> +		send_one_desc(vb, vb->inflate_vq,
> +			      virt_to_phys(page_address(newpage)),
> +			      PAGE_SIZE, true, false);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, newpage);
> +		tell_host(vb, vb->inflate_vq);
> +	}
>  	/* balloon's page migration 2nd step -- deflate "page" */
>  	balloon_page_delete(page);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, page);
> -	tell_host(vb, vb->deflate_vq);
> -
> +	if (use_sg) {
> +		send_one_desc(vb, vb->inflate_vq,
> +			      virt_to_phys(page_address(page)),
> +			      PAGE_SIZE, true, false);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, page);
> +		tell_host(vb, vb->deflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	put_page(page); /* balloon reference */
> @@ -567,6 +755,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (err)
>  		goto out_free_vb;
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
> +		xb_init(&vb->page_xb);
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -683,6 +874,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_SG,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 343d7dd..37780a7 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v18 10/10] virtio-balloon: don't report free pages when page poisoning is enabled
  2017-11-29 13:55 ` [PATCH v18 10/10] virtio-balloon: don't report free pages when page poisoning is enabled Wei Wang
@ 2017-12-01 15:49   ` Michael S. Tsirkin
  2017-12-04  5:39     ` Wei Wang
  2017-12-11  6:38     ` Wei Wang
  0 siblings, 2 replies; 37+ messages in thread
From: Michael S. Tsirkin @ 2017-12-01 15:49 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, penguin-kernel,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu, nilal, riel

On Wed, Nov 29, 2017 at 09:55:26PM +0800, Wei Wang wrote:
> The guest free pages should not be discarded by the live migration thread
> when page poisoning is enabled with PAGE_POISONING_NO_SANITY=n, because
> skipping the transfer of such poisoned free pages will trigger false
> positive when new pages are allocated and checked on the destination.
> This patch skips the reporting of free pages in the above case.
> 
> Reported-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Cc: Michal Hocko <mhocko@suse.com>
> ---
>  drivers/virtio/virtio_balloon.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 035bd3a..6ac4cff 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -652,7 +652,9 @@ static void report_free_page(struct work_struct *work)
>  	/* Start by sending the obtained cmd id to the host with an outbuf */
>  	send_one_desc(vb, vb->free_page_vq, virt_to_phys(&vb->start_cmd_id),
>  		      sizeof(uint32_t), false, true, false);
> -	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> +	if (!(page_poisoning_enabled() &&
> +	    !IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY)))
> +		walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
>  	/*
>  	 * End by sending the stop id to the host with an outbuf. Use the
>  	 * non-batching mode here to trigger a kick after adding the stop id.

PAGE_POISONING_ZERO is actually OK.

But I really would prefer it that we still send pages to host,
otherwise debugging becomes much harder.

And it does not have to be completely useless, even though
you can not discard them as they would be zero-filled then.

How about a config field telling host what should be there in the free
pages? This way even though host can not discard them, host can send
them out without reading them, still a win.



> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v18 05/10] xbitmap: add more operations
  2017-12-01 15:09         ` Wang, Wei W
@ 2017-12-01 17:25           ` Matthew Wilcox
  2017-12-03  1:50             ` Tetsuo Handa
  0 siblings, 1 reply; 37+ messages in thread
From: Matthew Wilcox @ 2017-12-01 17:25 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: Tetsuo Handa, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, mst, mhocko, akpm, mawilcox,
	david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, yang.zhang.wz, quan.xu, nilal, riel

On Fri, Dec 01, 2017 at 03:09:08PM +0000, Wang, Wei W wrote:
> On Friday, December 1, 2017 9:02 PM, Tetsuo Handa wrote:
> > If start == end is legal,
> > 
> >    for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
> > 
> > makes this loop do nothing because 10 < 10 is false.
> 
> How about "start <= end "?

Don't ask Tetsuo for his opinion, write some userspace code that uses it.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v18 05/10] xbitmap: add more operations
  2017-11-30 14:39       ` Matthew Wilcox
@ 2017-12-03  1:44         ` Tetsuo Handa
  0 siblings, 0 replies; 37+ messages in thread
From: Tetsuo Handa @ 2017-12-03  1:44 UTC (permalink / raw)
  To: willy
  Cc: wei.w.wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource,
	yang.zhang.wz, quan.xu, nilal, riel

Matthew Wilcox wrote:
> On Thu, Nov 30, 2017 at 10:35:03PM +0900, Tetsuo Handa wrote:
> > According to xb_set_bit(), it seems to me that we are trying to avoid memory allocation
> > for "struct ida_bitmap" when all set bits within a 1024-bits bitmap reside in the first
> > 61 bits.
> > 
> > But does such saving help? Is there characteristic bias that majority of set bits resides
> > in the first 61 bits, for "bit" is "unsigned long" which holds a page number (isn't it)?
> > If no such bias, wouldn't eliminating radix_tree_exception() case and always storing
> > "struct ida_bitmap" simplifies the code (and make the processing faster)?
> 
> It happens all the time.  The vast majority of users of the IDA set
> low bits.  Also, it's the first 62 bits -- going up to 63 bits with the
> XArray rewrite.

Oops, 0...61 is 62 bits.

What I meant is below (untested) patch. If correct, it can save lines and make
the code easier to read.

 lib/radix-tree.c | 20 +------------
 lib/xbitmap.c    | 88 +++++---------------------------------------------------
 2 files changed, 8 insertions(+), 100 deletions(-)

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index a039588..fb84b91 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -2152,25 +2152,7 @@ int ida_pre_get(struct ida *ida, gfp_t gfp)
  */
 __must_check bool xb_preload(gfp_t gfp)
 {
-	if (!this_cpu_read(ida_bitmap)) {
-		struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
-
-		if (!bitmap)
-			return false;
-		/*
-		 * The per-CPU variable is updated with preemption enabled.
-		 * If the calling task is unlucky to be scheduled to another
-		 * CPU which has no ida_bitmap allocation, it will be detected
-		 * when setting a bit (i.e. __xb_set_bit()).
-		 */
-		bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
-		kfree(bitmap);
-	}
-
-	if (__radix_tree_preload(gfp, XB_PRELOAD_SIZE) < 0)
-		return false;
-
-	return true;
+	return __radix_tree_preload(gfp, XB_PRELOAD_SIZE) == 0;
 }
 EXPORT_SYMBOL(xb_preload);
 
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
index 816dd3e..426d168 100644
--- a/lib/xbitmap.c
+++ b/lib/xbitmap.c
@@ -18,7 +18,7 @@
  * This function is used to set a bit in the xbitmap. If the bitmap that @bit
  * resides in is not there, the per-cpu ida_bitmap will be taken.
  *
- * Returns: 0 on success. %-EAGAIN indicates that @bit was not set.
+ * Returns: 0 on success. Negative value on failure.
  */
 int xb_set_bit(struct xb *xb, unsigned long bit)
 {
@@ -28,46 +28,19 @@ int xb_set_bit(struct xb *xb, unsigned long bit)
 	struct radix_tree_node *node;
 	void **slot;
 	struct ida_bitmap *bitmap;
-	unsigned long ebit;
 
 	bit %= IDA_BITMAP_BITS;
-	ebit = bit + 2;
 
 	err = __radix_tree_create(root, index, 0, &node, &slot);
 	if (err)
 		return err;
 	bitmap = rcu_dereference_raw(*slot);
-	if (radix_tree_exception(bitmap)) {
-		unsigned long tmp = (unsigned long)bitmap;
-
-		if (ebit < BITS_PER_LONG) {
-			tmp |= 1UL << ebit;
-			rcu_assign_pointer(*slot, (void *)tmp);
-			return 0;
-		}
-		bitmap = this_cpu_xchg(ida_bitmap, NULL);
-		if (!bitmap) {
-			__radix_tree_delete(root, node, slot);
-			return -EAGAIN;
-		}
-		memset(bitmap, 0, sizeof(*bitmap));
-		bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
-		rcu_assign_pointer(*slot, bitmap);
-	}
-
 	if (!bitmap) {
-		if (ebit < BITS_PER_LONG) {
-			bitmap = (void *)((1UL << ebit) |
-					RADIX_TREE_EXCEPTIONAL_ENTRY);
-			__radix_tree_replace(root, node, slot, bitmap, NULL);
-			return 0;
-		}
-		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		bitmap = kzalloc(sizeof(*bitmap), GFP_NOWAIT | __GFP_NOWARN);
 		if (!bitmap) {
 			__radix_tree_delete(root, node, slot);
-			return -EAGAIN;
+			return -ENOMEM;
 		}
-		memset(bitmap, 0, sizeof(*bitmap));
 		__radix_tree_replace(root, node, slot, bitmap, NULL);
 	}
 
@@ -82,7 +55,7 @@ int xb_set_bit(struct xb *xb, unsigned long bit)
  *  @bit: index of the bit to set
  *
  * A wrapper of the xb_preload() and xb_set_bit().
- * Returns: 0 on success; -EAGAIN or -ENOMEM on error.
+ * Returns: 0 on success; -ENOMEM on error.
  */
 int xb_preload_and_set_bit(struct xb *xb, unsigned long bit, gfp_t gfp)
 {
@@ -113,25 +86,10 @@ void xb_clear_bit(struct xb *xb, unsigned long bit)
 	struct radix_tree_node *node;
 	void **slot;
 	struct ida_bitmap *bitmap;
-	unsigned long ebit;
 
 	bit %= IDA_BITMAP_BITS;
-	ebit = bit + 2;
 
 	bitmap = __radix_tree_lookup(root, index, &node, &slot);
-	if (radix_tree_exception(bitmap)) {
-		unsigned long tmp = (unsigned long)bitmap;
-
-		if (ebit >= BITS_PER_LONG)
-			return;
-		tmp &= ~(1UL << ebit);
-		if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
-			__radix_tree_delete(root, node, slot);
-		else
-			rcu_assign_pointer(*slot, (void *)tmp);
-		return;
-	}
-
 	if (!bitmap)
 		return;
 
@@ -164,20 +122,7 @@ void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
 		unsigned long bit = start % IDA_BITMAP_BITS;
 
 		bitmap = __radix_tree_lookup(root, index, &node, &slot);
-		if (radix_tree_exception(bitmap)) {
-			unsigned long ebit = bit + 2;
-			unsigned long tmp = (unsigned long)bitmap;
-
-			nbits = min(end - start + 1, BITS_PER_LONG - ebit);
-
-			if (ebit >= BITS_PER_LONG)
-				continue;
-			bitmap_clear(&tmp, ebit, nbits);
-			if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
-				__radix_tree_delete(root, node, slot);
-			else
-				rcu_assign_pointer(*slot, (void *)tmp);
-		} else if (bitmap) {
+		if (bitmap) {
 			nbits = min(end - start + 1, IDA_BITMAP_BITS - bit);
 
 			if (nbits != IDA_BITMAP_BITS)
@@ -212,12 +157,6 @@ bool xb_test_bit(const struct xb *xb, unsigned long bit)
 
 	if (!bitmap)
 		return false;
-	if (radix_tree_exception(bitmap)) {
-		bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
-		if (bit >= BITS_PER_LONG)
-			return false;
-		return (unsigned long)bitmap & (1UL << bit);
-	}
 	return test_bit(bit, bitmap->bitmap);
 }
 EXPORT_SYMBOL(xb_test_bit);
@@ -236,20 +175,7 @@ static unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
 		unsigned long bit = start % IDA_BITMAP_BITS;
 
 		bmap = __radix_tree_lookup(root, index, &node, &slot);
-		if (radix_tree_exception(bmap)) {
-			unsigned long tmp = (unsigned long)bmap;
-			unsigned long ebit = bit + 2;
-
-			if (ebit >= BITS_PER_LONG)
-				continue;
-			if (set)
-				ret = find_next_bit(&tmp, BITS_PER_LONG, ebit);
-			else
-				ret = find_next_zero_bit(&tmp, BITS_PER_LONG,
-							 ebit);
-			if (ret < BITS_PER_LONG)
-				return ret - 2 + IDA_BITMAP_BITS * index;
-		} else if (bmap) {
+		if (bmap) {
 			if (set)
 				ret = find_next_bit(bmap->bitmap,
 						    IDA_BITMAP_BITS, bit);
@@ -258,7 +184,7 @@ static unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
 							 IDA_BITMAP_BITS, bit);
 			if (ret < IDA_BITMAP_BITS)
 				return ret + index * IDA_BITMAP_BITS;
-		} else if (!bmap && !set) {
+		} else if (!set) {
 			return start;
 		}
 	}
-- 

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v18 05/10] xbitmap: add more operations
  2017-12-01 17:25           ` Matthew Wilcox
@ 2017-12-03  1:50             ` Tetsuo Handa
  2017-12-07 12:01               ` Wei Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Tetsuo Handa @ 2017-12-03  1:50 UTC (permalink / raw)
  To: willy, wei.w.wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource,
	yang.zhang.wz, quan.xu, nilal, riel

Matthew Wilcox wrote:
> On Fri, Dec 01, 2017 at 03:09:08PM +0000, Wang, Wei W wrote:
> > On Friday, December 1, 2017 9:02 PM, Tetsuo Handa wrote:
> > > If start == end is legal,
> > > 
> > >    for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
> > > 
> > > makes this loop do nothing because 10 < 10 is false.
> > 
> > How about "start <= end "?
> 
> Don't ask Tetsuo for his opinion, write some userspace code that uses it.
> 

Please be sure to prepare for "end == -1UL" case, for "start < end" will become
true when "start = (start | (IDA_BITMAP_BITS - 1)) + 1" made "start == 0" due to
overflow.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v18 07/10] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-12-01 15:38   ` Michael S. Tsirkin
@ 2017-12-04  3:46     ` Wei Wang
  0 siblings, 0 replies; 37+ messages in thread
From: Wei Wang @ 2017-12-04  3:46 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, penguin-kernel,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu, nilal, riel

On 12/01/2017 11:38 PM, Michael S. Tsirkin wrote:
> On Wed, Nov 29, 2017 at 09:55:23PM +0800, Wei Wang wrote:
>> +static void send_one_desc(struct virtio_balloon *vb,
>> +			  struct virtqueue *vq,
>> +			  uint64_t addr,
>> +			  uint32_t len,
>> +			  bool inbuf,
>> +			  bool batch)
>> +{
>> +	int err;
>> +	unsigned int size;
>> +
>> +	/* Detach all the used buffers from the vq */
>> +	while (virtqueue_get_buf(vq, &size))
>> +		;
>> +
>> +	err = virtqueue_add_one_desc(vq, addr, len, inbuf, vq);
>> +	/*
>> +	 * This is expected to never fail: there is always at least 1 entry
>> +	 * available on the vq, because when the vq is full the worker thread
>> +	 * that adds the desc will be put into sleep until at least 1 entry is
>> +	 * available to use.
>> +	 */
>> +	BUG_ON(err);
>> +
>> +	/* If batching is requested, we batch till the vq is full */
>> +	if (!batch || !vq->num_free)
>> +		kick_and_wait(vq, vb->acked);
>> +}
>> +
> This internal kick complicates callers. I suggest that instead,
> you move this to callers, just return a "kick required" boolean.
> This way callers do not need to play with num_free at all.

Then in what situation would the function return true of "kick required"?

I think this wouldn't make a difference fundamentally. For example, we 
have 257 sgs (batching size=256) to send to host:

while (i < 257) {
     kick_required = send_sgs();
     if (kick_required)
         kick(); // After the 256 sgs have been added, the caller 
performs a kick().
}

Do we still need a kick here for the 257th sg as before? Only the caller 
knows if the last added sgs need a kick (when the send_sgs receives one 
sg, it doesn't know if there are more to come).

There is another approach to checking if the last added sgs haven't been 
sync-ed to the host: expose "vring_virtqueue->num_added" to the caller 
via a virtio_ring API:

     unsigned int virtqueue_num_added(struct virtqueue *_vq)
    {
         struct vring_virtqueue *vq = to_vvq(_vq);

         return vq->num_added;
   }



>> +/*
>> + * Send balloon pages in sgs to host. The balloon pages are recorded in the
>> + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
>> + * The page xbitmap is searched for continuous "1" bits, which correspond
>> + * to continuous pages, to chunk into sgs.
>> + *
>> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
>> + * need to be searched.
>> + */
>> +static void tell_host_sgs(struct virtio_balloon *vb,
>> +			  struct virtqueue *vq,
>> +			  unsigned long page_xb_start,
>> +			  unsigned long page_xb_end)
>> +{
>> +	unsigned long pfn_start, pfn_end;
>> +	uint64_t addr;
>> +	uint32_t len, max_len = round_down(UINT_MAX, PAGE_SIZE);
>> +
>> +	pfn_start = page_xb_start;
>> +	while (pfn_start < page_xb_end) {
>> +		pfn_start = xb_find_next_set_bit(&vb->page_xb, pfn_start,
>> +						 page_xb_end);
>> +		if (pfn_start == page_xb_end + 1)
>> +			break;
>> +		pfn_end = xb_find_next_zero_bit(&vb->page_xb,
>> +						pfn_start + 1,
>> +						page_xb_end);
>> +		addr = pfn_start << PAGE_SHIFT;
>> +		len = (pfn_end - pfn_start) << PAGE_SHIFT;
> This assugnment can overflow. Next line compares with UINT_MAX but by
> that time it is too late.  I think you should do all math in 64 bit to
> avoid surprises, then truncate to max_len and then it's safe to assign
> to sg.

Sounds reasonable, thanks.


>> +
>> +	xb_clear_bit_range(&vb->page_xb, page_xb_start, page_xb_end);
>> +}
>> +
>> +static inline int xb_set_page(struct virtio_balloon *vb,
>> +			       struct page *page,
>> +			       unsigned long *pfn_min,
>> +			       unsigned long *pfn_max)
>> +{
>> +	unsigned long pfn = page_to_pfn(page);
>> +	int ret;
>> +
>> +	*pfn_min = min(pfn, *pfn_min);
>> +	*pfn_max = max(pfn, *pfn_max);
>> +
>> +	do {
>> +		ret = xb_preload_and_set_bit(&vb->page_xb, pfn,
>> +					     GFP_NOWAIT | __GFP_NOWARN);
>> +	} while (unlikely(ret == -EAGAIN));
> what exactly does this loop do? Does this wait
> forever until there is some free memory? why GFP_NOWAIT?

Basically, "-EAGAIN" is returned from xb_set_bit() in the case when the 
pre-allocated per-cpu ida_bitmap is NULL. In that case, the caller 
re-invokes xb_preload_and_set_bit(), which re-invokes xb_preload to 
allocate ida_bitmap. So "-EAGAIN" actually does not indicate a status 
about memory allocation. "-ENOMEM" is the one to indicate the failure of 
memory allocation, but the loop doesn't re-try on "-ENOMEM".

GFP_NOWAIT is used to avoid memory reclaiming, which could cause the 
deadlock issue we discussed before.




>   	return num_freed_pages;
>   }
>   
> +/*
> + * The regular leak_balloon() with VIRTIO_BALLOON_F_SG needs memory allocation
> + * for xbitmap, which is not suitable for the oom case. This function does not
> + * use xbitmap to chunk pages, so it can be used by oom notifier to deflate
> + * pages when VIRTIO_BALLOON_F_SG is negotiated.
> + */
> I guess we can live with this for now.

Agree, the patchset has been big. We can get the basic implementation in 
first, and leave the following as future work. I can add it in the 
commit log.

> Two things to consider
> - adding support for pre-allocating indirect buffers
> - sorting the internal page queue (how?)


Best,
Wei

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v18 10/10] virtio-balloon: don't report free pages when page poisoning is enabled
  2017-12-01 15:49   ` Michael S. Tsirkin
@ 2017-12-04  5:39     ` Wei Wang
  2017-12-11  6:38     ` Wei Wang
  1 sibling, 0 replies; 37+ messages in thread
From: Wei Wang @ 2017-12-04  5:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, penguin-kernel,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu, nilal, riel

On 12/01/2017 11:49 PM, Michael S. Tsirkin wrote:
> On Wed, Nov 29, 2017 at 09:55:26PM +0800, Wei Wang wrote:
>> The guest free pages should not be discarded by the live migration thread
>> when page poisoning is enabled with PAGE_POISONING_NO_SANITY=n, because
>> skipping the transfer of such poisoned free pages will trigger false
>> positive when new pages are allocated and checked on the destination.
>> This patch skips the reporting of free pages in the above case.
>>
>> Reported-by: Michael S. Tsirkin <mst@redhat.com>
>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> ---
>>   drivers/virtio/virtio_balloon.c | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>> index 035bd3a..6ac4cff 100644
>> --- a/drivers/virtio/virtio_balloon.c
>> +++ b/drivers/virtio/virtio_balloon.c
>> @@ -652,7 +652,9 @@ static void report_free_page(struct work_struct *work)
>>   	/* Start by sending the obtained cmd id to the host with an outbuf */
>>   	send_one_desc(vb, vb->free_page_vq, virt_to_phys(&vb->start_cmd_id),
>>   		      sizeof(uint32_t), false, true, false);
>> -	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
>> +	if (!(page_poisoning_enabled() &&
>> +	    !IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY)))
>> +		walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
>>   	/*
>>   	 * End by sending the stop id to the host with an outbuf. Use the
>>   	 * non-batching mode here to trigger a kick after adding the stop id.
> PAGE_POISONING_ZERO is actually OK.

I think the 0-filled pages still need to be sent. If the host on the 
destination doesn't use PAGE_POISONING_ZERO, then the pages it offers to 
the guest on the destination may have non-0 values.



>
> But I really would prefer it that we still send pages to host,
> otherwise debugging becomes much harder.
>
> And it does not have to be completely useless, even though
> you can not discard them as they would be zero-filled then.
>
> How about a config field telling host what should be there in the free
> pages? This way even though host can not discard them, host can send
> them out without reading them, still a win.
>
>

OK, but I think we would need two 32-bit config registers:

__u32 page_poison_val; // stores the PAGE_POISON VALUE, but it couldn't 
indicate if page poisoning is in use

__u32 special_features; // set bit 0 to indicate page poisoning is in use

#define VIRTIO_BALLOON_SF_PAGE_POISON (1 << 0)


Best,
Wei

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v18 05/10] xbitmap: add more operations
  2017-12-03  1:50             ` Tetsuo Handa
@ 2017-12-07 12:01               ` Wei Wang
  2017-12-07 15:41                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 37+ messages in thread
From: Wei Wang @ 2017-12-07 12:01 UTC (permalink / raw)
  To: Tetsuo Handa, willy
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource,
	yang.zhang.wz, quan.xu, nilal, riel

On 12/03/2017 09:50 AM, Tetsuo Handa wrote:
> Matthew Wilcox wrote:
>> On Fri, Dec 01, 2017 at 03:09:08PM +0000, Wang, Wei W wrote:
>>> On Friday, December 1, 2017 9:02 PM, Tetsuo Handa wrote:
>>>> If start == end is legal,
>>>>
>>>>     for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
>>>>
>>>> makes this loop do nothing because 10 < 10 is false.
>>> How about "start <= end "?
>> Don't ask Tetsuo for his opinion, write some userspace code that uses it.
>>
> Please be sure to prepare for "end == -1UL" case, for "start < end" will become
> true when "start = (start | (IDA_BITMAP_BITS - 1)) + 1" made "start == 0" due to
> overflow.

I think there is one more corner case with this API: searching for bit 
"1" from [0, ULONG_MAX] while no bit is set in the range, there appear 
to be no possible value that we can return (returning "end + 1" will be 
"ULONG_MAX + 1", which is 0)
I plan to make the "end" be exclusive of the searching, that is, [start, 
end), and return "end" if no such bit is found.

For cases like [16, 16), returning 16 doesn't mean bit 16 is 1 or 0, it 
simply means there is no bits to search in the given range, since 16 is 
exclusive.

Please let me know if you have a different thought.

Best,
Wei

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v18 05/10] xbitmap: add more operations
  2017-12-07 12:01               ` Wei Wang
@ 2017-12-07 15:41                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 37+ messages in thread
From: Michael S. Tsirkin @ 2017-12-07 15:41 UTC (permalink / raw)
  To: Wei Wang
  Cc: Tetsuo Handa, willy, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, mhocko, akpm, mawilcox, david,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, yang.zhang.wz, quan.xu, nilal, riel

On Thu, Dec 07, 2017 at 08:01:24PM +0800, Wei Wang wrote:
> On 12/03/2017 09:50 AM, Tetsuo Handa wrote:
> > Matthew Wilcox wrote:
> > > On Fri, Dec 01, 2017 at 03:09:08PM +0000, Wang, Wei W wrote:
> > > > On Friday, December 1, 2017 9:02 PM, Tetsuo Handa wrote:
> > > > > If start == end is legal,
> > > > > 
> > > > >     for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
> > > > > 
> > > > > makes this loop do nothing because 10 < 10 is false.
> > > > How about "start <= end "?
> > > Don't ask Tetsuo for his opinion, write some userspace code that uses it.
> > > 
> > Please be sure to prepare for "end == -1UL" case, for "start < end" will become
> > true when "start = (start | (IDA_BITMAP_BITS - 1)) + 1" made "start == 0" due to
> > overflow.
> 
> I think there is one more corner case with this API: searching for bit "1"
> from [0, ULONG_MAX] while no bit is set in the range, there appear to be no
> possible value that we can return (returning "end + 1" will be "ULONG_MAX +
> 1", which is 0)
> I plan to make the "end" be exclusive of the searching, that is, [start,
> end), and return "end" if no such bit is found.
> 
> For cases like [16, 16), returning 16 doesn't mean bit 16 is 1 or 0, it
> simply means there is no bits to search in the given range, since 16 is
> exclusive.
> 
> Please let me know if you have a different thought.
> 
> Best,
> Wei

Matthew is right though - you want to include tests for all
these corner cases.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v18 10/10] virtio-balloon: don't report free pages when page poisoning is enabled
  2017-12-01 15:49   ` Michael S. Tsirkin
  2017-12-04  5:39     ` Wei Wang
@ 2017-12-11  6:38     ` Wei Wang
  2017-12-11 13:24       ` Michael S. Tsirkin
  1 sibling, 1 reply; 37+ messages in thread
From: Wei Wang @ 2017-12-11  6:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, penguin-kernel,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu, nilal, riel

On 12/01/2017 11:49 PM, Michael S. Tsirkin wrote:
> On Wed, Nov 29, 2017 at 09:55:26PM +0800, Wei Wang wrote:
>> The guest free pages should not be discarded by the live migration thread
>> when page poisoning is enabled with PAGE_POISONING_NO_SANITY=n, because
>> skipping the transfer of such poisoned free pages will trigger false
>> positive when new pages are allocated and checked on the destination.
>> This patch skips the reporting of free pages in the above case.
>>
>> Reported-by: Michael S. Tsirkin <mst@redhat.com>
>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> ---
>>   drivers/virtio/virtio_balloon.c | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>> index 035bd3a..6ac4cff 100644
>> --- a/drivers/virtio/virtio_balloon.c
>> +++ b/drivers/virtio/virtio_balloon.c
>> @@ -652,7 +652,9 @@ static void report_free_page(struct work_struct *work)
>>   	/* Start by sending the obtained cmd id to the host with an outbuf */
>>   	send_one_desc(vb, vb->free_page_vq, virt_to_phys(&vb->start_cmd_id),
>>   		      sizeof(uint32_t), false, true, false);
>> -	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
>> +	if (!(page_poisoning_enabled() &&
>> +	    !IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY)))
>> +		walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
>>   	/*
>>   	 * End by sending the stop id to the host with an outbuf. Use the
>>   	 * non-batching mode here to trigger a kick after adding the stop id.
> PAGE_POISONING_ZERO is actually OK.
>
> But I really would prefer it that we still send pages to host,
> otherwise debugging becomes much harder.
>
> And it does not have to be completely useless, even though
> you can not discard them as they would be zero-filled then.
>
> How about a config field telling host what should be there in the free
> pages? This way even though host can not discard them, host can send
> them out without reading them, still a win.
>
>

Since this poison value comes with the free page reporting feature, how 
about sending the poison value via the free_page_vq, along with the cmd 
id in the outbuf? That is, use the following interface:

struct virtio_balloon_free_page_vq_hdr {
     bool page_poisoning;
     __virtio32 poison_value;
     __virtio32 cmd_id;
}

We need "bool page_poisoning" because "poison_value=0" doesn't tell 
whether page poising is in use by the guest. PAGE_POISONING_ZERO sets 
"page_poisoning=true, poisoning_value=0", and the host will send the 
0-filled pages to the destination (if not sending 0-filled pages, the 
destination host would offer non-zero pages to the guest)
The host can discard free pages only when "page_poisoning=false".

Best,
Wei

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v18 10/10] virtio-balloon: don't report free pages when page poisoning is enabled
  2017-12-11  6:38     ` Wei Wang
@ 2017-12-11 13:24       ` Michael S. Tsirkin
  2017-12-12 12:21         ` Wei Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2017-12-11 13:24 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, penguin-kernel,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu, nilal, riel

On Mon, Dec 11, 2017 at 02:38:45PM +0800, Wei Wang wrote:
> On 12/01/2017 11:49 PM, Michael S. Tsirkin wrote:
> > On Wed, Nov 29, 2017 at 09:55:26PM +0800, Wei Wang wrote:
> > > The guest free pages should not be discarded by the live migration thread
> > > when page poisoning is enabled with PAGE_POISONING_NO_SANITY=n, because
> > > skipping the transfer of such poisoned free pages will trigger false
> > > positive when new pages are allocated and checked on the destination.
> > > This patch skips the reporting of free pages in the above case.
> > > 
> > > Reported-by: Michael S. Tsirkin <mst@redhat.com>
> > > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > > Cc: Michal Hocko <mhocko@suse.com>
> > > ---
> > >   drivers/virtio/virtio_balloon.c | 4 +++-
> > >   1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> > > index 035bd3a..6ac4cff 100644
> > > --- a/drivers/virtio/virtio_balloon.c
> > > +++ b/drivers/virtio/virtio_balloon.c
> > > @@ -652,7 +652,9 @@ static void report_free_page(struct work_struct *work)
> > >   	/* Start by sending the obtained cmd id to the host with an outbuf */
> > >   	send_one_desc(vb, vb->free_page_vq, virt_to_phys(&vb->start_cmd_id),
> > >   		      sizeof(uint32_t), false, true, false);
> > > -	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> > > +	if (!(page_poisoning_enabled() &&
> > > +	    !IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY)))
> > > +		walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> > >   	/*
> > >   	 * End by sending the stop id to the host with an outbuf. Use the
> > >   	 * non-batching mode here to trigger a kick after adding the stop id.
> > PAGE_POISONING_ZERO is actually OK.
> > 
> > But I really would prefer it that we still send pages to host,
> > otherwise debugging becomes much harder.
> > 
> > And it does not have to be completely useless, even though
> > you can not discard them as they would be zero-filled then.
> > 
> > How about a config field telling host what should be there in the free
> > pages? This way even though host can not discard them, host can send
> > them out without reading them, still a win.
> > 
> > 
> 
> Since this poison value comes with the free page reporting feature, how
> about sending the poison value via the free_page_vq, along with the cmd id
> in the outbuf? That is, use the following interface:
> 
> struct virtio_balloon_free_page_vq_hdr {
>     bool page_poisoning;
>     __virtio32 poison_value;
>     __virtio32 cmd_id;
> }

Can we put the value in config space instead?

> We need "bool page_poisoning" because "poison_value=0" doesn't tell whether
> page poising is in use by the guest.

Can we use a feature bit for this?

> PAGE_POISONING_ZERO sets
> "page_poisoning=true, poisoning_value=0", and the host will send the
> 0-filled pages to the destination (if not sending 0-filled pages, the
> destination host would offer non-zero pages to the guest)

Why would it? Linux zeroes all pages on alloc.

> The host can discard free pages only when "page_poisoning=false".
> 
> Best,
> Wei

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v18 10/10] virtio-balloon: don't report free pages when page poisoning is enabled
  2017-12-11 13:24       ` Michael S. Tsirkin
@ 2017-12-12 12:21         ` Wei Wang
  0 siblings, 0 replies; 37+ messages in thread
From: Wei Wang @ 2017-12-12 12:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, penguin-kernel,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu, nilal, riel

On 12/11/2017 09:24 PM, Michael S. Tsirkin wrote:
> On Mon, Dec 11, 2017 at 02:38:45PM +0800, Wei Wang wrote:
>> On 12/01/2017 11:49 PM, Michael S. Tsirkin wrote:
>>> On Wed, Nov 29, 2017 at 09:55:26PM +0800, Wei Wang wrote:
>>>> The guest free pages should not be discarded by the live migration thread
>>>> when page poisoning is enabled with PAGE_POISONING_NO_SANITY=n, because
>>>> skipping the transfer of such poisoned free pages will trigger false
>>>> positive when new pages are allocated and checked on the destination.
>>>> This patch skips the reporting of free pages in the above case.
>>>>
>>>> Reported-by: Michael S. Tsirkin <mst@redhat.com>
>>>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>>>> Cc: Michal Hocko <mhocko@suse.com>
>>>> ---
>>>>    drivers/virtio/virtio_balloon.c | 4 +++-
>>>>    1 file changed, 3 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>>>> index 035bd3a..6ac4cff 100644
>>>> --- a/drivers/virtio/virtio_balloon.c
>>>> +++ b/drivers/virtio/virtio_balloon.c
>>>> @@ -652,7 +652,9 @@ static void report_free_page(struct work_struct *work)
>>>>    	/* Start by sending the obtained cmd id to the host with an outbuf */
>>>>    	send_one_desc(vb, vb->free_page_vq, virt_to_phys(&vb->start_cmd_id),
>>>>    		      sizeof(uint32_t), false, true, false);
>>>> -	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
>>>> +	if (!(page_poisoning_enabled() &&
>>>> +	    !IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY)))
>>>> +		walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
>>>>    	/*
>>>>    	 * End by sending the stop id to the host with an outbuf. Use the
>>>>    	 * non-batching mode here to trigger a kick after adding the stop id.
>>> PAGE_POISONING_ZERO is actually OK.
>>>
>>> But I really would prefer it that we still send pages to host,
>>> otherwise debugging becomes much harder.
>>>
>>> And it does not have to be completely useless, even though
>>> you can not discard them as they would be zero-filled then.
>>>
>>> How about a config field telling host what should be there in the free
>>> pages? This way even though host can not discard them, host can send
>>> them out without reading them, still a win.
>>>
>>>
>> Since this poison value comes with the free page reporting feature, how
>> about sending the poison value via the free_page_vq, along with the cmd id
>> in the outbuf? That is, use the following interface:
>>
>> struct virtio_balloon_free_page_vq_hdr {
>>      bool page_poisoning;
>>      __virtio32 poison_value;
>>      __virtio32 cmd_id;
>> }
> Can we put the value in config space instead?
>
>> We need "bool page_poisoning" because "poison_value=0" doesn't tell whether
>> page poising is in use by the guest.
> Can we use a feature bit for this?
>
>> PAGE_POISONING_ZERO sets
>> "page_poisoning=true, poisoning_value=0", and the host will send the
>> 0-filled pages to the destination (if not sending 0-filled pages, the
>> destination host would offer non-zero pages to the guest)
> Why would it? Linux zeroes all pages on alloc.
>

Thanks, that is the case. I think we don't need a feature bit then. 
Please have a check the v19 patches.

Best,
Wei

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2017-12-12 12:19 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-29 13:55 [PATCH v18 00/10] Virtio-balloon Enhancement Wei Wang
2017-11-29 13:55 ` [PATCH v18 01/10] idr: add #include <linux/bug.h> Wei Wang
2017-11-30  0:58   ` Matthew Wilcox
2017-11-30  7:07     ` Michal Hocko
2017-11-30 21:49     ` Andrew Morton
2017-11-29 13:55 ` [PATCH v18 02/10] radix tree test suite: remove ARRAY_SIZE to avoid redefinition Wei Wang
2017-11-29 13:55 ` [PATCH v18 03/10] xbitmap: Introduce xbitmap Wei Wang
2017-11-29 13:55 ` [PATCH v18 04/10] xbitmap: potential improvement Wei Wang
2017-11-29 13:55 ` [PATCH v18 05/10] xbitmap: add more operations Wei Wang
2017-11-30 10:34   ` Tetsuo Handa
2017-11-30 13:35     ` Tetsuo Handa
2017-11-30 14:39       ` Matthew Wilcox
2017-12-03  1:44         ` Tetsuo Handa
2017-12-01  8:02     ` Wei Wang
2017-12-01 13:02       ` Tetsuo Handa
2017-12-01 14:13         ` Matthew Wilcox
2017-12-01 15:09         ` Wang, Wei W
2017-12-01 17:25           ` Matthew Wilcox
2017-12-03  1:50             ` Tetsuo Handa
2017-12-07 12:01               ` Wei Wang
2017-12-07 15:41                 ` Michael S. Tsirkin
2017-11-29 13:55 ` [PATCH v18 06/10] virtio_ring: add a new API, virtqueue_add_one_desc Wei Wang
2017-11-30 19:38   ` Michael S. Tsirkin
2017-12-01  8:06     ` Wei Wang
2017-11-29 13:55 ` [PATCH v18 07/10] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
2017-11-30 10:35   ` Tetsuo Handa
2017-11-30 16:25     ` Wang, Wei W
2017-12-01 15:38   ` Michael S. Tsirkin
2017-12-04  3:46     ` Wei Wang
2017-11-29 13:55 ` [PATCH v18 08/10] mm: support reporting free page blocks Wei Wang
2017-11-29 13:55 ` [PATCH v18 09/10] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ Wei Wang
2017-11-29 13:55 ` [PATCH v18 10/10] virtio-balloon: don't report free pages when page poisoning is enabled Wei Wang
2017-12-01 15:49   ` Michael S. Tsirkin
2017-12-04  5:39     ` Wei Wang
2017-12-11  6:38     ` Wei Wang
2017-12-11 13:24       ` Michael S. Tsirkin
2017-12-12 12:21         ` Wei Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).