linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v17 0/6] Virtio-balloon Enhancement
@ 2017-11-03  8:13 Wei Wang
  2017-11-03  8:13 ` [PATCH v17 1/6] lib/xbitmap: Introduce xbitmap Wei Wang
                   ` (5 more replies)
  0 siblings, 6 replies; 32+ messages in thread
From: Wei Wang @ 2017-11-03  8:13 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, penguin-kernel, cornelia.huck, mgorman, aarcange,
	amit.shah, pbonzini, willy, wei.w.wang, liliang.opensource,
	yang.zhang.wz, quan.xu

This patch series enhances the existing virtio-balloon with the following
new features:
1) fast ballooning: transfer ballooned pages between the guest and host in
chunks using sgs, instead of one array each time; and
2) free page block reporting: a new virtqueue to report guest free pages
to the host.

The second feature can be used to accelerate live migration of VMs. Here
are some details:

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.

The second feature enables the optimization of the 1st round memory
transfer - the hypervisor can skip the transfer of guest free pages in the
1st round. It is not concerned that the memory pages are used after they
are given to the hypervisor as a hint of the free pages, because they will
be tracked by the hypervisor and transferred in the next round if they are
used and written.

ChangeLog:
v16->v17:
1) patch 1: please check the commit log there;
2) patch 3: included Michael S. Tsirkin patch to fix the potential
deadlock issue;
3) patch 4: use BUG_ON if virtqueue_add_ returns error, which is
expected never to happen;
4) patch 4: add leak_balloon_sg_oom, which is used in the oom case when
VIRTIO_BALLOON_F_SG is in use;
5) patch 6: use config registers, instead of a vq, as the command channel
between the host and guest;
6) patch 6: add the command sequence id support.

v15->v16:
1) mm: stop reporting the free pfn range if the callback returns false;
2) mm: move some implementaion of walk_free_mem_block into a function to
make the code layout looks better;
3) xbitmap: added some optimizations suggested by Matthew, please refer to
the ChangLog in the xbitmap patch for details.
4) xbitmap: added a test suite
5) virtio-balloon: bail out with a warning when virtqueue_add_inbuf returns
an error
6) virtio-balloon: some small code re-arrangement, e.g. detachinf used buf
from the vq before adding a new buf

v14->v15:
1) mm: make the report callback return a bool value - returning 1 to stop
walking through the free page list.
2) virtio-balloon: batching sgs of balloon pages till the vq is full
3) virtio-balloon: create a new workqueue, rather than using the default
system_wq, to queue the free page reporting work item.
4) virtio-balloon: add a ctrl_vq to be a central control plane which will
handle all the future control related commands between the host and guest.
Add free page report as the first feature controlled under ctrl_vq, and
the free_page_vq is a data plane vq dedicated to the transmission of free
page blocks.

v13->v14:
1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c.
2) xbitmap: consolidate the implementation of xb_bit_set/clear/test into
one xb_bit_ops.
3) xbitmap: add documents for the exported APIs.
4) mm: rewrite the function to walk through free page blocks.
5) virtio-balloon: when reporting a free page blcok to the device, if the
vq is full (less likey to happen in practice), just skip reporting this
block, instead of busywaiting till an entry gets released.
6) virtio-balloon: fail the probe function if adding the signal buf in
init_vqs fails.

v12->v13:
1) mm: use a callback function to handle the the free page blocks from the
report function. This avoids exposing the zone internal to a kernel
module.
2) virtio-balloon: send balloon pages or a free page block using a single
sg each time. This has the benefits of simpler implementation with no new
APIs.
3) virtio-balloon: the free_page_vq is used to report free pages only (no
multiple usages interleaving)
4) virtio-balloon: Balloon pages and free page blocks are sent via input
sgs, and the completion signal to the host is sent via an output sg.

v11->v12:
1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned pages.
2) virtio-ring: enable the driver to build up a desc chain using vring
desc.
3) virtio-ring: Add locking to the existing START_USE() and END_USE()
macro to lock/unlock the vq when a vq operation starts/ends.
4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
5) virtio-balloon: describe chunks of ballooned pages and free pages
blocks directly using one or more chains of desc from the vq.

v10->v11:
1) virtio_balloon: use vring_desc to describe a chunk;
2) virtio_ring: support to add an indirect desc table to virtqueue;
3)  virtio_balloon: use cmdq to report guest memory statistics.

v9->v10:
1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
2) virtio-balloon: add virtballoon_validate();
3) virtio-balloon: msg format change;
4) virtio-balloon: move miscq handling to a task on system_freezable_wq;
5) virtio-balloon: code cleanup.

v8->v9:
1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
implementation;
2) Simpler function to get the free page block.

v7->v8:
1) Use only one chunk format, instead of two.
2) re-write the virtio-balloon implementation patch.
3) commit changes
4) patch re-org

Matthew Wilcox (2):
  lib/xbitmap: Introduce xbitmap
  radix tree test suite: add tests for xbitmap

Michael S. Tsirkin (1):
  mm/balloon_compaction.c: split balloon page allocation and enqueue

Wei Wang (3):
  virtio-balloon: VIRTIO_BALLOON_F_SG
  mm: support reporting free page blocks
  virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ

 drivers/virtio/virtio_balloon.c         | 489 +++++++++++++++++++++++++++++---
 include/linux/balloon_compaction.h      |  34 ++-
 include/linux/mm.h                      |   6 +
 include/linux/radix-tree.h              |   2 +
 include/linux/xbitmap.h                 |  67 +++++
 include/uapi/linux/virtio_balloon.h     |  12 +
 lib/Makefile                            |   2 +-
 lib/radix-tree.c                        |  51 +++-
 lib/xbitmap.c                           | 283 ++++++++++++++++++
 mm/balloon_compaction.c                 |  28 +-
 mm/page_alloc.c                         |  91 ++++++
 tools/include/linux/bitmap.h            |  34 +++
 tools/include/linux/kernel.h            |   2 +
 tools/testing/radix-tree/Makefile       |   7 +-
 tools/testing/radix-tree/linux/kernel.h |   2 -
 tools/testing/radix-tree/main.c         |   5 +
 tools/testing/radix-tree/test.h         |   1 +
 tools/testing/radix-tree/xbitmap.c      | 278 ++++++++++++++++++
 18 files changed, 1335 insertions(+), 59 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c
 create mode 100644 tools/testing/radix-tree/xbitmap.c

-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v17 1/6] lib/xbitmap: Introduce xbitmap
  2017-11-03  8:13 [PATCH v17 0/6] Virtio-balloon Enhancement Wei Wang
@ 2017-11-03  8:13 ` Wei Wang
  2017-11-03 10:55   ` Tetsuo Handa
  2017-11-03  8:13 ` [PATCH v17 2/6] radix tree test suite: add tests for xbitmap Wei Wang
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: Wei Wang @ 2017-11-03  8:13 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, penguin-kernel, cornelia.huck, mgorman, aarcange,
	amit.shah, pbonzini, willy, wei.w.wang, liliang.opensource,
	yang.zhang.wz, quan.xu

From: Matthew Wilcox <mawilcox@microsoft.com>

The eXtensible Bitmap is a sparse bitmap representation which is
efficient for set bits which tend to cluster.  It supports up to
'unsigned long' worth of bits, and this commit adds the bare bones --
xb_set_bit(), xb_clear_bit(), xb_clear_bit_range(), xb_test_bit(),
xb_find_next_set_bit(), xb_find_next_zero_bit().

More possible optimizations to add in the future:
1) xb_set_bit_range: set a range of bits.
2) when searching a bit, if the bit is not found in the slot, move on to
the next slot directly.
3) add Tags to help searching.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

v16->v17 ChangeLog:
1) xb_preload: allocate ida bitmap before __radix_tree_preload() to avoid
kmalloc with preemption disabled. Also change this function to return with
preemption not disabled on error.
2) xb_preload_and_set_bit: a wrapper of xb_preload and xb_set_bit, for
the convenience of usage.

v15->v16 ChangeLog:
1) coding style - separate small functions for bit set/clear/test;
2) Clear a range of bits in a more efficient way:
   A) clear a range of bits from the same ida bitmap directly rather than
      search the bitmap again for each bit;
   B) when the range of bits to clear covers the whole ida bitmap,
      directly free the bitmap - no need to zero the bitmap first.
3) more efficient bit searching, like 2.A.
---
 include/linux/radix-tree.h |   2 +
 include/linux/xbitmap.h    |  67 +++++++++++
 lib/Makefile               |   2 +-
 lib/radix-tree.c           |  51 +++++++-
 lib/xbitmap.c              | 283 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 402 insertions(+), 3 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 567ebb5..1d6d6f6 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -309,6 +309,8 @@ void radix_tree_iter_replace(struct radix_tree_root *,
 		const struct radix_tree_iter *, void __rcu **slot, void *entry);
 void radix_tree_replace_slot(struct radix_tree_root *,
 			     void __rcu **slot, void *entry);
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot);
 void __radix_tree_delete_node(struct radix_tree_root *,
 			      struct radix_tree_node *,
 			      radix_tree_update_node_t update_node,
diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
new file mode 100644
index 0000000..00b59c3
--- /dev/null
+++ b/include/linux/xbitmap.h
@@ -0,0 +1,67 @@
+/*
+ * eXtensible Bitmaps
+ * Copyright (c) 2017 Microsoft Corporation <mawilcox@microsoft.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * eXtensible Bitmaps provide an unlimited-size sparse bitmap facility.
+ * All bits are initially zero.
+ */
+
+#ifndef __XBITMAP_H__
+#define __XBITMAP_H__
+
+#include <linux/idr.h>
+
+struct xb {
+	struct radix_tree_root xbrt;
+};
+
+#define XB_INIT {							\
+	.xbrt = RADIX_TREE_INIT(IDR_RT_MARKER | GFP_NOWAIT),		\
+}
+#define DEFINE_XB(name)		struct xb name = XB_INIT
+
+static inline void xb_init(struct xb *xb)
+{
+	INIT_RADIX_TREE(&xb->xbrt, IDR_RT_MARKER | GFP_NOWAIT);
+}
+
+int xb_set_bit(struct xb *xb, unsigned long bit);
+int xb_preload_and_set_bit(struct xb *xb, unsigned long bit, gfp_t gfp);
+bool xb_test_bit(struct xb *xb, unsigned long bit);
+void xb_clear_bit(struct xb *xb, unsigned long bit);
+unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
+				   unsigned long end);
+unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start,
+				    unsigned long end);
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end);
+
+/* Check if the xb tree is empty */
+static inline bool xb_is_empty(const struct xb *xb)
+{
+	return radix_tree_empty(&xb->xbrt);
+}
+
+bool xb_preload(gfp_t gfp);
+
+/**
+ * xb_preload_end - end preload section started with xb_preload()
+ *
+ * Each xb_preload() should be matched with an invocation of this
+ * function. See xb_preload() for details.
+ */
+static inline void xb_preload_end(void)
+{
+	preempt_enable();
+}
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index dafa796..082361b 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -18,7 +18,7 @@ KCOV_INSTRUMENT_dynamic_debug.o := n
 
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o dump_stack.o timerqueue.o\
-	 idr.o int_sqrt.o extable.o \
+	 idr.o xbitmap.o int_sqrt.o extable.o \
 	 sha1.o chacha20.o irq_regs.o argv_split.o \
 	 flex_proportions.o ratelimit.o show_mem.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 8b1feca..269a5cc 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -78,6 +78,19 @@ static struct kmem_cache *radix_tree_node_cachep;
 #define IDA_PRELOAD_SIZE	(IDA_MAX_PATH * 2 - 1)
 
 /*
+ * The xbitmap implementation supports up to ULONG_MAX bits, and it is
+ * implemented based on ida bitmaps. So, given an unsigned long index,
+ * the high order XB_INDEX_BITS bits of the index is used to find the
+ * corresponding item (i.e. ida bitmap) from the radix tree, and the low
+ * order (i.e. ilog2(IDA_BITMAP_BITS)) bits of the index are indexed into
+ * the ida bitmap to find the bit.
+ */
+#define XB_INDEX_BITS		(BITS_PER_LONG - ilog2(IDA_BITMAP_BITS))
+#define XB_MAX_PATH		(DIV_ROUND_UP(XB_INDEX_BITS, \
+					      RADIX_TREE_MAP_SHIFT))
+#define XB_PRELOAD_SIZE		(XB_MAX_PATH * 2 - 1)
+
+/*
  * Per-cpu pool of preloaded nodes
  */
 struct radix_tree_preload {
@@ -840,6 +853,8 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
 							offset, 0, 0);
 			if (!child)
 				return -ENOMEM;
+			if (is_idr(root))
+				all_tag_set(child, IDR_FREE);
 			rcu_assign_pointer(*slot, node_to_entry(child));
 			if (node)
 				node->count++;
@@ -1986,8 +2001,8 @@ void __radix_tree_delete_node(struct radix_tree_root *root,
 	delete_node(root, node, update_node, private);
 }
 
-static bool __radix_tree_delete(struct radix_tree_root *root,
-				struct radix_tree_node *node, void __rcu **slot)
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot)
 {
 	void *old = rcu_dereference_raw(*slot);
 	int exceptional = radix_tree_exceptional_entry(old) ? -1 : 0;
@@ -2005,6 +2020,38 @@ static bool __radix_tree_delete(struct radix_tree_root *root,
 }
 
 /**
+ *  xb_preload - preload for xb_set_bit()
+ *  @gfp_mask: allocation mask to use for preloading
+ *
+ * Preallocate memory to use for the next call to xb_set_bit(). On success,
+ * return true, with preemption disabled. On error, return false with
+ * preemption not disabled.
+ */
+bool xb_preload(gfp_t gfp)
+{
+	if (!this_cpu_read(ida_bitmap)) {
+		struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
+
+		if (!bitmap)
+			return false;
+		/*
+		 * The per-CPU variable is updated with preemption enabled.
+		 * If the calling task is unlucky to be scheduled to another
+		 * CPU which has no ida_bitmap allocation, it will be detected
+		 * when setting a bit (i.e. __xb_set_bit()).
+		 */
+		bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
+		kfree(bitmap);
+	}
+
+	if (__radix_tree_preload(gfp, XB_PRELOAD_SIZE) < 0)
+		return false;
+
+	return true;
+}
+EXPORT_SYMBOL(xb_preload);
+
+/**
  * radix_tree_iter_delete - delete the entry at this iterator position
  * @root: radix tree root
  * @iter: iterator state
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
new file mode 100644
index 0000000..3e891d6
--- /dev/null
+++ b/lib/xbitmap.c
@@ -0,0 +1,283 @@
+#include <linux/slab.h>
+#include <linux/xbitmap.h>
+
+/**
+ *  xb_set_bit - set a bit in the xbitmap
+ *  @xb: the xbitmap tree used to record the bit
+ *  @bit: index of the bit to set
+ *
+ * Returns: 0 on success; -EAGAIN on error, and the caller is expected to
+ * restart from xb_preload.
+ */
+int xb_set_bit(struct xb *xb, unsigned long bit)
+{
+	int err;
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	err = __radix_tree_create(root, index, 0, &node, &slot);
+	if (err)
+		return err;
+	bitmap = rcu_dereference_raw(*slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit < BITS_PER_LONG) {
+			tmp |= 1UL << ebit;
+			rcu_assign_pointer(*slot, (void *)tmp);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+		rcu_assign_pointer(*slot, bitmap);
+	}
+
+	if (!bitmap) {
+		if (ebit < BITS_PER_LONG) {
+			bitmap = (void *)((1UL << ebit) |
+					RADIX_TREE_EXCEPTIONAL_ENTRY);
+			__radix_tree_replace(root, node, slot, bitmap, NULL,
+						NULL);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		__radix_tree_replace(root, node, slot, bitmap, NULL, NULL);
+	}
+
+	__set_bit(bit, bitmap->bitmap);
+	return 0;
+}
+EXPORT_SYMBOL(xb_set_bit);
+
+/**
+ *  xb_preload_and_set_bit - preload the memory and set a bit in the xbitmap
+ *  @xb: the xbitmap tree used to record the bit
+ *  @bit: index of the bit to set
+ *
+ * A wrapper of the xb_preload() and xb_set_bit().
+ * Returns: 0 on success; -EAGAIN or -ENOMEM on error.
+ */
+int xb_preload_and_set_bit(struct xb *xb, unsigned long bit, gfp_t gfp)
+{
+	int ret = 0;
+
+	if (!xb_preload(gfp))
+		return -ENOMEM;
+
+	ret = xb_set_bit(xb, bit);
+	xb_preload_end();
+
+	return ret;
+}
+EXPORT_SYMBOL(xb_preload_and_set_bit);
+
+/**
+ * xb_clear_bit - clear a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to clear
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	bitmap = __radix_tree_lookup(root, index, &node, &slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit >= BITS_PER_LONG)
+			return;
+		tmp &= ~(1UL << ebit);
+		if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+			__radix_tree_delete(root, node, slot);
+		else
+			rcu_assign_pointer(*slot, (void *)tmp);
+		return;
+	}
+
+	if (!bitmap)
+		return;
+
+	__clear_bit(bit, bitmap->bitmap);
+	if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+		kfree(bitmap);
+		__radix_tree_delete(root, node, slot);
+	}
+}
+EXPORT_SYMBOL(xb_clear_bit);
+
+/**
+ * xb_clear_bit - clear a range of bits in the xbitmap
+ * @start: the start of the bit range, inclusive
+ * @end: the end of the bit range, inclusive
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned int nbits;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bitmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bitmap)) {
+			unsigned long ebit = bit + 2;
+			unsigned long tmp = (unsigned long)bitmap;
+
+			nbits = min(end - start + 1, BITS_PER_LONG - ebit);
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			bitmap_clear(&tmp, ebit, nbits);
+			if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+				__radix_tree_delete(root, node, slot);
+			else
+				rcu_assign_pointer(*slot, (void *)tmp);
+		} else if (bitmap) {
+			nbits = min(end - start + 1, IDA_BITMAP_BITS - bit);
+
+			if (nbits != IDA_BITMAP_BITS)
+				bitmap_clear(bitmap->bitmap, bit, nbits);
+
+			if (nbits == IDA_BITMAP_BITS ||
+				bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+				kfree(bitmap);
+				__radix_tree_delete(root, node, slot);
+			}
+		}
+	}
+}
+EXPORT_SYMBOL(xb_clear_bit_range);
+
+/**
+ * xb_test_bit - test a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to test
+ *
+ * This function is used to test a bit in the xbitmap.
+ * Returns: 1 if the bit is set, or 0 otherwise.
+ */
+bool xb_test_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	const struct radix_tree_root *root = &xb->xbrt;
+	struct ida_bitmap *bitmap = radix_tree_lookup(root, index);
+
+	bit %= IDA_BITMAP_BITS;
+
+	if (!bitmap)
+		return false;
+	if (radix_tree_exception(bitmap)) {
+		bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
+		if (bit > BITS_PER_LONG)
+			return false;
+		return (unsigned long)bitmap & (1UL << bit);
+	}
+
+	return test_bit(bit, bitmap->bitmap);
+}
+EXPORT_SYMBOL(xb_test_bit);
+
+static unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+				      unsigned long end, bool set)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bmap;
+	unsigned long ret = end + 1;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bmap)) {
+			unsigned long tmp = (unsigned long)bmap;
+			unsigned long ebit = bit + 2;
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			if (set)
+				ret = find_next_bit(&tmp, BITS_PER_LONG, ebit);
+			else
+				ret = find_next_zero_bit(&tmp, BITS_PER_LONG,
+							 ebit);
+			if (ret < BITS_PER_LONG)
+				return ret - 2 + IDA_BITMAP_BITS * index;
+		} else if (bmap) {
+			if (set)
+				ret = find_next_bit(bmap->bitmap,
+						    IDA_BITMAP_BITS, bit);
+			else
+				ret = find_next_zero_bit(bmap->bitmap,
+							 IDA_BITMAP_BITS, bit);
+			if (ret < IDA_BITMAP_BITS)
+				return ret + index * IDA_BITMAP_BITS;
+		} else if (!bmap && !set) {
+			return start;
+		}
+	}
+
+	return ret;
+}
+
+/**
+ * xb_find_next_set_bit - find the next set bit in a range
+ * @xb: the xbitmap to search
+ * @start: the start of the range, inclusive
+ * @end: the end of the range, inclusive
+ *
+ * Returns: the index of the found bit, or @end + 1 if no such bit is found.
+ */
+unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
+				   unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 1);
+}
+EXPORT_SYMBOL(xb_find_next_set_bit);
+
+/**
+ * xb_find_next_zero_bit - find the next zero bit in a range
+ * @xb: the xbitmap to search
+ * @start: the start of the range, inclusive
+ * @end: the end of the range, inclusive
+ *
+ * Returns: the index of the found bit, or @end + 1 if no such bit is found.
+ */
+unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start,
+				    unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 0);
+}
+EXPORT_SYMBOL(xb_find_next_zero_bit);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v17 2/6] radix tree test suite: add tests for xbitmap
  2017-11-03  8:13 [PATCH v17 0/6] Virtio-balloon Enhancement Wei Wang
  2017-11-03  8:13 ` [PATCH v17 1/6] lib/xbitmap: Introduce xbitmap Wei Wang
@ 2017-11-03  8:13 ` Wei Wang
  2017-11-06 17:00   ` Matthew Wilcox
  2017-11-03  8:13 ` [PATCH v17 3/6] mm/balloon_compaction.c: split balloon page allocation and enqueue Wei Wang
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: Wei Wang @ 2017-11-03  8:13 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, penguin-kernel, cornelia.huck, mgorman, aarcange,
	amit.shah, pbonzini, willy, wei.w.wang, liliang.opensource,
	yang.zhang.wz, quan.xu

From: Matthew Wilcox <mawilcox@microsoft.com>

Add the following tests for xbitmap:
1) single bit test: single bit set/clear/find;
2) bit range test: set/clear a range of bits and find a 0 or 1 bit in
the range.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 tools/include/linux/bitmap.h            |  34 ++++
 tools/include/linux/kernel.h            |   2 +
 tools/testing/radix-tree/Makefile       |   7 +-
 tools/testing/radix-tree/linux/kernel.h |   2 -
 tools/testing/radix-tree/main.c         |   5 +
 tools/testing/radix-tree/test.h         |   1 +
 tools/testing/radix-tree/xbitmap.c      | 278 ++++++++++++++++++++++++++++++++
 7 files changed, 326 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/radix-tree/xbitmap.c

diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h
index e8b9f51..890dab2 100644
--- a/tools/include/linux/bitmap.h
+++ b/tools/include/linux/bitmap.h
@@ -36,6 +36,40 @@ static inline void bitmap_zero(unsigned long *dst, int nbits)
 	}
 }
 
+static inline void __bitmap_clear(unsigned long *map, unsigned int start,
+				  int len)
+{
+	unsigned long *p = map + BIT_WORD(start);
+	const unsigned int size = start + len;
+	int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+	unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+
+	while (len - bits_to_clear >= 0) {
+		*p &= ~mask_to_clear;
+		len -= bits_to_clear;
+		bits_to_clear = BITS_PER_LONG;
+		mask_to_clear = ~0UL;
+		p++;
+	}
+	if (len) {
+		mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+		*p &= ~mask_to_clear;
+	}
+}
+
+static inline __always_inline void bitmap_clear(unsigned long *map,
+						unsigned int start,
+						unsigned int nbits)
+{
+	if (__builtin_constant_p(nbits) && nbits == 1)
+		__clear_bit(start, map);
+	else if (__builtin_constant_p(start & 7) && IS_ALIGNED(start, 8) &&
+		 __builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))
+		memset((char *)map + start / 8, 0, nbits / 8);
+	else
+		__bitmap_clear(map, start, nbits);
+}
+
 static inline void bitmap_fill(unsigned long *dst, unsigned int nbits)
 {
 	unsigned int nlongs = BITS_TO_LONGS(nbits);
diff --git a/tools/include/linux/kernel.h b/tools/include/linux/kernel.h
index 77d2e94..21e90ee 100644
--- a/tools/include/linux/kernel.h
+++ b/tools/include/linux/kernel.h
@@ -12,6 +12,8 @@
 #define UINT_MAX	(~0U)
 #endif
 
+#define IS_ALIGNED(x, a)	(((x) & ((typeof(x))(a) - 1)) == 0)
+
 #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
 
 #define PERF_ALIGN(x, a)	__PERF_ALIGN_MASK(x, (typeof(x))(a)-1)
diff --git a/tools/testing/radix-tree/Makefile b/tools/testing/radix-tree/Makefile
index 6a9480c..fc7cb422 100644
--- a/tools/testing/radix-tree/Makefile
+++ b/tools/testing/radix-tree/Makefile
@@ -5,7 +5,8 @@ LDLIBS+= -lpthread -lurcu
 TARGETS = main idr-test multiorder
 CORE_OFILES := radix-tree.o idr.o linux.o test.o find_bit.o
 OFILES = main.o $(CORE_OFILES) regression1.o regression2.o regression3.o \
-	 tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o
+	 tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o \
+	 xbitmap.o
 
 ifndef SHIFT
 	SHIFT=3
@@ -24,6 +25,9 @@ idr-test: idr-test.o $(CORE_OFILES)
 
 multiorder: multiorder.o $(CORE_OFILES)
 
+xbitmap: xbitmap.o $(CORE_OFILES)
+	$(CC) $(CFLAGS) $(LDFLAGS) $^ -o xbitmap
+
 clean:
 	$(RM) $(TARGETS) *.o radix-tree.c idr.c generated/map-shift.h
 
@@ -33,6 +37,7 @@ $(OFILES): Makefile *.h */*.h generated/map-shift.h \
 	../../include/linux/*.h \
 	../../include/asm/*.h \
 	../../../include/linux/radix-tree.h \
+	../../../include/linux/xbitmap.h \
 	../../../include/linux/idr.h
 
 radix-tree.c: ../../../lib/radix-tree.c
diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-tree/linux/kernel.h
index b21a77f..c1e6088 100644
--- a/tools/testing/radix-tree/linux/kernel.h
+++ b/tools/testing/radix-tree/linux/kernel.h
@@ -16,6 +16,4 @@
 #define pr_debug printk
 #define pr_cont printk
 
-#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
-
 #endif /* _KERNEL_H */
diff --git a/tools/testing/radix-tree/main.c b/tools/testing/radix-tree/main.c
index bc9a784..6f4774e 100644
--- a/tools/testing/radix-tree/main.c
+++ b/tools/testing/radix-tree/main.c
@@ -337,6 +337,11 @@ static void single_thread_tests(bool long_run)
 	rcu_barrier();
 	printv(2, "after copy_tag_check: %d allocated, preempt %d\n",
 		nr_allocated, preempt_count);
+
+	xbitmap_checks();
+	rcu_barrier();
+	printv(2, "after xbitmap_checks: %d allocated, preempt %d\n",
+			nr_allocated, preempt_count);
 }
 
 int main(int argc, char **argv)
diff --git a/tools/testing/radix-tree/test.h b/tools/testing/radix-tree/test.h
index 0f8220c..f8dcdaa 100644
--- a/tools/testing/radix-tree/test.h
+++ b/tools/testing/radix-tree/test.h
@@ -36,6 +36,7 @@ void iteration_test(unsigned order, unsigned duration);
 void benchmark(void);
 void idr_checks(void);
 void ida_checks(void);
+void xbitmap_checks(void);
 void ida_thread_tests(void);
 
 struct item *
diff --git a/tools/testing/radix-tree/xbitmap.c b/tools/testing/radix-tree/xbitmap.c
new file mode 100644
index 0000000..bee8a38
--- /dev/null
+++ b/tools/testing/radix-tree/xbitmap.c
@@ -0,0 +1,278 @@
+#include <linux/bitmap.h>
+#include <linux/slab.h>
+#include <linux/kernel.h>
+#include "../../../include/linux/xbitmap.h"
+
+static DEFINE_XB(xb1);
+
+int xb_set_bit(struct xb *xb, unsigned long bit)
+{
+	int err;
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	err = __radix_tree_create(root, index, 0, &node, &slot);
+	if (err)
+		return err;
+	bitmap = rcu_dereference_raw(*slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit < BITS_PER_LONG) {
+			tmp |= 1UL << ebit;
+			rcu_assign_pointer(*slot, (void *)tmp);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+		rcu_assign_pointer(*slot, bitmap);
+	}
+
+	if (!bitmap) {
+		if (ebit < BITS_PER_LONG) {
+			bitmap = (void *)((1UL << ebit) |
+					RADIX_TREE_EXCEPTIONAL_ENTRY);
+			__radix_tree_replace(root, node, slot, bitmap, NULL,
+						NULL);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		__radix_tree_replace(root, node, slot, bitmap, NULL, NULL);
+	}
+
+	__set_bit(bit, bitmap->bitmap);
+	return 0;
+}
+
+int xb_preload_and_set_bit(struct xb *xb, unsigned long bit, gfp_t gfp)
+{
+	int ret = 0;
+
+	if (!xb_preload(gfp))
+		return -ENOMEM;
+
+	ret = xb_set_bit(xb, bit);
+	xb_preload_end();
+
+	return ret;
+}
+
+bool xb_test_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	const struct radix_tree_root *root = &xb->xbrt;
+	struct ida_bitmap *bitmap = radix_tree_lookup(root, index);
+
+	bit %= IDA_BITMAP_BITS;
+
+	if (!bitmap)
+		return false;
+	if (radix_tree_exception(bitmap)) {
+		bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
+		if (bit > BITS_PER_LONG)
+			return false;
+		return (unsigned long)bitmap & (1UL << bit);
+	}
+
+	return test_bit(bit, bitmap->bitmap);
+}
+
+void xb_clear_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	bitmap = __radix_tree_lookup(root, index, &node, &slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit >= BITS_PER_LONG)
+			return;
+		tmp &= ~(1UL << ebit);
+		if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+			__radix_tree_delete(root, node, slot);
+		else
+			rcu_assign_pointer(*slot, (void *)tmp);
+		return;
+	}
+
+	if (!bitmap)
+		return;
+
+	__clear_bit(bit, bitmap->bitmap);
+	if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+		kfree(bitmap);
+		__radix_tree_delete(root, node, slot);
+	}
+}
+
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned int nbits;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bitmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bitmap)) {
+			unsigned long ebit = bit + 2;
+			unsigned long tmp = (unsigned long)bitmap;
+
+			nbits = min(end - start + 1, BITS_PER_LONG - ebit);
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			bitmap_clear(&tmp, ebit, nbits);
+			if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+				__radix_tree_delete(root, node, slot);
+			else
+				rcu_assign_pointer(*slot, (void *)tmp);
+		} else if (bitmap) {
+			nbits = min(end - start + 1, IDA_BITMAP_BITS - bit);
+
+			if (nbits != IDA_BITMAP_BITS)
+				bitmap_clear(bitmap->bitmap, bit, nbits);
+
+			if (nbits == IDA_BITMAP_BITS ||
+			    bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+				kfree(bitmap);
+				__radix_tree_delete(root, node, slot);
+			}
+		}
+	}
+}
+
+static unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+				      unsigned long end, bool set)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bmap;
+	unsigned long ret = end + 1;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bmap)) {
+			unsigned long tmp = (unsigned long)bmap;
+			unsigned long ebit = bit + 2;
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			if (set)
+				ret = find_next_bit(&tmp, BITS_PER_LONG, ebit);
+			else
+				ret = find_next_zero_bit(&tmp, BITS_PER_LONG,
+							 ebit);
+			if (ret < BITS_PER_LONG)
+				return ret - 2 + IDA_BITMAP_BITS * index;
+		} else if (bmap) {
+			if (set)
+				ret = find_next_bit(bmap->bitmap,
+						    IDA_BITMAP_BITS, bit);
+			else
+				ret = find_next_zero_bit(bmap->bitmap,
+							 IDA_BITMAP_BITS, bit);
+			if (ret < IDA_BITMAP_BITS)
+				return ret + index * IDA_BITMAP_BITS;
+		} else if (!bmap && !set) {
+			return start;
+		}
+	}
+
+	return ret;
+}
+
+unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
+				   unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 1);
+}
+
+unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start,
+				    unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 0);
+}
+
+static void xbitmap_check_bit(unsigned long bit)
+{
+	assert(!xb_test_bit(&xb1, bit));
+	assert(!xb_preload_and_set_bit(&xb1, bit, GFP_KERNEL));
+	assert(xb_test_bit(&xb1, bit));
+	xb_clear_bit(&xb1, bit);
+	assert(xb_is_empty(&xb1));
+}
+
+static void xbitmap_check_bit_range(void)
+{
+	xb_preload(GFP_KERNEL);
+
+	/* Set a range of bits */
+	assert(!xb_set_bit(&xb1, 1060));
+	assert(!xb_set_bit(&xb1, 1061));
+	assert(!xb_set_bit(&xb1, 1064));
+	assert(!xb_set_bit(&xb1, 1065));
+	assert(!xb_set_bit(&xb1, 8180));
+	assert(!xb_set_bit(&xb1, 8181));
+	assert(!xb_set_bit(&xb1, 8190));
+	assert(!xb_set_bit(&xb1, 8191));
+
+	/* Test a range of bits */
+	assert(xb_find_next_set_bit(&xb1, 0, 10000) == 1060);
+	assert(xb_find_next_zero_bit(&xb1, 1061, 10000) == 1062);
+	assert(xb_find_next_set_bit(&xb1, 1062, 10000) == 1064);
+	assert(xb_find_next_zero_bit(&xb1, 1065, 10000) == 1066);
+	assert(xb_find_next_set_bit(&xb1, 1066, 10000) == 8180);
+	assert(xb_find_next_zero_bit(&xb1, 8180, 10000) == 8182);
+	xb_clear_bit_range(&xb1, 0, 1000000);
+	assert(xb_find_next_set_bit(&xb1, 0, 10000) == 10001);
+
+	assert(xb_find_next_zero_bit(&xb1, 20000, 30000) == 20000);
+
+	xb_preload_end();
+}
+
+void xbitmap_checks(void)
+{
+	xb_init(&xb1);
+
+	xbitmap_check_bit(0);
+	xbitmap_check_bit(30);
+	xbitmap_check_bit(31);
+	xbitmap_check_bit(1023);
+	xbitmap_check_bit(1024);
+	xbitmap_check_bit(1025);
+	xbitmap_check_bit((1UL << 63) | (1UL << 24));
+	xbitmap_check_bit((1UL << 63) | (1UL << 24) | 70);
+
+	xbitmap_check_bit_range();
+}
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v17 3/6] mm/balloon_compaction.c: split balloon page allocation and enqueue
  2017-11-03  8:13 [PATCH v17 0/6] Virtio-balloon Enhancement Wei Wang
  2017-11-03  8:13 ` [PATCH v17 1/6] lib/xbitmap: Introduce xbitmap Wei Wang
  2017-11-03  8:13 ` [PATCH v17 2/6] radix tree test suite: add tests for xbitmap Wei Wang
@ 2017-11-03  8:13 ` Wei Wang
  2017-11-03 10:59   ` Tetsuo Handa
  2017-11-03  8:13 ` [PATCH v17 4/6] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: Wei Wang @ 2017-11-03  8:13 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, penguin-kernel, cornelia.huck, mgorman, aarcange,
	amit.shah, pbonzini, willy, wei.w.wang, liliang.opensource,
	yang.zhang.wz, quan.xu

From: "Michael S. Tsirkin" <mst.redhat.com>

fill_balloon doing memory allocations under balloon_lock
can cause a deadlock when leak_balloon is called from
virtballoon_oom_notify and tries to take same lock.

To fix, split page allocation and enqueue and do allocations outside
the lock.

Here's a detailed analysis of the deadlock by Tetsuo Handa:

In leak_balloon(), mutex_lock(&vb->balloon_lock) is called in order to
serialize against fill_balloon(). But in fill_balloon(),
alloc_page(GFP_HIGHUSER[_MOVABLE] | __GFP_NOMEMALLOC | __GFP_NORETRY) is
called with vb->balloon_lock mutex held. Since GFP_HIGHUSER[_MOVABLE]
implies __GFP_DIRECT_RECLAIM | __GFP_IO | __GFP_FS, despite __GFP_NORETRY
is specified, this allocation attempt might indirectly depend on somebody
else's __GFP_DIRECT_RECLAIM memory allocation. And such indirect
__GFP_DIRECT_RECLAIM memory allocation might call leak_balloon() via
virtballoon_oom_notify() via blocking_notifier_call_chain() callback via
out_of_memory() when it reached __alloc_pages_may_oom() and held oom_lock
mutex. Since vb->balloon_lock mutex is already held by fill_balloon(), it
will cause OOM lockup. Thus, do not wait for vb->balloon_lock mutex if
leak_balloon() is called from out_of_memory().

Thread1                                Thread2
fill_balloon()
 takes a balloon_lock
  balloon_page_enqueue()
   alloc_page(GFP_HIGHUSER_MOVABLE)
    direct reclaim (__GFP_FS context)  takes a fs lock
     waits for that fs lock             alloc_page(GFP_NOFS)
                                         __alloc_pages_may_oom()
                                          takes the oom_lock
                                           out_of_memory()
                                            blocking_notifier_call_chain()
                                             leak_balloon()
                                               tries to take that
					       balloon_lock and deadlocks

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Wang <wei.w.wang@intel.com>
Reviewed-by: Wei Wang <wei.w.wang@intel.com>

---
 drivers/virtio/virtio_balloon.c    | 23 ++++++++++++++++++-----
 include/linux/balloon_compaction.h | 34 +++++++++++++++++++++++++++++++++-
 mm/balloon_compaction.c            | 28 +++++++++++++++++++++-------
 3 files changed, 72 insertions(+), 13 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f0b3a0b..45fe6a8 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -143,16 +143,17 @@ static void set_page_pfns(struct virtio_balloon *vb,
 
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
-	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	unsigned num_allocated_pages;
+	unsigned int num_pfns;
+	struct page *page;
+	LIST_HEAD(pages);
 
 	/* We can only do one array worth at a time. */
 	num = min(num, ARRAY_SIZE(vb->pfns));
 
-	mutex_lock(&vb->balloon_lock);
-	for (vb->num_pfns = 0; vb->num_pfns < num;
-	     vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-		struct page *page = balloon_page_enqueue(vb_dev_info);
+	for (num_pfns = 0; num_pfns < num;
+	     num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
+		struct page *page = balloon_page_alloc();
 
 		if (!page) {
 			dev_info_ratelimited(&vb->vdev->dev,
@@ -162,6 +163,18 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
+
+		balloon_page_push(&pages, page);
+	}
+
+	mutex_lock(&vb->balloon_lock);
+
+	vb->num_pfns = 0;
+	while ((page = balloon_page_pop(&pages))) {
+		balloon_page_enqueue(&vb->vb_dev_info, page);
+
+		vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE;
+
 		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h
index 79542b2..bdc055a 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -49,6 +49,7 @@
 #include <linux/gfp.h>
 #include <linux/err.h>
 #include <linux/fs.h>
+#include <linux/list.h>
 
 /*
  * Balloon device information descriptor.
@@ -66,7 +67,9 @@ struct balloon_dev_info {
 	struct inode *inode;
 };
 
-extern struct page *balloon_page_enqueue(struct balloon_dev_info *b_dev_info);
+extern struct page *balloon_page_alloc(void);
+extern void balloon_page_enqueue(struct balloon_dev_info *b_dev_info,
+				 struct page *page);
 extern struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info);
 
 static inline void balloon_devinfo_init(struct balloon_dev_info *balloon)
@@ -86,6 +89,35 @@ extern void balloon_page_putback(struct page *page);
 extern int balloon_page_migrate(struct address_space *mapping,
 				struct page *newpage,
 				struct page *page, enum migrate_mode mode);
+/*
+ * balloon_page_push - insert a page into a page list.
+ * @head : pointer to list
+ * @page : page to be added
+ *
+ * Caller must ensure the page is private and protect the list.
+ */
+static inline void balloon_page_push(struct list_head *pages, struct page *page)
+{
+	list_add(&page->lru, pages);
+}
+
+/*
+ * balloon_page_pop - remove a page from a page list.
+ * @head : pointer to list
+ * @page : page to be added
+ *
+ * Caller must ensure the page is private and protect the list.
+ */
+static inline struct page *balloon_page_pop(struct list_head *pages)
+{
+	struct page *page = list_first_entry_or_null(pages, struct page, lru);
+
+	if (!page)
+		return NULL;
+
+	list_del(&page->lru);
+	return page;
+}
 
 /*
  * balloon_page_insert - insert a page into the balloon's page list and make
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index 68d2892..16212c7 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -11,22 +11,36 @@
 #include <linux/balloon_compaction.h>
 
 /*
- * balloon_page_enqueue - allocates a new page and inserts it into the balloon
+ * balloon_page_alloc - allocates a new page for insertion into the balloon
  *			  page list.
+ *
+ * Driver must call it to properly allocate a new enlisted balloon page.
+ * Driver must call balloon_page_enqueue before definitively removing it from
+ * the guest system.  This function returns the page address for the recently
+ * allocated page or NULL in the case we fail to allocate a new page this turn.
+ */
+struct page *balloon_page_alloc(void)
+{
+	struct page *page = alloc_page(balloon_mapping_gfp_mask() |
+				       __GFP_NOMEMALLOC | __GFP_NORETRY);
+	return page;
+}
+EXPORT_SYMBOL_GPL(balloon_page_alloc);
+
+/*
+ * balloon_page_enqueue - inserts a new page into the balloon page list.
  * @b_dev_info: balloon device descriptor where we will insert a new page to
+ * @page: new page to enqueue - allocated using balloon_page_alloc.
  *
- * Driver must call it to properly allocate a new enlisted balloon page
+ * Driver must call it to properly enqueue a new enlisted balloon page
  * before definitively removing it from the guest system.
  * This function returns the page address for the recently enqueued page or
  * NULL in the case we fail to allocate a new page this turn.
  */
-struct page *balloon_page_enqueue(struct balloon_dev_info *b_dev_info)
+void balloon_page_enqueue(struct balloon_dev_info *b_dev_info,
+			  struct page *page)
 {
 	unsigned long flags;
-	struct page *page = alloc_page(balloon_mapping_gfp_mask() |
-				       __GFP_NOMEMALLOC | __GFP_NORETRY);
-	if (!page)
-		return NULL;
 
 	/*
 	 * Block others from accessing the 'page' when we get around to
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v17 4/6] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-11-03  8:13 [PATCH v17 0/6] Virtio-balloon Enhancement Wei Wang
                   ` (2 preceding siblings ...)
  2017-11-03  8:13 ` [PATCH v17 3/6] mm/balloon_compaction.c: split balloon page allocation and enqueue Wei Wang
@ 2017-11-03  8:13 ` Wei Wang
  2017-11-03 11:25   ` Tetsuo Handa
  2017-11-03  8:13 ` [PATCH v17 5/6] mm: support reporting free page blocks Wei Wang
  2017-11-03  8:13 ` [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ Wei Wang
  5 siblings, 1 reply; 32+ messages in thread
From: Wei Wang @ 2017-11-03  8:13 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, penguin-kernel, cornelia.huck, mgorman, aarcange,
	amit.shah, pbonzini, willy, wei.w.wang, liliang.opensource,
	yang.zhang.wz, quan.xu

Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer
of balloon (i.e. inflated/deflated) pages using scatter-gather lists
to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the balloon pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
sgs. An sg describes a chunk of guest physically continuous pages.
With this mechanism, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~492ms
resulting in an improvement of ~88%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 drivers/virtio/virtio_balloon.c     | 232 +++++++++++++++++++++++++++++++++---
 include/uapi/linux/virtio_balloon.h |   1 +
 2 files changed, 215 insertions(+), 18 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 45fe6a8..b31fc25 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -32,6 +32,8 @@
 #include <linux/mm.h>
 #include <linux/mount.h>
 #include <linux/magic.h>
+#include <linux/xbitmap.h>
+#include <asm/page.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -79,6 +81,9 @@ struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	/* The xbitmap used to record balloon pages */
+	struct xb page_xb;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -141,15 +146,130 @@ static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
+
+static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
+{
+	unsigned int len;
+
+	virtqueue_kick(vq);
+	wait_event(wq_head, virtqueue_get_buf(vq, &len));
+}
+
+static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
+{
+	struct scatterlist sg;
+	unsigned int len;
+
+	sg_init_one(&sg, addr, size);
+
+	/* Detach all the used buffers from the vq */
+	while (virtqueue_get_buf(vq, &len))
+		;
+
+	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
+}
+
+static void send_balloon_page_sg(struct virtio_balloon *vb,
+				 struct virtqueue *vq,
+				 void *addr,
+				 uint32_t size,
+				 bool batch)
+{
+	int err;
+
+	err = add_one_sg(vq, addr, size);
+	/*
+	 * This is expected to never fail: there is always at least 1 entry
+	 * available on the vq, because when the vq is full the worker thread
+	 * that adds the sg will be put into sleep until at least 1 entry is
+	 * available to use.
+	 */
+	BUG_ON(err);
+
+	/* If batching is requested, we batch till the vq is full */
+	if (!batch || !vq->num_free)
+		kick_and_wait(vq, vb->acked);
+}
+
+/*
+ * Send balloon pages in sgs to host. The balloon pages are recorded in the
+ * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
+ * The page xbitmap is searched for continuous "1" bits, which correspond
+ * to continuous pages, to chunk into sgs.
+ *
+ * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
+ * need to be searched.
+ */
+static void tell_host_sgs(struct virtio_balloon *vb,
+			  struct virtqueue *vq,
+			  unsigned long page_xb_start,
+			  unsigned long page_xb_end)
+{
+	unsigned long sg_pfn_start, sg_pfn_end;
+	void *sg_addr;
+	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
+
+	sg_pfn_start = page_xb_start;
+	while (sg_pfn_start < page_xb_end) {
+		sg_pfn_start = xb_find_next_set_bit(&vb->page_xb, sg_pfn_start,
+						    page_xb_end);
+		if (sg_pfn_start == page_xb_end + 1)
+			break;
+		sg_pfn_end = xb_find_next_zero_bit(&vb->page_xb,
+						   sg_pfn_start + 1,
+						   page_xb_end);
+		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
+		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
+		while (sg_len > sg_max_len) {
+			send_balloon_page_sg(vb, vq, sg_addr, sg_max_len,
+					     true);
+			sg_addr += sg_max_len;
+			sg_len -= sg_max_len;
+		}
+		send_balloon_page_sg(vb, vq, sg_addr, sg_len, true);
+		sg_pfn_start = sg_pfn_end + 1;
+	}
+
+	/*
+	 * The last few sgs may not reach the batch size, but need a kick to
+	 * notify the device to handle them.
+	 */
+	if (vq->num_free != virtqueue_get_vring_size(vq))
+		kick_and_wait(vq, vb->acked);
+
+	xb_clear_bit_range(&vb->page_xb, page_xb_start, page_xb_end);
+}
+
+static inline int xb_set_page(struct virtio_balloon *vb,
+			       struct page *page,
+			       unsigned long *pfn_min,
+			       unsigned long *pfn_max)
+{
+	unsigned long pfn = page_to_pfn(page);
+	int ret;
+
+	*pfn_min = min(pfn, *pfn_min);
+	*pfn_max = max(pfn, *pfn_max);
+
+	do {
+		ret = xb_preload_and_set_bit(&vb->page_xb, pfn, GFP_KERNEL);
+	} while (unlikely(ret == -EAGAIN));
+
+	return ret;
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	unsigned num_allocated_pages;
 	unsigned int num_pfns;
 	struct page *page;
 	LIST_HEAD(pages);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	for (num_pfns = 0; num_pfns < num;
 	     num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
@@ -164,6 +284,8 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			break;
 		}
 
+		if (use_sg && xb_set_page(vb, page, &pfn_min, &pfn_max) < 0)
+			break;
 		balloon_page_push(&pages, page);
 	}
 
@@ -175,7 +297,8 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 		vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE;
 
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (!use_sg)
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -184,8 +307,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -211,9 +338,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	/* Traditionally, we can only do one array worth at a time. */
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	/* We can't release more pages than taken */
@@ -223,7 +353,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (use_sg) {
+			if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0)
+				break;
+		} else {
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		}
+
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -234,13 +370,56 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
 }
 
+/*
+ * The regular leak_balloon() with VIRTIO_BALLOON_F_SG needs memory allocation
+ * for xbitmap, which is not suitable for the oom case. This function does not
+ * use xbitmap to chunk pages, so it can be used by oom notifier to deflate
+ * pages when VIRTIO_BALLOON_F_SG is negotiated.
+ */
+static unsigned int leak_balloon_sg_oom(struct virtio_balloon *vb)
+{
+	unsigned int n;
+	struct page *page;
+	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
+	struct virtqueue *vq = vb->deflate_vq;
+	LIST_HEAD(pages);
+
+	mutex_lock(&vb->balloon_lock);
+	for (n = 0; n < oom_pages; n++) {
+		page = balloon_page_dequeue(vb_dev_info);
+		if (!page)
+			break;
+
+		list_add(&page->lru, &pages);
+		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
+		send_balloon_page_sg(vb, vq, page_address(page),
+				     VIRTIO_BALLOON_PAGES_PER_PAGE, true);
+		release_pages_balloon(vb, &pages);
+	}
+
+	/*
+	 * The last few sgs may not reach the batch size, but need a kick to
+	 * notify the device to handle them.
+	 */
+	if (vq->num_free != virtqueue_get_vring_size(vq))
+		kick_and_wait(vq, vb->acked);
+	mutex_unlock(&vb->balloon_lock);
+
+	return n;
+}
+
+
 static inline void update_stat(struct virtio_balloon *vb, int idx,
 			       u16 tag, u64 val)
 {
@@ -380,7 +559,10 @@ static int virtballoon_oom_notify(struct notifier_block *self,
 		return NOTIFY_OK;
 
 	freed = parm;
-	num_freed_pages = leak_balloon(vb, oom_pages);
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG))
+		num_freed_pages = leak_balloon_sg_oom(vb);
+	else
+		num_freed_pages = leak_balloon(vb, oom_pages);
 	update_balloon_size(vb);
 	*freed += num_freed_pages;
 
@@ -454,6 +636,7 @@ static int init_vqs(struct virtio_balloon *vb)
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
+
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
  *			     a compation thread.     (called under page lock)
@@ -477,6 +660,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
 	unsigned long flags;
 
 	/*
@@ -498,16 +682,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (use_sg) {
+		send_balloon_page_sg(vb, vb->inflate_vq, page_address(newpage),
+				     PAGE_SIZE, false);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (use_sg) {
+		send_balloon_page_sg(vb, vb->deflate_vq, page_address(page),
+				     PAGE_SIZE, false);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -566,6 +758,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (err)
 		goto out_free_vb;
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
+		xb_init(&vb->page_xb);
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -682,6 +877,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_SG,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..37780a7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v17 5/6] mm: support reporting free page blocks
  2017-11-03  8:13 [PATCH v17 0/6] Virtio-balloon Enhancement Wei Wang
                   ` (3 preceding siblings ...)
  2017-11-03  8:13 ` [PATCH v17 4/6] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
@ 2017-11-03  8:13 ` Wei Wang
  2017-11-03  8:13 ` [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ Wei Wang
  5 siblings, 0 replies; 32+ messages in thread
From: Wei Wang @ 2017-11-03  8:13 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, penguin-kernel, cornelia.huck, mgorman, aarcange,
	amit.shah, pbonzini, willy, wei.w.wang, liliang.opensource,
	yang.zhang.wz, quan.xu

This patch adds support to walk through the free page blocks in the
system and report them via a callback function. Some page blocks may
leave the free list after zone->lock is released, so it is the caller's
responsibility to either detect or prevent the use of such pages.

One use example of this patch is to accelerate live migration by skipping
the transfer of free pages reported from the guest. A popular method used
by the hypervisor to track which part of memory is written during live
migration is to write-protect all the guest memory. So, those pages that
are reported as free pages but are written after the report function
returns will be captured by the hypervisor, and they will be added to the
next round of memory transfer.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
---
 include/linux/mm.h |  6 ++++
 mm/page_alloc.c    | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 97 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 065d99d..fe5a90e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1877,6 +1877,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 
+extern void walk_free_mem_block(void *opaque,
+				int min_order,
+				bool (*report_pfn_range)(void *opaque,
+							 unsigned long pfn,
+							 unsigned long num));
+
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
  * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 77e4d3c..2283fcc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4867,6 +4867,97 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 	show_swap_cache_info();
 }
 
+/*
+ * Walk through a free page list and report the found pfn range via the
+ * callback.
+ *
+ * Return false if the callback requests to stop reporting. Otherwise,
+ * return true.
+ */
+static bool walk_free_page_list(void *opaque,
+				struct zone *zone,
+				int order,
+				enum migratetype mt,
+				bool (*report_pfn_range)(void *,
+							 unsigned long,
+							 unsigned long))
+{
+	struct page *page;
+	struct list_head *list;
+	unsigned long pfn, flags;
+	bool ret;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	list = &zone->free_area[order].free_list[mt];
+	list_for_each_entry(page, list, lru) {
+		pfn = page_to_pfn(page);
+		ret = report_pfn_range(opaque, pfn, 1 << order);
+		if (!ret)
+			break;
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	return ret;
+}
+
+/**
+ * walk_free_mem_block - Walk through the free page blocks in the system
+ * @opaque: the context passed from the caller
+ * @min_order: the minimum order of free lists to check
+ * @report_pfn_range: the callback to report the pfn range of the free pages
+ *
+ * If the callback returns false, stop iterating the list of free page blocks.
+ * Otherwise, continue to report.
+ *
+ * Please note that there are no locking guarantees for the callback and
+ * that the reported pfn range might be freed or disappear after the
+ * callback returns so the caller has to be very careful how it is used.
+ *
+ * The callback itself must not sleep or perform any operations which would
+ * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
+ * or via any lock dependency. It is generally advisable to implement
+ * the callback as simple as possible and defer any heavy lifting to a
+ * different context.
+ *
+ * There is no guarantee that each free range will be reported only once
+ * during one walk_free_mem_block invocation.
+ *
+ * pfn_to_page on the given range is strongly discouraged and if there is
+ * an absolute need for that make sure to contact MM people to discuss
+ * potential problems.
+ *
+ * The function itself might sleep so it cannot be called from atomic
+ * contexts.
+ *
+ * In general low orders tend to be very volatile and so it makes more
+ * sense to query larger ones first for various optimizations which like
+ * ballooning etc... This will reduce the overhead as well.
+ */
+void walk_free_mem_block(void *opaque,
+			 int min_order,
+			 bool (*report_pfn_range)(void *opaque,
+						  unsigned long pfn,
+						  unsigned long num))
+{
+	struct zone *zone;
+	int order;
+	enum migratetype mt;
+	bool ret;
+
+	for_each_populated_zone(zone) {
+		for (order = MAX_ORDER - 1; order >= min_order; order--) {
+			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+				ret = walk_free_page_list(opaque, zone,
+							  order, mt,
+							  report_pfn_range);
+				if (!ret)
+					return;
+			}
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(walk_free_mem_block);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-11-03  8:13 [PATCH v17 0/6] Virtio-balloon Enhancement Wei Wang
                   ` (4 preceding siblings ...)
  2017-11-03  8:13 ` [PATCH v17 5/6] mm: support reporting free page blocks Wei Wang
@ 2017-11-03  8:13 ` Wei Wang
  2017-11-13 10:34   ` Wei Wang
  2017-11-15 20:32   ` Michael S. Tsirkin
  5 siblings, 2 replies; 32+ messages in thread
From: Wei Wang @ 2017-11-03  8:13 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, penguin-kernel, cornelia.huck, mgorman, aarcange,
	amit.shah, pbonzini, willy, wei.w.wang, liliang.opensource,
	yang.zhang.wz, quan.xu

Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_VQ feature indicates the
support of reporting hints of guest free pages to the host via
virtio-balloon. The host requests the guest to report the free pages by
sending commands via the virtio-balloon configuration registers.

When the guest starts to report, the first element added to the free page
vq is a sequence id of the start reporting command. The id is given by
the host, and it indicates whether the following free pages correspond
to the command. For example, the host may stop the report and start again
with a new command id. The obsolete pages for the previous start command
can be detected by the id dismatching on the host. The id is added to the
vq using an output buffer, and the free pages are added to the vq using
input buffer.

Here are some explainations about the added configuration registers:
- host2guest_cmd: a register used by the host to send commands to the
guest.
- guest2host_cmd: written by the guest to ACK to the host about the
commands that have been received. The host will clear the corresponding
bits on the host2guest_cmd register. The guest also uses this register
to send commands to the host (e.g. when finish free page reporting).
- free_page_cmd_id: the sequence id of the free page report command
given by the host.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
---
 drivers/virtio/virtio_balloon.c     | 234 ++++++++++++++++++++++++++++++++----
 include/uapi/linux/virtio_balloon.h |  11 ++
 2 files changed, 223 insertions(+), 22 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index b31fc25..4087f04 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -55,7 +55,12 @@ static struct vfsmount *balloon_mnt;
 
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
+
+	/* Balloon's own wq for cpu-intensive work items */
+	struct workqueue_struct *balloon_wq;
+	/* The free page reporting work item submitted to the balloon wq */
+	struct work_struct report_free_page_work;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
@@ -65,6 +70,10 @@ struct virtio_balloon {
 	spinlock_t stop_update_lock;
 	bool stop_update;
 
+	/* Stop reporting free pages */
+	bool report_free_page_stop;
+	uint32_t free_page_cmd_id;
+
 	/* Waiting for host to ack the pages we released. */
 	wait_queue_head_t acked;
 
@@ -191,6 +200,30 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
 		kick_and_wait(vq, vb->acked);
 }
 
+static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
+{
+	int err = 0;
+	unsigned int len;
+
+	/* Detach all the used buffers from the vq */
+	while (virtqueue_get_buf(vq, &len))
+		;
+
+	/*
+	 * Since this is an optimization feature, losing a couple of free
+	 * pages to report isn't important. We simply resturn without adding
+	 * the page if the vq is full.
+	 */
+	if (vq->num_free) {
+		err = add_one_sg(vq, addr, size);
+		BUG_ON(err);
+	}
+
+	/* Batch till the vq is full */
+	if (!vq->num_free)
+		virtqueue_kick(vq);
+}
+
 /*
  * Send balloon pages in sgs to host. The balloon pages are recorded in the
  * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
@@ -495,9 +528,8 @@ static void stats_handle_request(struct virtio_balloon *vb)
 	virtqueue_kick(vq);
 }
 
-static void virtballoon_changed(struct virtio_device *vdev)
+static void virtballoon_cmd_balloon_memory(struct virtio_balloon *vb)
 {
-	struct virtio_balloon *vb = vdev->priv;
 	unsigned long flags;
 
 	spin_lock_irqsave(&vb->stop_update_lock, flags);
@@ -506,6 +538,50 @@ static void virtballoon_changed(struct virtio_device *vdev)
 	spin_unlock_irqrestore(&vb->stop_update_lock, flags);
 }
 
+static void virtballoon_cmd_report_free_page_start(struct virtio_balloon *vb)
+{
+	unsigned long flags;
+
+	vb->report_free_page_stop = false;
+	spin_lock_irqsave(&vb->stop_update_lock, flags);
+	if (!vb->stop_update)
+		queue_work(vb->balloon_wq, &vb->report_free_page_work);
+	spin_unlock_irqrestore(&vb->stop_update_lock, flags);
+}
+
+static void virtballoon_changed(struct virtio_device *vdev)
+{
+	struct virtio_balloon *vb = vdev->priv;
+	u32 host2guest_cmd, guest2host_cmd = 0;
+
+	if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
+		virtballoon_cmd_balloon_memory(vb);
+		return;
+	}
+
+	virtio_cread(vb->vdev, struct virtio_balloon_config, host2guest_cmd,
+		     &host2guest_cmd);
+
+	if (host2guest_cmd & VIRTIO_BALLOON_CMD_BALLOON_MEMORY) {
+		virtballoon_cmd_balloon_memory(vb);
+		guest2host_cmd |= VIRTIO_BALLOON_CMD_BALLOON_MEMORY;
+	}
+
+	if (host2guest_cmd & VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_START) {
+		virtballoon_cmd_report_free_page_start(vb);
+		guest2host_cmd |= VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_START;
+	}
+
+	if (host2guest_cmd & VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP) {
+		vb->report_free_page_stop = true;
+		guest2host_cmd |= VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP;
+	}
+
+	/* Ack to the host about the commands that have been received */
+	virtio_cwrite(vb->vdev, struct virtio_balloon_config, guest2host_cmd,
+		      &guest2host_cmd);
+}
+
 static inline s64 towards_target(struct virtio_balloon *vb)
 {
 	s64 target;
@@ -597,42 +673,147 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
-static int init_vqs(struct virtio_balloon *vb)
+static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
+					   unsigned long nr_pages)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
-	int err, nvqs;
+	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
+	void *addr = (void *)pfn_to_kaddr(pfn);
+	uint32_t len = nr_pages << PAGE_SHIFT;
+
+	if (vb->report_free_page_stop)
+		return false;
+
+	send_free_page_sg(vb->free_page_vq, addr, len);
 
+	return true;
+}
+
+static void report_free_page_end(struct virtio_balloon *vb)
+{
+	u32 cmd = VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP;
 	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
+	 * The host may have already requested to stop the reporting before we
+	 * finish, so no need to notify the host in this case.
 	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
+	if (vb->report_free_page_stop)
+		return;
+	vb->report_free_page_stop = true;
+
+	virtio_cwrite(vb->vdev, struct virtio_balloon_config, guest2host_cmd,
+		      &cmd);
+}
+
+static void report_free_page_cmd_id(struct virtio_balloon *vb)
+{
+	struct scatterlist sg;
+	int err;
+
+	virtio_cread(vb->vdev, struct virtio_balloon_config, free_page_cmd_id,
+		     &vb->free_page_cmd_id);
+	sg_init_one(&sg, &vb->free_page_cmd_id, sizeof(uint32_t));
+	err = virtqueue_add_outbuf(vb->free_page_vq, &sg, 1,
+				   &vb->free_page_cmd_id, GFP_KERNEL);
+	BUG_ON(err);
+}
+
+static void report_free_page(struct work_struct *work)
+{
+	struct virtio_balloon *vb;
+
+	vb = container_of(work, struct virtio_balloon, report_free_page_work);
+	report_free_page_cmd_id(vb);
+	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
+	/*
+	 * The last few free page blocks that were added may not reach the
+	 * batch size, but need a kick to notify the device to handle them.
+	 */
+	virtqueue_kick(vb->free_page_vq);
+	report_free_page_end(vb);
+}
+
+static int init_vqs(struct virtio_balloon *vb)
+{
+	struct virtqueue **vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	struct scatterlist sg;
+	int i, nvqs, err = -ENOMEM;
+
+	/* Inflateq and deflateq are used unconditionally */
+	nvqs = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs++;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
+		nvqs++;
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vq;
+	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
+	if (!callbacks)
+		goto err_callback;
+	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
+	if (!names)
+		goto err_names;
+
+	callbacks[0] = balloon_ack;
+	names[0] = "inflate";
+	callbacks[1] = balloon_ack;
+	names[1] = "deflate";
+
+	i = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		callbacks[i] = stats_request;
+		names[i] = "stats";
+		i++;
+	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
+		callbacks[i] = NULL;
+		names[i] = "free_page_vq";
+	}
+
+	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
+					 NULL, NULL);
 	if (err)
-		return err;
+		goto err_find;
 
 	vb->inflate_vq = vqs[0];
 	vb->deflate_vq = vqs[1];
+	i = 2;
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
-		struct scatterlist sg;
-		unsigned int num_stats;
-		vb->stats_vq = vqs[2];
-
+		vb->stats_vq = vqs[i++];
 		/*
 		 * Prime this virtqueue with one buffer so the hypervisor can
 		 * use it to signal us later (it can't be broken yet!).
 		 */
-		num_stats = update_balloon_stats(vb);
-
-		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
+		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
 		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
-		    < 0)
-			BUG();
+		    < 0) {
+			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
+				 __func__);
+			goto err_find;
+		}
 		virtqueue_kick(vb->stats_vq);
 	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
+		vb->free_page_vq = vqs[i];
+
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
 	return 0;
+
+err_find:
+	kfree(names);
+err_names:
+	kfree(callbacks);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return err;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -761,6 +942,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
 		xb_init(&vb->page_xb);
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
+		vb->balloon_wq = alloc_workqueue("balloon-wq",
+					WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
+		INIT_WORK(&vb->report_free_page_work, report_free_page);
+		vb->report_free_page_stop = true;
+	}
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -825,6 +1013,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	spin_unlock_irq(&vb->stop_update_lock);
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
+	cancel_work_sync(&vb->report_free_page_work);
 
 	remove_common(vb);
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -878,6 +1067,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_SG,
+	VIRTIO_BALLOON_F_FREE_PAGE_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 37780a7..b758484 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,15 +35,26 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
+#define VIRTIO_BALLOON_F_FREE_PAGE_VQ	4 /* VQ to report free pages */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
 
+#define	VIRTIO_BALLOON_CMD_BALLOON_MEMORY		(1 << 0)
+#define	VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_START	(1 << 1)
+#define	VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP	(1 << 2)
+
 struct virtio_balloon_config {
 	/* Number of pages host wants Guest to give up. */
 	__u32 num_pages;
 	/* Number of pages we've actually got in balloon. */
 	__u32 actual;
+	/* Host-to-guest command, readonly by guest */
+	__u32 host2guest_cmd;
+	/* Sequence id of the free_page report command, readonly by guest */
+	__u32 free_page_cmd_id;
+	/* Guest-to-host command */
+	__u32 guest2host_cmd;
 };
 
 #define VIRTIO_BALLOON_S_SWAP_IN  0   /* Amount of memory swapped in */
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v17 1/6] lib/xbitmap: Introduce xbitmap
  2017-11-03  8:13 ` [PATCH v17 1/6] lib/xbitmap: Introduce xbitmap Wei Wang
@ 2017-11-03 10:55   ` Tetsuo Handa
  2017-11-06  8:15     ` Wei Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Tetsuo Handa @ 2017-11-03 10:55 UTC (permalink / raw)
  To: wei.w.wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, liliang.opensource, yang.zhang.wz, quan.xu

I'm commenting without understanding the logic.

Wei Wang wrote:
> +
> +bool xb_preload(gfp_t gfp);
> +

Want __must_check annotation, for __radix_tree_preload() is marked
with __must_check annotation. By error failing to check result of
xb_preload() will lead to preemption kept disabled unexpectedly.



> +int xb_set_bit(struct xb *xb, unsigned long bit)
> +{
> +	int err;
> +	unsigned long index = bit / IDA_BITMAP_BITS;
> +	struct radix_tree_root *root = &xb->xbrt;
> +	struct radix_tree_node *node;
> +	void **slot;
> +	struct ida_bitmap *bitmap;
> +	unsigned long ebit;
> +
> +	bit %= IDA_BITMAP_BITS;
> +	ebit = bit + 2;
> +
> +	err = __radix_tree_create(root, index, 0, &node, &slot);
> +	if (err)
> +		return err;
> +	bitmap = rcu_dereference_raw(*slot);
> +	if (radix_tree_exception(bitmap)) {
> +		unsigned long tmp = (unsigned long)bitmap;
> +
> +		if (ebit < BITS_PER_LONG) {
> +			tmp |= 1UL << ebit;
> +			rcu_assign_pointer(*slot, (void *)tmp);
> +			return 0;
> +		}
> +		bitmap = this_cpu_xchg(ida_bitmap, NULL);
> +		if (!bitmap)

Please write locking rules, in order to explain how memory
allocated by __radix_tree_create() will not leak.

> +			return -EAGAIN;
> +		memset(bitmap, 0, sizeof(*bitmap));
> +		bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
> +		rcu_assign_pointer(*slot, bitmap);
> +	}
> +
> +	if (!bitmap) {
> +		if (ebit < BITS_PER_LONG) {
> +			bitmap = (void *)((1UL << ebit) |
> +					RADIX_TREE_EXCEPTIONAL_ENTRY);
> +			__radix_tree_replace(root, node, slot, bitmap, NULL,
> +						NULL);
> +			return 0;
> +		}
> +		bitmap = this_cpu_xchg(ida_bitmap, NULL);
> +		if (!bitmap)

Same here.

> +			return -EAGAIN;
> +		memset(bitmap, 0, sizeof(*bitmap));
> +		__radix_tree_replace(root, node, slot, bitmap, NULL, NULL);
> +	}
> +
> +	__set_bit(bit, bitmap->bitmap);
> +	return 0;
> +}



> +void xb_clear_bit(struct xb *xb, unsigned long bit)
> +{
> +	unsigned long index = bit / IDA_BITMAP_BITS;
> +	struct radix_tree_root *root = &xb->xbrt;
> +	struct radix_tree_node *node;
> +	void **slot;
> +	struct ida_bitmap *bitmap;
> +	unsigned long ebit;
> +
> +	bit %= IDA_BITMAP_BITS;
> +	ebit = bit + 2;
> +
> +	bitmap = __radix_tree_lookup(root, index, &node, &slot);
> +	if (radix_tree_exception(bitmap)) {
> +		unsigned long tmp = (unsigned long)bitmap;
> +
> +		if (ebit >= BITS_PER_LONG)
> +			return;
> +		tmp &= ~(1UL << ebit);
> +		if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
> +			__radix_tree_delete(root, node, slot);
> +		else
> +			rcu_assign_pointer(*slot, (void *)tmp);
> +		return;
> +	}
> +
> +	if (!bitmap)
> +		return;
> +
> +	__clear_bit(bit, bitmap->bitmap);
> +	if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {

Please write locking rules, in order to explain how double kfree() and/or
use-after-free can be avoided.

> +		kfree(bitmap);
> +		__radix_tree_delete(root, node, slot);
> +	}
> +}



> +void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
> +{
> +	struct radix_tree_root *root = &xb->xbrt;
> +	struct radix_tree_node *node;
> +	void **slot;
> +	struct ida_bitmap *bitmap;
> +	unsigned int nbits;
> +
> +	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
> +		unsigned long index = start / IDA_BITMAP_BITS;
> +		unsigned long bit = start % IDA_BITMAP_BITS;
> +
> +		bitmap = __radix_tree_lookup(root, index, &node, &slot);
> +		if (radix_tree_exception(bitmap)) {
> +			unsigned long ebit = bit + 2;
> +			unsigned long tmp = (unsigned long)bitmap;
> +
> +			nbits = min(end - start + 1, BITS_PER_LONG - ebit);
> +
> +			if (ebit >= BITS_PER_LONG)
> +				continue;
> +			bitmap_clear(&tmp, ebit, nbits);
> +			if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
> +				__radix_tree_delete(root, node, slot);
> +			else
> +				rcu_assign_pointer(*slot, (void *)tmp);
> +		} else if (bitmap) {
> +			nbits = min(end - start + 1, IDA_BITMAP_BITS - bit);
> +
> +			if (nbits != IDA_BITMAP_BITS)
> +				bitmap_clear(bitmap->bitmap, bit, nbits);
> +
> +			if (nbits == IDA_BITMAP_BITS ||
> +				bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {

Same here.

> +				kfree(bitmap);
> +				__radix_tree_delete(root, node, slot);
> +			}
> +		}
> +	}
> +}



> +bool xb_test_bit(struct xb *xb, unsigned long bit)
> +{
> +	unsigned long index = bit / IDA_BITMAP_BITS;
> +	const struct radix_tree_root *root = &xb->xbrt;
> +	struct ida_bitmap *bitmap = radix_tree_lookup(root, index);
> +
> +	bit %= IDA_BITMAP_BITS;
> +
> +	if (!bitmap)
> +		return false;
> +	if (radix_tree_exception(bitmap)) {
> +		bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
> +		if (bit > BITS_PER_LONG)

Why not bit >= BITS_PER_LONG here?

> +			return false;
> +		return (unsigned long)bitmap & (1UL << bit);
> +	}
> +
> +	return test_bit(bit, bitmap->bitmap);
> +}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v17 3/6] mm/balloon_compaction.c: split balloon page allocation and enqueue
  2017-11-03  8:13 ` [PATCH v17 3/6] mm/balloon_compaction.c: split balloon page allocation and enqueue Wei Wang
@ 2017-11-03 10:59   ` Tetsuo Handa
  0 siblings, 0 replies; 32+ messages in thread
From: Tetsuo Handa @ 2017-11-03 10:59 UTC (permalink / raw)
  To: wei.w.wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, liliang.opensource, yang.zhang.wz, quan.xu

Wei Wang wrote:
> Here's a detailed analysis of the deadlock by Tetsuo Handa:
> 
> In leak_balloon(), mutex_lock(&vb->balloon_lock) is called in order to
> serialize against fill_balloon(). But in fill_balloon(),
> alloc_page(GFP_HIGHUSER[_MOVABLE] | __GFP_NOMEMALLOC | __GFP_NORETRY) is
> called with vb->balloon_lock mutex held. Since GFP_HIGHUSER[_MOVABLE]
> implies __GFP_DIRECT_RECLAIM | __GFP_IO | __GFP_FS, despite __GFP_NORETRY
> is specified, this allocation attempt might indirectly depend on somebody
> else's __GFP_DIRECT_RECLAIM memory allocation. And such indirect
> __GFP_DIRECT_RECLAIM memory allocation might call leak_balloon() via
> virtballoon_oom_notify() via blocking_notifier_call_chain() callback via
> out_of_memory() when it reached __alloc_pages_may_oom() and held oom_lock
> mutex. Since vb->balloon_lock mutex is already held by fill_balloon(), it
> will cause OOM lockup. Thus, do not wait for vb->balloon_lock mutex if
> leak_balloon() is called from out_of_memory().

Please drop "Thus, do not wait for vb->balloon_lock mutex if leak_balloon()
is called from out_of_memory()." part. This is not what this patch will do.

> 
> Thread1                                Thread2
> fill_balloon()
>  takes a balloon_lock
>   balloon_page_enqueue()
>    alloc_page(GFP_HIGHUSER_MOVABLE)
>     direct reclaim (__GFP_FS context)  takes a fs lock
>      waits for that fs lock             alloc_page(GFP_NOFS)
>                                          __alloc_pages_may_oom()
>                                           takes the oom_lock
>                                            out_of_memory()
>                                             blocking_notifier_call_chain()
>                                              leak_balloon()
>                                                tries to take that
> 					       balloon_lock and deadlocks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v17 4/6] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-11-03  8:13 ` [PATCH v17 4/6] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
@ 2017-11-03 11:25   ` Tetsuo Handa
  2017-11-04 11:09     ` Wei Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Tetsuo Handa @ 2017-11-03 11:25 UTC (permalink / raw)
  To: wei.w.wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, liliang.opensource, yang.zhang.wz, quan.xu

Wei Wang wrote:
> @@ -164,6 +284,8 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  			break;
>  		}
>  
> +		if (use_sg && xb_set_page(vb, page, &pfn_min, &pfn_max) < 0)

Isn't this leaking "page" ?

> +			break;
>  		balloon_page_push(&pages, page);
>  	}
>  



> @@ -184,8 +307,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  
>  	num_allocated_pages = vb->num_pfns;
>  	/* Did we get any? */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->inflate_vq);
> +	if (vb->num_pfns) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);

Please describe why tell_host_sgs() can work without __GFP_DIRECT_RECLAIM allocation,
for tell_host_sgs() is called with vb->balloon_lock mutex held.

> +		else
> +			tell_host(vb, vb->inflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	return num_allocated_pages;



> @@ -223,7 +353,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  		page = balloon_page_dequeue(vb_dev_info);
>  		if (!page)
>  			break;
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (use_sg) {
> +			if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0)

Isn't this leaking "page" ?

If this is inside vb->balloon_lock mutex (isn't this?), xb_set_page() must not
use __GFP_DIRECT_RECLAIM allocation, for leak_balloon_sg_oom() will be blocked
on vb->balloon_lock mutex.

> +				break;
> +		} else {
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		}
> +
>  		list_add(&page->lru, &pages);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v17 4/6] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-11-03 11:25   ` Tetsuo Handa
@ 2017-11-04 11:09     ` Wei Wang
  2017-11-04 11:28       ` Tetsuo Handa
  0 siblings, 1 reply; 32+ messages in thread
From: Wei Wang @ 2017-11-04 11:09 UTC (permalink / raw)
  To: Tetsuo Handa, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, liliang.opensource, yang.zhang.wz, quan.xu

On 11/03/2017 07:25 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
>> @@ -164,6 +284,8 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>>   			break;
>>   		}
>>   
>> +		if (use_sg && xb_set_page(vb, page, &pfn_min, &pfn_max) < 0)
> Isn't this leaking "page" ?


Right, thanks, will add __free_page(page) here.

>> @@ -184,8 +307,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>>   
>>   	num_allocated_pages = vb->num_pfns;
>>   	/* Did we get any? */
>> -	if (vb->num_pfns != 0)
>> -		tell_host(vb, vb->inflate_vq);
>> +	if (vb->num_pfns) {
>> +		if (use_sg)
>> +			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
> Please describe why tell_host_sgs() can work without __GFP_DIRECT_RECLAIM allocation,
> for tell_host_sgs() is called with vb->balloon_lock mutex held.

Essentially, 
tell_host_sgs()-->send_balloon_page_sg()-->add_one_sg()-->virtqueue_add_inbuf( 
, , num=1 ,,GFP_KERNEL)
won't need any memory allocation, because we always add one sg (i.e. 
num=1) each time. That memory
allocation option is only used when multiple sgs are added (i.e. num > 
1) and the implementation inside virtqueue_add_inbuf
need allocation of indirect descriptor table.

We could also add some comments above the function to explain a little 
about this if necessary.

>
>
>> @@ -223,7 +353,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>>   		page = balloon_page_dequeue(vb_dev_info);
>>   		if (!page)
>>   			break;
>> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>> +		if (use_sg) {
>> +			if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0)
> Isn't this leaking "page" ?

Yes, will make it:

     if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
         balloon_page_enqueue(..., page);
         break;
     }

>
> If this is inside vb->balloon_lock mutex (isn't this?), xb_set_page() must not
> use __GFP_DIRECT_RECLAIM allocation, for leak_balloon_sg_oom() will be blocked
> on vb->balloon_lock mutex.

OK. Since the preload() doesn't need too much memory (< 4K in total), 
how about GFP_NOWAIT here?


Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v17 4/6] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-11-04 11:09     ` Wei Wang
@ 2017-11-04 11:28       ` Tetsuo Handa
  2017-11-06  8:21         ` Wei Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Tetsuo Handa @ 2017-11-04 11:28 UTC (permalink / raw)
  To: wei.w.wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, liliang.opensource, yang.zhang.wz, quan.xu

Wei Wang wrote:
> On 11/03/2017 07:25 PM, Tetsuo Handa wrote:
> >> @@ -184,8 +307,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
> >>   
> >>   	num_allocated_pages = vb->num_pfns;
> >>   	/* Did we get any? */
> >> -	if (vb->num_pfns != 0)
> >> -		tell_host(vb, vb->inflate_vq);
> >> +	if (vb->num_pfns) {
> >> +		if (use_sg)
> >> +			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
> > Please describe why tell_host_sgs() can work without __GFP_DIRECT_RECLAIM allocation,
> > for tell_host_sgs() is called with vb->balloon_lock mutex held.
> 
> Essentially, 
> tell_host_sgs()-->send_balloon_page_sg()-->add_one_sg()-->virtqueue_add_inbuf( 
> , , num=1 ,,GFP_KERNEL)
> won't need any memory allocation, because we always add one sg (i.e. 
> num=1) each time. That memory
> allocation option is only used when multiple sgs are added (i.e. num > 
> 1) and the implementation inside virtqueue_add_inbuf
> need allocation of indirect descriptor table.
> 
> We could also add some comments above the function to explain a little 
> about this if necessary.

Yes, please do so.

Or maybe replace GFP_KERNEL with GFP_NOWAIT or 0. Though Michael might remove that GFP
argument ( http://lkml.kernel.org/r/201710022344.JII17368.HQtLOMJOOSFFVF@I-love.SAKURA.ne.jp ).

> > If this is inside vb->balloon_lock mutex (isn't this?), xb_set_page() must not
> > use __GFP_DIRECT_RECLAIM allocation, for leak_balloon_sg_oom() will be blocked
> > on vb->balloon_lock mutex.
> 
> OK. Since the preload() doesn't need too much memory (< 4K in total), 
> how about GFP_NOWAIT here?

Maybe GFP_NOWAIT | __GFP_NOWARN ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v17 1/6] lib/xbitmap: Introduce xbitmap
  2017-11-03 10:55   ` Tetsuo Handa
@ 2017-11-06  8:15     ` Wei Wang
  0 siblings, 0 replies; 32+ messages in thread
From: Wei Wang @ 2017-11-06  8:15 UTC (permalink / raw)
  To: Tetsuo Handa, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, liliang.opensource, yang.zhang.wz, quan.xu

On 11/03/2017 06:55 PM, Tetsuo Handa wrote:
> I'm commenting without understanding the logic.
>
> Wei Wang wrote:
>> +
>> +bool xb_preload(gfp_t gfp);
>> +
> Want __must_check annotation, for __radix_tree_preload() is marked
> with __must_check annotation. By error failing to check result of
> xb_preload() will lead to preemption kept disabled unexpectedly.
>

I don't disagree with this, but I find its wrappers, e.g. 
radix_tree_preload() and radix_tree_maybe_preload(), don't seem to have 
__must_chek added.


>
>> +int xb_set_bit(struct xb *xb, unsigned long bit)
>> +{
>> +	int err;
>> +	unsigned long index = bit / IDA_BITMAP_BITS;
>> +	struct radix_tree_root *root = &xb->xbrt;
>> +	struct radix_tree_node *node;
>> +	void **slot;
>> +	struct ida_bitmap *bitmap;
>> +	unsigned long ebit;
>> +
>> +	bit %= IDA_BITMAP_BITS;
>> +	ebit = bit + 2;
>> +
>> +	err = __radix_tree_create(root, index, 0, &node, &slot);
>> +	if (err)
>> +		return err;
>> +	bitmap = rcu_dereference_raw(*slot);
>> +	if (radix_tree_exception(bitmap)) {
>> +		unsigned long tmp = (unsigned long)bitmap;
>> +
>> +		if (ebit < BITS_PER_LONG) {
>> +			tmp |= 1UL << ebit;
>> +			rcu_assign_pointer(*slot, (void *)tmp);
>> +			return 0;
>> +		}
>> +		bitmap = this_cpu_xchg(ida_bitmap, NULL);
>> +		if (!bitmap)
> Please write locking rules, in order to explain how memory
> allocated by __radix_tree_create() will not leak.
>

For the memory allocated by __radix_tree_create(), I think we could add:

     if (!bitmap) {
         __radix_tree_delete(root, node, slot);
         break;
     }


For the locking rules, how about adding the following "Developer notes:" 
at the top of the file:

"
Locks are required to ensure that concurrent calls to xb_set_bit, 
xb_preload_and_set_bit, xb_test_bit, xb_clear_bit, xb_clear_bit_range, 
xb_find_next_set_bit and xb_find_next_zero_bit, for the same ida bitmap 
will not happen.
"

>> +bool xb_test_bit(struct xb *xb, unsigned long bit)
>> +{
>> +	unsigned long index = bit / IDA_BITMAP_BITS;
>> +	const struct radix_tree_root *root = &xb->xbrt;
>> +	struct ida_bitmap *bitmap = radix_tree_lookup(root, index);
>> +
>> +	bit %= IDA_BITMAP_BITS;
>> +
>> +	if (!bitmap)
>> +		return false;
>> +	if (radix_tree_exception(bitmap)) {
>> +		bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
>> +		if (bit > BITS_PER_LONG)
> Why not bit >= BITS_PER_LONG here?

Yes, I think it should be ">=" here. Thanks.

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v17 4/6] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-11-04 11:28       ` Tetsuo Handa
@ 2017-11-06  8:21         ` Wei Wang
  0 siblings, 0 replies; 32+ messages in thread
From: Wei Wang @ 2017-11-06  8:21 UTC (permalink / raw)
  To: Tetsuo Handa, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, liliang.opensource, yang.zhang.wz, quan.xu

On 11/04/2017 07:28 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
>> On 11/03/2017 07:25 PM, Tetsuo Handa wrote:
>>
>> If this is inside vb->balloon_lock mutex (isn't this?), xb_set_page() must not
>> use __GFP_DIRECT_RECLAIM allocation, for leak_balloon_sg_oom() will be blocked
>> on vb->balloon_lock mutex.
>> OK. Since the preload() doesn't need too much memory (< 4K in total),
>> how about GFP_NOWAIT here?
> Maybe GFP_NOWAIT | __GFP_NOWARN ?

Sounds good to me. I also plan to move "xb_set_page()" under mutex_lock, 
that is,

     fill_balloon()
     {
         ...
         mutex_lock(&vb->balloon_lock);

         vb->num_pfns = 0;
         while ((page = balloon_page_pop(&pages))) {
==>        xb_set_page(..,page,..);
                 balloon_page_enqueue(&vb->vb_dev_info, page);
         ...
     }

As explained in the xbitmap patch, we need the lock to avoid concurrent 
access to the bitmap.

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v17 2/6] radix tree test suite: add tests for xbitmap
  2017-11-03  8:13 ` [PATCH v17 2/6] radix tree test suite: add tests for xbitmap Wei Wang
@ 2017-11-06 17:00   ` Matthew Wilcox
  2017-11-29 14:20     ` Wei Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Matthew Wilcox @ 2017-11-06 17:00 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox, david, penguin-kernel,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, yang.zhang.wz, quan.xu

On Fri, Nov 03, 2017 at 04:13:02PM +0800, Wei Wang wrote:
> From: Matthew Wilcox <mawilcox@microsoft.com>
> 
> Add the following tests for xbitmap:
> 1) single bit test: single bit set/clear/find;
> 2) bit range test: set/clear a range of bits and find a 0 or 1 bit in
> the range.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> ---
>  tools/include/linux/bitmap.h            |  34 ++++
>  tools/include/linux/kernel.h            |   2 +
>  tools/testing/radix-tree/Makefile       |   7 +-
>  tools/testing/radix-tree/linux/kernel.h |   2 -
>  tools/testing/radix-tree/main.c         |   5 +
>  tools/testing/radix-tree/test.h         |   1 +
>  tools/testing/radix-tree/xbitmap.c      | 278 ++++++++++++++++++++++++++++++++

Umm.  No.  You've duplicated xbitmap.c into the test framework, so now it can
slowly get out of sync with the one in lib/.  That's not OK.

Put it back the way it was, with the patch I gave you as patch 1/n
(relocating xbitmap.c from tools/testing/radix-tree to lib/).
Then add your enhancements as patch 2/n.  All you should need to
change in your 1/n from
http://git.infradead.org/users/willy/linux-dax.git/commit/727e401bee5ad7d37e0077291d90cc17475c6392
is a bit of Makefile tooling.  Leave the test suite embedded in the file;
that way people might remember to update the test suite when adding
new functionality.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-11-03  8:13 ` [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ Wei Wang
@ 2017-11-13 10:34   ` Wei Wang
  2017-11-13 17:32     ` Michael S. Tsirkin
  2017-11-15 20:32   ` Michael S. Tsirkin
  1 sibling, 1 reply; 32+ messages in thread
From: Wei Wang @ 2017-11-13 10:34 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, penguin-kernel, cornelia.huck, mgorman, aarcange,
	amit.shah, pbonzini, willy, liliang.opensource, yang.zhang.wz,
	quan.xu

Ping for comments, thanks.

On 11/03/2017 04:13 PM, Wei Wang wrote:
> Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_VQ feature indicates the
> support of reporting hints of guest free pages to the host via
> virtio-balloon. The host requests the guest to report the free pages by
> sending commands via the virtio-balloon configuration registers.
>
> When the guest starts to report, the first element added to the free page
> vq is a sequence id of the start reporting command. The id is given by
> the host, and it indicates whether the following free pages correspond
> to the command. For example, the host may stop the report and start again
> with a new command id. The obsolete pages for the previous start command
> can be detected by the id dismatching on the host. The id is added to the
> vq using an output buffer, and the free pages are added to the vq using
> input buffer.
>
> Here are some explainations about the added configuration registers:
> - host2guest_cmd: a register used by the host to send commands to the
> guest.
> - guest2host_cmd: written by the guest to ACK to the host about the
> commands that have been received. The host will clear the corresponding
> bits on the host2guest_cmd register. The guest also uses this register
> to send commands to the host (e.g. when finish free page reporting).
> - free_page_cmd_id: the sequence id of the free page report command
> given by the host.
>
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> ---
>   drivers/virtio/virtio_balloon.c     | 234 ++++++++++++++++++++++++++++++++----
>   include/uapi/linux/virtio_balloon.h |  11 ++
>   2 files changed, 223 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index b31fc25..4087f04 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -55,7 +55,12 @@ static struct vfsmount *balloon_mnt;
>   
>   struct virtio_balloon {
>   	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
> +
> +	/* Balloon's own wq for cpu-intensive work items */
> +	struct workqueue_struct *balloon_wq;
> +	/* The free page reporting work item submitted to the balloon wq */
> +	struct work_struct report_free_page_work;
>   
>   	/* The balloon servicing is delegated to a freezable workqueue. */
>   	struct work_struct update_balloon_stats_work;
> @@ -65,6 +70,10 @@ struct virtio_balloon {
>   	spinlock_t stop_update_lock;
>   	bool stop_update;
>   
> +	/* Stop reporting free pages */
> +	bool report_free_page_stop;
> +	uint32_t free_page_cmd_id;
> +
>   	/* Waiting for host to ack the pages we released. */
>   	wait_queue_head_t acked;
>   
> @@ -191,6 +200,30 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
>   		kick_and_wait(vq, vb->acked);
>   }
>   
> +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	int err = 0;
> +	unsigned int len;
> +
> +	/* Detach all the used buffers from the vq */
> +	while (virtqueue_get_buf(vq, &len))
> +		;
> +
> +	/*
> +	 * Since this is an optimization feature, losing a couple of free
> +	 * pages to report isn't important. We simply resturn without adding
> +	 * the page if the vq is full.
> +	 */
> +	if (vq->num_free) {
> +		err = add_one_sg(vq, addr, size);
> +		BUG_ON(err);
> +	}
> +
> +	/* Batch till the vq is full */
> +	if (!vq->num_free)
> +		virtqueue_kick(vq);
> +}
> +
>   /*
>    * Send balloon pages in sgs to host. The balloon pages are recorded in the
>    * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> @@ -495,9 +528,8 @@ static void stats_handle_request(struct virtio_balloon *vb)
>   	virtqueue_kick(vq);
>   }
>   
> -static void virtballoon_changed(struct virtio_device *vdev)
> +static void virtballoon_cmd_balloon_memory(struct virtio_balloon *vb)
>   {
> -	struct virtio_balloon *vb = vdev->priv;
>   	unsigned long flags;
>   
>   	spin_lock_irqsave(&vb->stop_update_lock, flags);
> @@ -506,6 +538,50 @@ static void virtballoon_changed(struct virtio_device *vdev)
>   	spin_unlock_irqrestore(&vb->stop_update_lock, flags);
>   }
>   
> +static void virtballoon_cmd_report_free_page_start(struct virtio_balloon *vb)
> +{
> +	unsigned long flags;
> +
> +	vb->report_free_page_stop = false;
> +	spin_lock_irqsave(&vb->stop_update_lock, flags);
> +	if (!vb->stop_update)
> +		queue_work(vb->balloon_wq, &vb->report_free_page_work);
> +	spin_unlock_irqrestore(&vb->stop_update_lock, flags);
> +}
> +
> +static void virtballoon_changed(struct virtio_device *vdev)
> +{
> +	struct virtio_balloon *vb = vdev->priv;
> +	u32 host2guest_cmd, guest2host_cmd = 0;
> +
> +	if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		virtballoon_cmd_balloon_memory(vb);
> +		return;
> +	}
> +
> +	virtio_cread(vb->vdev, struct virtio_balloon_config, host2guest_cmd,
> +		     &host2guest_cmd);
> +
> +	if (host2guest_cmd & VIRTIO_BALLOON_CMD_BALLOON_MEMORY) {
> +		virtballoon_cmd_balloon_memory(vb);
> +		guest2host_cmd |= VIRTIO_BALLOON_CMD_BALLOON_MEMORY;
> +	}
> +
> +	if (host2guest_cmd & VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_START) {
> +		virtballoon_cmd_report_free_page_start(vb);
> +		guest2host_cmd |= VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_START;
> +	}
> +
> +	if (host2guest_cmd & VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP) {
> +		vb->report_free_page_stop = true;
> +		guest2host_cmd |= VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP;
> +	}
> +
> +	/* Ack to the host about the commands that have been received */
> +	virtio_cwrite(vb->vdev, struct virtio_balloon_config, guest2host_cmd,
> +		      &guest2host_cmd);
> +}
> +
>   static inline s64 towards_target(struct virtio_balloon *vb)
>   {
>   	s64 target;
> @@ -597,42 +673,147 @@ static void update_balloon_size_func(struct work_struct *work)
>   		queue_work(system_freezable_wq, work);
>   }
>   
> -static int init_vqs(struct virtio_balloon *vb)
> +static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> +					   unsigned long nr_pages)
>   {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> +	void *addr = (void *)pfn_to_kaddr(pfn);
> +	uint32_t len = nr_pages << PAGE_SHIFT;
> +
> +	if (vb->report_free_page_stop)
> +		return false;
> +
> +	send_free_page_sg(vb->free_page_vq, addr, len);
>   
> +	return true;
> +}
> +
> +static void report_free_page_end(struct virtio_balloon *vb)
> +{
> +	u32 cmd = VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP;
>   	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> +	 * The host may have already requested to stop the reporting before we
> +	 * finish, so no need to notify the host in this case.
>   	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> -	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> +	if (vb->report_free_page_stop)
> +		return;
> +	vb->report_free_page_stop = true;
> +
> +	virtio_cwrite(vb->vdev, struct virtio_balloon_config, guest2host_cmd,
> +		      &cmd);
> +}
> +
> +static void report_free_page_cmd_id(struct virtio_balloon *vb)
> +{
> +	struct scatterlist sg;
> +	int err;
> +
> +	virtio_cread(vb->vdev, struct virtio_balloon_config, free_page_cmd_id,
> +		     &vb->free_page_cmd_id);
> +	sg_init_one(&sg, &vb->free_page_cmd_id, sizeof(uint32_t));
> +	err = virtqueue_add_outbuf(vb->free_page_vq, &sg, 1,
> +				   &vb->free_page_cmd_id, GFP_KERNEL);
> +	BUG_ON(err);
> +}
> +
> +static void report_free_page(struct work_struct *work)
> +{
> +	struct virtio_balloon *vb;
> +
> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> +	report_free_page_cmd_id(vb);
> +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> +	/*
> +	 * The last few free page blocks that were added may not reach the
> +	 * batch size, but need a kick to notify the device to handle them.
> +	 */
> +	virtqueue_kick(vb->free_page_vq);
> +	report_free_page_end(vb);
> +}
> +
> +static int init_vqs(struct virtio_balloon *vb)
> +{
> +	struct virtqueue **vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	struct scatterlist sg;
> +	int i, nvqs, err = -ENOMEM;
> +
> +	/* Inflateq and deflateq are used unconditionally */
> +	nvqs = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> +		nvqs++;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> +		nvqs++;
> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs)
> +		goto err_vq;
> +	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
> +	if (!callbacks)
> +		goto err_callback;
> +	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
> +	if (!names)
> +		goto err_names;
> +
> +	callbacks[0] = balloon_ack;
> +	names[0] = "inflate";
> +	callbacks[1] = balloon_ack;
> +	names[1] = "deflate";
> +
> +	i = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		callbacks[i] = stats_request;
> +		names[i] = "stats";
> +		i++;
> +	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		callbacks[i] = NULL;
> +		names[i] = "free_page_vq";
> +	}
> +
> +	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
> +					 NULL, NULL);
>   	if (err)
> -		return err;
> +		goto err_find;
>   
>   	vb->inflate_vq = vqs[0];
>   	vb->deflate_vq = vqs[1];
> +	i = 2;
>   	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> -		struct scatterlist sg;
> -		unsigned int num_stats;
> -		vb->stats_vq = vqs[2];
> -
> +		vb->stats_vq = vqs[i++];
>   		/*
>   		 * Prime this virtqueue with one buffer so the hypervisor can
>   		 * use it to signal us later (it can't be broken yet!).
>   		 */
> -		num_stats = update_balloon_stats(vb);
> -
> -		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
> +		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
>   		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
> -		    < 0)
> -			BUG();
> +		    < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
>   		virtqueue_kick(vb->stats_vq);
>   	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> +		vb->free_page_vq = vqs[i];
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
>   	return 0;
> +
> +err_find:
> +	kfree(names);
> +err_names:
> +	kfree(callbacks);
> +err_callback:
> +	kfree(vqs);
> +err_vq:
> +	return err;
>   }
>   
>   #ifdef CONFIG_BALLOON_COMPACTION
> @@ -761,6 +942,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
>   	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
>   		xb_init(&vb->page_xb);
>   
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		vb->balloon_wq = alloc_workqueue("balloon-wq",
> +					WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
> +		INIT_WORK(&vb->report_free_page_work, report_free_page);
> +		vb->report_free_page_stop = true;
> +	}
> +
>   	vb->nb.notifier_call = virtballoon_oom_notify;
>   	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>   	err = register_oom_notifier(&vb->nb);
> @@ -825,6 +1013,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>   	spin_unlock_irq(&vb->stop_update_lock);
>   	cancel_work_sync(&vb->update_balloon_size_work);
>   	cancel_work_sync(&vb->update_balloon_stats_work);
> +	cancel_work_sync(&vb->report_free_page_work);
>   
>   	remove_common(vb);
>   #ifdef CONFIG_BALLOON_COMPACTION
> @@ -878,6 +1067,7 @@ static unsigned int features[] = {
>   	VIRTIO_BALLOON_F_STATS_VQ,
>   	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>   	VIRTIO_BALLOON_F_SG,
> +	VIRTIO_BALLOON_F_FREE_PAGE_VQ,
>   };
>   
>   static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 37780a7..b758484 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -35,15 +35,26 @@
>   #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>   #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>   #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
> +#define VIRTIO_BALLOON_F_FREE_PAGE_VQ	4 /* VQ to report free pages */
>   
>   /* Size of a PFN in the balloon interface. */
>   #define VIRTIO_BALLOON_PFN_SHIFT 12
>   
> +#define	VIRTIO_BALLOON_CMD_BALLOON_MEMORY		(1 << 0)
> +#define	VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_START	(1 << 1)
> +#define	VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP	(1 << 2)
> +
>   struct virtio_balloon_config {
>   	/* Number of pages host wants Guest to give up. */
>   	__u32 num_pages;
>   	/* Number of pages we've actually got in balloon. */
>   	__u32 actual;
> +	/* Host-to-guest command, readonly by guest */
> +	__u32 host2guest_cmd;
> +	/* Sequence id of the free_page report command, readonly by guest */
> +	__u32 free_page_cmd_id;
> +	/* Guest-to-host command */
> +	__u32 guest2host_cmd;
>   };
>   
>   #define VIRTIO_BALLOON_S_SWAP_IN  0   /* Amount of memory swapped in */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-11-13 10:34   ` Wei Wang
@ 2017-11-13 17:32     ` Michael S. Tsirkin
  2017-11-14 12:02       ` Wei Wang
  2017-11-20 11:42       ` Wei Wang
  0 siblings, 2 replies; 32+ messages in thread
From: Michael S. Tsirkin @ 2017-11-13 17:32 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, penguin-kernel,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu, Nitesh Narayan Lal,
	Rik van Riel

You should Cc Nitesh who is working on a related feature.

On Mon, Nov 13, 2017 at 06:34:48PM +0800, Wei Wang wrote:
> Ping for comments, thanks.
> 
> On 11/03/2017 04:13 PM, Wei Wang wrote:
> > Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_VQ feature indicates the
> > support of reporting hints of guest free pages to the host via
> > virtio-balloon. The host requests the guest to report the free pages by
> > sending commands via the virtio-balloon configuration registers.
> > 
> > When the guest starts to report, the first element added to the free page
> > vq is a sequence id of the start reporting command. The id is given by
> > the host, and it indicates whether the following free pages correspond
> > to the command. For example, the host may stop the report and start again
> > with a new command id. The obsolete pages for the previous start command
> > can be detected by the id dismatching on the host. The id is added to the
> > vq using an output buffer, and the free pages are added to the vq using
> > input buffer.
> > 
> > Here are some explainations about the added configuration registers:
> > - host2guest_cmd: a register used by the host to send commands to the
> > guest.
> > - guest2host_cmd: written by the guest to ACK to the host about the
> > commands that have been received. The host will clear the corresponding
> > bits on the host2guest_cmd register. The guest also uses this register
> > to send commands to the host (e.g. when finish free page reporting).

I am not sure what is the role of guest2host_cmd. Reporting of
the correct cmd id seems sufficient indication that guest
received the start command. Not getting any more seems sufficient
to detect stop.


> > - free_page_cmd_id: the sequence id of the free page report command
> > given by the host.
> > 
> > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > Cc: Michael S. Tsirkin <mst@redhat.com>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > ---
> >   drivers/virtio/virtio_balloon.c     | 234 ++++++++++++++++++++++++++++++++----
> >   include/uapi/linux/virtio_balloon.h |  11 ++
> >   2 files changed, 223 insertions(+), 22 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> > index b31fc25..4087f04 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -55,7 +55,12 @@ static struct vfsmount *balloon_mnt;
> >   struct virtio_balloon {
> >   	struct virtio_device *vdev;
> > -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> > +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
> > +
> > +	/* Balloon's own wq for cpu-intensive work items */
> > +	struct workqueue_struct *balloon_wq;
> > +	/* The free page reporting work item submitted to the balloon wq */
> > +	struct work_struct report_free_page_work;
> >   	/* The balloon servicing is delegated to a freezable workqueue. */
> >   	struct work_struct update_balloon_stats_work;
> > @@ -65,6 +70,10 @@ struct virtio_balloon {
> >   	spinlock_t stop_update_lock;
> >   	bool stop_update;
> > +	/* Stop reporting free pages */
> > +	bool report_free_page_stop;

I would revert logic here: bool report_free_page;

> > +	uint32_t free_page_cmd_id;
> > +
> >   	/* Waiting for host to ack the pages we released. */
> >   	wait_queue_head_t acked;
> > @@ -191,6 +200,30 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
> >   		kick_and_wait(vq, vb->acked);
> >   }
> > +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
> > +{
> > +	int err = 0;
> > +	unsigned int len;
> > +
> > +	/* Detach all the used buffers from the vq */
> > +	while (virtqueue_get_buf(vq, &len))
> > +		;
> > +
> > +	/*
> > +	 * Since this is an optimization feature, losing a couple of free
> > +	 * pages to report isn't important. We simply resturn without adding
> > +	 * the page if the vq is full.
> > +	 */
> > +	if (vq->num_free) {
> > +		err = add_one_sg(vq, addr, size);
> > +		BUG_ON(err);
> > +	}
> > +
> > +	/* Batch till the vq is full */
> > +	if (!vq->num_free)
> > +		virtqueue_kick(vq);
> > +}
> > +
> >   /*
> >    * Send balloon pages in sgs to host. The balloon pages are recorded in the
> >    * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> > @@ -495,9 +528,8 @@ static void stats_handle_request(struct virtio_balloon *vb)
> >   	virtqueue_kick(vq);
> >   }
> > -static void virtballoon_changed(struct virtio_device *vdev)
> > +static void virtballoon_cmd_balloon_memory(struct virtio_balloon *vb)
> >   {
> > -	struct virtio_balloon *vb = vdev->priv;
> >   	unsigned long flags;
> >   	spin_lock_irqsave(&vb->stop_update_lock, flags);
> > @@ -506,6 +538,50 @@ static void virtballoon_changed(struct virtio_device *vdev)
> >   	spin_unlock_irqrestore(&vb->stop_update_lock, flags);
> >   }
> > +static void virtballoon_cmd_report_free_page_start(struct virtio_balloon *vb)
> > +{
> > +	unsigned long flags;
> > +
> > +	vb->report_free_page_stop = false;

this flag is used a lot outside any locks. Why is this safe?
Please add some comments explaining access to this flag.

> > +	spin_lock_irqsave(&vb->stop_update_lock, flags);
> > +	if (!vb->stop_update)
> > +		queue_work(vb->balloon_wq, &vb->report_free_page_work);
> > +	spin_unlock_irqrestore(&vb->stop_update_lock, flags);
> > +}
> > +
> > +static void virtballoon_changed(struct virtio_device *vdev)
> > +{
> > +	struct virtio_balloon *vb = vdev->priv;
> > +	u32 host2guest_cmd, guest2host_cmd = 0;
> > +
> > +	if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> > +		virtballoon_cmd_balloon_memory(vb);
> > +		return;

This might be a handy feature: one can disable balloon without
hot-unplug. But I would use a separate feature flag to
control it.

> > +	}
> > +
> > +	virtio_cread(vb->vdev, struct virtio_balloon_config, host2guest_cmd,
> > +		     &host2guest_cmd);
> > +
> > +	if (host2guest_cmd & VIRTIO_BALLOON_CMD_BALLOON_MEMORY) {
> > +		virtballoon_cmd_balloon_memory(vb);
> > +		guest2host_cmd |= VIRTIO_BALLOON_CMD_BALLOON_MEMORY;
> > +	}
> > +
> > +	if (host2guest_cmd & VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_START) {
> > +		virtballoon_cmd_report_free_page_start(vb);
> > +		guest2host_cmd |= VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_START;
> > +	}
> > +
> > +	if (host2guest_cmd & VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP) {
> > +		vb->report_free_page_stop = true;
> > +		guest2host_cmd |= VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP;
> > +	}

I am not sure why free page has start+stop but e.g. balloon has a single
bit. In fact I would really just use command id. When it changes, we
know a new report is needed.

> > +
> > +	/* Ack to the host about the commands that have been received */
> > +	virtio_cwrite(vb->vdev, struct virtio_balloon_config, guest2host_cmd,
> > +		      &guest2host_cmd);

So the same register is used to ack stop command and to signal
end of report. This seems buggy.

> > +}
> > +
> >   static inline s64 towards_target(struct virtio_balloon *vb)
> >   {
> >   	s64 target;
> > @@ -597,42 +673,147 @@ static void update_balloon_size_func(struct work_struct *work)
> >   		queue_work(system_freezable_wq, work);
> >   }
> > -static int init_vqs(struct virtio_balloon *vb)
> > +static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> > +					   unsigned long nr_pages)
> >   {
> > -	struct virtqueue *vqs[3];
> > -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> > -	static const char * const names[] = { "inflate", "deflate", "stats" };
> > -	int err, nvqs;
> > +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> > +	void *addr = (void *)pfn_to_kaddr(pfn);

How do we know all free pages have a kaddr?

> > +	uint32_t len = nr_pages << PAGE_SHIFT;
> > +
> > +	if (vb->report_free_page_stop)
> > +		return false;
> > +
> > +	send_free_page_sg(vb->free_page_vq, addr, len);
> > +	return true;
> > +}
> > +
> > +static void report_free_page_end(struct virtio_balloon *vb)
> > +{
> > +	u32 cmd = VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP;
> >   	/*
> > -	 * We expect two virtqueues: inflate and deflate, and
> > -	 * optionally stat.
> > +	 * The host may have already requested to stop the reporting before we
> > +	 * finish, so no need to notify the host in this case.
> >   	 */
> > -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> > -	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> > +	if (vb->report_free_page_stop)
> > +		return;
> > +	vb->report_free_page_stop = true;
> > +
> > +	virtio_cwrite(vb->vdev, struct virtio_balloon_config, guest2host_cmd,
> > +		      &cmd);

Wouldn't it be easier to add a buffer in the queue?

> > +}
> > +
> > +static void report_free_page_cmd_id(struct virtio_balloon *vb)
> > +{
> > +	struct scatterlist sg;
> > +	int err;
> > +
> > +	virtio_cread(vb->vdev, struct virtio_balloon_config, free_page_cmd_id,
> > +		     &vb->free_page_cmd_id);
> > +	sg_init_one(&sg, &vb->free_page_cmd_id, sizeof(uint32_t));
> > +	err = virtqueue_add_outbuf(vb->free_page_vq, &sg, 1,
> > +				   &vb->free_page_cmd_id, GFP_KERNEL);
> > +	BUG_ON(err);
> > +}
> > +
> > +static void report_free_page(struct work_struct *work)
> > +{
> > +	struct virtio_balloon *vb;
> > +
> > +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> > +	report_free_page_cmd_id(vb);
> > +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> > +	/*
> > +	 * The last few free page blocks that were added may not reach the
> > +	 * batch size, but need a kick to notify the device to handle them.
> > +	 */
> > +	virtqueue_kick(vb->free_page_vq);
> > +	report_free_page_end(vb);
> > +}
> > +
> > +static int init_vqs(struct virtio_balloon *vb)
> > +{
> > +	struct virtqueue **vqs;
> > +	vq_callback_t **callbacks;
> > +	const char **names;
> > +	struct scatterlist sg;
> > +	int i, nvqs, err = -ENOMEM;
> > +
> > +	/* Inflateq and deflateq are used unconditionally */
> > +	nvqs = 2;
> > +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> > +		nvqs++;
> > +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> > +		nvqs++;
> > +
> > +	/* Allocate space for find_vqs parameters */
> > +	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
> > +	if (!vqs)
> > +		goto err_vq;
> > +	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
> > +	if (!callbacks)
> > +		goto err_callback;
> > +	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
> > +	if (!names)
> > +		goto err_names;
> > +
> > +	callbacks[0] = balloon_ack;
> > +	names[0] = "inflate";
> > +	callbacks[1] = balloon_ack;
> > +	names[1] = "deflate";
> > +
> > +	i = 2;
> > +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> > +		callbacks[i] = stats_request;
> > +		names[i] = "stats";
> > +		i++;
> > +	}
> > +
> > +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> > +		callbacks[i] = NULL;
> > +		names[i] = "free_page_vq";
> > +	}
> > +
> > +	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
> > +					 NULL, NULL);
> >   	if (err)
> > -		return err;
> > +		goto err_find;
> >   	vb->inflate_vq = vqs[0];
> >   	vb->deflate_vq = vqs[1];
> > +	i = 2;
> >   	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> > -		struct scatterlist sg;
> > -		unsigned int num_stats;
> > -		vb->stats_vq = vqs[2];
> > -
> > +		vb->stats_vq = vqs[i++];
> >   		/*
> >   		 * Prime this virtqueue with one buffer so the hypervisor can
> >   		 * use it to signal us later (it can't be broken yet!).
> >   		 */
> > -		num_stats = update_balloon_stats(vb);
> > -
> > -		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
> > +		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
> >   		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
> > -		    < 0)
> > -			BUG();
> > +		    < 0) {
> > +			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
> > +				 __func__);
> > +			goto err_find;
> > +		}
> >   		virtqueue_kick(vb->stats_vq);
> >   	}
> > +
> > +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> > +		vb->free_page_vq = vqs[i];
> > +
> > +	kfree(names);
> > +	kfree(callbacks);
> > +	kfree(vqs);
> >   	return 0;
> > +
> > +err_find:
> > +	kfree(names);
> > +err_names:
> > +	kfree(callbacks);
> > +err_callback:
> > +	kfree(vqs);
> > +err_vq:
> > +	return err;
> >   }
> >   #ifdef CONFIG_BALLOON_COMPACTION
> > @@ -761,6 +942,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
> >   	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
> >   		xb_init(&vb->page_xb);
> > +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> > +		vb->balloon_wq = alloc_workqueue("balloon-wq",
> > +					WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
> > +		INIT_WORK(&vb->report_free_page_work, report_free_page);
> > +		vb->report_free_page_stop = true;
> > +	}
> > +
> >   	vb->nb.notifier_call = virtballoon_oom_notify;
> >   	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
> >   	err = register_oom_notifier(&vb->nb);
> > @@ -825,6 +1013,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
> >   	spin_unlock_irq(&vb->stop_update_lock);
> >   	cancel_work_sync(&vb->update_balloon_size_work);
> >   	cancel_work_sync(&vb->update_balloon_stats_work);
> > +	cancel_work_sync(&vb->report_free_page_work);
> >   	remove_common(vb);
> >   #ifdef CONFIG_BALLOON_COMPACTION
> > @@ -878,6 +1067,7 @@ static unsigned int features[] = {
> >   	VIRTIO_BALLOON_F_STATS_VQ,
> >   	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> >   	VIRTIO_BALLOON_F_SG,
> > +	VIRTIO_BALLOON_F_FREE_PAGE_VQ,
> >   };
> >   static struct virtio_driver virtio_balloon_driver = {
> > diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> > index 37780a7..b758484 100644
> > --- a/include/uapi/linux/virtio_balloon.h
> > +++ b/include/uapi/linux/virtio_balloon.h
> > @@ -35,15 +35,26 @@
> >   #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
> >   #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> >   #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
> > +#define VIRTIO_BALLOON_F_FREE_PAGE_VQ	4 /* VQ to report free pages */
> >   /* Size of a PFN in the balloon interface. */
> >   #define VIRTIO_BALLOON_PFN_SHIFT 12
> > +#define	VIRTIO_BALLOON_CMD_BALLOON_MEMORY		(1 << 0)
> > +#define	VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_START	(1 << 1)
> > +#define	VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP	(1 << 2)
> > +
> >   struct virtio_balloon_config {
> >   	/* Number of pages host wants Guest to give up. */
> >   	__u32 num_pages;
> >   	/* Number of pages we've actually got in balloon. */
> >   	__u32 actual;
> > +	/* Host-to-guest command, readonly by guest */
> > +	__u32 host2guest_cmd;
> > +	/* Sequence id of the free_page report command, readonly by guest */
> > +	__u32 free_page_cmd_id;
> > +	/* Guest-to-host command */
> > +	__u32 guest2host_cmd;
> >   };
> >   #define VIRTIO_BALLOON_S_SWAP_IN  0   /* Amount of memory swapped in */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-11-13 17:32     ` Michael S. Tsirkin
@ 2017-11-14 12:02       ` Wei Wang
  2017-11-14 21:21         ` Michael S. Tsirkin
  2017-11-20 11:42       ` Wei Wang
  1 sibling, 1 reply; 32+ messages in thread
From: Wei Wang @ 2017-11-14 12:02 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, penguin-kernel,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu, Nitesh Narayan Lal,
	Rik van Riel

On 11/14/2017 01:32 AM, Michael S. Tsirkin wrote:
>> - guest2host_cmd: written by the guest to ACK to the host about the
>> commands that have been received. The host will clear the corresponding
>> bits on the host2guest_cmd register. The guest also uses this register
>> to send commands to the host (e.g. when finish free page reporting).
> I am not sure what is the role of guest2host_cmd. Reporting of
> the correct cmd id seems sufficient indication that guest
> received the start command. Not getting any more seems sufficient
> to detect stop.
>

I think the issue is when the host is waiting for the guest to report 
pages, it does not know whether the guest is going to report more or the 
report is done already. That's why we need a way to let the guest tell 
the host "the report is done, don't wait for more", then the host 
continues to the next step - sending the non-free pages to the 
destination. The following method is a conclusion of other comments, 
with some new thought. Please have a check if it is good.

Two new configuration registers in total:
- cmd_reg: the command register, combined from the previous host2guest 
and guest2host. I think we can use the same register for host requesting 
and guest ACKing, since the guest writing will trap to QEMU, that is, 
all the writes to the register are performed in QEMU, and we can keep 
things work in a correct way there.
- cmd_id_reg: the sequence id of the free page report command.

-- free page report:
     - host requests the guest to start reporting by "cmd_reg | 
REPORT_START";
     - guest ACKs to the host about receiving the start reporting 
request by "cmd_reg | REPORT_START", host will clear the flag bit once 
receiving the ACK.
     - host requests the guest to stop reporting by "cmd_reg | REPORT_STOP";
     - guest ACKs to the host about receiving the stop reporting request 
by "cmd_reg | REPORT_STOP", host will clear the flag once receiving the ACK.
     - guest tells the host about the start of the reporting by writing 
"cmd id" into an outbuf, which is added to the free page vq.
     - guest tells the host about the end of the reporting by writing 
"0" into an outbuf, which is added to the free page vq. (we reserve 
"id=0" as the stop sign)

-- ballooning:
     - host requests the guest to start ballooning by "cmd_reg | 
BALLOONING";
     - guest ACKs to the host about receiving the request by "cmd_reg | 
BALLOONING", host will clear the flag once receiving the ACK.


Some more explanations:
-- Why not let the host request the guest to start the free page 
reporting simply by writing a new cmd id to the cmd_id_reg?
The configuration interrupt is shared among all the features - 
ballooning, free page reporting, and future feature extensions which 
need host-to-guest requests. Some features may need to add other feature 
specific configuration registers, like free page reporting need the 
cmd_id_reg, which is not used by ballooning. The rule here is that the 
feature specific registers are read only when that feature is requested 
via the cmd_reg. For example, the cmd_id_reg is read only when "cmd_reg 
| REPORT_START" is true. Otherwise, when the driver receives a 
configuration interrupt, it has to read both cmd_reg and cmd_id 
registers to know what are requested by the host - think about the case 
that ballooning requests are sent frequently while free page reporting 
isn't requested, the guest has to read the cmd_id register every time a 
ballooning request is sent by the host, which is not necessary. If 
future new features follow this style, there will be more unnecessary 
VMexits to read the unused feature specific registers.
So I think it is good to have a central control of the feature request 
via only one cmd register - reading that one is enough to know what is 
requested by the host.


Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-11-14 12:02       ` Wei Wang
@ 2017-11-14 21:21         ` Michael S. Tsirkin
  2017-11-15  3:47           ` Wei Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Michael S. Tsirkin @ 2017-11-14 21:21 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, penguin-kernel,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu, Nitesh Narayan Lal,
	Rik van Riel

On Tue, Nov 14, 2017 at 08:02:03PM +0800, Wei Wang wrote:
> On 11/14/2017 01:32 AM, Michael S. Tsirkin wrote:
> > > - guest2host_cmd: written by the guest to ACK to the host about the
> > > commands that have been received. The host will clear the corresponding
> > > bits on the host2guest_cmd register. The guest also uses this register
> > > to send commands to the host (e.g. when finish free page reporting).
> > I am not sure what is the role of guest2host_cmd. Reporting of
> > the correct cmd id seems sufficient indication that guest
> > received the start command. Not getting any more seems sufficient
> > to detect stop.
> > 
> 
> I think the issue is when the host is waiting for the guest to report pages,
> it does not know whether the guest is going to report more or the report is
> done already. That's why we need a way to let the guest tell the host "the
> report is done, don't wait for more", then the host continues to the next
> step - sending the non-free pages to the destination. The following method
> is a conclusion of other comments, with some new thought. Please have a
> check if it is good.

config won't work well for this IMHO.
Writes to config register are hard to synchronize with the VQ.
For example, guest sends free pages, host says stop, meanwhile
guest sends stop for 1st set of pages.

How about adding a buffer with "stop" in the VQ instead?
Wastes a VQ entry which you will need to reserve for this
but is it a big deal?


> Two new configuration registers in total:
> - cmd_reg: the command register, combined from the previous host2guest and
> guest2host. I think we can use the same register for host requesting and
> guest ACKing, since the guest writing will trap to QEMU, that is, all the
> writes to the register are performed in QEMU, and we can keep things work in
> a correct way there.
> - cmd_id_reg: the sequence id of the free page report command.
> 
> -- free page report:
>     - host requests the guest to start reporting by "cmd_reg |
> REPORT_START";
>     - guest ACKs to the host about receiving the start reporting request by
> "cmd_reg | REPORT_START", host will clear the flag bit once receiving the
> ACK.
>     - host requests the guest to stop reporting by "cmd_reg | REPORT_STOP";
>     - guest ACKs to the host about receiving the stop reporting request by
> "cmd_reg | REPORT_STOP", host will clear the flag once receiving the ACK.
>     - guest tells the host about the start of the reporting by writing "cmd
> id" into an outbuf, which is added to the free page vq.
>     - guest tells the host about the end of the reporting by writing "0"
> into an outbuf, which is added to the free page vq. (we reserve "id=0" as
> the stop sign)
> 
> -- ballooning:
>     - host requests the guest to start ballooning by "cmd_reg | BALLOONING";
>     - guest ACKs to the host about receiving the request by "cmd_reg |
> BALLOONING", host will clear the flag once receiving the ACK.
> 
> 
> Some more explanations:
> -- Why not let the host request the guest to start the free page reporting
> simply by writing a new cmd id to the cmd_id_reg?
> The configuration interrupt is shared among all the features - ballooning,
> free page reporting, and future feature extensions which need host-to-guest
> requests. Some features may need to add other feature specific configuration
> registers, like free page reporting need the cmd_id_reg, which is not used
> by ballooning. The rule here is that the feature specific registers are read
> only when that feature is requested via the cmd_reg. For example, the
> cmd_id_reg is read only when "cmd_reg | REPORT_START" is true. Otherwise,
> when the driver receives a configuration interrupt, it has to read both
> cmd_reg and cmd_id registers to know what are requested by the host - think
> about the case that ballooning requests are sent frequently while free page
> reporting isn't requested, the guest has to read the cmd_id register every
> time a ballooning request is sent by the host, which is not necessary. If
> future new features follow this style, there will be more unnecessary
> VMexits to read the unused feature specific registers.
> So I think it is good to have a central control of the feature request via
> only one cmd register - reading that one is enough to know what is requested
> by the host.
> 

Right now you are increasing the cost of balloon request 3x though.


How about we establish a baseline with a simple interface, and
then add the command register when it's actually benefitial.



> Best,
> Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-11-14 21:21         ` Michael S. Tsirkin
@ 2017-11-15  3:47           ` Wei Wang
  2017-11-15 13:26             ` Michael S. Tsirkin
  0 siblings, 1 reply; 32+ messages in thread
From: Wei Wang @ 2017-11-15  3:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, penguin-kernel,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu, Nitesh Narayan Lal,
	Rik van Riel

On 11/15/2017 05:21 AM, Michael S. Tsirkin wrote:
> On Tue, Nov 14, 2017 at 08:02:03PM +0800, Wei Wang wrote:
>> On 11/14/2017 01:32 AM, Michael S. Tsirkin wrote:
>>>> - guest2host_cmd: written by the guest to ACK to the host about the
>>>> commands that have been received. The host will clear the corresponding
>>>> bits on the host2guest_cmd register. The guest also uses this register
>>>> to send commands to the host (e.g. when finish free page reporting).
>>> I am not sure what is the role of guest2host_cmd. Reporting of
>>> the correct cmd id seems sufficient indication that guest
>>> received the start command. Not getting any more seems sufficient
>>> to detect stop.
>>>
>> I think the issue is when the host is waiting for the guest to report pages,
>> it does not know whether the guest is going to report more or the report is
>> done already. That's why we need a way to let the guest tell the host "the
>> report is done, don't wait for more", then the host continues to the next
>> step - sending the non-free pages to the destination. The following method
>> is a conclusion of other comments, with some new thought. Please have a
>> check if it is good.
> config won't work well for this IMHO.
> Writes to config register are hard to synchronize with the VQ.
> For example, guest sends free pages, host says stop, meanwhile
> guest sends stop for 1st set of pages.

I still don't see an issue with this. Please see below:
(before jumping into the discussion, just make sure I've well explained 
this point: now host-to-guest commands are done via config, and 
guest-to-host commands are done via the free page vq)

Case: Host starts to request the reporting with cmd_id=1. Some time 
later, Host writes "stop" to config, meantime guest happens to finish 
the reporting and plan to actively send a "stop" command from the 
free_page_vq().
           Essentially, this is like a sync between two threads - if we 
view the config interrupt handler as one thread, another is the free 
page reporting worker thread.

         - what the config handler does is simply:
               1.1:  WRITE_ONCE(vb->reporting_stop, true);

         - what the reporting thread will do is
               2.1:  WRITE_ONCE(vb->reporting_stop, true);
               2.2:  send_stop_to_host_via_vq();

 From the guest point of view, no matter 1.1 is executed first or 2.1 
first, it doesn't make a difference to the end result - 
vb->reporting_stop is set.

 From the host point of view, it knows that cmd_id=1 has truly stopped 
the reporting when it receives a "stop" sign via the vq.


> How about adding a buffer with "stop" in the VQ instead?
> Wastes a VQ entry which you will need to reserve for this
> but is it a big deal?

The free page vq is guest-to-host direction. Using it for host-to-guest 
requests will make it bidirectional, which will result in the same issue 
described before: https://lkml.org/lkml/2017/10/11/1009 (the first response)

On the other hand, I think adding another new vq for host-to-guest 
requesting doesn't make a difference in essence, compared to using 
config (same 1.1, 2.1, 2.2 above), but will be more complicated.


>> Two new configuration registers in total:
>> - cmd_reg: the command register, combined from the previous host2guest and
>> guest2host. I think we can use the same register for host requesting and
>> guest ACKing, since the guest writing will trap to QEMU, that is, all the
>> writes to the register are performed in QEMU, and we can keep things work in
>> a correct way there.
>> - cmd_id_reg: the sequence id of the free page report command.
>>
>> -- free page report:
>>      - host requests the guest to start reporting by "cmd_reg |
>> REPORT_START";
>>      - guest ACKs to the host about receiving the start reporting request by
>> "cmd_reg | REPORT_START", host will clear the flag bit once receiving the
>> ACK.
>>      - host requests the guest to stop reporting by "cmd_reg | REPORT_STOP";
>>      - guest ACKs to the host about receiving the stop reporting request by
>> "cmd_reg | REPORT_STOP", host will clear the flag once receiving the ACK.
>>      - guest tells the host about the start of the reporting by writing "cmd
>> id" into an outbuf, which is added to the free page vq.
>>      - guest tells the host about the end of the reporting by writing "0"
>> into an outbuf, which is added to the free page vq. (we reserve "id=0" as
>> the stop sign)
>>
>> -- ballooning:
>>      - host requests the guest to start ballooning by "cmd_reg | BALLOONING";
>>      - guest ACKs to the host about receiving the request by "cmd_reg |
>> BALLOONING", host will clear the flag once receiving the ACK.
>>
>>
>> Some more explanations:
>> -- Why not let the host request the guest to start the free page reporting
>> simply by writing a new cmd id to the cmd_id_reg?
>> The configuration interrupt is shared among all the features - ballooning,
>> free page reporting, and future feature extensions which need host-to-guest
>> requests. Some features may need to add other feature specific configuration
>> registers, like free page reporting need the cmd_id_reg, which is not used
>> by ballooning. The rule here is that the feature specific registers are read
>> only when that feature is requested via the cmd_reg. For example, the
>> cmd_id_reg is read only when "cmd_reg | REPORT_START" is true. Otherwise,
>> when the driver receives a configuration interrupt, it has to read both
>> cmd_reg and cmd_id registers to know what are requested by the host - think
>> about the case that ballooning requests are sent frequently while free page
>> reporting isn't requested, the guest has to read the cmd_id register every
>> time a ballooning request is sent by the host, which is not necessary. If
>> future new features follow this style, there will be more unnecessary
>> VMexits to read the unused feature specific registers.
>> So I think it is good to have a central control of the feature request via
>> only one cmd register - reading that one is enough to know what is requested
>> by the host.
>>
> Right now you are increasing the cost of balloon request 3x though.

Not that much, I think, just a cmd register read and ACK, and this 
should be neglected compared to the ballooning time.
(I don't see a difference in the performance testing either).

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-11-15  3:47           ` Wei Wang
@ 2017-11-15 13:26             ` Michael S. Tsirkin
  2017-11-16 11:59               ` Wei Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Michael S. Tsirkin @ 2017-11-15 13:26 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, penguin-kernel,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu, Nitesh Narayan Lal,
	Rik van Riel

On Wed, Nov 15, 2017 at 11:47:58AM +0800, Wei Wang wrote:
> On 11/15/2017 05:21 AM, Michael S. Tsirkin wrote:
> > On Tue, Nov 14, 2017 at 08:02:03PM +0800, Wei Wang wrote:
> > > On 11/14/2017 01:32 AM, Michael S. Tsirkin wrote:
> > > > > - guest2host_cmd: written by the guest to ACK to the host about the
> > > > > commands that have been received. The host will clear the corresponding
> > > > > bits on the host2guest_cmd register. The guest also uses this register
> > > > > to send commands to the host (e.g. when finish free page reporting).
> > > > I am not sure what is the role of guest2host_cmd. Reporting of
> > > > the correct cmd id seems sufficient indication that guest
> > > > received the start command. Not getting any more seems sufficient
> > > > to detect stop.
> > > > 
> > > I think the issue is when the host is waiting for the guest to report pages,
> > > it does not know whether the guest is going to report more or the report is
> > > done already. That's why we need a way to let the guest tell the host "the
> > > report is done, don't wait for more", then the host continues to the next
> > > step - sending the non-free pages to the destination. The following method
> > > is a conclusion of other comments, with some new thought. Please have a
> > > check if it is good.
> > config won't work well for this IMHO.
> > Writes to config register are hard to synchronize with the VQ.
> > For example, guest sends free pages, host says stop, meanwhile
> > guest sends stop for 1st set of pages.
> 
> I still don't see an issue with this. Please see below:
> (before jumping into the discussion, just make sure I've well explained this
> point: now host-to-guest commands are done via config, and guest-to-host
> commands are done via the free page vq)

This is fine by me actually. But right now you have guest to host
not going through vq, going through command register instead -
this is how sending stop to host seems to happen.
If you make it go through vq then I think all will be well.

> 
> Case: Host starts to request the reporting with cmd_id=1. Some time later,
> Host writes "stop" to config, meantime guest happens to finish the reporting
> and plan to actively send a "stop" command from the free_page_vq().
>           Essentially, this is like a sync between two threads - if we view
> the config interrupt handler as one thread, another is the free page
> reporting worker thread.
> 
>         - what the config handler does is simply:
>               1.1:  WRITE_ONCE(vb->reporting_stop, true);
> 
>         - what the reporting thread will do is
>               2.1:  WRITE_ONCE(vb->reporting_stop, true);
>               2.2:  send_stop_to_host_via_vq();
> 
> From the guest point of view, no matter 1.1 is executed first or 2.1 first,
> it doesn't make a difference to the end result - vb->reporting_stop is set.
> 
> From the host point of view, it knows that cmd_id=1 has truly stopped the
> reporting when it receives a "stop" sign via the vq.
> 
> 
> > How about adding a buffer with "stop" in the VQ instead?
> > Wastes a VQ entry which you will need to reserve for this
> > but is it a big deal?
> 
> The free page vq is guest-to-host direction.

Yes, for guest to host stop sign.

> Using it for host-to-guest
> requests will make it bidirectional, which will result in the same issue
> described before: https://lkml.org/lkml/2017/10/11/1009 (the first response)
> 
> On the other hand, I think adding another new vq for host-to-guest
> requesting doesn't make a difference in essence, compared to using config
> (same 1.1, 2.1, 2.2 above), but will be more complicated.

I agree with this. Host to guest can just incremenent the "free command id"
register.

> 
> > > Two new configuration registers in total:
> > > - cmd_reg: the command register, combined from the previous host2guest and
> > > guest2host. I think we can use the same register for host requesting and
> > > guest ACKing, since the guest writing will trap to QEMU, that is, all the
> > > writes to the register are performed in QEMU, and we can keep things work in
> > > a correct way there.
> > > - cmd_id_reg: the sequence id of the free page report command.
> > > 
> > > -- free page report:
> > >      - host requests the guest to start reporting by "cmd_reg |
> > > REPORT_START";
> > >      - guest ACKs to the host about receiving the start reporting request by
> > > "cmd_reg | REPORT_START", host will clear the flag bit once receiving the
> > > ACK.
> > >      - host requests the guest to stop reporting by "cmd_reg | REPORT_STOP";
> > >      - guest ACKs to the host about receiving the stop reporting request by
> > > "cmd_reg | REPORT_STOP", host will clear the flag once receiving the ACK.
> > >      - guest tells the host about the start of the reporting by writing "cmd
> > > id" into an outbuf, which is added to the free page vq.
> > >      - guest tells the host about the end of the reporting by writing "0"
> > > into an outbuf, which is added to the free page vq. (we reserve "id=0" as
> > > the stop sign)
> > > 
> > > -- ballooning:
> > >      - host requests the guest to start ballooning by "cmd_reg | BALLOONING";
> > >      - guest ACKs to the host about receiving the request by "cmd_reg |
> > > BALLOONING", host will clear the flag once receiving the ACK.
> > > 
> > > 
> > > Some more explanations:
> > > -- Why not let the host request the guest to start the free page reporting
> > > simply by writing a new cmd id to the cmd_id_reg?
> > > The configuration interrupt is shared among all the features - ballooning,
> > > free page reporting, and future feature extensions which need host-to-guest
> > > requests. Some features may need to add other feature specific configuration
> > > registers, like free page reporting need the cmd_id_reg, which is not used
> > > by ballooning. The rule here is that the feature specific registers are read
> > > only when that feature is requested via the cmd_reg. For example, the
> > > cmd_id_reg is read only when "cmd_reg | REPORT_START" is true. Otherwise,
> > > when the driver receives a configuration interrupt, it has to read both
> > > cmd_reg and cmd_id registers to know what are requested by the host - think
> > > about the case that ballooning requests are sent frequently while free page
> > > reporting isn't requested, the guest has to read the cmd_id register every
> > > time a ballooning request is sent by the host, which is not necessary. If
> > > future new features follow this style, there will be more unnecessary
> > > VMexits to read the unused feature specific registers.
> > > So I think it is good to have a central control of the feature request via
> > > only one cmd register - reading that one is enough to know what is requested
> > > by the host.
> > > 
> > Right now you are increasing the cost of balloon request 3x though.
> 
> Not that much, I think, just a cmd register read and ACK, and this should be
> neglected compared to the ballooning time.
> (I don't see a difference in the performance testing either).
> 
> Best,
> Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-11-03  8:13 ` [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ Wei Wang
  2017-11-13 10:34   ` Wei Wang
@ 2017-11-15 20:32   ` Michael S. Tsirkin
  2017-11-16 13:27     ` [virtio-dev] " Wei Wang
  1 sibling, 1 reply; 32+ messages in thread
From: Michael S. Tsirkin @ 2017-11-15 20:32 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, penguin-kernel,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu

On Fri, Nov 03, 2017 at 04:13:06PM +0800, Wei Wang wrote:
> Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_VQ feature indicates the
> support of reporting hints of guest free pages to the host via
> virtio-balloon. The host requests the guest to report the free pages by
> sending commands via the virtio-balloon configuration registers.
> 
> When the guest starts to report, the first element added to the free page
> vq is a sequence id of the start reporting command. The id is given by
> the host, and it indicates whether the following free pages correspond
> to the command. For example, the host may stop the report and start again
> with a new command id. The obsolete pages for the previous start command
> can be detected by the id dismatching on the host. The id is added to the
> vq using an output buffer, and the free pages are added to the vq using
> input buffer.
> 
> Here are some explainations about the added configuration registers:
> - host2guest_cmd: a register used by the host to send commands to the
> guest.
> - guest2host_cmd: written by the guest to ACK to the host about the
> commands that have been received. The host will clear the corresponding
> bits on the host2guest_cmd register. The guest also uses this register
> to send commands to the host (e.g. when finish free page reporting).
> - free_page_cmd_id: the sequence id of the free page report command
> given by the host.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> ---
>  drivers/virtio/virtio_balloon.c     | 234 ++++++++++++++++++++++++++++++++----
>  include/uapi/linux/virtio_balloon.h |  11 ++
>  2 files changed, 223 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index b31fc25..4087f04 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -55,7 +55,12 @@ static struct vfsmount *balloon_mnt;
>  
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
> +
> +	/* Balloon's own wq for cpu-intensive work items */
> +	struct workqueue_struct *balloon_wq;
> +	/* The free page reporting work item submitted to the balloon wq */
> +	struct work_struct report_free_page_work;
>  
>  	/* The balloon servicing is delegated to a freezable workqueue. */
>  	struct work_struct update_balloon_stats_work;
> @@ -65,6 +70,10 @@ struct virtio_balloon {
>  	spinlock_t stop_update_lock;
>  	bool stop_update;
>  
> +	/* Stop reporting free pages */
> +	bool report_free_page_stop;
> +	uint32_t free_page_cmd_id;
> +
>  	/* Waiting for host to ack the pages we released. */
>  	wait_queue_head_t acked;
>  
> @@ -191,6 +200,30 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
>  		kick_and_wait(vq, vb->acked);
>  }
>  
> +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	int err = 0;
> +	unsigned int len;
> +
> +	/* Detach all the used buffers from the vq */
> +	while (virtqueue_get_buf(vq, &len))
> +		;
> +
> +	/*
> +	 * Since this is an optimization feature, losing a couple of free
> +	 * pages to report isn't important. We simply resturn without adding
> +	 * the page if the vq is full.
> +	 */
> +	if (vq->num_free) {
> +		err = add_one_sg(vq, addr, size);
> +		BUG_ON(err);
> +	}
> +
> +	/* Batch till the vq is full */
> +	if (!vq->num_free)
> +		virtqueue_kick(vq);
> +}
> +
>  /*
>   * Send balloon pages in sgs to host. The balloon pages are recorded in the
>   * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> @@ -495,9 +528,8 @@ static void stats_handle_request(struct virtio_balloon *vb)
>  	virtqueue_kick(vq);
>  }
>  
> -static void virtballoon_changed(struct virtio_device *vdev)
> +static void virtballoon_cmd_balloon_memory(struct virtio_balloon *vb)
>  {
> -	struct virtio_balloon *vb = vdev->priv;
>  	unsigned long flags;
>  
>  	spin_lock_irqsave(&vb->stop_update_lock, flags);
> @@ -506,6 +538,50 @@ static void virtballoon_changed(struct virtio_device *vdev)
>  	spin_unlock_irqrestore(&vb->stop_update_lock, flags);
>  }
>  
> +static void virtballoon_cmd_report_free_page_start(struct virtio_balloon *vb)
> +{
> +	unsigned long flags;
> +
> +	vb->report_free_page_stop = false;
> +	spin_lock_irqsave(&vb->stop_update_lock, flags);
> +	if (!vb->stop_update)
> +		queue_work(vb->balloon_wq, &vb->report_free_page_work);
> +	spin_unlock_irqrestore(&vb->stop_update_lock, flags);
> +}
> +
> +static void virtballoon_changed(struct virtio_device *vdev)
> +{
> +	struct virtio_balloon *vb = vdev->priv;
> +	u32 host2guest_cmd, guest2host_cmd = 0;
> +
> +	if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		virtballoon_cmd_balloon_memory(vb);
> +		return;
> +	}
> +
> +	virtio_cread(vb->vdev, struct virtio_balloon_config, host2guest_cmd,
> +		     &host2guest_cmd);
> +
> +	if (host2guest_cmd & VIRTIO_BALLOON_CMD_BALLOON_MEMORY) {
> +		virtballoon_cmd_balloon_memory(vb);
> +		guest2host_cmd |= VIRTIO_BALLOON_CMD_BALLOON_MEMORY;
> +	}
> +
> +	if (host2guest_cmd & VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_START) {
> +		virtballoon_cmd_report_free_page_start(vb);
> +		guest2host_cmd |= VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_START;
> +	}
> +
> +	if (host2guest_cmd & VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP) {
> +		vb->report_free_page_stop = true;
> +		guest2host_cmd |= VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP;
> +	}
> +
> +	/* Ack to the host about the commands that have been received */
> +	virtio_cwrite(vb->vdev, struct virtio_balloon_config, guest2host_cmd,
> +		      &guest2host_cmd);
> +}
> +
>  static inline s64 towards_target(struct virtio_balloon *vb)
>  {
>  	s64 target;
> @@ -597,42 +673,147 @@ static void update_balloon_size_func(struct work_struct *work)
>  		queue_work(system_freezable_wq, work);
>  }
>  
> -static int init_vqs(struct virtio_balloon *vb)
> +static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> +					   unsigned long nr_pages)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> +	void *addr = (void *)pfn_to_kaddr(pfn);
> +	uint32_t len = nr_pages << PAGE_SHIFT;
> +
> +	if (vb->report_free_page_stop)
> +		return false;
> +
> +	send_free_page_sg(vb->free_page_vq, addr, len);
>  
> +	return true;
> +}
> +
> +static void report_free_page_end(struct virtio_balloon *vb)
> +{
> +	u32 cmd = VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP;
>  	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> +	 * The host may have already requested to stop the reporting before we
> +	 * finish, so no need to notify the host in this case.
>  	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> -	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> +	if (vb->report_free_page_stop)
> +		return;
> +	vb->report_free_page_stop = true;
> +
> +	virtio_cwrite(vb->vdev, struct virtio_balloon_config, guest2host_cmd,
> +		      &cmd);
> +}
> +
> +static void report_free_page_cmd_id(struct virtio_balloon *vb)
> +{
> +	struct scatterlist sg;
> +	int err;
> +
> +	virtio_cread(vb->vdev, struct virtio_balloon_config, free_page_cmd_id,
> +		     &vb->free_page_cmd_id);
> +	sg_init_one(&sg, &vb->free_page_cmd_id, sizeof(uint32_t));
> +	err = virtqueue_add_outbuf(vb->free_page_vq, &sg, 1,
> +				   &vb->free_page_cmd_id, GFP_KERNEL);
> +	BUG_ON(err);
> +}
> +
> +static void report_free_page(struct work_struct *work)
> +{
> +	struct virtio_balloon *vb;
> +
> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> +	report_free_page_cmd_id(vb);
> +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> +	/*
> +	 * The last few free page blocks that were added may not reach the
> +	 * batch size, but need a kick to notify the device to handle them.
> +	 */
> +	virtqueue_kick(vb->free_page_vq);
> +	report_free_page_end(vb);
> +}
> +

I think there's an issue here: if pages are poisoned and hypervisor
subsequently drops them, testing them after allocation will
trigger a false positive.

The specific configuration:

PAGE_POISONING on
PAGE_POISONING_NO_SANITY off
PAGE_POISONING_ZERO off


Solutions:
1. disable the feature in that configuration
	suggested as an initial step
2. pass poison value to host so it can validate page content
   before it drops it
3. pass poison value to host so it can init allocated pages with that value

In fact one nice side effect would be that unmap
becomes safe even though free list is not locked anymore.

It would be interesting to see whether this last has
any value performance-wise.


> +static int init_vqs(struct virtio_balloon *vb)
> +{
> +	struct virtqueue **vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	struct scatterlist sg;
> +	int i, nvqs, err = -ENOMEM;
> +
> +	/* Inflateq and deflateq are used unconditionally */
> +	nvqs = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> +		nvqs++;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> +		nvqs++;
> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs)
> +		goto err_vq;
> +	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
> +	if (!callbacks)
> +		goto err_callback;
> +	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
> +	if (!names)
> +		goto err_names;
> +
> +	callbacks[0] = balloon_ack;
> +	names[0] = "inflate";
> +	callbacks[1] = balloon_ack;
> +	names[1] = "deflate";
> +
> +	i = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		callbacks[i] = stats_request;
> +		names[i] = "stats";
> +		i++;
> +	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		callbacks[i] = NULL;
> +		names[i] = "free_page_vq";
> +	}
> +
> +	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
> +					 NULL, NULL);
>  	if (err)
> -		return err;
> +		goto err_find;
>  
>  	vb->inflate_vq = vqs[0];
>  	vb->deflate_vq = vqs[1];
> +	i = 2;
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> -		struct scatterlist sg;
> -		unsigned int num_stats;
> -		vb->stats_vq = vqs[2];
> -
> +		vb->stats_vq = vqs[i++];
>  		/*
>  		 * Prime this virtqueue with one buffer so the hypervisor can
>  		 * use it to signal us later (it can't be broken yet!).
>  		 */
> -		num_stats = update_balloon_stats(vb);
> -
> -		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
> +		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
>  		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
> -		    < 0)
> -			BUG();
> +		    < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
>  		virtqueue_kick(vb->stats_vq);
>  	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> +		vb->free_page_vq = vqs[i];
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
>  	return 0;
> +
> +err_find:
> +	kfree(names);
> +err_names:
> +	kfree(callbacks);
> +err_callback:
> +	kfree(vqs);
> +err_vq:
> +	return err;
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -761,6 +942,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
>  		xb_init(&vb->page_xb);
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		vb->balloon_wq = alloc_workqueue("balloon-wq",
> +					WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
> +		INIT_WORK(&vb->report_free_page_work, report_free_page);
> +		vb->report_free_page_stop = true;
> +	}
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -825,6 +1013,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	spin_unlock_irq(&vb->stop_update_lock);
>  	cancel_work_sync(&vb->update_balloon_size_work);
>  	cancel_work_sync(&vb->update_balloon_stats_work);
> +	cancel_work_sync(&vb->report_free_page_work);
>  
>  	remove_common(vb);
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -878,6 +1067,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>  	VIRTIO_BALLOON_F_SG,
> +	VIRTIO_BALLOON_F_FREE_PAGE_VQ,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 37780a7..b758484 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -35,15 +35,26 @@
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
> +#define VIRTIO_BALLOON_F_FREE_PAGE_VQ	4 /* VQ to report free pages */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
>  
> +#define	VIRTIO_BALLOON_CMD_BALLOON_MEMORY		(1 << 0)
> +#define	VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_START	(1 << 1)
> +#define	VIRTIO_BALLOON_CMD_REPORT_FREE_PAGE_STOP	(1 << 2)
> +
>  struct virtio_balloon_config {
>  	/* Number of pages host wants Guest to give up. */
>  	__u32 num_pages;
>  	/* Number of pages we've actually got in balloon. */
>  	__u32 actual;
> +	/* Host-to-guest command, readonly by guest */
> +	__u32 host2guest_cmd;
> +	/* Sequence id of the free_page report command, readonly by guest */
> +	__u32 free_page_cmd_id;
> +	/* Guest-to-host command */
> +	__u32 guest2host_cmd;
>  };
>  
>  #define VIRTIO_BALLOON_S_SWAP_IN  0   /* Amount of memory swapped in */
> -- 
> 2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-11-15 13:26             ` Michael S. Tsirkin
@ 2017-11-16 11:59               ` Wei Wang
  0 siblings, 0 replies; 32+ messages in thread
From: Wei Wang @ 2017-11-16 11:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, penguin-kernel,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu, Nitesh Narayan Lal,
	Rik van Riel

On 11/15/2017 09:26 PM, Michael S. Tsirkin wrote:
> On Wed, Nov 15, 2017 at 11:47:58AM +0800, Wei Wang wrote:
>> On 11/15/2017 05:21 AM, Michael S. Tsirkin wrote:
>>> On Tue, Nov 14, 2017 at 08:02:03PM +0800, Wei Wang wrote:
>>>> On 11/14/2017 01:32 AM, Michael S. Tsirkin wrote:
>>>>>> - guest2host_cmd: written by the guest to ACK to the host about the
>>>>>> commands that have been received. The host will clear the corresponding
>>>>>> bits on the host2guest_cmd register. The guest also uses this register
>>>>>> to send commands to the host (e.g. when finish free page reporting).
>>>>> I am not sure what is the role of guest2host_cmd. Reporting of
>>>>> the correct cmd id seems sufficient indication that guest
>>>>> received the start command. Not getting any more seems sufficient
>>>>> to detect stop.
>>>>>
>>>> I think the issue is when the host is waiting for the guest to report pages,
>>>> it does not know whether the guest is going to report more or the report is
>>>> done already. That's why we need a way to let the guest tell the host "the
>>>> report is done, don't wait for more", then the host continues to the next
>>>> step - sending the non-free pages to the destination. The following method
>>>> is a conclusion of other comments, with some new thought. Please have a
>>>> check if it is good.
>>> config won't work well for this IMHO.
>>> Writes to config register are hard to synchronize with the VQ.
>>> For example, guest sends free pages, host says stop, meanwhile
>>> guest sends stop for 1st set of pages.
>> I still don't see an issue with this. Please see below:
>> (before jumping into the discussion, just make sure I've well explained this
>> point: now host-to-guest commands are done via config, and guest-to-host
>> commands are done via the free page vq)
> This is fine by me actually. But right now you have guest to host
> not going through vq, going through command register instead -
> this is how sending stop to host seems to happen.
> If you make it go through vq then I think all will be well.
>
>> Case: Host starts to request the reporting with cmd_id=1. Some time later,
>> Host writes "stop" to config, meantime guest happens to finish the reporting
>> and plan to actively send a "stop" command from the free_page_vq().
>>            Essentially, this is like a sync between two threads - if we view
>> the config interrupt handler as one thread, another is the free page
>> reporting worker thread.
>>
>>          - what the config handler does is simply:
>>                1.1:  WRITE_ONCE(vb->reporting_stop, true);
>>
>>          - what the reporting thread will do is
>>                2.1:  WRITE_ONCE(vb->reporting_stop, true);
>>                2.2:  send_stop_to_host_via_vq();
>>
>>  From the guest point of view, no matter 1.1 is executed first or 2.1 first,
>> it doesn't make a difference to the end result - vb->reporting_stop is set.
>>
>>  From the host point of view, it knows that cmd_id=1 has truly stopped the
>> reporting when it receives a "stop" sign via the vq.
>>
>>
>>> How about adding a buffer with "stop" in the VQ instead?
>>> Wastes a VQ entry which you will need to reserve for this
>>> but is it a big deal?
>> The free page vq is guest-to-host direction.
> Yes, for guest to host stop sign.
>
>> Using it for host-to-guest
>> requests will make it bidirectional, which will result in the same issue
>> described before: https://lkml.org/lkml/2017/10/11/1009 (the first response)
>>
>> On the other hand, I think adding another new vq for host-to-guest
>> requesting doesn't make a difference in essence, compared to using config
>> (same 1.1, 2.1, 2.2 above), but will be more complicated.
> I agree with this. Host to guest can just incremenent the "free command id"
> register.


OK, thanks for the suggestions. I think one more issue left here:

Previously, when the guest receives a config interrupt, it blindly adds 
the balloon work item to the workqueue in virtballoon_changed(), because 
only ballooning uses the config.
Now, free page reporting is requested via config, too.

We have the following two options:

Option 1: add "diff = towards_target()" to virtballoon_changed(), and if 
diff = 0, it will not add the balloon work item to the wq.

Option 2: add "cmd" for the host-to-guest request, and add the item when 
"cmd | CMD_BALLOON" is true.

I'm inclined to take option 1 now. Which one would you prefer?

Best,
Wei


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [virtio-dev] Re: [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-11-15 20:32   ` Michael S. Tsirkin
@ 2017-11-16 13:27     ` Wei Wang
  2017-11-17 11:35       ` Wei Wang
  2017-11-17 13:18       ` Michael S. Tsirkin
  0 siblings, 2 replies; 32+ messages in thread
From: Wei Wang @ 2017-11-16 13:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, penguin-kernel,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu

On 11/16/2017 04:32 AM, Michael S. Tsirkin wrote:
> On Fri, Nov 03, 2017 at 04:13:06PM +0800, Wei Wang wrote:
>> Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_VQ feature indicates the
>> support of reporting hints of guest free pages to the host via
>> virtio-balloon. The host requests the guest to report the free pages by
>> sending commands via the virtio-balloon configuration registers.
>>
>> When the guest starts to report, the first element added to the free page
>> vq is a sequence id of the start reporting command. The id is given by
>> the host, and it indicates whether the following free pages correspond
>> to the command. For example, the host may stop the report and start again
>> with a new command id. The obsolete pages for the previous start command
>> can be detected by the id dismatching on the host. The id is added to the
>> vq using an output buffer, and the free pages are added to the vq using
>> input buffer.
>>
>> Here are some explainations about the added configuration registers:
>> - host2guest_cmd: a register used by the host to send commands to the
>> guest.
>> - guest2host_cmd: written by the guest to ACK to the host about the
>> commands that have been received. The host will clear the corresponding
>> bits on the host2guest_cmd register. The guest also uses this register
>> to send commands to the host (e.g. when finish free page reporting).
>> - free_page_cmd_id: the sequence id of the free page report command
>> given by the host.
>>
>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>> Cc: Michael S. Tsirkin <mst@redhat.com>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> ---
>>
>> +
>> +static void report_free_page(struct work_struct *work)
>> +{
>> +	struct virtio_balloon *vb;
>> +
>> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
>> +	report_free_page_cmd_id(vb);
>> +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
>> +	/*
>> +	 * The last few free page blocks that were added may not reach the
>> +	 * batch size, but need a kick to notify the device to handle them.
>> +	 */
>> +	virtqueue_kick(vb->free_page_vq);
>> +	report_free_page_end(vb);
>> +}
>> +
> I think there's an issue here: if pages are poisoned and hypervisor
> subsequently drops them, testing them after allocation will
> trigger a false positive.
>
> The specific configuration:
>
> PAGE_POISONING on
> PAGE_POISONING_NO_SANITY off
> PAGE_POISONING_ZERO off
>
>
> Solutions:
> 1. disable the feature in that configuration
> 	suggested as an initial step

Thanks for the finding.
Similar to this option: I'm thinking could we make walk_free_mem_block() 
simply return if that option is on?
That is, at the beginning of the function:
     if (!page_poisoning_enabled())
                 return;

I think in most usages, people would not choose to use the poisoning 
option due to the added overhead.


Probably we could make it a separate fix patch of this report following 
patch 5 to explain the above reasons in the commit.

> 2. pass poison value to host so it can validate page content
>     before it drops it
> 3. pass poison value to host so it can init allocated pages with that value
>
> In fact one nice side effect would be that unmap
> becomes safe even though free list is not locked anymore.

I haven't got this point yet,  how would it bring performance benefit?

> It would be interesting to see whether this last has
> any value performance-wise.
>

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [virtio-dev] Re: [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-11-16 13:27     ` [virtio-dev] " Wei Wang
@ 2017-11-17 11:35       ` Wei Wang
  2017-11-17 11:48         ` Wei Wang
  2017-11-17 12:44         ` Michael S. Tsirkin
  2017-11-17 13:18       ` Michael S. Tsirkin
  1 sibling, 2 replies; 32+ messages in thread
From: Wei Wang @ 2017-11-17 11:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	penguin-kernel, linux-kernel, willy, virtualization, linux-mm,
	yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm, mhocko,
	mgorman, liliang.opensource

On 11/16/2017 09:27 PM, Wei Wang wrote:
> On 11/16/2017 04:32 AM, Michael S. Tsirkin wrote:
>> On Fri, Nov 03, 2017 at 04:13:06PM +0800, Wei Wang wrote:
>>> Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_VQ feature indicates the
>>> support of reporting hints of guest free pages to the host via
>>> virtio-balloon. The host requests the guest to report the free pages by
>>> sending commands via the virtio-balloon configuration registers.
>>>
>>> When the guest starts to report, the first element added to the free 
>>> page
>>> vq is a sequence id of the start reporting command. The id is given by
>>> the host, and it indicates whether the following free pages correspond
>>> to the command. For example, the host may stop the report and start 
>>> again
>>> with a new command id. The obsolete pages for the previous start 
>>> command
>>> can be detected by the id dismatching on the host. The id is added 
>>> to the
>>> vq using an output buffer, and the free pages are added to the vq using
>>> input buffer.
>>>
>>> Here are some explainations about the added configuration registers:
>>> - host2guest_cmd: a register used by the host to send commands to the
>>> guest.
>>> - guest2host_cmd: written by the guest to ACK to the host about the
>>> commands that have been received. The host will clear the corresponding
>>> bits on the host2guest_cmd register. The guest also uses this register
>>> to send commands to the host (e.g. when finish free page reporting).
>>> - free_page_cmd_id: the sequence id of the free page report command
>>> given by the host.
>>>
>>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>>> Cc: Michael S. Tsirkin <mst@redhat.com>
>>> Cc: Michal Hocko <mhocko@kernel.org>
>>> ---
>>>
>>> +
>>> +static void report_free_page(struct work_struct *work)
>>> +{
>>> +    struct virtio_balloon *vb;
>>> +
>>> +    vb = container_of(work, struct virtio_balloon, 
>>> report_free_page_work);
>>> +    report_free_page_cmd_id(vb);
>>> +    walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
>>> +    /*
>>> +     * The last few free page blocks that were added may not reach the
>>> +     * batch size, but need a kick to notify the device to handle 
>>> them.
>>> +     */
>>> +    virtqueue_kick(vb->free_page_vq);
>>> +    report_free_page_end(vb);
>>> +}
>>> +
>> I think there's an issue here: if pages are poisoned and hypervisor
>> subsequently drops them, testing them after allocation will
>> trigger a false positive.
>>
>> The specific configuration:
>>
>> PAGE_POISONING on
>> PAGE_POISONING_NO_SANITY off
>> PAGE_POISONING_ZERO off
>>
>>
>> Solutions:
>> 1. disable the feature in that configuration
>>     suggested as an initial step
>
> Thanks for the finding.
> Similar to this option: I'm thinking could we make 
> walk_free_mem_block() simply return if that option is on?
> That is, at the beginning of the function:
>     if (!page_poisoning_enabled())
>                 return;
>


Thought about it more, I think it would be better to put this logic to 
virtio_balloon:

         send_free_page_cmd_id(vb, &vb->start_cmd_id);
         if (page_poisoning_enabled() &&
             !IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY))
                 walk_free_mem_block(vb, 0, 
&virtio_balloon_send_free_pages);
         send_free_page_cmd_id(vb, &vb->stop_cmd_id);


walk_free_mem_block() should be a more generic API, and this potential 
page poisoning issue is specific to live migration which is only one use 
case of this function, so I think it is better to handle it in the 
special use case itself.

Best,
Wei



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [virtio-dev] Re: [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-11-17 11:35       ` Wei Wang
@ 2017-11-17 11:48         ` Wei Wang
  2017-11-17 12:44         ` Michael S. Tsirkin
  1 sibling, 0 replies; 32+ messages in thread
From: Wei Wang @ 2017-11-17 11:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	penguin-kernel, linux-kernel, willy, virtualization, linux-mm,
	yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm, mhocko,
	mgorman, liliang.opensource

On 11/17/2017 07:35 PM, Wei Wang wrote:
> On 11/16/2017 09:27 PM, Wei Wang wrote:
>> On 11/16/2017 04:32 AM, Michael S. Tsirkin wrote:
>>> On Fri, Nov 03, 2017 at 04:13:06PM +0800, Wei Wang wrote:
>>>> Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_VQ feature indicates the
>>>> support of reporting hints of guest free pages to the host via
>>>> virtio-balloon. The host requests the guest to report the free 
>>>> pages by
>>>> sending commands via the virtio-balloon configuration registers.
>>>>
>>>> When the guest starts to report, the first element added to the 
>>>> free page
>>>> vq is a sequence id of the start reporting command. The id is given by
>>>> the host, and it indicates whether the following free pages correspond
>>>> to the command. For example, the host may stop the report and start 
>>>> again
>>>> with a new command id. The obsolete pages for the previous start 
>>>> command
>>>> can be detected by the id dismatching on the host. The id is added 
>>>> to the
>>>> vq using an output buffer, and the free pages are added to the vq 
>>>> using
>>>> input buffer.
>>>>
>>>> Here are some explainations about the added configuration registers:
>>>> - host2guest_cmd: a register used by the host to send commands to the
>>>> guest.
>>>> - guest2host_cmd: written by the guest to ACK to the host about the
>>>> commands that have been received. The host will clear the 
>>>> corresponding
>>>> bits on the host2guest_cmd register. The guest also uses this register
>>>> to send commands to the host (e.g. when finish free page reporting).
>>>> - free_page_cmd_id: the sequence id of the free page report command
>>>> given by the host.
>>>>
>>>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>>>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>>>> Cc: Michael S. Tsirkin <mst@redhat.com>
>>>> Cc: Michal Hocko <mhocko@kernel.org>
>>>> ---
>>>>
>>>> +
>>>> +static void report_free_page(struct work_struct *work)
>>>> +{
>>>> +    struct virtio_balloon *vb;
>>>> +
>>>> +    vb = container_of(work, struct virtio_balloon, 
>>>> report_free_page_work);
>>>> +    report_free_page_cmd_id(vb);
>>>> +    walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
>>>> +    /*
>>>> +     * The last few free page blocks that were added may not reach 
>>>> the
>>>> +     * batch size, but need a kick to notify the device to handle 
>>>> them.
>>>> +     */
>>>> +    virtqueue_kick(vb->free_page_vq);
>>>> +    report_free_page_end(vb);
>>>> +}
>>>> +
>>> I think there's an issue here: if pages are poisoned and hypervisor
>>> subsequently drops them, testing them after allocation will
>>> trigger a false positive.
>>>
>>> The specific configuration:
>>>
>>> PAGE_POISONING on
>>> PAGE_POISONING_NO_SANITY off
>>> PAGE_POISONING_ZERO off
>>>
>>>
>>> Solutions:
>>> 1. disable the feature in that configuration
>>>     suggested as an initial step
>>
>> Thanks for the finding.
>> Similar to this option: I'm thinking could we make 
>> walk_free_mem_block() simply return if that option is on?
>> That is, at the beginning of the function:
>>     if (!page_poisoning_enabled())
>>                 return;
>>
>
>
> Thought about it more, I think it would be better to put this logic to 
> virtio_balloon:
>
>         send_free_page_cmd_id(vb, &vb->start_cmd_id);
>         if (page_poisoning_enabled() &&
>             !IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY))
>                 walk_free_mem_block(vb, 0, 
> &virtio_balloon_send_free_pages);

logic should be inverse:
     if (!(page_poisoning_enabled() &&
             !IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY)))

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [virtio-dev] Re: [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-11-17 11:35       ` Wei Wang
  2017-11-17 11:48         ` Wei Wang
@ 2017-11-17 12:44         ` Michael S. Tsirkin
  2017-11-18  5:22           ` Wang, Wei W
  1 sibling, 1 reply; 32+ messages in thread
From: Michael S. Tsirkin @ 2017-11-17 12:44 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	penguin-kernel, linux-kernel, willy, virtualization, linux-mm,
	yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm, mhocko,
	mgorman, liliang.opensource

On Fri, Nov 17, 2017 at 07:35:03PM +0800, Wei Wang wrote:
> On 11/16/2017 09:27 PM, Wei Wang wrote:
> > On 11/16/2017 04:32 AM, Michael S. Tsirkin wrote:
> > > On Fri, Nov 03, 2017 at 04:13:06PM +0800, Wei Wang wrote:
> > > > Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_VQ feature indicates the
> > > > support of reporting hints of guest free pages to the host via
> > > > virtio-balloon. The host requests the guest to report the free pages by
> > > > sending commands via the virtio-balloon configuration registers.
> > > > 
> > > > When the guest starts to report, the first element added to the
> > > > free page
> > > > vq is a sequence id of the start reporting command. The id is given by
> > > > the host, and it indicates whether the following free pages correspond
> > > > to the command. For example, the host may stop the report and
> > > > start again
> > > > with a new command id. The obsolete pages for the previous start
> > > > command
> > > > can be detected by the id dismatching on the host. The id is
> > > > added to the
> > > > vq using an output buffer, and the free pages are added to the vq using
> > > > input buffer.
> > > > 
> > > > Here are some explainations about the added configuration registers:
> > > > - host2guest_cmd: a register used by the host to send commands to the
> > > > guest.
> > > > - guest2host_cmd: written by the guest to ACK to the host about the
> > > > commands that have been received. The host will clear the corresponding
> > > > bits on the host2guest_cmd register. The guest also uses this register
> > > > to send commands to the host (e.g. when finish free page reporting).
> > > > - free_page_cmd_id: the sequence id of the free page report command
> > > > given by the host.
> > > > 
> > > > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > > > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > > > Cc: Michael S. Tsirkin <mst@redhat.com>
> > > > Cc: Michal Hocko <mhocko@kernel.org>
> > > > ---
> > > > 
> > > > +
> > > > +static void report_free_page(struct work_struct *work)
> > > > +{
> > > > +    struct virtio_balloon *vb;
> > > > +
> > > > +    vb = container_of(work, struct virtio_balloon,
> > > > report_free_page_work);
> > > > +    report_free_page_cmd_id(vb);
> > > > +    walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> > > > +    /*
> > > > +     * The last few free page blocks that were added may not reach the
> > > > +     * batch size, but need a kick to notify the device to
> > > > handle them.
> > > > +     */
> > > > +    virtqueue_kick(vb->free_page_vq);
> > > > +    report_free_page_end(vb);
> > > > +}
> > > > +
> > > I think there's an issue here: if pages are poisoned and hypervisor
> > > subsequently drops them, testing them after allocation will
> > > trigger a false positive.
> > > 
> > > The specific configuration:
> > > 
> > > PAGE_POISONING on
> > > PAGE_POISONING_NO_SANITY off
> > > PAGE_POISONING_ZERO off
> > > 
> > > 
> > > Solutions:
> > > 1. disable the feature in that configuration
> > >     suggested as an initial step
> > 
> > Thanks for the finding.
> > Similar to this option: I'm thinking could we make walk_free_mem_block()
> > simply return if that option is on?
> > That is, at the beginning of the function:
> >     if (!page_poisoning_enabled())
> >                 return;
> > 
> 
> 
> Thought about it more, I think it would be better to put this logic to
> virtio_balloon:
> 
>         send_free_page_cmd_id(vb, &vb->start_cmd_id);
>         if (page_poisoning_enabled() &&
>             !IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY))
>                 walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
>         send_free_page_cmd_id(vb, &vb->stop_cmd_id);
> 
> 
> walk_free_mem_block() should be a more generic API, and this potential page
> poisoning issue is specific to live migration which is only one use case of
> this function, so I think it is better to handle it in the special use case
> itself.
> 
> Best,
> Wei
> 

It's a quick work-around but it doesn't make me very happy.

AFAIK e.g. RHEL has a debug kernel with poisoning enabled.
If this never uses free page hinting at all, it will
be much less useful for debugging guests.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [virtio-dev] Re: [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-11-16 13:27     ` [virtio-dev] " Wei Wang
  2017-11-17 11:35       ` Wei Wang
@ 2017-11-17 13:18       ` Michael S. Tsirkin
  1 sibling, 0 replies; 32+ messages in thread
From: Michael S. Tsirkin @ 2017-11-17 13:18 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, penguin-kernel,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu

On Thu, Nov 16, 2017 at 09:27:24PM +0800, Wei Wang wrote:
> On 11/16/2017 04:32 AM, Michael S. Tsirkin wrote:
> > On Fri, Nov 03, 2017 at 04:13:06PM +0800, Wei Wang wrote:
> > > Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_VQ feature indicates the
> > > support of reporting hints of guest free pages to the host via
> > > virtio-balloon. The host requests the guest to report the free pages by
> > > sending commands via the virtio-balloon configuration registers.
> > > 
> > > When the guest starts to report, the first element added to the free page
> > > vq is a sequence id of the start reporting command. The id is given by
> > > the host, and it indicates whether the following free pages correspond
> > > to the command. For example, the host may stop the report and start again
> > > with a new command id. The obsolete pages for the previous start command
> > > can be detected by the id dismatching on the host. The id is added to the
> > > vq using an output buffer, and the free pages are added to the vq using
> > > input buffer.
> > > 
> > > Here are some explainations about the added configuration registers:
> > > - host2guest_cmd: a register used by the host to send commands to the
> > > guest.
> > > - guest2host_cmd: written by the guest to ACK to the host about the
> > > commands that have been received. The host will clear the corresponding
> > > bits on the host2guest_cmd register. The guest also uses this register
> > > to send commands to the host (e.g. when finish free page reporting).
> > > - free_page_cmd_id: the sequence id of the free page report command
> > > given by the host.
> > > 
> > > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > > Cc: Michael S. Tsirkin <mst@redhat.com>
> > > Cc: Michal Hocko <mhocko@kernel.org>
> > > ---
> > > 
> > > +
> > > +static void report_free_page(struct work_struct *work)
> > > +{
> > > +	struct virtio_balloon *vb;
> > > +
> > > +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> > > +	report_free_page_cmd_id(vb);
> > > +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> > > +	/*
> > > +	 * The last few free page blocks that were added may not reach the
> > > +	 * batch size, but need a kick to notify the device to handle them.
> > > +	 */
> > > +	virtqueue_kick(vb->free_page_vq);
> > > +	report_free_page_end(vb);
> > > +}
> > > +
> > I think there's an issue here: if pages are poisoned and hypervisor
> > subsequently drops them, testing them after allocation will
> > trigger a false positive.
> > 
> > The specific configuration:
> > 
> > PAGE_POISONING on
> > PAGE_POISONING_NO_SANITY off
> > PAGE_POISONING_ZERO off
> > 
> > 
> > Solutions:
> > 1. disable the feature in that configuration
> > 	suggested as an initial step
> 
> Thanks for the finding.
> Similar to this option: I'm thinking could we make walk_free_mem_block()
> simply return if that option is on?
> That is, at the beginning of the function:
>     if (!page_poisoning_enabled())
>                 return;
> 
> I think in most usages, people would not choose to use the poisoning option
> due to the added overhead.
> 
> 
> Probably we could make it a separate fix patch of this report following
> patch 5 to explain the above reasons in the commit.
> 
> > 2. pass poison value to host so it can validate page content
> >     before it drops it
> > 3. pass poison value to host so it can init allocated pages with that value
> > 
> > In fact one nice side effect would be that unmap
> > becomes safe even though free list is not locked anymore.
> 
> I haven't got this point yet,  how would it bring performance benefit?

Upon getting a free page, host could check that its content
matches the poison value. If it doesn't page has been used.

But let's ignore this for now.

> > It would be interesting to see whether this last has
> > any value performance-wise.
> > 
> 
> Best,
> Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [virtio-dev] Re: [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-11-17 12:44         ` Michael S. Tsirkin
@ 2017-11-18  5:22           ` Wang, Wei W
  2017-11-19 15:11             ` Michael S. Tsirkin
  0 siblings, 1 reply; 32+ messages in thread
From: Wang, Wei W @ 2017-11-18  5:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	penguin-kernel, linux-kernel, willy, virtualization, linux-mm,
	yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm, mhocko,
	mgorman, liliang.opensource

On Friday, November 17, 2017 8:45 PM, Michael S. Tsirkin wrote:
> On Fri, Nov 17, 2017 at 07:35:03PM +0800, Wei Wang wrote:
> > On 11/16/2017 09:27 PM, Wei Wang wrote:
> > > On 11/16/2017 04:32 AM, Michael S. Tsirkin wrote:
> > > > On Fri, Nov 03, 2017 at 04:13:06PM +0800, Wei Wang wrote:
> > > > > Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_VQ feature
> > > > > indicates the support of reporting hints of guest free pages to
> > > > > the host via virtio-balloon. The host requests the guest to
> > > > > report the free pages by sending commands via the virtio-balloon
> configuration registers.
> > > > >
> > > > > When the guest starts to report, the first element added to the
> > > > > free page vq is a sequence id of the start reporting command.
> > > > > The id is given by the host, and it indicates whether the
> > > > > following free pages correspond to the command. For example, the
> > > > > host may stop the report and start again with a new command id.
> > > > > The obsolete pages for the previous start command can be
> > > > > detected by the id dismatching on the host. The id is added to
> > > > > the vq using an output buffer, and the free pages are added to
> > > > > the vq using input buffer.
> > > > >
> > > > > Here are some explainations about the added configuration registers:
> > > > > - host2guest_cmd: a register used by the host to send commands
> > > > > to the guest.
> > > > > - guest2host_cmd: written by the guest to ACK to the host about
> > > > > the commands that have been received. The host will clear the
> > > > > corresponding bits on the host2guest_cmd register. The guest
> > > > > also uses this register to send commands to the host (e.g. when finish
> free page reporting).
> > > > > - free_page_cmd_id: the sequence id of the free page report
> > > > > command given by the host.
> > > > >
> > > > > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > > > > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > > > > Cc: Michael S. Tsirkin <mst@redhat.com>
> > > > > Cc: Michal Hocko <mhocko@kernel.org>
> > > > > ---
> > > > >
> > > > > +
> > > > > +static void report_free_page(struct work_struct *work) {
> > > > > +    struct virtio_balloon *vb;
> > > > > +
> > > > > +    vb = container_of(work, struct virtio_balloon,
> > > > > report_free_page_work);
> > > > > +    report_free_page_cmd_id(vb);
> > > > > +    walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> > > > > +    /*
> > > > > +     * The last few free page blocks that were added may not reach the
> > > > > +     * batch size, but need a kick to notify the device to
> > > > > handle them.
> > > > > +     */
> > > > > +    virtqueue_kick(vb->free_page_vq);
> > > > > +    report_free_page_end(vb);
> > > > > +}
> > > > > +
> > > > I think there's an issue here: if pages are poisoned and
> > > > hypervisor subsequently drops them, testing them after allocation
> > > > will trigger a false positive.
> > > >
> > > > The specific configuration:
> > > >
> > > > PAGE_POISONING on
> > > > PAGE_POISONING_NO_SANITY off
> > > > PAGE_POISONING_ZERO off
> > > >
> > > >
> > > > Solutions:
> > > > 1. disable the feature in that configuration
> > > >     suggested as an initial step
> > >
> > > Thanks for the finding.
> > > Similar to this option: I'm thinking could we make
> > > walk_free_mem_block() simply return if that option is on?
> > > That is, at the beginning of the function:
> > >     if (!page_poisoning_enabled())
> > >                 return;
> > >
> >
> >
> > Thought about it more, I think it would be better to put this logic to
> > virtio_balloon:
> >
> >         send_free_page_cmd_id(vb, &vb->start_cmd_id);
> >         if (page_poisoning_enabled() &&
> >             !IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY))
> >                 walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> >         send_free_page_cmd_id(vb, &vb->stop_cmd_id);
> >
> >
> > walk_free_mem_block() should be a more generic API, and this potential
> > page poisoning issue is specific to live migration which is only one
> > use case of this function, so I think it is better to handle it in the
> > special use case itself.
> >
> > Best,
> > Wei
> >
> 
> It's a quick work-around but it doesn't make me very happy.
> 
> AFAIK e.g. RHEL has a debug kernel with poisoning enabled.
> If this never uses free page hinting at all, it will be much less useful for
> debugging guests.
> 

I understand your concern. I think people who use debugging guests don't regard performance as the first priority, and most vendors usually wouldn't use debugging guests for their products.

How about taking it as the initial solution? We can exploit more solutions after this series is done.

Best,
Wei


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [virtio-dev] Re: [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-11-18  5:22           ` Wang, Wei W
@ 2017-11-19 15:11             ` Michael S. Tsirkin
  0 siblings, 0 replies; 32+ messages in thread
From: Michael S. Tsirkin @ 2017-11-19 15:11 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	penguin-kernel, linux-kernel, willy, virtualization, linux-mm,
	yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm, mhocko,
	mgorman, liliang.opensource

On Sat, Nov 18, 2017 at 05:22:28AM +0000, Wang, Wei W wrote:
> On Friday, November 17, 2017 8:45 PM, Michael S. Tsirkin wrote:
> > On Fri, Nov 17, 2017 at 07:35:03PM +0800, Wei Wang wrote:
> > > On 11/16/2017 09:27 PM, Wei Wang wrote:
> > > > On 11/16/2017 04:32 AM, Michael S. Tsirkin wrote:
> > > > > On Fri, Nov 03, 2017 at 04:13:06PM +0800, Wei Wang wrote:
> > > > > > Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_VQ feature
> > > > > > indicates the support of reporting hints of guest free pages to
> > > > > > the host via virtio-balloon. The host requests the guest to
> > > > > > report the free pages by sending commands via the virtio-balloon
> > configuration registers.
> > > > > >
> > > > > > When the guest starts to report, the first element added to the
> > > > > > free page vq is a sequence id of the start reporting command.
> > > > > > The id is given by the host, and it indicates whether the
> > > > > > following free pages correspond to the command. For example, the
> > > > > > host may stop the report and start again with a new command id.
> > > > > > The obsolete pages for the previous start command can be
> > > > > > detected by the id dismatching on the host. The id is added to
> > > > > > the vq using an output buffer, and the free pages are added to
> > > > > > the vq using input buffer.
> > > > > >
> > > > > > Here are some explainations about the added configuration registers:
> > > > > > - host2guest_cmd: a register used by the host to send commands
> > > > > > to the guest.
> > > > > > - guest2host_cmd: written by the guest to ACK to the host about
> > > > > > the commands that have been received. The host will clear the
> > > > > > corresponding bits on the host2guest_cmd register. The guest
> > > > > > also uses this register to send commands to the host (e.g. when finish
> > free page reporting).
> > > > > > - free_page_cmd_id: the sequence id of the free page report
> > > > > > command given by the host.
> > > > > >
> > > > > > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > > > > > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > > > > > Cc: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Cc: Michal Hocko <mhocko@kernel.org>
> > > > > > ---
> > > > > >
> > > > > > +
> > > > > > +static void report_free_page(struct work_struct *work) {
> > > > > > +    struct virtio_balloon *vb;
> > > > > > +
> > > > > > +    vb = container_of(work, struct virtio_balloon,
> > > > > > report_free_page_work);
> > > > > > +    report_free_page_cmd_id(vb);
> > > > > > +    walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> > > > > > +    /*
> > > > > > +     * The last few free page blocks that were added may not reach the
> > > > > > +     * batch size, but need a kick to notify the device to
> > > > > > handle them.
> > > > > > +     */
> > > > > > +    virtqueue_kick(vb->free_page_vq);
> > > > > > +    report_free_page_end(vb);
> > > > > > +}
> > > > > > +
> > > > > I think there's an issue here: if pages are poisoned and
> > > > > hypervisor subsequently drops them, testing them after allocation
> > > > > will trigger a false positive.
> > > > >
> > > > > The specific configuration:
> > > > >
> > > > > PAGE_POISONING on
> > > > > PAGE_POISONING_NO_SANITY off
> > > > > PAGE_POISONING_ZERO off
> > > > >
> > > > >
> > > > > Solutions:
> > > > > 1. disable the feature in that configuration
> > > > >     suggested as an initial step
> > > >
> > > > Thanks for the finding.
> > > > Similar to this option: I'm thinking could we make
> > > > walk_free_mem_block() simply return if that option is on?
> > > > That is, at the beginning of the function:
> > > >     if (!page_poisoning_enabled())
> > > >                 return;
> > > >
> > >
> > >
> > > Thought about it more, I think it would be better to put this logic to
> > > virtio_balloon:
> > >
> > >         send_free_page_cmd_id(vb, &vb->start_cmd_id);
> > >         if (page_poisoning_enabled() &&
> > >             !IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY))
> > >                 walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> > >         send_free_page_cmd_id(vb, &vb->stop_cmd_id);
> > >
> > >
> > > walk_free_mem_block() should be a more generic API, and this potential
> > > page poisoning issue is specific to live migration which is only one
> > > use case of this function, so I think it is better to handle it in the
> > > special use case itself.
> > >
> > > Best,
> > > Wei
> > >
> > 
> > It's a quick work-around but it doesn't make me very happy.
> > 
> > AFAIK e.g. RHEL has a debug kernel with poisoning enabled.
> > If this never uses free page hinting at all, it will be much less useful for
> > debugging guests.
> > 
> 
> I understand your concern. I think people who use debugging guests
> don't regard performance as the first priority, and most vendors
> usually wouldn't use debugging guests for their products.

And when one of these crashes but only after migration what do you do?  A
very common step is for Red Hat support is to ask people to try
reproducing with a debug build.

IOT being able to debug guests is important, if a debugging guest takes
a significantly different path from non-debug one, we have a problem.

> 
> How about taking it as the initial solution? We can exploit more
> solutions after this series is done.
> 
> Best,
> Wei

I think it's fine as a separate patch.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-11-13 17:32     ` Michael S. Tsirkin
  2017-11-14 12:02       ` Wei Wang
@ 2017-11-20 11:42       ` Wei Wang
  1 sibling, 0 replies; 32+ messages in thread
From: Wei Wang @ 2017-11-20 11:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, penguin-kernel,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu, Nitesh Narayan Lal,
	Rik van Riel

On 11/14/2017 01:32 AM, Michael S. Tsirkin wrote:
> You should Cc Nitesh who is working on a related feature.

OK, I'll do. We have two more issues which haven't been discussed yet, 
please have a check below.

>
> On Mon, Nov 13, 2017 at 06:34:48PM +0800, Wei Wang wrote:
>> Ping for comments, thanks.
>>
>> On 11/03/2017 04:13 PM, Wei Wang wrote:
>>> +static void virtballoon_cmd_report_free_page_start(struct virtio_balloon *vb)
>>> +{
>>> +	unsigned long flags;
>>> +
>>> +	vb->report_free_page_stop = false;
> this flag is used a lot outside any locks. Why is this safe?
> Please add some comments explaining access to this flag.

I will revert the logic as suggested: vb->report_free_page. Also plan to 
simplify its usage as below.

The flag is set or cleared in the config handler according to the 
new_cmd_id given
by the host:

new_cmd_id=0:                    WRITE_ONCE(vb->report_free_page, 
false); // stop reporting
new_cmd_id != old_cmd_id: WRITE_ONCE(vb->report_free_page, true);  // 
start reporting


The flag is read by virtio_balloon_send_free_pages() - the callback to 
report free pages:

if (!READ_ONCE(vb->report_free_page))
                 return false;

I don't find where it could be unsafe then (the flag is written by the 
config handler only).



>
>>> +}
>>> +
>>>    static inline s64 towards_target(struct virtio_balloon *vb)
>>>    {
>>>    	s64 target;
>>> @@ -597,42 +673,147 @@ static void update_balloon_size_func(struct work_struct *work)
>>>    		queue_work(system_freezable_wq, work);
>>>    }
>>> -static int init_vqs(struct virtio_balloon *vb)
>>> +static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
>>> +					   unsigned long nr_pages)
>>>    {
>>> -	struct virtqueue *vqs[3];
>>> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
>>> -	static const char * const names[] = { "inflate", "deflate", "stats" };
>>> -	int err, nvqs;
>>> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
>>> +	void *addr = (void *)pfn_to_kaddr(pfn);
> How do we know all free pages have a kaddr?

For x86_64, it works well since the kernel has all the physical memory 
mapped already. But for 32-bit kernel, yes, the high memory usually 
isn't mapped and thus no kaddr. Essentially, this pfn_to_kaddr convert 
isn't necessary, we do it here because the current API that virtio has 
is based on "struct scatterlist", which takes a kaddr, and this kaddr is 
then convert back to physical address in virtqueue_add() when assigning 
to desc->addr.

I think a better solution would be to add a new API, which directly 
assigns the caller's guest physical address to desc->addr, similar to 
the previous implementation "add_one_chunk()" 
(https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg02452.html). 
But we can change that to a general virtio API:
virtqueue_add_one_desc(struct virtqueue *_vq, u64 base_addr, u32 size, 
bool in_desc, void *data);

What do you think?

Best,
Wei


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v17 2/6] radix tree test suite: add tests for xbitmap
  2017-11-06 17:00   ` Matthew Wilcox
@ 2017-11-29 14:20     ` Wei Wang
  0 siblings, 0 replies; 32+ messages in thread
From: Wei Wang @ 2017-11-29 14:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox, david, penguin-kernel,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource, yang.zhang.wz, quan.xu

On 11/07/2017 01:00 AM, Matthew Wilcox wrote:
> On Fri, Nov 03, 2017 at 04:13:02PM +0800, Wei Wang wrote:
>> From: Matthew Wilcox <mawilcox@microsoft.com>
>>
>> Add the following tests for xbitmap:
>> 1) single bit test: single bit set/clear/find;
>> 2) bit range test: set/clear a range of bits and find a 0 or 1 bit in
>> the range.
>>
>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Michael S. Tsirkin <mst@redhat.com>
>> ---
>>   tools/include/linux/bitmap.h            |  34 ++++
>>   tools/include/linux/kernel.h            |   2 +
>>   tools/testing/radix-tree/Makefile       |   7 +-
>>   tools/testing/radix-tree/linux/kernel.h |   2 -
>>   tools/testing/radix-tree/main.c         |   5 +
>>   tools/testing/radix-tree/test.h         |   1 +
>>   tools/testing/radix-tree/xbitmap.c      | 278 ++++++++++++++++++++++++++++++++
> Umm.  No.  You've duplicated xbitmap.c into the test framework, so now it can
> slowly get out of sync with the one in lib/.  That's not OK.
>
> Put it back the way it was, with the patch I gave you as patch 1/n
> (relocating xbitmap.c from tools/testing/radix-tree to lib/).
> Then add your enhancements as patch 2/n.  All you should need to
> change in your 1/n from
> http://git.infradead.org/users/willy/linux-dax.git/commit/727e401bee5ad7d37e0077291d90cc17475c6392
> is a bit of Makefile tooling.  Leave the test suite embedded in the file;
> that way people might remember to update the test suite when adding
> new functionality.
>

Thanks for you suggestions. Please have a check the v18 patches:
The original implementation is put in patch 4, and the proposed changes 
are separated into patch 5 (we can merge them to patch 4 later if they 
look good to you), and the new APIs are in patch 6.

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2017-11-29 14:18 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-03  8:13 [PATCH v17 0/6] Virtio-balloon Enhancement Wei Wang
2017-11-03  8:13 ` [PATCH v17 1/6] lib/xbitmap: Introduce xbitmap Wei Wang
2017-11-03 10:55   ` Tetsuo Handa
2017-11-06  8:15     ` Wei Wang
2017-11-03  8:13 ` [PATCH v17 2/6] radix tree test suite: add tests for xbitmap Wei Wang
2017-11-06 17:00   ` Matthew Wilcox
2017-11-29 14:20     ` Wei Wang
2017-11-03  8:13 ` [PATCH v17 3/6] mm/balloon_compaction.c: split balloon page allocation and enqueue Wei Wang
2017-11-03 10:59   ` Tetsuo Handa
2017-11-03  8:13 ` [PATCH v17 4/6] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
2017-11-03 11:25   ` Tetsuo Handa
2017-11-04 11:09     ` Wei Wang
2017-11-04 11:28       ` Tetsuo Handa
2017-11-06  8:21         ` Wei Wang
2017-11-03  8:13 ` [PATCH v17 5/6] mm: support reporting free page blocks Wei Wang
2017-11-03  8:13 ` [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ Wei Wang
2017-11-13 10:34   ` Wei Wang
2017-11-13 17:32     ` Michael S. Tsirkin
2017-11-14 12:02       ` Wei Wang
2017-11-14 21:21         ` Michael S. Tsirkin
2017-11-15  3:47           ` Wei Wang
2017-11-15 13:26             ` Michael S. Tsirkin
2017-11-16 11:59               ` Wei Wang
2017-11-20 11:42       ` Wei Wang
2017-11-15 20:32   ` Michael S. Tsirkin
2017-11-16 13:27     ` [virtio-dev] " Wei Wang
2017-11-17 11:35       ` Wei Wang
2017-11-17 11:48         ` Wei Wang
2017-11-17 12:44         ` Michael S. Tsirkin
2017-11-18  5:22           ` Wang, Wei W
2017-11-19 15:11             ` Michael S. Tsirkin
2017-11-17 13:18       ` Michael S. Tsirkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).