All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v14 0/5] Virtio-balloon Enhancement
@ 2017-08-17  3:26 ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

This patch series enhances the existing virtio-balloon with the following
new features:
1) fast ballooning: transfer ballooned pages between the guest and host in
chunks using sgs, instead of one by one; and
2) free_page_vq: a new virtqueue to report guest free pages to the host.

The second feature can be used to accelerate live migration of VMs. Here
are some details:

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.

The second feature  enables the optimization of the 1st round memory
transfer - the hypervisor can skip the transfer of guest free pages in the
1st round. It is not concerned that the memory pages are used after they
are given to the hypervisor as a hint of the free pages, because they will
be tracked by the hypervisor and transferred in the next round if they are
used and written.

Change Log:
v13->v14:
1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c.
2) xbitmap: consolidate the implementation of xb_bit_set/clear/test into
one xb_bit_ops.
3) xbitmap: add documents for the exported APIs.
4) mm: rewrite the function to walk through free page blocks.
5) virtio-balloon: when reporting a free page blcok to the device, if the
vq is full (less likey to happen in practice), just skip reporting this
block, instead of busywaiting till an entry gets released.
6) virtio-balloon: fail the probe function if adding the signal buf in
init_vqs fails.

v12->v13:
1) mm: use a callback function to handle the the free page blocks from the
report function. This avoids exposing the zone internal to a kernel module.
2) virtio-balloon: send balloon pages or a free page block using a single sg
each time. This has the benefits of simpler implementation with no new APIs.
3) virtio-balloon: the free_page_vq is used to report free pages only (no
multiple usages interleaving)
4) virtio-balloon: Balloon pages and free page blocks are sent via input sgs,
and the completion signal to the host is sent via an output sg.

v11->v12:
1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned pages.
2) virtio-ring: enable the driver to build up a desc chain using vring desc.
3) virtio-ring: Add locking to the existing START_USE() and END_USE() macro
to lock/unlock the vq when a vq operation starts/ends.
4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
5) virtio-balloon: describe chunks of ballooned pages and free pages blocks
directly using one or more chains of desc from the vq.

v10->v11:
1) virtio_balloon: use vring_desc to describe a chunk;
2) virtio_ring: support to add an indirect desc table to virtqueue;
3)  virtio_balloon: use cmdq to report guest memory statistics.

v9->v10:
1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
2) virtio-balloon: add virtballoon_validate();
3) virtio-balloon: msg format change;
4) virtio-balloon: move miscq handling to a task on system_freezable_wq;
5) virtio-balloon: code cleanup.

v8->v9:
1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
implementation;
2) Simpler function to get the free page block.

v7->v8:
1) Use only one chunk format, instead of two.
2) re-write the virtio-balloon implementation patch.
3) commit changes
4) patch re-org

Matthew Wilcox (1):
  lib/xbitmap: Introduce xbitmap

Wei Wang (4):
  lib/xbitmap: add xb_find_next_bit() and xb_zero()
  virtio-balloon: VIRTIO_BALLOON_F_SG
  mm: support reporting free page blocks
  virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ

 drivers/virtio/virtio_balloon.c     | 324 +++++++++++++++++++++++++++++++-----
 include/linux/mm.h                  |   6 +
 include/linux/radix-tree.h          |   3 +
 include/linux/xbitmap.h             |  64 +++++++
 include/uapi/linux/virtio_balloon.h |   2 +
 lib/Makefile                        |   2 +-
 lib/radix-tree.c                    |  22 ++-
 lib/xbitmap.c                       | 215 ++++++++++++++++++++++++
 mm/page_alloc.c                     |  44 +++++
 9 files changed, 640 insertions(+), 42 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v14 0/5] Virtio-balloon Enhancement
@ 2017-08-17  3:26 ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

This patch series enhances the existing virtio-balloon with the following
new features:
1) fast ballooning: transfer ballooned pages between the guest and host in
chunks using sgs, instead of one by one; and
2) free_page_vq: a new virtqueue to report guest free pages to the host.

The second feature can be used to accelerate live migration of VMs. Here
are some details:

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.

The second feature  enables the optimization of the 1st round memory
transfer - the hypervisor can skip the transfer of guest free pages in the
1st round. It is not concerned that the memory pages are used after they
are given to the hypervisor as a hint of the free pages, because they will
be tracked by the hypervisor and transferred in the next round if they are
used and written.

Change Log:
v13->v14:
1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c.
2) xbitmap: consolidate the implementation of xb_bit_set/clear/test into
one xb_bit_ops.
3) xbitmap: add documents for the exported APIs.
4) mm: rewrite the function to walk through free page blocks.
5) virtio-balloon: when reporting a free page blcok to the device, if the
vq is full (less likey to happen in practice), just skip reporting this
block, instead of busywaiting till an entry gets released.
6) virtio-balloon: fail the probe function if adding the signal buf in
init_vqs fails.

v12->v13:
1) mm: use a callback function to handle the the free page blocks from the
report function. This avoids exposing the zone internal to a kernel module.
2) virtio-balloon: send balloon pages or a free page block using a single sg
each time. This has the benefits of simpler implementation with no new APIs.
3) virtio-balloon: the free_page_vq is used to report free pages only (no
multiple usages interleaving)
4) virtio-balloon: Balloon pages and free page blocks are sent via input sgs,
and the completion signal to the host is sent via an output sg.

v11->v12:
1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned pages.
2) virtio-ring: enable the driver to build up a desc chain using vring desc.
3) virtio-ring: Add locking to the existing START_USE() and END_USE() macro
to lock/unlock the vq when a vq operation starts/ends.
4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
5) virtio-balloon: describe chunks of ballooned pages and free pages blocks
directly using one or more chains of desc from the vq.

v10->v11:
1) virtio_balloon: use vring_desc to describe a chunk;
2) virtio_ring: support to add an indirect desc table to virtqueue;
3)  virtio_balloon: use cmdq to report guest memory statistics.

v9->v10:
1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
2) virtio-balloon: add virtballoon_validate();
3) virtio-balloon: msg format change;
4) virtio-balloon: move miscq handling to a task on system_freezable_wq;
5) virtio-balloon: code cleanup.

v8->v9:
1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
implementation;
2) Simpler function to get the free page block.

v7->v8:
1) Use only one chunk format, instead of two.
2) re-write the virtio-balloon implementation patch.
3) commit changes
4) patch re-org

Matthew Wilcox (1):
  lib/xbitmap: Introduce xbitmap

Wei Wang (4):
  lib/xbitmap: add xb_find_next_bit() and xb_zero()
  virtio-balloon: VIRTIO_BALLOON_F_SG
  mm: support reporting free page blocks
  virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ

 drivers/virtio/virtio_balloon.c     | 324 +++++++++++++++++++++++++++++++-----
 include/linux/mm.h                  |   6 +
 include/linux/radix-tree.h          |   3 +
 include/linux/xbitmap.h             |  64 +++++++
 include/uapi/linux/virtio_balloon.h |   2 +
 lib/Makefile                        |   2 +-
 lib/radix-tree.c                    |  22 ++-
 lib/xbitmap.c                       | 215 ++++++++++++++++++++++++
 mm/page_alloc.c                     |  44 +++++
 9 files changed, 640 insertions(+), 42 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c

-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [Qemu-devel] [PATCH v14 0/5] Virtio-balloon Enhancement
@ 2017-08-17  3:26 ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

This patch series enhances the existing virtio-balloon with the following
new features:
1) fast ballooning: transfer ballooned pages between the guest and host in
chunks using sgs, instead of one by one; and
2) free_page_vq: a new virtqueue to report guest free pages to the host.

The second feature can be used to accelerate live migration of VMs. Here
are some details:

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.

The second feature  enables the optimization of the 1st round memory
transfer - the hypervisor can skip the transfer of guest free pages in the
1st round. It is not concerned that the memory pages are used after they
are given to the hypervisor as a hint of the free pages, because they will
be tracked by the hypervisor and transferred in the next round if they are
used and written.

Change Log:
v13->v14:
1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c.
2) xbitmap: consolidate the implementation of xb_bit_set/clear/test into
one xb_bit_ops.
3) xbitmap: add documents for the exported APIs.
4) mm: rewrite the function to walk through free page blocks.
5) virtio-balloon: when reporting a free page blcok to the device, if the
vq is full (less likey to happen in practice), just skip reporting this
block, instead of busywaiting till an entry gets released.
6) virtio-balloon: fail the probe function if adding the signal buf in
init_vqs fails.

v12->v13:
1) mm: use a callback function to handle the the free page blocks from the
report function. This avoids exposing the zone internal to a kernel module.
2) virtio-balloon: send balloon pages or a free page block using a single sg
each time. This has the benefits of simpler implementation with no new APIs.
3) virtio-balloon: the free_page_vq is used to report free pages only (no
multiple usages interleaving)
4) virtio-balloon: Balloon pages and free page blocks are sent via input sgs,
and the completion signal to the host is sent via an output sg.

v11->v12:
1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned pages.
2) virtio-ring: enable the driver to build up a desc chain using vring desc.
3) virtio-ring: Add locking to the existing START_USE() and END_USE() macro
to lock/unlock the vq when a vq operation starts/ends.
4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
5) virtio-balloon: describe chunks of ballooned pages and free pages blocks
directly using one or more chains of desc from the vq.

v10->v11:
1) virtio_balloon: use vring_desc to describe a chunk;
2) virtio_ring: support to add an indirect desc table to virtqueue;
3)  virtio_balloon: use cmdq to report guest memory statistics.

v9->v10:
1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
2) virtio-balloon: add virtballoon_validate();
3) virtio-balloon: msg format change;
4) virtio-balloon: move miscq handling to a task on system_freezable_wq;
5) virtio-balloon: code cleanup.

v8->v9:
1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
implementation;
2) Simpler function to get the free page block.

v7->v8:
1) Use only one chunk format, instead of two.
2) re-write the virtio-balloon implementation patch.
3) commit changes
4) patch re-org

Matthew Wilcox (1):
  lib/xbitmap: Introduce xbitmap

Wei Wang (4):
  lib/xbitmap: add xb_find_next_bit() and xb_zero()
  virtio-balloon: VIRTIO_BALLOON_F_SG
  mm: support reporting free page blocks
  virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ

 drivers/virtio/virtio_balloon.c     | 324 +++++++++++++++++++++++++++++++-----
 include/linux/mm.h                  |   6 +
 include/linux/radix-tree.h          |   3 +
 include/linux/xbitmap.h             |  64 +++++++
 include/uapi/linux/virtio_balloon.h |   2 +
 lib/Makefile                        |   2 +-
 lib/radix-tree.c                    |  22 ++-
 lib/xbitmap.c                       | 215 ++++++++++++++++++++++++
 mm/page_alloc.c                     |  44 +++++
 9 files changed, 640 insertions(+), 42 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [virtio-dev] [PATCH v14 0/5] Virtio-balloon Enhancement
@ 2017-08-17  3:26 ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

This patch series enhances the existing virtio-balloon with the following
new features:
1) fast ballooning: transfer ballooned pages between the guest and host in
chunks using sgs, instead of one by one; and
2) free_page_vq: a new virtqueue to report guest free pages to the host.

The second feature can be used to accelerate live migration of VMs. Here
are some details:

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.

The second feature  enables the optimization of the 1st round memory
transfer - the hypervisor can skip the transfer of guest free pages in the
1st round. It is not concerned that the memory pages are used after they
are given to the hypervisor as a hint of the free pages, because they will
be tracked by the hypervisor and transferred in the next round if they are
used and written.

Change Log:
v13->v14:
1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c.
2) xbitmap: consolidate the implementation of xb_bit_set/clear/test into
one xb_bit_ops.
3) xbitmap: add documents for the exported APIs.
4) mm: rewrite the function to walk through free page blocks.
5) virtio-balloon: when reporting a free page blcok to the device, if the
vq is full (less likey to happen in practice), just skip reporting this
block, instead of busywaiting till an entry gets released.
6) virtio-balloon: fail the probe function if adding the signal buf in
init_vqs fails.

v12->v13:
1) mm: use a callback function to handle the the free page blocks from the
report function. This avoids exposing the zone internal to a kernel module.
2) virtio-balloon: send balloon pages or a free page block using a single sg
each time. This has the benefits of simpler implementation with no new APIs.
3) virtio-balloon: the free_page_vq is used to report free pages only (no
multiple usages interleaving)
4) virtio-balloon: Balloon pages and free page blocks are sent via input sgs,
and the completion signal to the host is sent via an output sg.

v11->v12:
1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned pages.
2) virtio-ring: enable the driver to build up a desc chain using vring desc.
3) virtio-ring: Add locking to the existing START_USE() and END_USE() macro
to lock/unlock the vq when a vq operation starts/ends.
4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
5) virtio-balloon: describe chunks of ballooned pages and free pages blocks
directly using one or more chains of desc from the vq.

v10->v11:
1) virtio_balloon: use vring_desc to describe a chunk;
2) virtio_ring: support to add an indirect desc table to virtqueue;
3)  virtio_balloon: use cmdq to report guest memory statistics.

v9->v10:
1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
2) virtio-balloon: add virtballoon_validate();
3) virtio-balloon: msg format change;
4) virtio-balloon: move miscq handling to a task on system_freezable_wq;
5) virtio-balloon: code cleanup.

v8->v9:
1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
implementation;
2) Simpler function to get the free page block.

v7->v8:
1) Use only one chunk format, instead of two.
2) re-write the virtio-balloon implementation patch.
3) commit changes
4) patch re-org

Matthew Wilcox (1):
  lib/xbitmap: Introduce xbitmap

Wei Wang (4):
  lib/xbitmap: add xb_find_next_bit() and xb_zero()
  virtio-balloon: VIRTIO_BALLOON_F_SG
  mm: support reporting free page blocks
  virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ

 drivers/virtio/virtio_balloon.c     | 324 +++++++++++++++++++++++++++++++-----
 include/linux/mm.h                  |   6 +
 include/linux/radix-tree.h          |   3 +
 include/linux/xbitmap.h             |  64 +++++++
 include/uapi/linux/virtio_balloon.h |   2 +
 lib/Makefile                        |   2 +-
 lib/radix-tree.c                    |  22 ++-
 lib/xbitmap.c                       | 215 ++++++++++++++++++++++++
 mm/page_alloc.c                     |  44 +++++
 9 files changed, 640 insertions(+), 42 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c

-- 
2.7.4


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v14 1/5] lib/xbitmap: Introduce xbitmap
  2017-08-17  3:26 ` Wei Wang
  (?)
  (?)
@ 2017-08-17  3:26   ` Wei Wang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

From: Matthew Wilcox <mawilcox@microsoft.com>

The eXtensible Bitmap is a sparse bitmap representation which is
efficient for set bits which tend to cluster.  It supports up to
'unsigned long' worth of bits, and this commit adds the bare bones --
xb_set_bit(), xb_clear_bit() and xb_test_bit().

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/radix-tree.h |   3 +
 include/linux/xbitmap.h    |  61 ++++++++++++++++
 lib/Makefile               |   2 +-
 lib/radix-tree.c           |  22 +++++-
 lib/xbitmap.c              | 176 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 260 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 3e57350..e1203b1 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -309,6 +309,8 @@ void radix_tree_iter_replace(struct radix_tree_root *,
 		const struct radix_tree_iter *, void __rcu **slot, void *entry);
 void radix_tree_replace_slot(struct radix_tree_root *,
 			     void __rcu **slot, void *entry);
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot);
 void __radix_tree_delete_node(struct radix_tree_root *,
 			      struct radix_tree_node *,
 			      radix_tree_update_node_t update_node,
@@ -325,6 +327,7 @@ unsigned int radix_tree_gang_lookup(const struct radix_tree_root *,
 unsigned int radix_tree_gang_lookup_slot(const struct radix_tree_root *,
 			void __rcu ***results, unsigned long *indices,
 			unsigned long first_index, unsigned int max_items);
+int __radix_tree_preload(gfp_t gfp_mask, unsigned int nr);
 int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order);
diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
new file mode 100644
index 0000000..5edbf84
--- /dev/null
+++ b/include/linux/xbitmap.h
@@ -0,0 +1,61 @@
+/*
+ * eXtensible Bitmaps
+ * Copyright (c) 2017 Microsoft Corporation <mawilcox@microsoft.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * eXtensible Bitmaps provide an unlimited-size sparse bitmap facility.
+ * All bits are initially zero.
+ */
+
+#ifndef __XBITMAP_H__
+#define __XBITMAP_H__
+
+#include <linux/idr.h>
+
+struct xb {
+	struct radix_tree_root xbrt;
+};
+
+#define XB_INIT {							\
+	.xbrt = RADIX_TREE_INIT(IDR_RT_MARKER | GFP_NOWAIT),		\
+}
+#define DEFINE_XB(name)		struct xb name = XB_INIT
+
+static inline void xb_init(struct xb *xb)
+{
+	INIT_RADIX_TREE(&xb->xbrt, IDR_RT_MARKER | GFP_NOWAIT);
+}
+
+int xb_set_bit(struct xb *xb, unsigned long bit);
+bool xb_test_bit(const struct xb *xb, unsigned long bit);
+void xb_clear_bit(struct xb *xb, unsigned long bit);
+
+/* Check if the xb tree is empty */
+static inline bool xb_is_empty(const struct xb *xb)
+{
+	return radix_tree_empty(&xb->xbrt);
+}
+
+void xb_preload(gfp_t gfp);
+
+/**
+ * xb_preload_end - end preload section started with xb_preload()
+ *
+ * Each xb_preload() should be matched with an invocation of this
+ * function. See xb_preload() for details.
+ */
+static inline void xb_preload_end(void)
+{
+	preempt_enable();
+}
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 40c1837..ea50496 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -18,7 +18,7 @@ KCOV_INSTRUMENT_dynamic_debug.o := n
 
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o dump_stack.o timerqueue.o\
-	 idr.o int_sqrt.o extable.o \
+	 idr.o xbitmap.o int_sqrt.o extable.o \
 	 sha1.o chacha20.o irq_regs.o argv_split.o \
 	 flex_proportions.o ratelimit.o show_mem.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 898e879..ee72e2c 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -463,7 +463,7 @@ radix_tree_node_free(struct radix_tree_node *node)
  * To make use of this facility, the radix tree must be initialised without
  * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
  */
-static int __radix_tree_preload(gfp_t gfp_mask, unsigned nr)
+int __radix_tree_preload(gfp_t gfp_mask, unsigned int nr)
 {
 	struct radix_tree_preload *rtp;
 	struct radix_tree_node *node;
@@ -496,6 +496,7 @@ static int __radix_tree_preload(gfp_t gfp_mask, unsigned nr)
 out:
 	return ret;
 }
+EXPORT_SYMBOL(__radix_tree_preload);
 
 /*
  * Load up this CPU's radix_tree_node buffer with sufficient objects to
@@ -840,6 +841,8 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
 							offset, 0, 0);
 			if (!child)
 				return -ENOMEM;
+			if (is_idr(root))
+				all_tag_set(child, IDR_FREE);
 			rcu_assign_pointer(*slot, node_to_entry(child));
 			if (node)
 				node->count++;
@@ -1986,8 +1989,20 @@ void __radix_tree_delete_node(struct radix_tree_root *root,
 	delete_node(root, node, update_node, private);
 }
 
-static bool __radix_tree_delete(struct radix_tree_root *root,
-				struct radix_tree_node *node, void __rcu **slot)
+/**
+ * __radix_tree_delete - delete a slot from a radix tree
+ * @root: radix tree root
+ * @node: node containing the slot
+ * @slot: pointer to the slot to delete
+ *
+ * Clear @slot from @node of the radix tree. This may cause the current node to
+ * be freed. This function may be called without any locking if there are no
+ * other threads which can access this tree.
+ *
+ * Return: the node or NULL if the node is freed.
+ */
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot)
 {
 	void *old = rcu_dereference_raw(*slot);
 	int exceptional = radix_tree_exceptional_entry(old) ? -1 : 0;
@@ -2003,6 +2018,7 @@ static bool __radix_tree_delete(struct radix_tree_root *root,
 	replace_slot(slot, NULL, node, -1, exceptional);
 	return node && delete_node(root, node, NULL, NULL);
 }
+EXPORT_SYMBOL(__radix_tree_delete);
 
 /**
  * radix_tree_iter_delete - delete the entry at this iterator position
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
new file mode 100644
index 0000000..cc766d9
--- /dev/null
+++ b/lib/xbitmap.c
@@ -0,0 +1,176 @@
+#include <linux/slab.h>
+#include <linux/xbitmap.h>
+
+/*
+ * The xbitmap implementation supports up to ULONG_MAX bits, and it is
+ * implemented based on ida bitmaps. So, given an unsigned long index,
+ * the high order XB_INDEX_BITS bits of the index is used to find the
+ * corresponding iteam (i.e. ida bitmap) from the radix tree, and the low
+ * order (i.e. ilog2(IDA_BITMAP_BITS)) bits of the index are indexed into
+ * the ida bitmap to find the bit.
+ */
+#define XB_INDEX_BITS		(BITS_PER_LONG - ilog2(IDA_BITMAP_BITS))
+#define XB_MAX_PATH		(DIV_ROUND_UP(XB_INDEX_BITS, \
+					      RADIX_TREE_MAP_SHIFT))
+#define XB_PRELOAD_SIZE		(XB_MAX_PATH * 2 - 1)
+
+enum xb_ops {
+	XB_SET,
+	XB_CLEAR,
+	XB_TEST
+};
+
+static int xb_bit_ops(struct xb *xb, unsigned long bit, enum xb_ops ops)
+{
+	int ret = 0;
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit, tmp;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + RADIX_TREE_EXCEPTIONAL_SHIFT;
+
+	switch (ops) {
+	case XB_SET:
+		ret = __radix_tree_create(root, index, 0, &node, &slot);
+		if (ret)
+			return ret;
+		bitmap = rcu_dereference_raw(*slot);
+		if (radix_tree_exception(bitmap)) {
+			tmp = (unsigned long)bitmap;
+			if (ebit < BITS_PER_LONG) {
+				tmp |= 1UL << ebit;
+				rcu_assign_pointer(*slot, (void *)tmp);
+				return 0;
+			}
+			bitmap = this_cpu_xchg(ida_bitmap, NULL);
+			if (!bitmap)
+				return -EAGAIN;
+			memset(bitmap, 0, sizeof(*bitmap));
+			bitmap->bitmap[0] =
+					tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+			rcu_assign_pointer(*slot, bitmap);
+		}
+		if (!bitmap) {
+			if (ebit < BITS_PER_LONG) {
+				bitmap = (void *)((1UL << ebit) |
+					RADIX_TREE_EXCEPTIONAL_ENTRY);
+				__radix_tree_replace(root, node, slot, bitmap,
+						     NULL, NULL);
+				return 0;
+			}
+			bitmap = this_cpu_xchg(ida_bitmap, NULL);
+			if (!bitmap)
+				return -EAGAIN;
+			memset(bitmap, 0, sizeof(*bitmap));
+			__radix_tree_replace(root, node, slot, bitmap, NULL,
+					     NULL);
+		}
+		__set_bit(bit, bitmap->bitmap);
+		break;
+	case XB_CLEAR:
+		bitmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bitmap)) {
+			tmp = (unsigned long)bitmap;
+			if (ebit >= BITS_PER_LONG)
+				return 0;
+			tmp &= ~(1UL << ebit);
+			if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+				__radix_tree_delete(root, node, slot);
+			else
+				rcu_assign_pointer(*slot, (void *)tmp);
+			return 0;
+		}
+		if (!bitmap)
+			return 0;
+		__clear_bit(bit, bitmap->bitmap);
+		if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+			kfree(bitmap);
+			__radix_tree_delete(root, node, slot);
+		}
+		break;
+	case XB_TEST:
+		bitmap = radix_tree_lookup(root, index);
+		if (!bitmap)
+			return 0;
+		if (radix_tree_exception(bitmap)) {
+			if (ebit > BITS_PER_LONG)
+				return 0;
+			return (unsigned long)bitmap & (1UL << bit);
+		}
+		ret = test_bit(bit, bitmap->bitmap);
+		break;
+	default:
+		return -EINVAL;
+	}
+	return ret;
+}
+
+/**
+ *  xb_set_bit - set a bit in the xbitmap
+ *  @xb: the xbitmap tree used to record the bit
+ *  @bit: index of the bit to set
+ *
+ * This function is used to set a bit in the xbitmap. If the bitmap that @bit
+ * resides in is not there, it will be allocated.
+ *
+ * Returns: 0 on success. %-EAGAIN indicates that @bit was not set. The caller
+ * may want to call the function again.
+ */
+int xb_set_bit(struct xb *xb, unsigned long bit)
+{
+	return xb_bit_ops(xb, bit, XB_SET);
+}
+EXPORT_SYMBOL(xb_set_bit);
+
+/**
+ * xb_clear_bit - clear a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to set
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit(struct xb *xb, unsigned long bit)
+{
+	xb_bit_ops(xb, bit, XB_CLEAR);
+}
+EXPORT_SYMBOL(xb_clear_bit);
+
+/**
+ * xb_test_bit - test a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to set
+ *
+ * This function is used to test a bit in the xbitmap.
+ * Returns: 1 if the bit is set, or 0 otherwise.
+ */
+bool xb_test_bit(const struct xb *xb, unsigned long bit)
+{
+	return (bool)xb_bit_ops(xb, bit, XB_TEST);
+}
+EXPORT_SYMBOL(xb_test_bit);
+
+/**
+ *  xb_preload - preload for xb_set_bit()
+ *  @gfp_mask: allocation mask to use for preloading
+ *
+ * Preallocate memory to use for the next call to xb_set_bit(). This function
+ * returns with preemption disabled. It will be enabled by xb_preload_end().
+ */
+void xb_preload(gfp_t gfp)
+{
+	__radix_tree_preload(gfp, XB_PRELOAD_SIZE);
+	if (!this_cpu_read(ida_bitmap)) {
+		struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
+
+		if (!bitmap)
+			return;
+		bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
+		kfree(bitmap);
+	}
+}
+EXPORT_SYMBOL(xb_preload);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v14 1/5] lib/xbitmap: Introduce xbitmap
@ 2017-08-17  3:26   ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

From: Matthew Wilcox <mawilcox@microsoft.com>

The eXtensible Bitmap is a sparse bitmap representation which is
efficient for set bits which tend to cluster.  It supports up to
'unsigned long' worth of bits, and this commit adds the bare bones --
xb_set_bit(), xb_clear_bit() and xb_test_bit().

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/radix-tree.h |   3 +
 include/linux/xbitmap.h    |  61 ++++++++++++++++
 lib/Makefile               |   2 +-
 lib/radix-tree.c           |  22 +++++-
 lib/xbitmap.c              | 176 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 260 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 3e57350..e1203b1 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -309,6 +309,8 @@ void radix_tree_iter_replace(struct radix_tree_root *,
 		const struct radix_tree_iter *, void __rcu **slot, void *entry);
 void radix_tree_replace_slot(struct radix_tree_root *,
 			     void __rcu **slot, void *entry);
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot);
 void __radix_tree_delete_node(struct radix_tree_root *,
 			      struct radix_tree_node *,
 			      radix_tree_update_node_t update_node,
@@ -325,6 +327,7 @@ unsigned int radix_tree_gang_lookup(const struct radix_tree_root *,
 unsigned int radix_tree_gang_lookup_slot(const struct radix_tree_root *,
 			void __rcu ***results, unsigned long *indices,
 			unsigned long first_index, unsigned int max_items);
+int __radix_tree_preload(gfp_t gfp_mask, unsigned int nr);
 int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order);
diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
new file mode 100644
index 0000000..5edbf84
--- /dev/null
+++ b/include/linux/xbitmap.h
@@ -0,0 +1,61 @@
+/*
+ * eXtensible Bitmaps
+ * Copyright (c) 2017 Microsoft Corporation <mawilcox@microsoft.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * eXtensible Bitmaps provide an unlimited-size sparse bitmap facility.
+ * All bits are initially zero.
+ */
+
+#ifndef __XBITMAP_H__
+#define __XBITMAP_H__
+
+#include <linux/idr.h>
+
+struct xb {
+	struct radix_tree_root xbrt;
+};
+
+#define XB_INIT {							\
+	.xbrt = RADIX_TREE_INIT(IDR_RT_MARKER | GFP_NOWAIT),		\
+}
+#define DEFINE_XB(name)		struct xb name = XB_INIT
+
+static inline void xb_init(struct xb *xb)
+{
+	INIT_RADIX_TREE(&xb->xbrt, IDR_RT_MARKER | GFP_NOWAIT);
+}
+
+int xb_set_bit(struct xb *xb, unsigned long bit);
+bool xb_test_bit(const struct xb *xb, unsigned long bit);
+void xb_clear_bit(struct xb *xb, unsigned long bit);
+
+/* Check if the xb tree is empty */
+static inline bool xb_is_empty(const struct xb *xb)
+{
+	return radix_tree_empty(&xb->xbrt);
+}
+
+void xb_preload(gfp_t gfp);
+
+/**
+ * xb_preload_end - end preload section started with xb_preload()
+ *
+ * Each xb_preload() should be matched with an invocation of this
+ * function. See xb_preload() for details.
+ */
+static inline void xb_preload_end(void)
+{
+	preempt_enable();
+}
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 40c1837..ea50496 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -18,7 +18,7 @@ KCOV_INSTRUMENT_dynamic_debug.o := n
 
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o dump_stack.o timerqueue.o\
-	 idr.o int_sqrt.o extable.o \
+	 idr.o xbitmap.o int_sqrt.o extable.o \
 	 sha1.o chacha20.o irq_regs.o argv_split.o \
 	 flex_proportions.o ratelimit.o show_mem.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 898e879..ee72e2c 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -463,7 +463,7 @@ radix_tree_node_free(struct radix_tree_node *node)
  * To make use of this facility, the radix tree must be initialised without
  * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
  */
-static int __radix_tree_preload(gfp_t gfp_mask, unsigned nr)
+int __radix_tree_preload(gfp_t gfp_mask, unsigned int nr)
 {
 	struct radix_tree_preload *rtp;
 	struct radix_tree_node *node;
@@ -496,6 +496,7 @@ static int __radix_tree_preload(gfp_t gfp_mask, unsigned nr)
 out:
 	return ret;
 }
+EXPORT_SYMBOL(__radix_tree_preload);
 
 /*
  * Load up this CPU's radix_tree_node buffer with sufficient objects to
@@ -840,6 +841,8 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
 							offset, 0, 0);
 			if (!child)
 				return -ENOMEM;
+			if (is_idr(root))
+				all_tag_set(child, IDR_FREE);
 			rcu_assign_pointer(*slot, node_to_entry(child));
 			if (node)
 				node->count++;
@@ -1986,8 +1989,20 @@ void __radix_tree_delete_node(struct radix_tree_root *root,
 	delete_node(root, node, update_node, private);
 }
 
-static bool __radix_tree_delete(struct radix_tree_root *root,
-				struct radix_tree_node *node, void __rcu **slot)
+/**
+ * __radix_tree_delete - delete a slot from a radix tree
+ * @root: radix tree root
+ * @node: node containing the slot
+ * @slot: pointer to the slot to delete
+ *
+ * Clear @slot from @node of the radix tree. This may cause the current node to
+ * be freed. This function may be called without any locking if there are no
+ * other threads which can access this tree.
+ *
+ * Return: the node or NULL if the node is freed.
+ */
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot)
 {
 	void *old = rcu_dereference_raw(*slot);
 	int exceptional = radix_tree_exceptional_entry(old) ? -1 : 0;
@@ -2003,6 +2018,7 @@ static bool __radix_tree_delete(struct radix_tree_root *root,
 	replace_slot(slot, NULL, node, -1, exceptional);
 	return node && delete_node(root, node, NULL, NULL);
 }
+EXPORT_SYMBOL(__radix_tree_delete);
 
 /**
  * radix_tree_iter_delete - delete the entry at this iterator position
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
new file mode 100644
index 0000000..cc766d9
--- /dev/null
+++ b/lib/xbitmap.c
@@ -0,0 +1,176 @@
+#include <linux/slab.h>
+#include <linux/xbitmap.h>
+
+/*
+ * The xbitmap implementation supports up to ULONG_MAX bits, and it is
+ * implemented based on ida bitmaps. So, given an unsigned long index,
+ * the high order XB_INDEX_BITS bits of the index is used to find the
+ * corresponding iteam (i.e. ida bitmap) from the radix tree, and the low
+ * order (i.e. ilog2(IDA_BITMAP_BITS)) bits of the index are indexed into
+ * the ida bitmap to find the bit.
+ */
+#define XB_INDEX_BITS		(BITS_PER_LONG - ilog2(IDA_BITMAP_BITS))
+#define XB_MAX_PATH		(DIV_ROUND_UP(XB_INDEX_BITS, \
+					      RADIX_TREE_MAP_SHIFT))
+#define XB_PRELOAD_SIZE		(XB_MAX_PATH * 2 - 1)
+
+enum xb_ops {
+	XB_SET,
+	XB_CLEAR,
+	XB_TEST
+};
+
+static int xb_bit_ops(struct xb *xb, unsigned long bit, enum xb_ops ops)
+{
+	int ret = 0;
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit, tmp;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + RADIX_TREE_EXCEPTIONAL_SHIFT;
+
+	switch (ops) {
+	case XB_SET:
+		ret = __radix_tree_create(root, index, 0, &node, &slot);
+		if (ret)
+			return ret;
+		bitmap = rcu_dereference_raw(*slot);
+		if (radix_tree_exception(bitmap)) {
+			tmp = (unsigned long)bitmap;
+			if (ebit < BITS_PER_LONG) {
+				tmp |= 1UL << ebit;
+				rcu_assign_pointer(*slot, (void *)tmp);
+				return 0;
+			}
+			bitmap = this_cpu_xchg(ida_bitmap, NULL);
+			if (!bitmap)
+				return -EAGAIN;
+			memset(bitmap, 0, sizeof(*bitmap));
+			bitmap->bitmap[0] =
+					tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+			rcu_assign_pointer(*slot, bitmap);
+		}
+		if (!bitmap) {
+			if (ebit < BITS_PER_LONG) {
+				bitmap = (void *)((1UL << ebit) |
+					RADIX_TREE_EXCEPTIONAL_ENTRY);
+				__radix_tree_replace(root, node, slot, bitmap,
+						     NULL, NULL);
+				return 0;
+			}
+			bitmap = this_cpu_xchg(ida_bitmap, NULL);
+			if (!bitmap)
+				return -EAGAIN;
+			memset(bitmap, 0, sizeof(*bitmap));
+			__radix_tree_replace(root, node, slot, bitmap, NULL,
+					     NULL);
+		}
+		__set_bit(bit, bitmap->bitmap);
+		break;
+	case XB_CLEAR:
+		bitmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bitmap)) {
+			tmp = (unsigned long)bitmap;
+			if (ebit >= BITS_PER_LONG)
+				return 0;
+			tmp &= ~(1UL << ebit);
+			if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+				__radix_tree_delete(root, node, slot);
+			else
+				rcu_assign_pointer(*slot, (void *)tmp);
+			return 0;
+		}
+		if (!bitmap)
+			return 0;
+		__clear_bit(bit, bitmap->bitmap);
+		if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+			kfree(bitmap);
+			__radix_tree_delete(root, node, slot);
+		}
+		break;
+	case XB_TEST:
+		bitmap = radix_tree_lookup(root, index);
+		if (!bitmap)
+			return 0;
+		if (radix_tree_exception(bitmap)) {
+			if (ebit > BITS_PER_LONG)
+				return 0;
+			return (unsigned long)bitmap & (1UL << bit);
+		}
+		ret = test_bit(bit, bitmap->bitmap);
+		break;
+	default:
+		return -EINVAL;
+	}
+	return ret;
+}
+
+/**
+ *  xb_set_bit - set a bit in the xbitmap
+ *  @xb: the xbitmap tree used to record the bit
+ *  @bit: index of the bit to set
+ *
+ * This function is used to set a bit in the xbitmap. If the bitmap that @bit
+ * resides in is not there, it will be allocated.
+ *
+ * Returns: 0 on success. %-EAGAIN indicates that @bit was not set. The caller
+ * may want to call the function again.
+ */
+int xb_set_bit(struct xb *xb, unsigned long bit)
+{
+	return xb_bit_ops(xb, bit, XB_SET);
+}
+EXPORT_SYMBOL(xb_set_bit);
+
+/**
+ * xb_clear_bit - clear a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to set
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit(struct xb *xb, unsigned long bit)
+{
+	xb_bit_ops(xb, bit, XB_CLEAR);
+}
+EXPORT_SYMBOL(xb_clear_bit);
+
+/**
+ * xb_test_bit - test a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to set
+ *
+ * This function is used to test a bit in the xbitmap.
+ * Returns: 1 if the bit is set, or 0 otherwise.
+ */
+bool xb_test_bit(const struct xb *xb, unsigned long bit)
+{
+	return (bool)xb_bit_ops(xb, bit, XB_TEST);
+}
+EXPORT_SYMBOL(xb_test_bit);
+
+/**
+ *  xb_preload - preload for xb_set_bit()
+ *  @gfp_mask: allocation mask to use for preloading
+ *
+ * Preallocate memory to use for the next call to xb_set_bit(). This function
+ * returns with preemption disabled. It will be enabled by xb_preload_end().
+ */
+void xb_preload(gfp_t gfp)
+{
+	__radix_tree_preload(gfp, XB_PRELOAD_SIZE);
+	if (!this_cpu_read(ida_bitmap)) {
+		struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
+
+		if (!bitmap)
+			return;
+		bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
+		kfree(bitmap);
+	}
+}
+EXPORT_SYMBOL(xb_preload);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [Qemu-devel] [PATCH v14 1/5] lib/xbitmap: Introduce xbitmap
@ 2017-08-17  3:26   ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

From: Matthew Wilcox <mawilcox@microsoft.com>

The eXtensible Bitmap is a sparse bitmap representation which is
efficient for set bits which tend to cluster.  It supports up to
'unsigned long' worth of bits, and this commit adds the bare bones --
xb_set_bit(), xb_clear_bit() and xb_test_bit().

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/radix-tree.h |   3 +
 include/linux/xbitmap.h    |  61 ++++++++++++++++
 lib/Makefile               |   2 +-
 lib/radix-tree.c           |  22 +++++-
 lib/xbitmap.c              | 176 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 260 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 3e57350..e1203b1 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -309,6 +309,8 @@ void radix_tree_iter_replace(struct radix_tree_root *,
 		const struct radix_tree_iter *, void __rcu **slot, void *entry);
 void radix_tree_replace_slot(struct radix_tree_root *,
 			     void __rcu **slot, void *entry);
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot);
 void __radix_tree_delete_node(struct radix_tree_root *,
 			      struct radix_tree_node *,
 			      radix_tree_update_node_t update_node,
@@ -325,6 +327,7 @@ unsigned int radix_tree_gang_lookup(const struct radix_tree_root *,
 unsigned int radix_tree_gang_lookup_slot(const struct radix_tree_root *,
 			void __rcu ***results, unsigned long *indices,
 			unsigned long first_index, unsigned int max_items);
+int __radix_tree_preload(gfp_t gfp_mask, unsigned int nr);
 int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order);
diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
new file mode 100644
index 0000000..5edbf84
--- /dev/null
+++ b/include/linux/xbitmap.h
@@ -0,0 +1,61 @@
+/*
+ * eXtensible Bitmaps
+ * Copyright (c) 2017 Microsoft Corporation <mawilcox@microsoft.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * eXtensible Bitmaps provide an unlimited-size sparse bitmap facility.
+ * All bits are initially zero.
+ */
+
+#ifndef __XBITMAP_H__
+#define __XBITMAP_H__
+
+#include <linux/idr.h>
+
+struct xb {
+	struct radix_tree_root xbrt;
+};
+
+#define XB_INIT {							\
+	.xbrt = RADIX_TREE_INIT(IDR_RT_MARKER | GFP_NOWAIT),		\
+}
+#define DEFINE_XB(name)		struct xb name = XB_INIT
+
+static inline void xb_init(struct xb *xb)
+{
+	INIT_RADIX_TREE(&xb->xbrt, IDR_RT_MARKER | GFP_NOWAIT);
+}
+
+int xb_set_bit(struct xb *xb, unsigned long bit);
+bool xb_test_bit(const struct xb *xb, unsigned long bit);
+void xb_clear_bit(struct xb *xb, unsigned long bit);
+
+/* Check if the xb tree is empty */
+static inline bool xb_is_empty(const struct xb *xb)
+{
+	return radix_tree_empty(&xb->xbrt);
+}
+
+void xb_preload(gfp_t gfp);
+
+/**
+ * xb_preload_end - end preload section started with xb_preload()
+ *
+ * Each xb_preload() should be matched with an invocation of this
+ * function. See xb_preload() for details.
+ */
+static inline void xb_preload_end(void)
+{
+	preempt_enable();
+}
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 40c1837..ea50496 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -18,7 +18,7 @@ KCOV_INSTRUMENT_dynamic_debug.o := n
 
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o dump_stack.o timerqueue.o\
-	 idr.o int_sqrt.o extable.o \
+	 idr.o xbitmap.o int_sqrt.o extable.o \
 	 sha1.o chacha20.o irq_regs.o argv_split.o \
 	 flex_proportions.o ratelimit.o show_mem.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 898e879..ee72e2c 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -463,7 +463,7 @@ radix_tree_node_free(struct radix_tree_node *node)
  * To make use of this facility, the radix tree must be initialised without
  * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
  */
-static int __radix_tree_preload(gfp_t gfp_mask, unsigned nr)
+int __radix_tree_preload(gfp_t gfp_mask, unsigned int nr)
 {
 	struct radix_tree_preload *rtp;
 	struct radix_tree_node *node;
@@ -496,6 +496,7 @@ static int __radix_tree_preload(gfp_t gfp_mask, unsigned nr)
 out:
 	return ret;
 }
+EXPORT_SYMBOL(__radix_tree_preload);
 
 /*
  * Load up this CPU's radix_tree_node buffer with sufficient objects to
@@ -840,6 +841,8 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
 							offset, 0, 0);
 			if (!child)
 				return -ENOMEM;
+			if (is_idr(root))
+				all_tag_set(child, IDR_FREE);
 			rcu_assign_pointer(*slot, node_to_entry(child));
 			if (node)
 				node->count++;
@@ -1986,8 +1989,20 @@ void __radix_tree_delete_node(struct radix_tree_root *root,
 	delete_node(root, node, update_node, private);
 }
 
-static bool __radix_tree_delete(struct radix_tree_root *root,
-				struct radix_tree_node *node, void __rcu **slot)
+/**
+ * __radix_tree_delete - delete a slot from a radix tree
+ * @root: radix tree root
+ * @node: node containing the slot
+ * @slot: pointer to the slot to delete
+ *
+ * Clear @slot from @node of the radix tree. This may cause the current node to
+ * be freed. This function may be called without any locking if there are no
+ * other threads which can access this tree.
+ *
+ * Return: the node or NULL if the node is freed.
+ */
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot)
 {
 	void *old = rcu_dereference_raw(*slot);
 	int exceptional = radix_tree_exceptional_entry(old) ? -1 : 0;
@@ -2003,6 +2018,7 @@ static bool __radix_tree_delete(struct radix_tree_root *root,
 	replace_slot(slot, NULL, node, -1, exceptional);
 	return node && delete_node(root, node, NULL, NULL);
 }
+EXPORT_SYMBOL(__radix_tree_delete);
 
 /**
  * radix_tree_iter_delete - delete the entry at this iterator position
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
new file mode 100644
index 0000000..cc766d9
--- /dev/null
+++ b/lib/xbitmap.c
@@ -0,0 +1,176 @@
+#include <linux/slab.h>
+#include <linux/xbitmap.h>
+
+/*
+ * The xbitmap implementation supports up to ULONG_MAX bits, and it is
+ * implemented based on ida bitmaps. So, given an unsigned long index,
+ * the high order XB_INDEX_BITS bits of the index is used to find the
+ * corresponding iteam (i.e. ida bitmap) from the radix tree, and the low
+ * order (i.e. ilog2(IDA_BITMAP_BITS)) bits of the index are indexed into
+ * the ida bitmap to find the bit.
+ */
+#define XB_INDEX_BITS		(BITS_PER_LONG - ilog2(IDA_BITMAP_BITS))
+#define XB_MAX_PATH		(DIV_ROUND_UP(XB_INDEX_BITS, \
+					      RADIX_TREE_MAP_SHIFT))
+#define XB_PRELOAD_SIZE		(XB_MAX_PATH * 2 - 1)
+
+enum xb_ops {
+	XB_SET,
+	XB_CLEAR,
+	XB_TEST
+};
+
+static int xb_bit_ops(struct xb *xb, unsigned long bit, enum xb_ops ops)
+{
+	int ret = 0;
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit, tmp;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + RADIX_TREE_EXCEPTIONAL_SHIFT;
+
+	switch (ops) {
+	case XB_SET:
+		ret = __radix_tree_create(root, index, 0, &node, &slot);
+		if (ret)
+			return ret;
+		bitmap = rcu_dereference_raw(*slot);
+		if (radix_tree_exception(bitmap)) {
+			tmp = (unsigned long)bitmap;
+			if (ebit < BITS_PER_LONG) {
+				tmp |= 1UL << ebit;
+				rcu_assign_pointer(*slot, (void *)tmp);
+				return 0;
+			}
+			bitmap = this_cpu_xchg(ida_bitmap, NULL);
+			if (!bitmap)
+				return -EAGAIN;
+			memset(bitmap, 0, sizeof(*bitmap));
+			bitmap->bitmap[0] =
+					tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+			rcu_assign_pointer(*slot, bitmap);
+		}
+		if (!bitmap) {
+			if (ebit < BITS_PER_LONG) {
+				bitmap = (void *)((1UL << ebit) |
+					RADIX_TREE_EXCEPTIONAL_ENTRY);
+				__radix_tree_replace(root, node, slot, bitmap,
+						     NULL, NULL);
+				return 0;
+			}
+			bitmap = this_cpu_xchg(ida_bitmap, NULL);
+			if (!bitmap)
+				return -EAGAIN;
+			memset(bitmap, 0, sizeof(*bitmap));
+			__radix_tree_replace(root, node, slot, bitmap, NULL,
+					     NULL);
+		}
+		__set_bit(bit, bitmap->bitmap);
+		break;
+	case XB_CLEAR:
+		bitmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bitmap)) {
+			tmp = (unsigned long)bitmap;
+			if (ebit >= BITS_PER_LONG)
+				return 0;
+			tmp &= ~(1UL << ebit);
+			if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+				__radix_tree_delete(root, node, slot);
+			else
+				rcu_assign_pointer(*slot, (void *)tmp);
+			return 0;
+		}
+		if (!bitmap)
+			return 0;
+		__clear_bit(bit, bitmap->bitmap);
+		if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+			kfree(bitmap);
+			__radix_tree_delete(root, node, slot);
+		}
+		break;
+	case XB_TEST:
+		bitmap = radix_tree_lookup(root, index);
+		if (!bitmap)
+			return 0;
+		if (radix_tree_exception(bitmap)) {
+			if (ebit > BITS_PER_LONG)
+				return 0;
+			return (unsigned long)bitmap & (1UL << bit);
+		}
+		ret = test_bit(bit, bitmap->bitmap);
+		break;
+	default:
+		return -EINVAL;
+	}
+	return ret;
+}
+
+/**
+ *  xb_set_bit - set a bit in the xbitmap
+ *  @xb: the xbitmap tree used to record the bit
+ *  @bit: index of the bit to set
+ *
+ * This function is used to set a bit in the xbitmap. If the bitmap that @bit
+ * resides in is not there, it will be allocated.
+ *
+ * Returns: 0 on success. %-EAGAIN indicates that @bit was not set. The caller
+ * may want to call the function again.
+ */
+int xb_set_bit(struct xb *xb, unsigned long bit)
+{
+	return xb_bit_ops(xb, bit, XB_SET);
+}
+EXPORT_SYMBOL(xb_set_bit);
+
+/**
+ * xb_clear_bit - clear a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to set
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit(struct xb *xb, unsigned long bit)
+{
+	xb_bit_ops(xb, bit, XB_CLEAR);
+}
+EXPORT_SYMBOL(xb_clear_bit);
+
+/**
+ * xb_test_bit - test a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to set
+ *
+ * This function is used to test a bit in the xbitmap.
+ * Returns: 1 if the bit is set, or 0 otherwise.
+ */
+bool xb_test_bit(const struct xb *xb, unsigned long bit)
+{
+	return (bool)xb_bit_ops(xb, bit, XB_TEST);
+}
+EXPORT_SYMBOL(xb_test_bit);
+
+/**
+ *  xb_preload - preload for xb_set_bit()
+ *  @gfp_mask: allocation mask to use for preloading
+ *
+ * Preallocate memory to use for the next call to xb_set_bit(). This function
+ * returns with preemption disabled. It will be enabled by xb_preload_end().
+ */
+void xb_preload(gfp_t gfp)
+{
+	__radix_tree_preload(gfp, XB_PRELOAD_SIZE);
+	if (!this_cpu_read(ida_bitmap)) {
+		struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
+
+		if (!bitmap)
+			return;
+		bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
+		kfree(bitmap);
+	}
+}
+EXPORT_SYMBOL(xb_preload);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v14 1/5] lib/xbitmap: Introduce xbitmap
  2017-08-17  3:26 ` Wei Wang
                   ` (2 preceding siblings ...)
  (?)
@ 2017-08-17  3:26 ` Wei Wang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: aarcange, yang.zhang.wz, liliang.opensource, willy, amit.shah,
	quan.xu, cornelia.huck, pbonzini, mgorman

From: Matthew Wilcox <mawilcox@microsoft.com>

The eXtensible Bitmap is a sparse bitmap representation which is
efficient for set bits which tend to cluster.  It supports up to
'unsigned long' worth of bits, and this commit adds the bare bones --
xb_set_bit(), xb_clear_bit() and xb_test_bit().

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/radix-tree.h |   3 +
 include/linux/xbitmap.h    |  61 ++++++++++++++++
 lib/Makefile               |   2 +-
 lib/radix-tree.c           |  22 +++++-
 lib/xbitmap.c              | 176 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 260 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 3e57350..e1203b1 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -309,6 +309,8 @@ void radix_tree_iter_replace(struct radix_tree_root *,
 		const struct radix_tree_iter *, void __rcu **slot, void *entry);
 void radix_tree_replace_slot(struct radix_tree_root *,
 			     void __rcu **slot, void *entry);
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot);
 void __radix_tree_delete_node(struct radix_tree_root *,
 			      struct radix_tree_node *,
 			      radix_tree_update_node_t update_node,
@@ -325,6 +327,7 @@ unsigned int radix_tree_gang_lookup(const struct radix_tree_root *,
 unsigned int radix_tree_gang_lookup_slot(const struct radix_tree_root *,
 			void __rcu ***results, unsigned long *indices,
 			unsigned long first_index, unsigned int max_items);
+int __radix_tree_preload(gfp_t gfp_mask, unsigned int nr);
 int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order);
diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
new file mode 100644
index 0000000..5edbf84
--- /dev/null
+++ b/include/linux/xbitmap.h
@@ -0,0 +1,61 @@
+/*
+ * eXtensible Bitmaps
+ * Copyright (c) 2017 Microsoft Corporation <mawilcox@microsoft.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * eXtensible Bitmaps provide an unlimited-size sparse bitmap facility.
+ * All bits are initially zero.
+ */
+
+#ifndef __XBITMAP_H__
+#define __XBITMAP_H__
+
+#include <linux/idr.h>
+
+struct xb {
+	struct radix_tree_root xbrt;
+};
+
+#define XB_INIT {							\
+	.xbrt = RADIX_TREE_INIT(IDR_RT_MARKER | GFP_NOWAIT),		\
+}
+#define DEFINE_XB(name)		struct xb name = XB_INIT
+
+static inline void xb_init(struct xb *xb)
+{
+	INIT_RADIX_TREE(&xb->xbrt, IDR_RT_MARKER | GFP_NOWAIT);
+}
+
+int xb_set_bit(struct xb *xb, unsigned long bit);
+bool xb_test_bit(const struct xb *xb, unsigned long bit);
+void xb_clear_bit(struct xb *xb, unsigned long bit);
+
+/* Check if the xb tree is empty */
+static inline bool xb_is_empty(const struct xb *xb)
+{
+	return radix_tree_empty(&xb->xbrt);
+}
+
+void xb_preload(gfp_t gfp);
+
+/**
+ * xb_preload_end - end preload section started with xb_preload()
+ *
+ * Each xb_preload() should be matched with an invocation of this
+ * function. See xb_preload() for details.
+ */
+static inline void xb_preload_end(void)
+{
+	preempt_enable();
+}
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 40c1837..ea50496 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -18,7 +18,7 @@ KCOV_INSTRUMENT_dynamic_debug.o := n
 
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o dump_stack.o timerqueue.o\
-	 idr.o int_sqrt.o extable.o \
+	 idr.o xbitmap.o int_sqrt.o extable.o \
 	 sha1.o chacha20.o irq_regs.o argv_split.o \
 	 flex_proportions.o ratelimit.o show_mem.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 898e879..ee72e2c 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -463,7 +463,7 @@ radix_tree_node_free(struct radix_tree_node *node)
  * To make use of this facility, the radix tree must be initialised without
  * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
  */
-static int __radix_tree_preload(gfp_t gfp_mask, unsigned nr)
+int __radix_tree_preload(gfp_t gfp_mask, unsigned int nr)
 {
 	struct radix_tree_preload *rtp;
 	struct radix_tree_node *node;
@@ -496,6 +496,7 @@ static int __radix_tree_preload(gfp_t gfp_mask, unsigned nr)
 out:
 	return ret;
 }
+EXPORT_SYMBOL(__radix_tree_preload);
 
 /*
  * Load up this CPU's radix_tree_node buffer with sufficient objects to
@@ -840,6 +841,8 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
 							offset, 0, 0);
 			if (!child)
 				return -ENOMEM;
+			if (is_idr(root))
+				all_tag_set(child, IDR_FREE);
 			rcu_assign_pointer(*slot, node_to_entry(child));
 			if (node)
 				node->count++;
@@ -1986,8 +1989,20 @@ void __radix_tree_delete_node(struct radix_tree_root *root,
 	delete_node(root, node, update_node, private);
 }
 
-static bool __radix_tree_delete(struct radix_tree_root *root,
-				struct radix_tree_node *node, void __rcu **slot)
+/**
+ * __radix_tree_delete - delete a slot from a radix tree
+ * @root: radix tree root
+ * @node: node containing the slot
+ * @slot: pointer to the slot to delete
+ *
+ * Clear @slot from @node of the radix tree. This may cause the current node to
+ * be freed. This function may be called without any locking if there are no
+ * other threads which can access this tree.
+ *
+ * Return: the node or NULL if the node is freed.
+ */
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot)
 {
 	void *old = rcu_dereference_raw(*slot);
 	int exceptional = radix_tree_exceptional_entry(old) ? -1 : 0;
@@ -2003,6 +2018,7 @@ static bool __radix_tree_delete(struct radix_tree_root *root,
 	replace_slot(slot, NULL, node, -1, exceptional);
 	return node && delete_node(root, node, NULL, NULL);
 }
+EXPORT_SYMBOL(__radix_tree_delete);
 
 /**
  * radix_tree_iter_delete - delete the entry at this iterator position
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
new file mode 100644
index 0000000..cc766d9
--- /dev/null
+++ b/lib/xbitmap.c
@@ -0,0 +1,176 @@
+#include <linux/slab.h>
+#include <linux/xbitmap.h>
+
+/*
+ * The xbitmap implementation supports up to ULONG_MAX bits, and it is
+ * implemented based on ida bitmaps. So, given an unsigned long index,
+ * the high order XB_INDEX_BITS bits of the index is used to find the
+ * corresponding iteam (i.e. ida bitmap) from the radix tree, and the low
+ * order (i.e. ilog2(IDA_BITMAP_BITS)) bits of the index are indexed into
+ * the ida bitmap to find the bit.
+ */
+#define XB_INDEX_BITS		(BITS_PER_LONG - ilog2(IDA_BITMAP_BITS))
+#define XB_MAX_PATH		(DIV_ROUND_UP(XB_INDEX_BITS, \
+					      RADIX_TREE_MAP_SHIFT))
+#define XB_PRELOAD_SIZE		(XB_MAX_PATH * 2 - 1)
+
+enum xb_ops {
+	XB_SET,
+	XB_CLEAR,
+	XB_TEST
+};
+
+static int xb_bit_ops(struct xb *xb, unsigned long bit, enum xb_ops ops)
+{
+	int ret = 0;
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit, tmp;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + RADIX_TREE_EXCEPTIONAL_SHIFT;
+
+	switch (ops) {
+	case XB_SET:
+		ret = __radix_tree_create(root, index, 0, &node, &slot);
+		if (ret)
+			return ret;
+		bitmap = rcu_dereference_raw(*slot);
+		if (radix_tree_exception(bitmap)) {
+			tmp = (unsigned long)bitmap;
+			if (ebit < BITS_PER_LONG) {
+				tmp |= 1UL << ebit;
+				rcu_assign_pointer(*slot, (void *)tmp);
+				return 0;
+			}
+			bitmap = this_cpu_xchg(ida_bitmap, NULL);
+			if (!bitmap)
+				return -EAGAIN;
+			memset(bitmap, 0, sizeof(*bitmap));
+			bitmap->bitmap[0] =
+					tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+			rcu_assign_pointer(*slot, bitmap);
+		}
+		if (!bitmap) {
+			if (ebit < BITS_PER_LONG) {
+				bitmap = (void *)((1UL << ebit) |
+					RADIX_TREE_EXCEPTIONAL_ENTRY);
+				__radix_tree_replace(root, node, slot, bitmap,
+						     NULL, NULL);
+				return 0;
+			}
+			bitmap = this_cpu_xchg(ida_bitmap, NULL);
+			if (!bitmap)
+				return -EAGAIN;
+			memset(bitmap, 0, sizeof(*bitmap));
+			__radix_tree_replace(root, node, slot, bitmap, NULL,
+					     NULL);
+		}
+		__set_bit(bit, bitmap->bitmap);
+		break;
+	case XB_CLEAR:
+		bitmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bitmap)) {
+			tmp = (unsigned long)bitmap;
+			if (ebit >= BITS_PER_LONG)
+				return 0;
+			tmp &= ~(1UL << ebit);
+			if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+				__radix_tree_delete(root, node, slot);
+			else
+				rcu_assign_pointer(*slot, (void *)tmp);
+			return 0;
+		}
+		if (!bitmap)
+			return 0;
+		__clear_bit(bit, bitmap->bitmap);
+		if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+			kfree(bitmap);
+			__radix_tree_delete(root, node, slot);
+		}
+		break;
+	case XB_TEST:
+		bitmap = radix_tree_lookup(root, index);
+		if (!bitmap)
+			return 0;
+		if (radix_tree_exception(bitmap)) {
+			if (ebit > BITS_PER_LONG)
+				return 0;
+			return (unsigned long)bitmap & (1UL << bit);
+		}
+		ret = test_bit(bit, bitmap->bitmap);
+		break;
+	default:
+		return -EINVAL;
+	}
+	return ret;
+}
+
+/**
+ *  xb_set_bit - set a bit in the xbitmap
+ *  @xb: the xbitmap tree used to record the bit
+ *  @bit: index of the bit to set
+ *
+ * This function is used to set a bit in the xbitmap. If the bitmap that @bit
+ * resides in is not there, it will be allocated.
+ *
+ * Returns: 0 on success. %-EAGAIN indicates that @bit was not set. The caller
+ * may want to call the function again.
+ */
+int xb_set_bit(struct xb *xb, unsigned long bit)
+{
+	return xb_bit_ops(xb, bit, XB_SET);
+}
+EXPORT_SYMBOL(xb_set_bit);
+
+/**
+ * xb_clear_bit - clear a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to set
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit(struct xb *xb, unsigned long bit)
+{
+	xb_bit_ops(xb, bit, XB_CLEAR);
+}
+EXPORT_SYMBOL(xb_clear_bit);
+
+/**
+ * xb_test_bit - test a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to set
+ *
+ * This function is used to test a bit in the xbitmap.
+ * Returns: 1 if the bit is set, or 0 otherwise.
+ */
+bool xb_test_bit(const struct xb *xb, unsigned long bit)
+{
+	return (bool)xb_bit_ops(xb, bit, XB_TEST);
+}
+EXPORT_SYMBOL(xb_test_bit);
+
+/**
+ *  xb_preload - preload for xb_set_bit()
+ *  @gfp_mask: allocation mask to use for preloading
+ *
+ * Preallocate memory to use for the next call to xb_set_bit(). This function
+ * returns with preemption disabled. It will be enabled by xb_preload_end().
+ */
+void xb_preload(gfp_t gfp)
+{
+	__radix_tree_preload(gfp, XB_PRELOAD_SIZE);
+	if (!this_cpu_read(ida_bitmap)) {
+		struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
+
+		if (!bitmap)
+			return;
+		bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
+		kfree(bitmap);
+	}
+}
+EXPORT_SYMBOL(xb_preload);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [virtio-dev] [PATCH v14 1/5] lib/xbitmap: Introduce xbitmap
@ 2017-08-17  3:26   ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

From: Matthew Wilcox <mawilcox@microsoft.com>

The eXtensible Bitmap is a sparse bitmap representation which is
efficient for set bits which tend to cluster.  It supports up to
'unsigned long' worth of bits, and this commit adds the bare bones --
xb_set_bit(), xb_clear_bit() and xb_test_bit().

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/radix-tree.h |   3 +
 include/linux/xbitmap.h    |  61 ++++++++++++++++
 lib/Makefile               |   2 +-
 lib/radix-tree.c           |  22 +++++-
 lib/xbitmap.c              | 176 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 260 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 3e57350..e1203b1 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -309,6 +309,8 @@ void radix_tree_iter_replace(struct radix_tree_root *,
 		const struct radix_tree_iter *, void __rcu **slot, void *entry);
 void radix_tree_replace_slot(struct radix_tree_root *,
 			     void __rcu **slot, void *entry);
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot);
 void __radix_tree_delete_node(struct radix_tree_root *,
 			      struct radix_tree_node *,
 			      radix_tree_update_node_t update_node,
@@ -325,6 +327,7 @@ unsigned int radix_tree_gang_lookup(const struct radix_tree_root *,
 unsigned int radix_tree_gang_lookup_slot(const struct radix_tree_root *,
 			void __rcu ***results, unsigned long *indices,
 			unsigned long first_index, unsigned int max_items);
+int __radix_tree_preload(gfp_t gfp_mask, unsigned int nr);
 int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order);
diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
new file mode 100644
index 0000000..5edbf84
--- /dev/null
+++ b/include/linux/xbitmap.h
@@ -0,0 +1,61 @@
+/*
+ * eXtensible Bitmaps
+ * Copyright (c) 2017 Microsoft Corporation <mawilcox@microsoft.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * eXtensible Bitmaps provide an unlimited-size sparse bitmap facility.
+ * All bits are initially zero.
+ */
+
+#ifndef __XBITMAP_H__
+#define __XBITMAP_H__
+
+#include <linux/idr.h>
+
+struct xb {
+	struct radix_tree_root xbrt;
+};
+
+#define XB_INIT {							\
+	.xbrt = RADIX_TREE_INIT(IDR_RT_MARKER | GFP_NOWAIT),		\
+}
+#define DEFINE_XB(name)		struct xb name = XB_INIT
+
+static inline void xb_init(struct xb *xb)
+{
+	INIT_RADIX_TREE(&xb->xbrt, IDR_RT_MARKER | GFP_NOWAIT);
+}
+
+int xb_set_bit(struct xb *xb, unsigned long bit);
+bool xb_test_bit(const struct xb *xb, unsigned long bit);
+void xb_clear_bit(struct xb *xb, unsigned long bit);
+
+/* Check if the xb tree is empty */
+static inline bool xb_is_empty(const struct xb *xb)
+{
+	return radix_tree_empty(&xb->xbrt);
+}
+
+void xb_preload(gfp_t gfp);
+
+/**
+ * xb_preload_end - end preload section started with xb_preload()
+ *
+ * Each xb_preload() should be matched with an invocation of this
+ * function. See xb_preload() for details.
+ */
+static inline void xb_preload_end(void)
+{
+	preempt_enable();
+}
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 40c1837..ea50496 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -18,7 +18,7 @@ KCOV_INSTRUMENT_dynamic_debug.o := n
 
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o dump_stack.o timerqueue.o\
-	 idr.o int_sqrt.o extable.o \
+	 idr.o xbitmap.o int_sqrt.o extable.o \
 	 sha1.o chacha20.o irq_regs.o argv_split.o \
 	 flex_proportions.o ratelimit.o show_mem.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 898e879..ee72e2c 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -463,7 +463,7 @@ radix_tree_node_free(struct radix_tree_node *node)
  * To make use of this facility, the radix tree must be initialised without
  * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
  */
-static int __radix_tree_preload(gfp_t gfp_mask, unsigned nr)
+int __radix_tree_preload(gfp_t gfp_mask, unsigned int nr)
 {
 	struct radix_tree_preload *rtp;
 	struct radix_tree_node *node;
@@ -496,6 +496,7 @@ static int __radix_tree_preload(gfp_t gfp_mask, unsigned nr)
 out:
 	return ret;
 }
+EXPORT_SYMBOL(__radix_tree_preload);
 
 /*
  * Load up this CPU's radix_tree_node buffer with sufficient objects to
@@ -840,6 +841,8 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
 							offset, 0, 0);
 			if (!child)
 				return -ENOMEM;
+			if (is_idr(root))
+				all_tag_set(child, IDR_FREE);
 			rcu_assign_pointer(*slot, node_to_entry(child));
 			if (node)
 				node->count++;
@@ -1986,8 +1989,20 @@ void __radix_tree_delete_node(struct radix_tree_root *root,
 	delete_node(root, node, update_node, private);
 }
 
-static bool __radix_tree_delete(struct radix_tree_root *root,
-				struct radix_tree_node *node, void __rcu **slot)
+/**
+ * __radix_tree_delete - delete a slot from a radix tree
+ * @root: radix tree root
+ * @node: node containing the slot
+ * @slot: pointer to the slot to delete
+ *
+ * Clear @slot from @node of the radix tree. This may cause the current node to
+ * be freed. This function may be called without any locking if there are no
+ * other threads which can access this tree.
+ *
+ * Return: the node or NULL if the node is freed.
+ */
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot)
 {
 	void *old = rcu_dereference_raw(*slot);
 	int exceptional = radix_tree_exceptional_entry(old) ? -1 : 0;
@@ -2003,6 +2018,7 @@ static bool __radix_tree_delete(struct radix_tree_root *root,
 	replace_slot(slot, NULL, node, -1, exceptional);
 	return node && delete_node(root, node, NULL, NULL);
 }
+EXPORT_SYMBOL(__radix_tree_delete);
 
 /**
  * radix_tree_iter_delete - delete the entry at this iterator position
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
new file mode 100644
index 0000000..cc766d9
--- /dev/null
+++ b/lib/xbitmap.c
@@ -0,0 +1,176 @@
+#include <linux/slab.h>
+#include <linux/xbitmap.h>
+
+/*
+ * The xbitmap implementation supports up to ULONG_MAX bits, and it is
+ * implemented based on ida bitmaps. So, given an unsigned long index,
+ * the high order XB_INDEX_BITS bits of the index is used to find the
+ * corresponding iteam (i.e. ida bitmap) from the radix tree, and the low
+ * order (i.e. ilog2(IDA_BITMAP_BITS)) bits of the index are indexed into
+ * the ida bitmap to find the bit.
+ */
+#define XB_INDEX_BITS		(BITS_PER_LONG - ilog2(IDA_BITMAP_BITS))
+#define XB_MAX_PATH		(DIV_ROUND_UP(XB_INDEX_BITS, \
+					      RADIX_TREE_MAP_SHIFT))
+#define XB_PRELOAD_SIZE		(XB_MAX_PATH * 2 - 1)
+
+enum xb_ops {
+	XB_SET,
+	XB_CLEAR,
+	XB_TEST
+};
+
+static int xb_bit_ops(struct xb *xb, unsigned long bit, enum xb_ops ops)
+{
+	int ret = 0;
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit, tmp;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + RADIX_TREE_EXCEPTIONAL_SHIFT;
+
+	switch (ops) {
+	case XB_SET:
+		ret = __radix_tree_create(root, index, 0, &node, &slot);
+		if (ret)
+			return ret;
+		bitmap = rcu_dereference_raw(*slot);
+		if (radix_tree_exception(bitmap)) {
+			tmp = (unsigned long)bitmap;
+			if (ebit < BITS_PER_LONG) {
+				tmp |= 1UL << ebit;
+				rcu_assign_pointer(*slot, (void *)tmp);
+				return 0;
+			}
+			bitmap = this_cpu_xchg(ida_bitmap, NULL);
+			if (!bitmap)
+				return -EAGAIN;
+			memset(bitmap, 0, sizeof(*bitmap));
+			bitmap->bitmap[0] =
+					tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+			rcu_assign_pointer(*slot, bitmap);
+		}
+		if (!bitmap) {
+			if (ebit < BITS_PER_LONG) {
+				bitmap = (void *)((1UL << ebit) |
+					RADIX_TREE_EXCEPTIONAL_ENTRY);
+				__radix_tree_replace(root, node, slot, bitmap,
+						     NULL, NULL);
+				return 0;
+			}
+			bitmap = this_cpu_xchg(ida_bitmap, NULL);
+			if (!bitmap)
+				return -EAGAIN;
+			memset(bitmap, 0, sizeof(*bitmap));
+			__radix_tree_replace(root, node, slot, bitmap, NULL,
+					     NULL);
+		}
+		__set_bit(bit, bitmap->bitmap);
+		break;
+	case XB_CLEAR:
+		bitmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bitmap)) {
+			tmp = (unsigned long)bitmap;
+			if (ebit >= BITS_PER_LONG)
+				return 0;
+			tmp &= ~(1UL << ebit);
+			if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+				__radix_tree_delete(root, node, slot);
+			else
+				rcu_assign_pointer(*slot, (void *)tmp);
+			return 0;
+		}
+		if (!bitmap)
+			return 0;
+		__clear_bit(bit, bitmap->bitmap);
+		if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+			kfree(bitmap);
+			__radix_tree_delete(root, node, slot);
+		}
+		break;
+	case XB_TEST:
+		bitmap = radix_tree_lookup(root, index);
+		if (!bitmap)
+			return 0;
+		if (radix_tree_exception(bitmap)) {
+			if (ebit > BITS_PER_LONG)
+				return 0;
+			return (unsigned long)bitmap & (1UL << bit);
+		}
+		ret = test_bit(bit, bitmap->bitmap);
+		break;
+	default:
+		return -EINVAL;
+	}
+	return ret;
+}
+
+/**
+ *  xb_set_bit - set a bit in the xbitmap
+ *  @xb: the xbitmap tree used to record the bit
+ *  @bit: index of the bit to set
+ *
+ * This function is used to set a bit in the xbitmap. If the bitmap that @bit
+ * resides in is not there, it will be allocated.
+ *
+ * Returns: 0 on success. %-EAGAIN indicates that @bit was not set. The caller
+ * may want to call the function again.
+ */
+int xb_set_bit(struct xb *xb, unsigned long bit)
+{
+	return xb_bit_ops(xb, bit, XB_SET);
+}
+EXPORT_SYMBOL(xb_set_bit);
+
+/**
+ * xb_clear_bit - clear a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to set
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit(struct xb *xb, unsigned long bit)
+{
+	xb_bit_ops(xb, bit, XB_CLEAR);
+}
+EXPORT_SYMBOL(xb_clear_bit);
+
+/**
+ * xb_test_bit - test a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to set
+ *
+ * This function is used to test a bit in the xbitmap.
+ * Returns: 1 if the bit is set, or 0 otherwise.
+ */
+bool xb_test_bit(const struct xb *xb, unsigned long bit)
+{
+	return (bool)xb_bit_ops(xb, bit, XB_TEST);
+}
+EXPORT_SYMBOL(xb_test_bit);
+
+/**
+ *  xb_preload - preload for xb_set_bit()
+ *  @gfp_mask: allocation mask to use for preloading
+ *
+ * Preallocate memory to use for the next call to xb_set_bit(). This function
+ * returns with preemption disabled. It will be enabled by xb_preload_end().
+ */
+void xb_preload(gfp_t gfp)
+{
+	__radix_tree_preload(gfp, XB_PRELOAD_SIZE);
+	if (!this_cpu_read(ida_bitmap)) {
+		struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
+
+		if (!bitmap)
+			return;
+		bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
+		kfree(bitmap);
+	}
+}
+EXPORT_SYMBOL(xb_preload);
-- 
2.7.4


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v14 2/5] lib/xbitmap: add xb_find_next_bit() and xb_zero()
  2017-08-17  3:26 ` Wei Wang
  (?)
  (?)
@ 2017-08-17  3:26   ` Wei Wang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

xb_find_next_bit() is used to find the next "1" or "0" bit in the
given range. xb_zero() is used to zero the given range of bits.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/xbitmap.h |  3 +++
 lib/xbitmap.c           | 39 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)

diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
index 5edbf84..739d08c 100644
--- a/include/linux/xbitmap.h
+++ b/include/linux/xbitmap.h
@@ -38,6 +38,9 @@ static inline void xb_init(struct xb *xb)
 int xb_set_bit(struct xb *xb, unsigned long bit);
 bool xb_test_bit(const struct xb *xb, unsigned long bit);
 void xb_clear_bit(struct xb *xb, unsigned long bit);
+void xb_zero(struct xb *xb, unsigned long start, unsigned long end);
+unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+			       unsigned long end, bool set);
 
 /* Check if the xb tree is empty */
 static inline bool xb_is_empty(const struct xb *xb)
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
index cc766d9..2267ac2 100644
--- a/lib/xbitmap.c
+++ b/lib/xbitmap.c
@@ -174,3 +174,42 @@ void xb_preload(gfp_t gfp)
 	}
 }
 EXPORT_SYMBOL(xb_preload);
+
+/**
+ *  xb_zero - zero a range of bits in the xbitmap
+ *  @xb: the xbitmap that the bits reside in
+ *  @start: the start of the range, inclusive
+ *  @end: the end of the range, inclusive
+ */
+void xb_zero(struct xb *xb, unsigned long start, unsigned long end)
+{
+	unsigned long i;
+
+	for (i = start; i <= end; i++)
+		xb_clear_bit(xb, i);
+}
+EXPORT_SYMBOL(xb_zero);
+
+/**
+ * xb_find_next_bit - find next 1 or 0 in the give range of bits
+ * @xb: the xbitmap that the bits reside in
+ * @start: the start of the range, inclusive
+ * @end: the end of the range, inclusive
+ * @set: the polarity (1 or 0) of the next bit to find
+ *
+ * Return the index of the found bit in the xbitmap. If the returned index
+ * exceeds @end, it indicates that no such bit is found in the given range.
+ */
+unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+			       unsigned long end, bool set)
+{
+	unsigned long i;
+
+	for (i = start; i <= end; i++) {
+		if (xb_test_bit(xb, i) == set)
+			break;
+	}
+
+	return i;
+}
+EXPORT_SYMBOL(xb_find_next_bit);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v14 2/5] lib/xbitmap: add xb_find_next_bit() and xb_zero()
@ 2017-08-17  3:26   ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

xb_find_next_bit() is used to find the next "1" or "0" bit in the
given range. xb_zero() is used to zero the given range of bits.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/xbitmap.h |  3 +++
 lib/xbitmap.c           | 39 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)

diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
index 5edbf84..739d08c 100644
--- a/include/linux/xbitmap.h
+++ b/include/linux/xbitmap.h
@@ -38,6 +38,9 @@ static inline void xb_init(struct xb *xb)
 int xb_set_bit(struct xb *xb, unsigned long bit);
 bool xb_test_bit(const struct xb *xb, unsigned long bit);
 void xb_clear_bit(struct xb *xb, unsigned long bit);
+void xb_zero(struct xb *xb, unsigned long start, unsigned long end);
+unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+			       unsigned long end, bool set);
 
 /* Check if the xb tree is empty */
 static inline bool xb_is_empty(const struct xb *xb)
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
index cc766d9..2267ac2 100644
--- a/lib/xbitmap.c
+++ b/lib/xbitmap.c
@@ -174,3 +174,42 @@ void xb_preload(gfp_t gfp)
 	}
 }
 EXPORT_SYMBOL(xb_preload);
+
+/**
+ *  xb_zero - zero a range of bits in the xbitmap
+ *  @xb: the xbitmap that the bits reside in
+ *  @start: the start of the range, inclusive
+ *  @end: the end of the range, inclusive
+ */
+void xb_zero(struct xb *xb, unsigned long start, unsigned long end)
+{
+	unsigned long i;
+
+	for (i = start; i <= end; i++)
+		xb_clear_bit(xb, i);
+}
+EXPORT_SYMBOL(xb_zero);
+
+/**
+ * xb_find_next_bit - find next 1 or 0 in the give range of bits
+ * @xb: the xbitmap that the bits reside in
+ * @start: the start of the range, inclusive
+ * @end: the end of the range, inclusive
+ * @set: the polarity (1 or 0) of the next bit to find
+ *
+ * Return the index of the found bit in the xbitmap. If the returned index
+ * exceeds @end, it indicates that no such bit is found in the given range.
+ */
+unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+			       unsigned long end, bool set)
+{
+	unsigned long i;
+
+	for (i = start; i <= end; i++) {
+		if (xb_test_bit(xb, i) == set)
+			break;
+	}
+
+	return i;
+}
+EXPORT_SYMBOL(xb_find_next_bit);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [Qemu-devel] [PATCH v14 2/5] lib/xbitmap: add xb_find_next_bit() and xb_zero()
@ 2017-08-17  3:26   ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

xb_find_next_bit() is used to find the next "1" or "0" bit in the
given range. xb_zero() is used to zero the given range of bits.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/xbitmap.h |  3 +++
 lib/xbitmap.c           | 39 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)

diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
index 5edbf84..739d08c 100644
--- a/include/linux/xbitmap.h
+++ b/include/linux/xbitmap.h
@@ -38,6 +38,9 @@ static inline void xb_init(struct xb *xb)
 int xb_set_bit(struct xb *xb, unsigned long bit);
 bool xb_test_bit(const struct xb *xb, unsigned long bit);
 void xb_clear_bit(struct xb *xb, unsigned long bit);
+void xb_zero(struct xb *xb, unsigned long start, unsigned long end);
+unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+			       unsigned long end, bool set);
 
 /* Check if the xb tree is empty */
 static inline bool xb_is_empty(const struct xb *xb)
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
index cc766d9..2267ac2 100644
--- a/lib/xbitmap.c
+++ b/lib/xbitmap.c
@@ -174,3 +174,42 @@ void xb_preload(gfp_t gfp)
 	}
 }
 EXPORT_SYMBOL(xb_preload);
+
+/**
+ *  xb_zero - zero a range of bits in the xbitmap
+ *  @xb: the xbitmap that the bits reside in
+ *  @start: the start of the range, inclusive
+ *  @end: the end of the range, inclusive
+ */
+void xb_zero(struct xb *xb, unsigned long start, unsigned long end)
+{
+	unsigned long i;
+
+	for (i = start; i <= end; i++)
+		xb_clear_bit(xb, i);
+}
+EXPORT_SYMBOL(xb_zero);
+
+/**
+ * xb_find_next_bit - find next 1 or 0 in the give range of bits
+ * @xb: the xbitmap that the bits reside in
+ * @start: the start of the range, inclusive
+ * @end: the end of the range, inclusive
+ * @set: the polarity (1 or 0) of the next bit to find
+ *
+ * Return the index of the found bit in the xbitmap. If the returned index
+ * exceeds @end, it indicates that no such bit is found in the given range.
+ */
+unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+			       unsigned long end, bool set)
+{
+	unsigned long i;
+
+	for (i = start; i <= end; i++) {
+		if (xb_test_bit(xb, i) == set)
+			break;
+	}
+
+	return i;
+}
+EXPORT_SYMBOL(xb_find_next_bit);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v14 2/5] lib/xbitmap: add xb_find_next_bit() and xb_zero()
  2017-08-17  3:26 ` Wei Wang
                   ` (4 preceding siblings ...)
  (?)
@ 2017-08-17  3:26 ` Wei Wang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: aarcange, yang.zhang.wz, liliang.opensource, willy, amit.shah,
	quan.xu, cornelia.huck, pbonzini, mgorman

xb_find_next_bit() is used to find the next "1" or "0" bit in the
given range. xb_zero() is used to zero the given range of bits.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/xbitmap.h |  3 +++
 lib/xbitmap.c           | 39 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)

diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
index 5edbf84..739d08c 100644
--- a/include/linux/xbitmap.h
+++ b/include/linux/xbitmap.h
@@ -38,6 +38,9 @@ static inline void xb_init(struct xb *xb)
 int xb_set_bit(struct xb *xb, unsigned long bit);
 bool xb_test_bit(const struct xb *xb, unsigned long bit);
 void xb_clear_bit(struct xb *xb, unsigned long bit);
+void xb_zero(struct xb *xb, unsigned long start, unsigned long end);
+unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+			       unsigned long end, bool set);
 
 /* Check if the xb tree is empty */
 static inline bool xb_is_empty(const struct xb *xb)
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
index cc766d9..2267ac2 100644
--- a/lib/xbitmap.c
+++ b/lib/xbitmap.c
@@ -174,3 +174,42 @@ void xb_preload(gfp_t gfp)
 	}
 }
 EXPORT_SYMBOL(xb_preload);
+
+/**
+ *  xb_zero - zero a range of bits in the xbitmap
+ *  @xb: the xbitmap that the bits reside in
+ *  @start: the start of the range, inclusive
+ *  @end: the end of the range, inclusive
+ */
+void xb_zero(struct xb *xb, unsigned long start, unsigned long end)
+{
+	unsigned long i;
+
+	for (i = start; i <= end; i++)
+		xb_clear_bit(xb, i);
+}
+EXPORT_SYMBOL(xb_zero);
+
+/**
+ * xb_find_next_bit - find next 1 or 0 in the give range of bits
+ * @xb: the xbitmap that the bits reside in
+ * @start: the start of the range, inclusive
+ * @end: the end of the range, inclusive
+ * @set: the polarity (1 or 0) of the next bit to find
+ *
+ * Return the index of the found bit in the xbitmap. If the returned index
+ * exceeds @end, it indicates that no such bit is found in the given range.
+ */
+unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+			       unsigned long end, bool set)
+{
+	unsigned long i;
+
+	for (i = start; i <= end; i++) {
+		if (xb_test_bit(xb, i) == set)
+			break;
+	}
+
+	return i;
+}
+EXPORT_SYMBOL(xb_find_next_bit);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [virtio-dev] [PATCH v14 2/5] lib/xbitmap: add xb_find_next_bit() and xb_zero()
@ 2017-08-17  3:26   ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

xb_find_next_bit() is used to find the next "1" or "0" bit in the
given range. xb_zero() is used to zero the given range of bits.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/xbitmap.h |  3 +++
 lib/xbitmap.c           | 39 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)

diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
index 5edbf84..739d08c 100644
--- a/include/linux/xbitmap.h
+++ b/include/linux/xbitmap.h
@@ -38,6 +38,9 @@ static inline void xb_init(struct xb *xb)
 int xb_set_bit(struct xb *xb, unsigned long bit);
 bool xb_test_bit(const struct xb *xb, unsigned long bit);
 void xb_clear_bit(struct xb *xb, unsigned long bit);
+void xb_zero(struct xb *xb, unsigned long start, unsigned long end);
+unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+			       unsigned long end, bool set);
 
 /* Check if the xb tree is empty */
 static inline bool xb_is_empty(const struct xb *xb)
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
index cc766d9..2267ac2 100644
--- a/lib/xbitmap.c
+++ b/lib/xbitmap.c
@@ -174,3 +174,42 @@ void xb_preload(gfp_t gfp)
 	}
 }
 EXPORT_SYMBOL(xb_preload);
+
+/**
+ *  xb_zero - zero a range of bits in the xbitmap
+ *  @xb: the xbitmap that the bits reside in
+ *  @start: the start of the range, inclusive
+ *  @end: the end of the range, inclusive
+ */
+void xb_zero(struct xb *xb, unsigned long start, unsigned long end)
+{
+	unsigned long i;
+
+	for (i = start; i <= end; i++)
+		xb_clear_bit(xb, i);
+}
+EXPORT_SYMBOL(xb_zero);
+
+/**
+ * xb_find_next_bit - find next 1 or 0 in the give range of bits
+ * @xb: the xbitmap that the bits reside in
+ * @start: the start of the range, inclusive
+ * @end: the end of the range, inclusive
+ * @set: the polarity (1 or 0) of the next bit to find
+ *
+ * Return the index of the found bit in the xbitmap. If the returned index
+ * exceeds @end, it indicates that no such bit is found in the given range.
+ */
+unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+			       unsigned long end, bool set)
+{
+	unsigned long i;
+
+	for (i = start; i <= end; i++) {
+		if (xb_test_bit(xb, i) == set)
+			break;
+	}
+
+	return i;
+}
+EXPORT_SYMBOL(xb_find_next_bit);
-- 
2.7.4


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-08-17  3:26 ` Wei Wang
  (?)
  (?)
@ 2017-08-17  3:26   ` Wei Wang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer
of balloon (i.e. inflated/deflated) pages using scatter-gather lists
to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the balloon pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
sgs. An sg describes a chunk of guest physically continuous pages.
With this mechanism, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~541ms
resulting in an improvement of ~87%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_balloon.c     | 157 ++++++++++++++++++++++++++++++++----
 include/uapi/linux/virtio_balloon.h |   1 +
 2 files changed, 141 insertions(+), 17 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f0b3a0b..72041b4 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -32,6 +32,7 @@
 #include <linux/mm.h>
 #include <linux/mount.h>
 #include <linux/magic.h>
+#include <linux/xbitmap.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -79,6 +80,9 @@ struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	/* The xbitmap used to record ballooned pages */
+	struct xb page_xb;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -141,13 +145,98 @@ static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
+static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
+{
+	struct scatterlist sg;
+
+	sg_init_one(&sg, addr, size);
+	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
+}
+
+static void send_balloon_page_sg(struct virtio_balloon *vb,
+				 struct virtqueue *vq,
+				 void *addr,
+				 uint32_t size)
+{
+	unsigned int len;
+	int ret;
+
+	do {
+		ret = add_one_sg(vq, addr, size);
+		virtqueue_kick(vq);
+		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		/*
+		 * It is uncommon to see the vq is full, because the sg is sent
+		 * one by one and the device is able to handle it in time. But
+		 * if that happens, we go back to retry after an entry gets
+		 * released.
+		 */
+	} while (unlikely(ret == -ENOSPC));
+}
+
+/*
+ * Send balloon pages in sgs to host. The balloon pages are recorded in the
+ * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
+ * The page xbitmap is searched for continuous "1" bits, which correspond
+ * to continuous pages, to chunk into sgs.
+ *
+ * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
+ * need to be searched.
+ */
+static void tell_host_sgs(struct virtio_balloon *vb,
+			  struct virtqueue *vq,
+			  unsigned long page_xb_start,
+			  unsigned long page_xb_end)
+{
+	unsigned long sg_pfn_start, sg_pfn_end;
+	void *sg_addr;
+	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
+
+	sg_pfn_start = page_xb_start;
+	while (sg_pfn_start < page_xb_end) {
+		sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start,
+						page_xb_end, 1);
+		if (sg_pfn_start == page_xb_end + 1)
+			break;
+		sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1,
+					      page_xb_end, 0);
+		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
+		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
+		while (sg_len > sg_max_len) {
+			send_balloon_page_sg(vb, vq, sg_addr, sg_max_len);
+			sg_addr += sg_max_len;
+			sg_len -= sg_max_len;
+		}
+		send_balloon_page_sg(vb, vq, sg_addr, sg_len);
+		xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end);
+		sg_pfn_start = sg_pfn_end + 1;
+	}
+}
+
+static inline void xb_set_page(struct virtio_balloon *vb,
+			       struct page *page,
+			       unsigned long *pfn_min,
+			       unsigned long *pfn_max)
+{
+	unsigned long pfn = page_to_pfn(page);
+
+	*pfn_min = min(pfn, *pfn_min);
+	*pfn_max = max(pfn, *pfn_max);
+	xb_preload(GFP_KERNEL);
+	xb_set_bit(&vb->page_xb, pfn);
+	xb_preload_end();
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	unsigned num_allocated_pages;
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -162,7 +251,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
+		if (use_sg)
+			xb_set_page(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -171,8 +265,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -198,9 +296,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	/* Traditionally, we can only do one array worth at a time. */
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	/* We can't release more pages than taken */
@@ -210,7 +311,11 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (use_sg)
+			xb_set_page(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -221,8 +326,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
@@ -441,6 +550,7 @@ static int init_vqs(struct virtio_balloon *vb)
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
+
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
  *			     a compation thread.     (called under page lock)
@@ -464,6 +574,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
 	unsigned long flags;
 
 	/*
@@ -485,16 +596,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (use_sg) {
+		send_balloon_page_sg(vb, vb->inflate_vq, page_address(newpage),
+				     PAGE_SIZE);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (use_sg) {
+		send_balloon_page_sg(vb, vb->deflate_vq, page_address(page),
+				     PAGE_SIZE);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -553,6 +672,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (err)
 		goto out_free_vb;
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
+		xb_init(&vb->page_xb);
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -669,6 +791,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_SG,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..37780a7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-08-17  3:26   ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer
of balloon (i.e. inflated/deflated) pages using scatter-gather lists
to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the balloon pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
sgs. An sg describes a chunk of guest physically continuous pages.
With this mechanism, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~541ms
resulting in an improvement of ~87%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_balloon.c     | 157 ++++++++++++++++++++++++++++++++----
 include/uapi/linux/virtio_balloon.h |   1 +
 2 files changed, 141 insertions(+), 17 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f0b3a0b..72041b4 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -32,6 +32,7 @@
 #include <linux/mm.h>
 #include <linux/mount.h>
 #include <linux/magic.h>
+#include <linux/xbitmap.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -79,6 +80,9 @@ struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	/* The xbitmap used to record ballooned pages */
+	struct xb page_xb;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -141,13 +145,98 @@ static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
+static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
+{
+	struct scatterlist sg;
+
+	sg_init_one(&sg, addr, size);
+	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
+}
+
+static void send_balloon_page_sg(struct virtio_balloon *vb,
+				 struct virtqueue *vq,
+				 void *addr,
+				 uint32_t size)
+{
+	unsigned int len;
+	int ret;
+
+	do {
+		ret = add_one_sg(vq, addr, size);
+		virtqueue_kick(vq);
+		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		/*
+		 * It is uncommon to see the vq is full, because the sg is sent
+		 * one by one and the device is able to handle it in time. But
+		 * if that happens, we go back to retry after an entry gets
+		 * released.
+		 */
+	} while (unlikely(ret == -ENOSPC));
+}
+
+/*
+ * Send balloon pages in sgs to host. The balloon pages are recorded in the
+ * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
+ * The page xbitmap is searched for continuous "1" bits, which correspond
+ * to continuous pages, to chunk into sgs.
+ *
+ * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
+ * need to be searched.
+ */
+static void tell_host_sgs(struct virtio_balloon *vb,
+			  struct virtqueue *vq,
+			  unsigned long page_xb_start,
+			  unsigned long page_xb_end)
+{
+	unsigned long sg_pfn_start, sg_pfn_end;
+	void *sg_addr;
+	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
+
+	sg_pfn_start = page_xb_start;
+	while (sg_pfn_start < page_xb_end) {
+		sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start,
+						page_xb_end, 1);
+		if (sg_pfn_start == page_xb_end + 1)
+			break;
+		sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1,
+					      page_xb_end, 0);
+		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
+		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
+		while (sg_len > sg_max_len) {
+			send_balloon_page_sg(vb, vq, sg_addr, sg_max_len);
+			sg_addr += sg_max_len;
+			sg_len -= sg_max_len;
+		}
+		send_balloon_page_sg(vb, vq, sg_addr, sg_len);
+		xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end);
+		sg_pfn_start = sg_pfn_end + 1;
+	}
+}
+
+static inline void xb_set_page(struct virtio_balloon *vb,
+			       struct page *page,
+			       unsigned long *pfn_min,
+			       unsigned long *pfn_max)
+{
+	unsigned long pfn = page_to_pfn(page);
+
+	*pfn_min = min(pfn, *pfn_min);
+	*pfn_max = max(pfn, *pfn_max);
+	xb_preload(GFP_KERNEL);
+	xb_set_bit(&vb->page_xb, pfn);
+	xb_preload_end();
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	unsigned num_allocated_pages;
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -162,7 +251,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
+		if (use_sg)
+			xb_set_page(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -171,8 +265,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -198,9 +296,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	/* Traditionally, we can only do one array worth at a time. */
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	/* We can't release more pages than taken */
@@ -210,7 +311,11 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (use_sg)
+			xb_set_page(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -221,8 +326,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
@@ -441,6 +550,7 @@ static int init_vqs(struct virtio_balloon *vb)
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
+
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
  *			     a compation thread.     (called under page lock)
@@ -464,6 +574,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
 	unsigned long flags;
 
 	/*
@@ -485,16 +596,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (use_sg) {
+		send_balloon_page_sg(vb, vb->inflate_vq, page_address(newpage),
+				     PAGE_SIZE);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (use_sg) {
+		send_balloon_page_sg(vb, vb->deflate_vq, page_address(page),
+				     PAGE_SIZE);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -553,6 +672,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (err)
 		goto out_free_vb;
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
+		xb_init(&vb->page_xb);
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -669,6 +791,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_SG,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..37780a7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [Qemu-devel] [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-08-17  3:26   ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer
of balloon (i.e. inflated/deflated) pages using scatter-gather lists
to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the balloon pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
sgs. An sg describes a chunk of guest physically continuous pages.
With this mechanism, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~541ms
resulting in an improvement of ~87%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_balloon.c     | 157 ++++++++++++++++++++++++++++++++----
 include/uapi/linux/virtio_balloon.h |   1 +
 2 files changed, 141 insertions(+), 17 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f0b3a0b..72041b4 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -32,6 +32,7 @@
 #include <linux/mm.h>
 #include <linux/mount.h>
 #include <linux/magic.h>
+#include <linux/xbitmap.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -79,6 +80,9 @@ struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	/* The xbitmap used to record ballooned pages */
+	struct xb page_xb;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -141,13 +145,98 @@ static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
+static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
+{
+	struct scatterlist sg;
+
+	sg_init_one(&sg, addr, size);
+	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
+}
+
+static void send_balloon_page_sg(struct virtio_balloon *vb,
+				 struct virtqueue *vq,
+				 void *addr,
+				 uint32_t size)
+{
+	unsigned int len;
+	int ret;
+
+	do {
+		ret = add_one_sg(vq, addr, size);
+		virtqueue_kick(vq);
+		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		/*
+		 * It is uncommon to see the vq is full, because the sg is sent
+		 * one by one and the device is able to handle it in time. But
+		 * if that happens, we go back to retry after an entry gets
+		 * released.
+		 */
+	} while (unlikely(ret == -ENOSPC));
+}
+
+/*
+ * Send balloon pages in sgs to host. The balloon pages are recorded in the
+ * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
+ * The page xbitmap is searched for continuous "1" bits, which correspond
+ * to continuous pages, to chunk into sgs.
+ *
+ * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
+ * need to be searched.
+ */
+static void tell_host_sgs(struct virtio_balloon *vb,
+			  struct virtqueue *vq,
+			  unsigned long page_xb_start,
+			  unsigned long page_xb_end)
+{
+	unsigned long sg_pfn_start, sg_pfn_end;
+	void *sg_addr;
+	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
+
+	sg_pfn_start = page_xb_start;
+	while (sg_pfn_start < page_xb_end) {
+		sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start,
+						page_xb_end, 1);
+		if (sg_pfn_start == page_xb_end + 1)
+			break;
+		sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1,
+					      page_xb_end, 0);
+		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
+		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
+		while (sg_len > sg_max_len) {
+			send_balloon_page_sg(vb, vq, sg_addr, sg_max_len);
+			sg_addr += sg_max_len;
+			sg_len -= sg_max_len;
+		}
+		send_balloon_page_sg(vb, vq, sg_addr, sg_len);
+		xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end);
+		sg_pfn_start = sg_pfn_end + 1;
+	}
+}
+
+static inline void xb_set_page(struct virtio_balloon *vb,
+			       struct page *page,
+			       unsigned long *pfn_min,
+			       unsigned long *pfn_max)
+{
+	unsigned long pfn = page_to_pfn(page);
+
+	*pfn_min = min(pfn, *pfn_min);
+	*pfn_max = max(pfn, *pfn_max);
+	xb_preload(GFP_KERNEL);
+	xb_set_bit(&vb->page_xb, pfn);
+	xb_preload_end();
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	unsigned num_allocated_pages;
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -162,7 +251,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
+		if (use_sg)
+			xb_set_page(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -171,8 +265,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -198,9 +296,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	/* Traditionally, we can only do one array worth at a time. */
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	/* We can't release more pages than taken */
@@ -210,7 +311,11 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (use_sg)
+			xb_set_page(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -221,8 +326,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
@@ -441,6 +550,7 @@ static int init_vqs(struct virtio_balloon *vb)
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
+
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
  *			     a compation thread.     (called under page lock)
@@ -464,6 +574,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
 	unsigned long flags;
 
 	/*
@@ -485,16 +596,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (use_sg) {
+		send_balloon_page_sg(vb, vb->inflate_vq, page_address(newpage),
+				     PAGE_SIZE);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (use_sg) {
+		send_balloon_page_sg(vb, vb->deflate_vq, page_address(page),
+				     PAGE_SIZE);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -553,6 +672,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (err)
 		goto out_free_vb;
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
+		xb_init(&vb->page_xb);
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -669,6 +791,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_SG,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..37780a7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-08-17  3:26 ` Wei Wang
                   ` (6 preceding siblings ...)
  (?)
@ 2017-08-17  3:26 ` Wei Wang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: aarcange, yang.zhang.wz, liliang.opensource, willy, amit.shah,
	quan.xu, cornelia.huck, pbonzini, mgorman

Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer
of balloon (i.e. inflated/deflated) pages using scatter-gather lists
to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the balloon pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
sgs. An sg describes a chunk of guest physically continuous pages.
With this mechanism, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~541ms
resulting in an improvement of ~87%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_balloon.c     | 157 ++++++++++++++++++++++++++++++++----
 include/uapi/linux/virtio_balloon.h |   1 +
 2 files changed, 141 insertions(+), 17 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f0b3a0b..72041b4 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -32,6 +32,7 @@
 #include <linux/mm.h>
 #include <linux/mount.h>
 #include <linux/magic.h>
+#include <linux/xbitmap.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -79,6 +80,9 @@ struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	/* The xbitmap used to record ballooned pages */
+	struct xb page_xb;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -141,13 +145,98 @@ static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
+static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
+{
+	struct scatterlist sg;
+
+	sg_init_one(&sg, addr, size);
+	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
+}
+
+static void send_balloon_page_sg(struct virtio_balloon *vb,
+				 struct virtqueue *vq,
+				 void *addr,
+				 uint32_t size)
+{
+	unsigned int len;
+	int ret;
+
+	do {
+		ret = add_one_sg(vq, addr, size);
+		virtqueue_kick(vq);
+		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		/*
+		 * It is uncommon to see the vq is full, because the sg is sent
+		 * one by one and the device is able to handle it in time. But
+		 * if that happens, we go back to retry after an entry gets
+		 * released.
+		 */
+	} while (unlikely(ret == -ENOSPC));
+}
+
+/*
+ * Send balloon pages in sgs to host. The balloon pages are recorded in the
+ * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
+ * The page xbitmap is searched for continuous "1" bits, which correspond
+ * to continuous pages, to chunk into sgs.
+ *
+ * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
+ * need to be searched.
+ */
+static void tell_host_sgs(struct virtio_balloon *vb,
+			  struct virtqueue *vq,
+			  unsigned long page_xb_start,
+			  unsigned long page_xb_end)
+{
+	unsigned long sg_pfn_start, sg_pfn_end;
+	void *sg_addr;
+	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
+
+	sg_pfn_start = page_xb_start;
+	while (sg_pfn_start < page_xb_end) {
+		sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start,
+						page_xb_end, 1);
+		if (sg_pfn_start == page_xb_end + 1)
+			break;
+		sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1,
+					      page_xb_end, 0);
+		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
+		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
+		while (sg_len > sg_max_len) {
+			send_balloon_page_sg(vb, vq, sg_addr, sg_max_len);
+			sg_addr += sg_max_len;
+			sg_len -= sg_max_len;
+		}
+		send_balloon_page_sg(vb, vq, sg_addr, sg_len);
+		xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end);
+		sg_pfn_start = sg_pfn_end + 1;
+	}
+}
+
+static inline void xb_set_page(struct virtio_balloon *vb,
+			       struct page *page,
+			       unsigned long *pfn_min,
+			       unsigned long *pfn_max)
+{
+	unsigned long pfn = page_to_pfn(page);
+
+	*pfn_min = min(pfn, *pfn_min);
+	*pfn_max = max(pfn, *pfn_max);
+	xb_preload(GFP_KERNEL);
+	xb_set_bit(&vb->page_xb, pfn);
+	xb_preload_end();
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	unsigned num_allocated_pages;
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -162,7 +251,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
+		if (use_sg)
+			xb_set_page(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -171,8 +265,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -198,9 +296,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	/* Traditionally, we can only do one array worth at a time. */
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	/* We can't release more pages than taken */
@@ -210,7 +311,11 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (use_sg)
+			xb_set_page(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -221,8 +326,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
@@ -441,6 +550,7 @@ static int init_vqs(struct virtio_balloon *vb)
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
+
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
  *			     a compation thread.     (called under page lock)
@@ -464,6 +574,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
 	unsigned long flags;
 
 	/*
@@ -485,16 +596,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (use_sg) {
+		send_balloon_page_sg(vb, vb->inflate_vq, page_address(newpage),
+				     PAGE_SIZE);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (use_sg) {
+		send_balloon_page_sg(vb, vb->deflate_vq, page_address(page),
+				     PAGE_SIZE);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -553,6 +672,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (err)
 		goto out_free_vb;
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
+		xb_init(&vb->page_xb);
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -669,6 +791,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_SG,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..37780a7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [virtio-dev] [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-08-17  3:26   ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer
of balloon (i.e. inflated/deflated) pages using scatter-gather lists
to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the balloon pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
sgs. An sg describes a chunk of guest physically continuous pages.
With this mechanism, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~541ms
resulting in an improvement of ~87%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_balloon.c     | 157 ++++++++++++++++++++++++++++++++----
 include/uapi/linux/virtio_balloon.h |   1 +
 2 files changed, 141 insertions(+), 17 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f0b3a0b..72041b4 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -32,6 +32,7 @@
 #include <linux/mm.h>
 #include <linux/mount.h>
 #include <linux/magic.h>
+#include <linux/xbitmap.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -79,6 +80,9 @@ struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	/* The xbitmap used to record ballooned pages */
+	struct xb page_xb;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -141,13 +145,98 @@ static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
+static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
+{
+	struct scatterlist sg;
+
+	sg_init_one(&sg, addr, size);
+	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
+}
+
+static void send_balloon_page_sg(struct virtio_balloon *vb,
+				 struct virtqueue *vq,
+				 void *addr,
+				 uint32_t size)
+{
+	unsigned int len;
+	int ret;
+
+	do {
+		ret = add_one_sg(vq, addr, size);
+		virtqueue_kick(vq);
+		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		/*
+		 * It is uncommon to see the vq is full, because the sg is sent
+		 * one by one and the device is able to handle it in time. But
+		 * if that happens, we go back to retry after an entry gets
+		 * released.
+		 */
+	} while (unlikely(ret == -ENOSPC));
+}
+
+/*
+ * Send balloon pages in sgs to host. The balloon pages are recorded in the
+ * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
+ * The page xbitmap is searched for continuous "1" bits, which correspond
+ * to continuous pages, to chunk into sgs.
+ *
+ * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
+ * need to be searched.
+ */
+static void tell_host_sgs(struct virtio_balloon *vb,
+			  struct virtqueue *vq,
+			  unsigned long page_xb_start,
+			  unsigned long page_xb_end)
+{
+	unsigned long sg_pfn_start, sg_pfn_end;
+	void *sg_addr;
+	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
+
+	sg_pfn_start = page_xb_start;
+	while (sg_pfn_start < page_xb_end) {
+		sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start,
+						page_xb_end, 1);
+		if (sg_pfn_start == page_xb_end + 1)
+			break;
+		sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1,
+					      page_xb_end, 0);
+		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
+		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
+		while (sg_len > sg_max_len) {
+			send_balloon_page_sg(vb, vq, sg_addr, sg_max_len);
+			sg_addr += sg_max_len;
+			sg_len -= sg_max_len;
+		}
+		send_balloon_page_sg(vb, vq, sg_addr, sg_len);
+		xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end);
+		sg_pfn_start = sg_pfn_end + 1;
+	}
+}
+
+static inline void xb_set_page(struct virtio_balloon *vb,
+			       struct page *page,
+			       unsigned long *pfn_min,
+			       unsigned long *pfn_max)
+{
+	unsigned long pfn = page_to_pfn(page);
+
+	*pfn_min = min(pfn, *pfn_min);
+	*pfn_max = max(pfn, *pfn_max);
+	xb_preload(GFP_KERNEL);
+	xb_set_bit(&vb->page_xb, pfn);
+	xb_preload_end();
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	unsigned num_allocated_pages;
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -162,7 +251,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
+		if (use_sg)
+			xb_set_page(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -171,8 +265,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -198,9 +296,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	/* Traditionally, we can only do one array worth at a time. */
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	/* We can't release more pages than taken */
@@ -210,7 +311,11 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (use_sg)
+			xb_set_page(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -221,8 +326,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
@@ -441,6 +550,7 @@ static int init_vqs(struct virtio_balloon *vb)
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
+
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
  *			     a compation thread.     (called under page lock)
@@ -464,6 +574,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
 	unsigned long flags;
 
 	/*
@@ -485,16 +596,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (use_sg) {
+		send_balloon_page_sg(vb, vb->inflate_vq, page_address(newpage),
+				     PAGE_SIZE);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (use_sg) {
+		send_balloon_page_sg(vb, vb->deflate_vq, page_address(page),
+				     PAGE_SIZE);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -553,6 +672,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (err)
 		goto out_free_vb;
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
+		xb_init(&vb->page_xb);
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -669,6 +791,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_SG,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..37780a7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.7.4


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v14 4/5] mm: support reporting free page blocks
  2017-08-17  3:26 ` Wei Wang
  (?)
  (?)
@ 2017-08-17  3:26   ` Wei Wang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

This patch adds support to walk through the free page blocks in the
system and report them via a callback function. Some page blocks may
leave the free list after zone->lock is released, so it is the caller's
responsibility to either detect or prevent the use of such pages.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/mm.h |  6 ++++++
 mm/page_alloc.c    | 44 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5..cd29b9f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 
+extern void walk_free_mem_block(void *opaque1,
+				unsigned int min_order,
+				void (*visit)(void *opaque2,
+					      unsigned long pfn,
+					      unsigned long nr_pages));
+
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
  * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d00f74..a721a35 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 	show_swap_cache_info();
 }
 
+/**
+ * walk_free_mem_block - Walk through the free page blocks in the system
+ * @opaque1: the context passed from the caller
+ * @min_order: the minimum order of free lists to check
+ * @visit: the callback function given by the caller
+ *
+ * The function is used to walk through the free page blocks in the system,
+ * and each free page block is reported to the caller via the @visit callback.
+ * Please note:
+ * 1) The function is used to report hints of free pages, so the caller should
+ * not use those reported pages after the callback returns.
+ * 2) The callback is invoked with the zone->lock being held, so it should not
+ * block and should finish as soon as possible.
+ */
+void walk_free_mem_block(void *opaque1,
+			 unsigned int min_order,
+			 void (*visit)(void *opaque2,
+				       unsigned long pfn,
+				       unsigned long nr_pages))
+{
+	struct zone *zone;
+	struct page *page;
+	struct list_head *list;
+	unsigned int order;
+	enum migratetype mt;
+	unsigned long pfn, flags;
+
+	for_each_populated_zone(zone) {
+		for (order = MAX_ORDER - 1;
+		     order < MAX_ORDER && order >= min_order; order--) {
+			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+				spin_lock_irqsave(&zone->lock, flags);
+				list = &zone->free_area[order].free_list[mt];
+				list_for_each_entry(page, list, lru) {
+					pfn = page_to_pfn(page);
+					visit(opaque1, pfn, 1 << order);
+				}
+				spin_unlock_irqrestore(&zone->lock, flags);
+			}
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(walk_free_mem_block);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v14 4/5] mm: support reporting free page blocks
@ 2017-08-17  3:26   ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

This patch adds support to walk through the free page blocks in the
system and report them via a callback function. Some page blocks may
leave the free list after zone->lock is released, so it is the caller's
responsibility to either detect or prevent the use of such pages.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/mm.h |  6 ++++++
 mm/page_alloc.c    | 44 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5..cd29b9f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 
+extern void walk_free_mem_block(void *opaque1,
+				unsigned int min_order,
+				void (*visit)(void *opaque2,
+					      unsigned long pfn,
+					      unsigned long nr_pages));
+
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
  * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d00f74..a721a35 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 	show_swap_cache_info();
 }
 
+/**
+ * walk_free_mem_block - Walk through the free page blocks in the system
+ * @opaque1: the context passed from the caller
+ * @min_order: the minimum order of free lists to check
+ * @visit: the callback function given by the caller
+ *
+ * The function is used to walk through the free page blocks in the system,
+ * and each free page block is reported to the caller via the @visit callback.
+ * Please note:
+ * 1) The function is used to report hints of free pages, so the caller should
+ * not use those reported pages after the callback returns.
+ * 2) The callback is invoked with the zone->lock being held, so it should not
+ * block and should finish as soon as possible.
+ */
+void walk_free_mem_block(void *opaque1,
+			 unsigned int min_order,
+			 void (*visit)(void *opaque2,
+				       unsigned long pfn,
+				       unsigned long nr_pages))
+{
+	struct zone *zone;
+	struct page *page;
+	struct list_head *list;
+	unsigned int order;
+	enum migratetype mt;
+	unsigned long pfn, flags;
+
+	for_each_populated_zone(zone) {
+		for (order = MAX_ORDER - 1;
+		     order < MAX_ORDER && order >= min_order; order--) {
+			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+				spin_lock_irqsave(&zone->lock, flags);
+				list = &zone->free_area[order].free_list[mt];
+				list_for_each_entry(page, list, lru) {
+					pfn = page_to_pfn(page);
+					visit(opaque1, pfn, 1 << order);
+				}
+				spin_unlock_irqrestore(&zone->lock, flags);
+			}
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(walk_free_mem_block);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [Qemu-devel] [PATCH v14 4/5] mm: support reporting free page blocks
@ 2017-08-17  3:26   ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

This patch adds support to walk through the free page blocks in the
system and report them via a callback function. Some page blocks may
leave the free list after zone->lock is released, so it is the caller's
responsibility to either detect or prevent the use of such pages.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/mm.h |  6 ++++++
 mm/page_alloc.c    | 44 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5..cd29b9f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 
+extern void walk_free_mem_block(void *opaque1,
+				unsigned int min_order,
+				void (*visit)(void *opaque2,
+					      unsigned long pfn,
+					      unsigned long nr_pages));
+
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
  * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d00f74..a721a35 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 	show_swap_cache_info();
 }
 
+/**
+ * walk_free_mem_block - Walk through the free page blocks in the system
+ * @opaque1: the context passed from the caller
+ * @min_order: the minimum order of free lists to check
+ * @visit: the callback function given by the caller
+ *
+ * The function is used to walk through the free page blocks in the system,
+ * and each free page block is reported to the caller via the @visit callback.
+ * Please note:
+ * 1) The function is used to report hints of free pages, so the caller should
+ * not use those reported pages after the callback returns.
+ * 2) The callback is invoked with the zone->lock being held, so it should not
+ * block and should finish as soon as possible.
+ */
+void walk_free_mem_block(void *opaque1,
+			 unsigned int min_order,
+			 void (*visit)(void *opaque2,
+				       unsigned long pfn,
+				       unsigned long nr_pages))
+{
+	struct zone *zone;
+	struct page *page;
+	struct list_head *list;
+	unsigned int order;
+	enum migratetype mt;
+	unsigned long pfn, flags;
+
+	for_each_populated_zone(zone) {
+		for (order = MAX_ORDER - 1;
+		     order < MAX_ORDER && order >= min_order; order--) {
+			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+				spin_lock_irqsave(&zone->lock, flags);
+				list = &zone->free_area[order].free_list[mt];
+				list_for_each_entry(page, list, lru) {
+					pfn = page_to_pfn(page);
+					visit(opaque1, pfn, 1 << order);
+				}
+				spin_unlock_irqrestore(&zone->lock, flags);
+			}
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(walk_free_mem_block);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v14 4/5] mm: support reporting free page blocks
  2017-08-17  3:26 ` Wei Wang
                   ` (9 preceding siblings ...)
  (?)
@ 2017-08-17  3:26 ` Wei Wang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: aarcange, yang.zhang.wz, liliang.opensource, willy, amit.shah,
	quan.xu, cornelia.huck, pbonzini, mgorman

This patch adds support to walk through the free page blocks in the
system and report them via a callback function. Some page blocks may
leave the free list after zone->lock is released, so it is the caller's
responsibility to either detect or prevent the use of such pages.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/mm.h |  6 ++++++
 mm/page_alloc.c    | 44 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5..cd29b9f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 
+extern void walk_free_mem_block(void *opaque1,
+				unsigned int min_order,
+				void (*visit)(void *opaque2,
+					      unsigned long pfn,
+					      unsigned long nr_pages));
+
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
  * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d00f74..a721a35 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 	show_swap_cache_info();
 }
 
+/**
+ * walk_free_mem_block - Walk through the free page blocks in the system
+ * @opaque1: the context passed from the caller
+ * @min_order: the minimum order of free lists to check
+ * @visit: the callback function given by the caller
+ *
+ * The function is used to walk through the free page blocks in the system,
+ * and each free page block is reported to the caller via the @visit callback.
+ * Please note:
+ * 1) The function is used to report hints of free pages, so the caller should
+ * not use those reported pages after the callback returns.
+ * 2) The callback is invoked with the zone->lock being held, so it should not
+ * block and should finish as soon as possible.
+ */
+void walk_free_mem_block(void *opaque1,
+			 unsigned int min_order,
+			 void (*visit)(void *opaque2,
+				       unsigned long pfn,
+				       unsigned long nr_pages))
+{
+	struct zone *zone;
+	struct page *page;
+	struct list_head *list;
+	unsigned int order;
+	enum migratetype mt;
+	unsigned long pfn, flags;
+
+	for_each_populated_zone(zone) {
+		for (order = MAX_ORDER - 1;
+		     order < MAX_ORDER && order >= min_order; order--) {
+			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+				spin_lock_irqsave(&zone->lock, flags);
+				list = &zone->free_area[order].free_list[mt];
+				list_for_each_entry(page, list, lru) {
+					pfn = page_to_pfn(page);
+					visit(opaque1, pfn, 1 << order);
+				}
+				spin_unlock_irqrestore(&zone->lock, flags);
+			}
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(walk_free_mem_block);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [virtio-dev] [PATCH v14 4/5] mm: support reporting free page blocks
@ 2017-08-17  3:26   ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

This patch adds support to walk through the free page blocks in the
system and report them via a callback function. Some page blocks may
leave the free list after zone->lock is released, so it is the caller's
responsibility to either detect or prevent the use of such pages.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/mm.h |  6 ++++++
 mm/page_alloc.c    | 44 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5..cd29b9f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 
+extern void walk_free_mem_block(void *opaque1,
+				unsigned int min_order,
+				void (*visit)(void *opaque2,
+					      unsigned long pfn,
+					      unsigned long nr_pages));
+
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
  * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d00f74..a721a35 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 	show_swap_cache_info();
 }
 
+/**
+ * walk_free_mem_block - Walk through the free page blocks in the system
+ * @opaque1: the context passed from the caller
+ * @min_order: the minimum order of free lists to check
+ * @visit: the callback function given by the caller
+ *
+ * The function is used to walk through the free page blocks in the system,
+ * and each free page block is reported to the caller via the @visit callback.
+ * Please note:
+ * 1) The function is used to report hints of free pages, so the caller should
+ * not use those reported pages after the callback returns.
+ * 2) The callback is invoked with the zone->lock being held, so it should not
+ * block and should finish as soon as possible.
+ */
+void walk_free_mem_block(void *opaque1,
+			 unsigned int min_order,
+			 void (*visit)(void *opaque2,
+				       unsigned long pfn,
+				       unsigned long nr_pages))
+{
+	struct zone *zone;
+	struct page *page;
+	struct list_head *list;
+	unsigned int order;
+	enum migratetype mt;
+	unsigned long pfn, flags;
+
+	for_each_populated_zone(zone) {
+		for (order = MAX_ORDER - 1;
+		     order < MAX_ORDER && order >= min_order; order--) {
+			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+				spin_lock_irqsave(&zone->lock, flags);
+				list = &zone->free_area[order].free_list[mt];
+				list_for_each_entry(page, list, lru) {
+					pfn = page_to_pfn(page);
+					visit(opaque1, pfn, 1 << order);
+				}
+				spin_unlock_irqrestore(&zone->lock, flags);
+			}
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(walk_free_mem_block);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-08-17  3:26 ` Wei Wang
  (?)
  (?)
@ 2017-08-17  3:26   ` Wei Wang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

Add a new vq to report hints of guest free pages to the host.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
 include/uapi/linux/virtio_balloon.h |   1 +
 2 files changed, 147 insertions(+), 21 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 72041b4..e6755bc 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
 
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
 	struct work_struct update_balloon_size_work;
+	struct work_struct report_free_page_work;
 
 	/* Prevent updating balloon when it is being canceled. */
 	spinlock_t stop_update_lock;
@@ -90,6 +91,13 @@ struct virtio_balloon {
 	/* Memory statistics */
 	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
 
+	/*
+	 * Used by the device and driver to signal each other.
+	 * device->driver: start the free page report.
+	 * driver->device: end the free page report.
+	 */
+	__virtio32 report_free_page_signal;
+
 	/* To register callback in oom notifier call chain */
 	struct notifier_block nb;
 };
@@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
 	} while (unlikely(ret == -ENOSPC));
 }
 
+static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
+{
+	unsigned int len;
+
+	add_one_sg(vq, addr, size);
+	virtqueue_kick(vq);
+	/* Release entries if there are */
+	while (virtqueue_get_buf(vq, &len))
+		;
+}
+
 /*
  * Send balloon pages in sgs to host. The balloon pages are recorded in the
  * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
@@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
+static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
+					   unsigned long nr_pages)
+{
+	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
+	void *addr = (void *)pfn_to_kaddr(pfn);
+	uint32_t len = nr_pages << PAGE_SHIFT;
+
+	send_free_page_sg(vb->free_page_vq, addr, len);
+}
+
+static void report_free_page_completion(struct virtio_balloon *vb)
+{
+	struct virtqueue *vq = vb->free_page_vq;
+	struct scatterlist sg;
+	unsigned int len;
+	int ret;
+
+	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
+retry:
+	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
+	virtqueue_kick(vq);
+	if (unlikely(ret == -ENOSPC)) {
+		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		goto retry;
+	}
+}
+
+static void report_free_page(struct work_struct *work)
+{
+	struct virtio_balloon *vb;
+
+	vb = container_of(work, struct virtio_balloon, report_free_page_work);
+	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
+	report_free_page_completion(vb);
+}
+
+static void free_page_request(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+
+	queue_work(system_freezable_wq, &vb->report_free_page_work);
+}
+
 static int init_vqs(struct virtio_balloon *vb)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
-	int err, nvqs;
+	struct virtqueue **vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	struct scatterlist sg;
+	int i, nvqs, err = -ENOMEM;
+
+	/* Inflateq and deflateq are used unconditionally */
+	nvqs = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs++;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
+		nvqs++;
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vq;
+	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
+	if (!callbacks)
+		goto err_callback;
+	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
+	if (!names)
+		goto err_names;
+
+	callbacks[0] = balloon_ack;
+	names[0] = "inflate";
+	callbacks[1] = balloon_ack;
+	names[1] = "deflate";
+
+	i = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		callbacks[i] = stats_request;
+		names[i] = "stats";
+		i++;
+	}
 
-	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
-	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
+		callbacks[i] = free_page_request;
+		names[i] = "free_page_vq";
+	}
+
+	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
+					 NULL, NULL);
 	if (err)
-		return err;
+		goto err_find;
 
 	vb->inflate_vq = vqs[0];
 	vb->deflate_vq = vqs[1];
+	i = 2;
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
-		struct scatterlist sg;
-		unsigned int num_stats;
-		vb->stats_vq = vqs[2];
-
+		vb->stats_vq = vqs[i++];
 		/*
 		 * Prime this virtqueue with one buffer so the hypervisor can
 		 * use it to signal us later (it can't be broken yet!).
 		 */
-		num_stats = update_balloon_stats(vb);
-
-		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
+		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
 		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
-		    < 0)
-			BUG();
+		    < 0) {
+			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
+				 __func__);
+			goto err_find;
+		}
 		virtqueue_kick(vb->stats_vq);
 	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
+		vb->free_page_vq = vqs[i];
+		vb->report_free_page_signal = 0;
+		sg_init_one(&sg, &vb->report_free_page_signal,
+			    sizeof(__virtio32));
+		if (virtqueue_add_outbuf(vb->free_page_vq, &sg, 1, vb,
+					 GFP_KERNEL) < 0) {
+			dev_warn(&vb->vdev->dev, "%s: add signal buf failed\n",
+				 __func__);
+			goto err_find;
+		}
+		virtqueue_kick(vb->free_page_vq);
+	}
+
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
 	return 0;
+
+err_find:
+	kfree(names);
+err_names:
+	kfree(callbacks);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return err;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -675,6 +795,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
 		xb_init(&vb->page_xb);
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
+		INIT_WORK(&vb->report_free_page_work, report_free_page);
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -739,6 +862,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	spin_unlock_irq(&vb->stop_update_lock);
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
+	cancel_work_sync(&vb->report_free_page_work);
 
 	remove_common(vb);
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -792,6 +916,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_SG,
+	VIRTIO_BALLOON_F_FREE_PAGE_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 37780a7..8214f84 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
+#define VIRTIO_BALLOON_F_FREE_PAGE_VQ	4 /* Virtqueue to report free pages */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-17  3:26   ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

Add a new vq to report hints of guest free pages to the host.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
 include/uapi/linux/virtio_balloon.h |   1 +
 2 files changed, 147 insertions(+), 21 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 72041b4..e6755bc 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
 
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
 	struct work_struct update_balloon_size_work;
+	struct work_struct report_free_page_work;
 
 	/* Prevent updating balloon when it is being canceled. */
 	spinlock_t stop_update_lock;
@@ -90,6 +91,13 @@ struct virtio_balloon {
 	/* Memory statistics */
 	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
 
+	/*
+	 * Used by the device and driver to signal each other.
+	 * device->driver: start the free page report.
+	 * driver->device: end the free page report.
+	 */
+	__virtio32 report_free_page_signal;
+
 	/* To register callback in oom notifier call chain */
 	struct notifier_block nb;
 };
@@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
 	} while (unlikely(ret == -ENOSPC));
 }
 
+static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
+{
+	unsigned int len;
+
+	add_one_sg(vq, addr, size);
+	virtqueue_kick(vq);
+	/* Release entries if there are */
+	while (virtqueue_get_buf(vq, &len))
+		;
+}
+
 /*
  * Send balloon pages in sgs to host. The balloon pages are recorded in the
  * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
@@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
+static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
+					   unsigned long nr_pages)
+{
+	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
+	void *addr = (void *)pfn_to_kaddr(pfn);
+	uint32_t len = nr_pages << PAGE_SHIFT;
+
+	send_free_page_sg(vb->free_page_vq, addr, len);
+}
+
+static void report_free_page_completion(struct virtio_balloon *vb)
+{
+	struct virtqueue *vq = vb->free_page_vq;
+	struct scatterlist sg;
+	unsigned int len;
+	int ret;
+
+	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
+retry:
+	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
+	virtqueue_kick(vq);
+	if (unlikely(ret == -ENOSPC)) {
+		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		goto retry;
+	}
+}
+
+static void report_free_page(struct work_struct *work)
+{
+	struct virtio_balloon *vb;
+
+	vb = container_of(work, struct virtio_balloon, report_free_page_work);
+	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
+	report_free_page_completion(vb);
+}
+
+static void free_page_request(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+
+	queue_work(system_freezable_wq, &vb->report_free_page_work);
+}
+
 static int init_vqs(struct virtio_balloon *vb)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
-	int err, nvqs;
+	struct virtqueue **vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	struct scatterlist sg;
+	int i, nvqs, err = -ENOMEM;
+
+	/* Inflateq and deflateq are used unconditionally */
+	nvqs = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs++;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
+		nvqs++;
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vq;
+	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
+	if (!callbacks)
+		goto err_callback;
+	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
+	if (!names)
+		goto err_names;
+
+	callbacks[0] = balloon_ack;
+	names[0] = "inflate";
+	callbacks[1] = balloon_ack;
+	names[1] = "deflate";
+
+	i = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		callbacks[i] = stats_request;
+		names[i] = "stats";
+		i++;
+	}
 
-	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
-	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
+		callbacks[i] = free_page_request;
+		names[i] = "free_page_vq";
+	}
+
+	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
+					 NULL, NULL);
 	if (err)
-		return err;
+		goto err_find;
 
 	vb->inflate_vq = vqs[0];
 	vb->deflate_vq = vqs[1];
+	i = 2;
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
-		struct scatterlist sg;
-		unsigned int num_stats;
-		vb->stats_vq = vqs[2];
-
+		vb->stats_vq = vqs[i++];
 		/*
 		 * Prime this virtqueue with one buffer so the hypervisor can
 		 * use it to signal us later (it can't be broken yet!).
 		 */
-		num_stats = update_balloon_stats(vb);
-
-		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
+		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
 		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
-		    < 0)
-			BUG();
+		    < 0) {
+			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
+				 __func__);
+			goto err_find;
+		}
 		virtqueue_kick(vb->stats_vq);
 	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
+		vb->free_page_vq = vqs[i];
+		vb->report_free_page_signal = 0;
+		sg_init_one(&sg, &vb->report_free_page_signal,
+			    sizeof(__virtio32));
+		if (virtqueue_add_outbuf(vb->free_page_vq, &sg, 1, vb,
+					 GFP_KERNEL) < 0) {
+			dev_warn(&vb->vdev->dev, "%s: add signal buf failed\n",
+				 __func__);
+			goto err_find;
+		}
+		virtqueue_kick(vb->free_page_vq);
+	}
+
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
 	return 0;
+
+err_find:
+	kfree(names);
+err_names:
+	kfree(callbacks);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return err;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -675,6 +795,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
 		xb_init(&vb->page_xb);
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
+		INIT_WORK(&vb->report_free_page_work, report_free_page);
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -739,6 +862,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	spin_unlock_irq(&vb->stop_update_lock);
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
+	cancel_work_sync(&vb->report_free_page_work);
 
 	remove_common(vb);
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -792,6 +916,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_SG,
+	VIRTIO_BALLOON_F_FREE_PAGE_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 37780a7..8214f84 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
+#define VIRTIO_BALLOON_F_FREE_PAGE_VQ	4 /* Virtqueue to report free pages */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [Qemu-devel] [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-17  3:26   ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

Add a new vq to report hints of guest free pages to the host.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
 include/uapi/linux/virtio_balloon.h |   1 +
 2 files changed, 147 insertions(+), 21 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 72041b4..e6755bc 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
 
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
 	struct work_struct update_balloon_size_work;
+	struct work_struct report_free_page_work;
 
 	/* Prevent updating balloon when it is being canceled. */
 	spinlock_t stop_update_lock;
@@ -90,6 +91,13 @@ struct virtio_balloon {
 	/* Memory statistics */
 	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
 
+	/*
+	 * Used by the device and driver to signal each other.
+	 * device->driver: start the free page report.
+	 * driver->device: end the free page report.
+	 */
+	__virtio32 report_free_page_signal;
+
 	/* To register callback in oom notifier call chain */
 	struct notifier_block nb;
 };
@@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
 	} while (unlikely(ret == -ENOSPC));
 }
 
+static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
+{
+	unsigned int len;
+
+	add_one_sg(vq, addr, size);
+	virtqueue_kick(vq);
+	/* Release entries if there are */
+	while (virtqueue_get_buf(vq, &len))
+		;
+}
+
 /*
  * Send balloon pages in sgs to host. The balloon pages are recorded in the
  * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
@@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
+static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
+					   unsigned long nr_pages)
+{
+	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
+	void *addr = (void *)pfn_to_kaddr(pfn);
+	uint32_t len = nr_pages << PAGE_SHIFT;
+
+	send_free_page_sg(vb->free_page_vq, addr, len);
+}
+
+static void report_free_page_completion(struct virtio_balloon *vb)
+{
+	struct virtqueue *vq = vb->free_page_vq;
+	struct scatterlist sg;
+	unsigned int len;
+	int ret;
+
+	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
+retry:
+	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
+	virtqueue_kick(vq);
+	if (unlikely(ret == -ENOSPC)) {
+		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		goto retry;
+	}
+}
+
+static void report_free_page(struct work_struct *work)
+{
+	struct virtio_balloon *vb;
+
+	vb = container_of(work, struct virtio_balloon, report_free_page_work);
+	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
+	report_free_page_completion(vb);
+}
+
+static void free_page_request(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+
+	queue_work(system_freezable_wq, &vb->report_free_page_work);
+}
+
 static int init_vqs(struct virtio_balloon *vb)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
-	int err, nvqs;
+	struct virtqueue **vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	struct scatterlist sg;
+	int i, nvqs, err = -ENOMEM;
+
+	/* Inflateq and deflateq are used unconditionally */
+	nvqs = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs++;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
+		nvqs++;
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vq;
+	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
+	if (!callbacks)
+		goto err_callback;
+	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
+	if (!names)
+		goto err_names;
+
+	callbacks[0] = balloon_ack;
+	names[0] = "inflate";
+	callbacks[1] = balloon_ack;
+	names[1] = "deflate";
+
+	i = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		callbacks[i] = stats_request;
+		names[i] = "stats";
+		i++;
+	}
 
-	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
-	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
+		callbacks[i] = free_page_request;
+		names[i] = "free_page_vq";
+	}
+
+	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
+					 NULL, NULL);
 	if (err)
-		return err;
+		goto err_find;
 
 	vb->inflate_vq = vqs[0];
 	vb->deflate_vq = vqs[1];
+	i = 2;
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
-		struct scatterlist sg;
-		unsigned int num_stats;
-		vb->stats_vq = vqs[2];
-
+		vb->stats_vq = vqs[i++];
 		/*
 		 * Prime this virtqueue with one buffer so the hypervisor can
 		 * use it to signal us later (it can't be broken yet!).
 		 */
-		num_stats = update_balloon_stats(vb);
-
-		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
+		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
 		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
-		    < 0)
-			BUG();
+		    < 0) {
+			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
+				 __func__);
+			goto err_find;
+		}
 		virtqueue_kick(vb->stats_vq);
 	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
+		vb->free_page_vq = vqs[i];
+		vb->report_free_page_signal = 0;
+		sg_init_one(&sg, &vb->report_free_page_signal,
+			    sizeof(__virtio32));
+		if (virtqueue_add_outbuf(vb->free_page_vq, &sg, 1, vb,
+					 GFP_KERNEL) < 0) {
+			dev_warn(&vb->vdev->dev, "%s: add signal buf failed\n",
+				 __func__);
+			goto err_find;
+		}
+		virtqueue_kick(vb->free_page_vq);
+	}
+
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
 	return 0;
+
+err_find:
+	kfree(names);
+err_names:
+	kfree(callbacks);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return err;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -675,6 +795,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
 		xb_init(&vb->page_xb);
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
+		INIT_WORK(&vb->report_free_page_work, report_free_page);
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -739,6 +862,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	spin_unlock_irq(&vb->stop_update_lock);
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
+	cancel_work_sync(&vb->report_free_page_work);
 
 	remove_common(vb);
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -792,6 +916,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_SG,
+	VIRTIO_BALLOON_F_FREE_PAGE_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 37780a7..8214f84 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
+#define VIRTIO_BALLOON_F_FREE_PAGE_VQ	4 /* Virtqueue to report free pages */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-08-17  3:26 ` Wei Wang
                   ` (10 preceding siblings ...)
  (?)
@ 2017-08-17  3:26 ` Wei Wang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: aarcange, yang.zhang.wz, liliang.opensource, willy, amit.shah,
	quan.xu, cornelia.huck, pbonzini, mgorman

Add a new vq to report hints of guest free pages to the host.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
 include/uapi/linux/virtio_balloon.h |   1 +
 2 files changed, 147 insertions(+), 21 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 72041b4..e6755bc 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
 
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
 	struct work_struct update_balloon_size_work;
+	struct work_struct report_free_page_work;
 
 	/* Prevent updating balloon when it is being canceled. */
 	spinlock_t stop_update_lock;
@@ -90,6 +91,13 @@ struct virtio_balloon {
 	/* Memory statistics */
 	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
 
+	/*
+	 * Used by the device and driver to signal each other.
+	 * device->driver: start the free page report.
+	 * driver->device: end the free page report.
+	 */
+	__virtio32 report_free_page_signal;
+
 	/* To register callback in oom notifier call chain */
 	struct notifier_block nb;
 };
@@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
 	} while (unlikely(ret == -ENOSPC));
 }
 
+static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
+{
+	unsigned int len;
+
+	add_one_sg(vq, addr, size);
+	virtqueue_kick(vq);
+	/* Release entries if there are */
+	while (virtqueue_get_buf(vq, &len))
+		;
+}
+
 /*
  * Send balloon pages in sgs to host. The balloon pages are recorded in the
  * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
@@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
+static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
+					   unsigned long nr_pages)
+{
+	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
+	void *addr = (void *)pfn_to_kaddr(pfn);
+	uint32_t len = nr_pages << PAGE_SHIFT;
+
+	send_free_page_sg(vb->free_page_vq, addr, len);
+}
+
+static void report_free_page_completion(struct virtio_balloon *vb)
+{
+	struct virtqueue *vq = vb->free_page_vq;
+	struct scatterlist sg;
+	unsigned int len;
+	int ret;
+
+	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
+retry:
+	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
+	virtqueue_kick(vq);
+	if (unlikely(ret == -ENOSPC)) {
+		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		goto retry;
+	}
+}
+
+static void report_free_page(struct work_struct *work)
+{
+	struct virtio_balloon *vb;
+
+	vb = container_of(work, struct virtio_balloon, report_free_page_work);
+	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
+	report_free_page_completion(vb);
+}
+
+static void free_page_request(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+
+	queue_work(system_freezable_wq, &vb->report_free_page_work);
+}
+
 static int init_vqs(struct virtio_balloon *vb)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
-	int err, nvqs;
+	struct virtqueue **vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	struct scatterlist sg;
+	int i, nvqs, err = -ENOMEM;
+
+	/* Inflateq and deflateq are used unconditionally */
+	nvqs = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs++;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
+		nvqs++;
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vq;
+	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
+	if (!callbacks)
+		goto err_callback;
+	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
+	if (!names)
+		goto err_names;
+
+	callbacks[0] = balloon_ack;
+	names[0] = "inflate";
+	callbacks[1] = balloon_ack;
+	names[1] = "deflate";
+
+	i = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		callbacks[i] = stats_request;
+		names[i] = "stats";
+		i++;
+	}
 
-	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
-	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
+		callbacks[i] = free_page_request;
+		names[i] = "free_page_vq";
+	}
+
+	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
+					 NULL, NULL);
 	if (err)
-		return err;
+		goto err_find;
 
 	vb->inflate_vq = vqs[0];
 	vb->deflate_vq = vqs[1];
+	i = 2;
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
-		struct scatterlist sg;
-		unsigned int num_stats;
-		vb->stats_vq = vqs[2];
-
+		vb->stats_vq = vqs[i++];
 		/*
 		 * Prime this virtqueue with one buffer so the hypervisor can
 		 * use it to signal us later (it can't be broken yet!).
 		 */
-		num_stats = update_balloon_stats(vb);
-
-		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
+		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
 		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
-		    < 0)
-			BUG();
+		    < 0) {
+			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
+				 __func__);
+			goto err_find;
+		}
 		virtqueue_kick(vb->stats_vq);
 	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
+		vb->free_page_vq = vqs[i];
+		vb->report_free_page_signal = 0;
+		sg_init_one(&sg, &vb->report_free_page_signal,
+			    sizeof(__virtio32));
+		if (virtqueue_add_outbuf(vb->free_page_vq, &sg, 1, vb,
+					 GFP_KERNEL) < 0) {
+			dev_warn(&vb->vdev->dev, "%s: add signal buf failed\n",
+				 __func__);
+			goto err_find;
+		}
+		virtqueue_kick(vb->free_page_vq);
+	}
+
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
 	return 0;
+
+err_find:
+	kfree(names);
+err_names:
+	kfree(callbacks);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return err;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -675,6 +795,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
 		xb_init(&vb->page_xb);
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
+		INIT_WORK(&vb->report_free_page_work, report_free_page);
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -739,6 +862,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	spin_unlock_irq(&vb->stop_update_lock);
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
+	cancel_work_sync(&vb->report_free_page_work);
 
 	remove_common(vb);
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -792,6 +916,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_SG,
+	VIRTIO_BALLOON_F_FREE_PAGE_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 37780a7..8214f84 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
+#define VIRTIO_BALLOON_F_FREE_PAGE_VQ	4 /* Virtqueue to report free pages */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [virtio-dev] [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-17  3:26   ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

Add a new vq to report hints of guest free pages to the host.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
 include/uapi/linux/virtio_balloon.h |   1 +
 2 files changed, 147 insertions(+), 21 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 72041b4..e6755bc 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
 
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
 	struct work_struct update_balloon_size_work;
+	struct work_struct report_free_page_work;
 
 	/* Prevent updating balloon when it is being canceled. */
 	spinlock_t stop_update_lock;
@@ -90,6 +91,13 @@ struct virtio_balloon {
 	/* Memory statistics */
 	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
 
+	/*
+	 * Used by the device and driver to signal each other.
+	 * device->driver: start the free page report.
+	 * driver->device: end the free page report.
+	 */
+	__virtio32 report_free_page_signal;
+
 	/* To register callback in oom notifier call chain */
 	struct notifier_block nb;
 };
@@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
 	} while (unlikely(ret == -ENOSPC));
 }
 
+static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
+{
+	unsigned int len;
+
+	add_one_sg(vq, addr, size);
+	virtqueue_kick(vq);
+	/* Release entries if there are */
+	while (virtqueue_get_buf(vq, &len))
+		;
+}
+
 /*
  * Send balloon pages in sgs to host. The balloon pages are recorded in the
  * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
@@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
+static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
+					   unsigned long nr_pages)
+{
+	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
+	void *addr = (void *)pfn_to_kaddr(pfn);
+	uint32_t len = nr_pages << PAGE_SHIFT;
+
+	send_free_page_sg(vb->free_page_vq, addr, len);
+}
+
+static void report_free_page_completion(struct virtio_balloon *vb)
+{
+	struct virtqueue *vq = vb->free_page_vq;
+	struct scatterlist sg;
+	unsigned int len;
+	int ret;
+
+	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
+retry:
+	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
+	virtqueue_kick(vq);
+	if (unlikely(ret == -ENOSPC)) {
+		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		goto retry;
+	}
+}
+
+static void report_free_page(struct work_struct *work)
+{
+	struct virtio_balloon *vb;
+
+	vb = container_of(work, struct virtio_balloon, report_free_page_work);
+	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
+	report_free_page_completion(vb);
+}
+
+static void free_page_request(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+
+	queue_work(system_freezable_wq, &vb->report_free_page_work);
+}
+
 static int init_vqs(struct virtio_balloon *vb)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
-	int err, nvqs;
+	struct virtqueue **vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	struct scatterlist sg;
+	int i, nvqs, err = -ENOMEM;
+
+	/* Inflateq and deflateq are used unconditionally */
+	nvqs = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs++;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
+		nvqs++;
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vq;
+	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
+	if (!callbacks)
+		goto err_callback;
+	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
+	if (!names)
+		goto err_names;
+
+	callbacks[0] = balloon_ack;
+	names[0] = "inflate";
+	callbacks[1] = balloon_ack;
+	names[1] = "deflate";
+
+	i = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		callbacks[i] = stats_request;
+		names[i] = "stats";
+		i++;
+	}
 
-	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
-	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
+		callbacks[i] = free_page_request;
+		names[i] = "free_page_vq";
+	}
+
+	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
+					 NULL, NULL);
 	if (err)
-		return err;
+		goto err_find;
 
 	vb->inflate_vq = vqs[0];
 	vb->deflate_vq = vqs[1];
+	i = 2;
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
-		struct scatterlist sg;
-		unsigned int num_stats;
-		vb->stats_vq = vqs[2];
-
+		vb->stats_vq = vqs[i++];
 		/*
 		 * Prime this virtqueue with one buffer so the hypervisor can
 		 * use it to signal us later (it can't be broken yet!).
 		 */
-		num_stats = update_balloon_stats(vb);
-
-		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
+		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
 		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
-		    < 0)
-			BUG();
+		    < 0) {
+			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
+				 __func__);
+			goto err_find;
+		}
 		virtqueue_kick(vb->stats_vq);
 	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
+		vb->free_page_vq = vqs[i];
+		vb->report_free_page_signal = 0;
+		sg_init_one(&sg, &vb->report_free_page_signal,
+			    sizeof(__virtio32));
+		if (virtqueue_add_outbuf(vb->free_page_vq, &sg, 1, vb,
+					 GFP_KERNEL) < 0) {
+			dev_warn(&vb->vdev->dev, "%s: add signal buf failed\n",
+				 __func__);
+			goto err_find;
+		}
+		virtqueue_kick(vb->free_page_vq);
+	}
+
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
 	return 0;
+
+err_find:
+	kfree(names);
+err_names:
+	kfree(callbacks);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return err;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -675,6 +795,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
 		xb_init(&vb->page_xb);
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
+		INIT_WORK(&vb->report_free_page_work, report_free_page);
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -739,6 +862,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	spin_unlock_irq(&vb->stop_update_lock);
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
+	cancel_work_sync(&vb->report_free_page_work);
 
 	remove_common(vb);
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -792,6 +916,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_SG,
+	VIRTIO_BALLOON_F_FREE_PAGE_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 37780a7..8214f84 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
+#define VIRTIO_BALLOON_F_FREE_PAGE_VQ	4 /* Virtqueue to report free pages */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.7.4


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-08-17  3:26   ` Wei Wang
  (?)
  (?)
@ 2017-08-18  2:13     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18  2:13 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
> Add a new vq to report hints of guest free pages to the host.

Please add some text here explaining the report_free_page_signal
thing.


I also really think we need some kind of ID in the
buffer to do a handshake. whenever id changes you
add another outbuf.

> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
>  include/uapi/linux/virtio_balloon.h |   1 +
>  2 files changed, 147 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 72041b4..e6755bc 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
>  
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>  
>  	/* The balloon servicing is delegated to a freezable workqueue. */
>  	struct work_struct update_balloon_stats_work;
>  	struct work_struct update_balloon_size_work;
> +	struct work_struct report_free_page_work;
>  
>  	/* Prevent updating balloon when it is being canceled. */
>  	spinlock_t stop_update_lock;
> @@ -90,6 +91,13 @@ struct virtio_balloon {
>  	/* Memory statistics */
>  	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>  
> +	/*
> +	 * Used by the device and driver to signal each other.
> +	 * device->driver: start the free page report.
> +	 * driver->device: end the free page report.
> +	 */
> +	__virtio32 report_free_page_signal;

Weird - all I can see is driver writing 0 there, then adding
it as out buf.

> +
>  	/* To register callback in oom notifier call chain */
>  	struct notifier_block nb;
>  };
> @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
>  	} while (unlikely(ret == -ENOSPC));
>  }
>  
> +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	unsigned int len;
> +
> +	add_one_sg(vq, addr, size);
> +	virtqueue_kick(vq);
> +	/* Release entries if there are */
> +	while (virtqueue_get_buf(vq, &len))
> +		;
> +}
> +
>  /*
>   * Send balloon pages in sgs to host. The balloon pages are recorded in the
>   * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
>  		queue_work(system_freezable_wq, work);
>  }
>  
> +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> +					   unsigned long nr_pages)
> +{
> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> +	void *addr = (void *)pfn_to_kaddr(pfn);
> +	uint32_t len = nr_pages << PAGE_SHIFT;
> +
> +	send_free_page_sg(vb->free_page_vq, addr, len);
> +}
> +
> +static void report_free_page_completion(struct virtio_balloon *vb)
> +{
> +	struct virtqueue *vq = vb->free_page_vq;
> +	struct scatterlist sg;
> +	unsigned int len;
> +	int ret;
> +
> +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));

sizeof vb->report_free_page_signal is better.

> +retry:
> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> +	virtqueue_kick(vq);
> +	if (unlikely(ret == -ENOSPC)) {

what if there's another error?

> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		goto retry;
> +	}

what is this trickery doing? needs more comments or
a simplification.



> +}
> +
> +static void report_free_page(struct work_struct *work)
> +{
> +	struct virtio_balloon *vb;
> +
> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);

That's a lot of work here. And system_wq documentation says:
 *
 * system_wq is the one used by schedule[_delayed]_work[_on]().
 * Multi-CPU multi-threaded.  There are users which expect relatively
 * short queue flush time.  Don't queue works which can run for too
 * long.

You might want to create your own wq, maybe even with WQ_CPU_INTENSIVE.

> +	report_free_page_completion(vb);

So first you get list of pages, then an outbuf telling you
what they are in end of.  I think it's backwards.
Add an outbuf first followed by inbufs that tell you
what they are.


> +}
> +
> +static void free_page_request(struct virtqueue *vq)
> +{
> +	struct virtio_balloon *vb = vq->vdev->priv;
> +
> +	queue_work(system_freezable_wq, &vb->report_free_page_work);
> +}
> +
>  static int init_vqs(struct virtio_balloon *vb)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct virtqueue **vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	struct scatterlist sg;
> +	int i, nvqs, err = -ENOMEM;
> +
> +	/* Inflateq and deflateq are used unconditionally */
> +	nvqs = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> +		nvqs++;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> +		nvqs++;
> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs)
> +		goto err_vq;
> +	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
> +	if (!callbacks)
> +		goto err_callback;
> +	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
> +	if (!names)
> +		goto err_names;
> +
> +	callbacks[0] = balloon_ack;
> +	names[0] = "inflate";
> +	callbacks[1] = balloon_ack;
> +	names[1] = "deflate";
> +
> +	i = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		callbacks[i] = stats_request;
> +		names[i] = "stats";
> +		i++;
> +	}
>  
> -	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> -	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> -	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		callbacks[i] = free_page_request;
> +		names[i] = "free_page_vq";
> +	}
> +
> +	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
> +					 NULL, NULL);
>  	if (err)
> -		return err;
> +		goto err_find;
>  
>  	vb->inflate_vq = vqs[0];
>  	vb->deflate_vq = vqs[1];
> +	i = 2;
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> -		struct scatterlist sg;
> -		unsigned int num_stats;
> -		vb->stats_vq = vqs[2];
> -
> +		vb->stats_vq = vqs[i++];
>  		/*
>  		 * Prime this virtqueue with one buffer so the hypervisor can
>  		 * use it to signal us later (it can't be broken yet!).
>  		 */
> -		num_stats = update_balloon_stats(vb);
> -
> -		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
> +		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
>  		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
> -		    < 0)
> -			BUG();
> +		    < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
>  		virtqueue_kick(vb->stats_vq);
>  	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		vb->free_page_vq = vqs[i];
> +		vb->report_free_page_signal = 0;
> +		sg_init_one(&sg, &vb->report_free_page_signal,
> +			    sizeof(__virtio32));
> +		if (virtqueue_add_outbuf(vb->free_page_vq, &sg, 1, vb,
> +					 GFP_KERNEL) < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add signal buf failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
> +		virtqueue_kick(vb->free_page_vq);
> +	}
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
>  	return 0;
> +
> +err_find:
> +	kfree(names);
> +err_names:
> +	kfree(callbacks);
> +err_callback:
> +	kfree(vqs);
> +err_vq:
> +	return err;
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -675,6 +795,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
>  		xb_init(&vb->page_xb);
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> +		INIT_WORK(&vb->report_free_page_work, report_free_page);
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -739,6 +862,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	spin_unlock_irq(&vb->stop_update_lock);
>  	cancel_work_sync(&vb->update_balloon_size_work);
>  	cancel_work_sync(&vb->update_balloon_stats_work);
> +	cancel_work_sync(&vb->report_free_page_work);
>  
>  	remove_common(vb);
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -792,6 +916,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>  	VIRTIO_BALLOON_F_SG,
> +	VIRTIO_BALLOON_F_FREE_PAGE_VQ,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 37780a7..8214f84 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -35,6 +35,7 @@
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
> +#define VIRTIO_BALLOON_F_FREE_PAGE_VQ	4 /* Virtqueue to report free pages */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-18  2:13     ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18  2:13 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
> Add a new vq to report hints of guest free pages to the host.

Please add some text here explaining the report_free_page_signal
thing.


I also really think we need some kind of ID in the
buffer to do a handshake. whenever id changes you
add another outbuf.

> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
>  include/uapi/linux/virtio_balloon.h |   1 +
>  2 files changed, 147 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 72041b4..e6755bc 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
>  
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>  
>  	/* The balloon servicing is delegated to a freezable workqueue. */
>  	struct work_struct update_balloon_stats_work;
>  	struct work_struct update_balloon_size_work;
> +	struct work_struct report_free_page_work;
>  
>  	/* Prevent updating balloon when it is being canceled. */
>  	spinlock_t stop_update_lock;
> @@ -90,6 +91,13 @@ struct virtio_balloon {
>  	/* Memory statistics */
>  	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>  
> +	/*
> +	 * Used by the device and driver to signal each other.
> +	 * device->driver: start the free page report.
> +	 * driver->device: end the free page report.
> +	 */
> +	__virtio32 report_free_page_signal;

Weird - all I can see is driver writing 0 there, then adding
it as out buf.

> +
>  	/* To register callback in oom notifier call chain */
>  	struct notifier_block nb;
>  };
> @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
>  	} while (unlikely(ret == -ENOSPC));
>  }
>  
> +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	unsigned int len;
> +
> +	add_one_sg(vq, addr, size);
> +	virtqueue_kick(vq);
> +	/* Release entries if there are */
> +	while (virtqueue_get_buf(vq, &len))
> +		;
> +}
> +
>  /*
>   * Send balloon pages in sgs to host. The balloon pages are recorded in the
>   * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
>  		queue_work(system_freezable_wq, work);
>  }
>  
> +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> +					   unsigned long nr_pages)
> +{
> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> +	void *addr = (void *)pfn_to_kaddr(pfn);
> +	uint32_t len = nr_pages << PAGE_SHIFT;
> +
> +	send_free_page_sg(vb->free_page_vq, addr, len);
> +}
> +
> +static void report_free_page_completion(struct virtio_balloon *vb)
> +{
> +	struct virtqueue *vq = vb->free_page_vq;
> +	struct scatterlist sg;
> +	unsigned int len;
> +	int ret;
> +
> +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));

sizeof vb->report_free_page_signal is better.

> +retry:
> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> +	virtqueue_kick(vq);
> +	if (unlikely(ret == -ENOSPC)) {

what if there's another error?

> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		goto retry;
> +	}

what is this trickery doing? needs more comments or
a simplification.



> +}
> +
> +static void report_free_page(struct work_struct *work)
> +{
> +	struct virtio_balloon *vb;
> +
> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);

That's a lot of work here. And system_wq documentation says:
 *
 * system_wq is the one used by schedule[_delayed]_work[_on]().
 * Multi-CPU multi-threaded.  There are users which expect relatively
 * short queue flush time.  Don't queue works which can run for too
 * long.

You might want to create your own wq, maybe even with WQ_CPU_INTENSIVE.

> +	report_free_page_completion(vb);

So first you get list of pages, then an outbuf telling you
what they are in end of.  I think it's backwards.
Add an outbuf first followed by inbufs that tell you
what they are.


> +}
> +
> +static void free_page_request(struct virtqueue *vq)
> +{
> +	struct virtio_balloon *vb = vq->vdev->priv;
> +
> +	queue_work(system_freezable_wq, &vb->report_free_page_work);
> +}
> +
>  static int init_vqs(struct virtio_balloon *vb)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct virtqueue **vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	struct scatterlist sg;
> +	int i, nvqs, err = -ENOMEM;
> +
> +	/* Inflateq and deflateq are used unconditionally */
> +	nvqs = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> +		nvqs++;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> +		nvqs++;
> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs)
> +		goto err_vq;
> +	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
> +	if (!callbacks)
> +		goto err_callback;
> +	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
> +	if (!names)
> +		goto err_names;
> +
> +	callbacks[0] = balloon_ack;
> +	names[0] = "inflate";
> +	callbacks[1] = balloon_ack;
> +	names[1] = "deflate";
> +
> +	i = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		callbacks[i] = stats_request;
> +		names[i] = "stats";
> +		i++;
> +	}
>  
> -	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> -	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> -	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		callbacks[i] = free_page_request;
> +		names[i] = "free_page_vq";
> +	}
> +
> +	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
> +					 NULL, NULL);
>  	if (err)
> -		return err;
> +		goto err_find;
>  
>  	vb->inflate_vq = vqs[0];
>  	vb->deflate_vq = vqs[1];
> +	i = 2;
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> -		struct scatterlist sg;
> -		unsigned int num_stats;
> -		vb->stats_vq = vqs[2];
> -
> +		vb->stats_vq = vqs[i++];
>  		/*
>  		 * Prime this virtqueue with one buffer so the hypervisor can
>  		 * use it to signal us later (it can't be broken yet!).
>  		 */
> -		num_stats = update_balloon_stats(vb);
> -
> -		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
> +		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
>  		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
> -		    < 0)
> -			BUG();
> +		    < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
>  		virtqueue_kick(vb->stats_vq);
>  	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		vb->free_page_vq = vqs[i];
> +		vb->report_free_page_signal = 0;
> +		sg_init_one(&sg, &vb->report_free_page_signal,
> +			    sizeof(__virtio32));
> +		if (virtqueue_add_outbuf(vb->free_page_vq, &sg, 1, vb,
> +					 GFP_KERNEL) < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add signal buf failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
> +		virtqueue_kick(vb->free_page_vq);
> +	}
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
>  	return 0;
> +
> +err_find:
> +	kfree(names);
> +err_names:
> +	kfree(callbacks);
> +err_callback:
> +	kfree(vqs);
> +err_vq:
> +	return err;
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -675,6 +795,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
>  		xb_init(&vb->page_xb);
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> +		INIT_WORK(&vb->report_free_page_work, report_free_page);
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -739,6 +862,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	spin_unlock_irq(&vb->stop_update_lock);
>  	cancel_work_sync(&vb->update_balloon_size_work);
>  	cancel_work_sync(&vb->update_balloon_stats_work);
> +	cancel_work_sync(&vb->report_free_page_work);
>  
>  	remove_common(vb);
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -792,6 +916,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>  	VIRTIO_BALLOON_F_SG,
> +	VIRTIO_BALLOON_F_FREE_PAGE_VQ,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 37780a7..8214f84 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -35,6 +35,7 @@
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
> +#define VIRTIO_BALLOON_F_FREE_PAGE_VQ	4 /* Virtqueue to report free pages */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> -- 
> 2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [Qemu-devel] [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-18  2:13     ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18  2:13 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
> Add a new vq to report hints of guest free pages to the host.

Please add some text here explaining the report_free_page_signal
thing.


I also really think we need some kind of ID in the
buffer to do a handshake. whenever id changes you
add another outbuf.

> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
>  include/uapi/linux/virtio_balloon.h |   1 +
>  2 files changed, 147 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 72041b4..e6755bc 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
>  
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>  
>  	/* The balloon servicing is delegated to a freezable workqueue. */
>  	struct work_struct update_balloon_stats_work;
>  	struct work_struct update_balloon_size_work;
> +	struct work_struct report_free_page_work;
>  
>  	/* Prevent updating balloon when it is being canceled. */
>  	spinlock_t stop_update_lock;
> @@ -90,6 +91,13 @@ struct virtio_balloon {
>  	/* Memory statistics */
>  	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>  
> +	/*
> +	 * Used by the device and driver to signal each other.
> +	 * device->driver: start the free page report.
> +	 * driver->device: end the free page report.
> +	 */
> +	__virtio32 report_free_page_signal;

Weird - all I can see is driver writing 0 there, then adding
it as out buf.

> +
>  	/* To register callback in oom notifier call chain */
>  	struct notifier_block nb;
>  };
> @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
>  	} while (unlikely(ret == -ENOSPC));
>  }
>  
> +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	unsigned int len;
> +
> +	add_one_sg(vq, addr, size);
> +	virtqueue_kick(vq);
> +	/* Release entries if there are */
> +	while (virtqueue_get_buf(vq, &len))
> +		;
> +}
> +
>  /*
>   * Send balloon pages in sgs to host. The balloon pages are recorded in the
>   * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
>  		queue_work(system_freezable_wq, work);
>  }
>  
> +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> +					   unsigned long nr_pages)
> +{
> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> +	void *addr = (void *)pfn_to_kaddr(pfn);
> +	uint32_t len = nr_pages << PAGE_SHIFT;
> +
> +	send_free_page_sg(vb->free_page_vq, addr, len);
> +}
> +
> +static void report_free_page_completion(struct virtio_balloon *vb)
> +{
> +	struct virtqueue *vq = vb->free_page_vq;
> +	struct scatterlist sg;
> +	unsigned int len;
> +	int ret;
> +
> +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));

sizeof vb->report_free_page_signal is better.

> +retry:
> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> +	virtqueue_kick(vq);
> +	if (unlikely(ret == -ENOSPC)) {

what if there's another error?

> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		goto retry;
> +	}

what is this trickery doing? needs more comments or
a simplification.



> +}
> +
> +static void report_free_page(struct work_struct *work)
> +{
> +	struct virtio_balloon *vb;
> +
> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);

That's a lot of work here. And system_wq documentation says:
 *
 * system_wq is the one used by schedule[_delayed]_work[_on]().
 * Multi-CPU multi-threaded.  There are users which expect relatively
 * short queue flush time.  Don't queue works which can run for too
 * long.

You might want to create your own wq, maybe even with WQ_CPU_INTENSIVE.

> +	report_free_page_completion(vb);

So first you get list of pages, then an outbuf telling you
what they are in end of.  I think it's backwards.
Add an outbuf first followed by inbufs that tell you
what they are.


> +}
> +
> +static void free_page_request(struct virtqueue *vq)
> +{
> +	struct virtio_balloon *vb = vq->vdev->priv;
> +
> +	queue_work(system_freezable_wq, &vb->report_free_page_work);
> +}
> +
>  static int init_vqs(struct virtio_balloon *vb)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct virtqueue **vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	struct scatterlist sg;
> +	int i, nvqs, err = -ENOMEM;
> +
> +	/* Inflateq and deflateq are used unconditionally */
> +	nvqs = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> +		nvqs++;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> +		nvqs++;
> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs)
> +		goto err_vq;
> +	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
> +	if (!callbacks)
> +		goto err_callback;
> +	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
> +	if (!names)
> +		goto err_names;
> +
> +	callbacks[0] = balloon_ack;
> +	names[0] = "inflate";
> +	callbacks[1] = balloon_ack;
> +	names[1] = "deflate";
> +
> +	i = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		callbacks[i] = stats_request;
> +		names[i] = "stats";
> +		i++;
> +	}
>  
> -	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> -	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> -	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		callbacks[i] = free_page_request;
> +		names[i] = "free_page_vq";
> +	}
> +
> +	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
> +					 NULL, NULL);
>  	if (err)
> -		return err;
> +		goto err_find;
>  
>  	vb->inflate_vq = vqs[0];
>  	vb->deflate_vq = vqs[1];
> +	i = 2;
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> -		struct scatterlist sg;
> -		unsigned int num_stats;
> -		vb->stats_vq = vqs[2];
> -
> +		vb->stats_vq = vqs[i++];
>  		/*
>  		 * Prime this virtqueue with one buffer so the hypervisor can
>  		 * use it to signal us later (it can't be broken yet!).
>  		 */
> -		num_stats = update_balloon_stats(vb);
> -
> -		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
> +		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
>  		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
> -		    < 0)
> -			BUG();
> +		    < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
>  		virtqueue_kick(vb->stats_vq);
>  	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		vb->free_page_vq = vqs[i];
> +		vb->report_free_page_signal = 0;
> +		sg_init_one(&sg, &vb->report_free_page_signal,
> +			    sizeof(__virtio32));
> +		if (virtqueue_add_outbuf(vb->free_page_vq, &sg, 1, vb,
> +					 GFP_KERNEL) < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add signal buf failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
> +		virtqueue_kick(vb->free_page_vq);
> +	}
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
>  	return 0;
> +
> +err_find:
> +	kfree(names);
> +err_names:
> +	kfree(callbacks);
> +err_callback:
> +	kfree(vqs);
> +err_vq:
> +	return err;
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -675,6 +795,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
>  		xb_init(&vb->page_xb);
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> +		INIT_WORK(&vb->report_free_page_work, report_free_page);
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -739,6 +862,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	spin_unlock_irq(&vb->stop_update_lock);
>  	cancel_work_sync(&vb->update_balloon_size_work);
>  	cancel_work_sync(&vb->update_balloon_stats_work);
> +	cancel_work_sync(&vb->report_free_page_work);
>  
>  	remove_common(vb);
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -792,6 +916,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>  	VIRTIO_BALLOON_F_SG,
> +	VIRTIO_BALLOON_F_FREE_PAGE_VQ,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 37780a7..8214f84 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -35,6 +35,7 @@
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
> +#define VIRTIO_BALLOON_F_FREE_PAGE_VQ	4 /* Virtqueue to report free pages */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-08-17  3:26   ` Wei Wang
                     ` (2 preceding siblings ...)
  (?)
@ 2017-08-18  2:13   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18  2:13 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko, mgorman

On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
> Add a new vq to report hints of guest free pages to the host.

Please add some text here explaining the report_free_page_signal
thing.


I also really think we need some kind of ID in the
buffer to do a handshake. whenever id changes you
add another outbuf.

> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
>  include/uapi/linux/virtio_balloon.h |   1 +
>  2 files changed, 147 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 72041b4..e6755bc 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
>  
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>  
>  	/* The balloon servicing is delegated to a freezable workqueue. */
>  	struct work_struct update_balloon_stats_work;
>  	struct work_struct update_balloon_size_work;
> +	struct work_struct report_free_page_work;
>  
>  	/* Prevent updating balloon when it is being canceled. */
>  	spinlock_t stop_update_lock;
> @@ -90,6 +91,13 @@ struct virtio_balloon {
>  	/* Memory statistics */
>  	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>  
> +	/*
> +	 * Used by the device and driver to signal each other.
> +	 * device->driver: start the free page report.
> +	 * driver->device: end the free page report.
> +	 */
> +	__virtio32 report_free_page_signal;

Weird - all I can see is driver writing 0 there, then adding
it as out buf.

> +
>  	/* To register callback in oom notifier call chain */
>  	struct notifier_block nb;
>  };
> @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
>  	} while (unlikely(ret == -ENOSPC));
>  }
>  
> +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	unsigned int len;
> +
> +	add_one_sg(vq, addr, size);
> +	virtqueue_kick(vq);
> +	/* Release entries if there are */
> +	while (virtqueue_get_buf(vq, &len))
> +		;
> +}
> +
>  /*
>   * Send balloon pages in sgs to host. The balloon pages are recorded in the
>   * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
>  		queue_work(system_freezable_wq, work);
>  }
>  
> +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> +					   unsigned long nr_pages)
> +{
> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> +	void *addr = (void *)pfn_to_kaddr(pfn);
> +	uint32_t len = nr_pages << PAGE_SHIFT;
> +
> +	send_free_page_sg(vb->free_page_vq, addr, len);
> +}
> +
> +static void report_free_page_completion(struct virtio_balloon *vb)
> +{
> +	struct virtqueue *vq = vb->free_page_vq;
> +	struct scatterlist sg;
> +	unsigned int len;
> +	int ret;
> +
> +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));

sizeof vb->report_free_page_signal is better.

> +retry:
> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> +	virtqueue_kick(vq);
> +	if (unlikely(ret == -ENOSPC)) {

what if there's another error?

> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		goto retry;
> +	}

what is this trickery doing? needs more comments or
a simplification.



> +}
> +
> +static void report_free_page(struct work_struct *work)
> +{
> +	struct virtio_balloon *vb;
> +
> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);

That's a lot of work here. And system_wq documentation says:
 *
 * system_wq is the one used by schedule[_delayed]_work[_on]().
 * Multi-CPU multi-threaded.  There are users which expect relatively
 * short queue flush time.  Don't queue works which can run for too
 * long.

You might want to create your own wq, maybe even with WQ_CPU_INTENSIVE.

> +	report_free_page_completion(vb);

So first you get list of pages, then an outbuf telling you
what they are in end of.  I think it's backwards.
Add an outbuf first followed by inbufs that tell you
what they are.


> +}
> +
> +static void free_page_request(struct virtqueue *vq)
> +{
> +	struct virtio_balloon *vb = vq->vdev->priv;
> +
> +	queue_work(system_freezable_wq, &vb->report_free_page_work);
> +}
> +
>  static int init_vqs(struct virtio_balloon *vb)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct virtqueue **vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	struct scatterlist sg;
> +	int i, nvqs, err = -ENOMEM;
> +
> +	/* Inflateq and deflateq are used unconditionally */
> +	nvqs = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> +		nvqs++;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> +		nvqs++;
> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs)
> +		goto err_vq;
> +	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
> +	if (!callbacks)
> +		goto err_callback;
> +	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
> +	if (!names)
> +		goto err_names;
> +
> +	callbacks[0] = balloon_ack;
> +	names[0] = "inflate";
> +	callbacks[1] = balloon_ack;
> +	names[1] = "deflate";
> +
> +	i = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		callbacks[i] = stats_request;
> +		names[i] = "stats";
> +		i++;
> +	}
>  
> -	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> -	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> -	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		callbacks[i] = free_page_request;
> +		names[i] = "free_page_vq";
> +	}
> +
> +	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
> +					 NULL, NULL);
>  	if (err)
> -		return err;
> +		goto err_find;
>  
>  	vb->inflate_vq = vqs[0];
>  	vb->deflate_vq = vqs[1];
> +	i = 2;
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> -		struct scatterlist sg;
> -		unsigned int num_stats;
> -		vb->stats_vq = vqs[2];
> -
> +		vb->stats_vq = vqs[i++];
>  		/*
>  		 * Prime this virtqueue with one buffer so the hypervisor can
>  		 * use it to signal us later (it can't be broken yet!).
>  		 */
> -		num_stats = update_balloon_stats(vb);
> -
> -		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
> +		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
>  		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
> -		    < 0)
> -			BUG();
> +		    < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
>  		virtqueue_kick(vb->stats_vq);
>  	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		vb->free_page_vq = vqs[i];
> +		vb->report_free_page_signal = 0;
> +		sg_init_one(&sg, &vb->report_free_page_signal,
> +			    sizeof(__virtio32));
> +		if (virtqueue_add_outbuf(vb->free_page_vq, &sg, 1, vb,
> +					 GFP_KERNEL) < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add signal buf failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
> +		virtqueue_kick(vb->free_page_vq);
> +	}
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
>  	return 0;
> +
> +err_find:
> +	kfree(names);
> +err_names:
> +	kfree(callbacks);
> +err_callback:
> +	kfree(vqs);
> +err_vq:
> +	return err;
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -675,6 +795,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
>  		xb_init(&vb->page_xb);
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> +		INIT_WORK(&vb->report_free_page_work, report_free_page);
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -739,6 +862,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	spin_unlock_irq(&vb->stop_update_lock);
>  	cancel_work_sync(&vb->update_balloon_size_work);
>  	cancel_work_sync(&vb->update_balloon_stats_work);
> +	cancel_work_sync(&vb->report_free_page_work);
>  
>  	remove_common(vb);
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -792,6 +916,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>  	VIRTIO_BALLOON_F_SG,
> +	VIRTIO_BALLOON_F_FREE_PAGE_VQ,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 37780a7..8214f84 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -35,6 +35,7 @@
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
> +#define VIRTIO_BALLOON_F_FREE_PAGE_VQ	4 /* Virtqueue to report free pages */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [virtio-dev] Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-18  2:13     ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18  2:13 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
> Add a new vq to report hints of guest free pages to the host.

Please add some text here explaining the report_free_page_signal
thing.


I also really think we need some kind of ID in the
buffer to do a handshake. whenever id changes you
add another outbuf.

> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
>  include/uapi/linux/virtio_balloon.h |   1 +
>  2 files changed, 147 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 72041b4..e6755bc 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
>  
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>  
>  	/* The balloon servicing is delegated to a freezable workqueue. */
>  	struct work_struct update_balloon_stats_work;
>  	struct work_struct update_balloon_size_work;
> +	struct work_struct report_free_page_work;
>  
>  	/* Prevent updating balloon when it is being canceled. */
>  	spinlock_t stop_update_lock;
> @@ -90,6 +91,13 @@ struct virtio_balloon {
>  	/* Memory statistics */
>  	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>  
> +	/*
> +	 * Used by the device and driver to signal each other.
> +	 * device->driver: start the free page report.
> +	 * driver->device: end the free page report.
> +	 */
> +	__virtio32 report_free_page_signal;

Weird - all I can see is driver writing 0 there, then adding
it as out buf.

> +
>  	/* To register callback in oom notifier call chain */
>  	struct notifier_block nb;
>  };
> @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
>  	} while (unlikely(ret == -ENOSPC));
>  }
>  
> +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	unsigned int len;
> +
> +	add_one_sg(vq, addr, size);
> +	virtqueue_kick(vq);
> +	/* Release entries if there are */
> +	while (virtqueue_get_buf(vq, &len))
> +		;
> +}
> +
>  /*
>   * Send balloon pages in sgs to host. The balloon pages are recorded in the
>   * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
>  		queue_work(system_freezable_wq, work);
>  }
>  
> +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> +					   unsigned long nr_pages)
> +{
> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> +	void *addr = (void *)pfn_to_kaddr(pfn);
> +	uint32_t len = nr_pages << PAGE_SHIFT;
> +
> +	send_free_page_sg(vb->free_page_vq, addr, len);
> +}
> +
> +static void report_free_page_completion(struct virtio_balloon *vb)
> +{
> +	struct virtqueue *vq = vb->free_page_vq;
> +	struct scatterlist sg;
> +	unsigned int len;
> +	int ret;
> +
> +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));

sizeof vb->report_free_page_signal is better.

> +retry:
> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> +	virtqueue_kick(vq);
> +	if (unlikely(ret == -ENOSPC)) {

what if there's another error?

> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		goto retry;
> +	}

what is this trickery doing? needs more comments or
a simplification.



> +}
> +
> +static void report_free_page(struct work_struct *work)
> +{
> +	struct virtio_balloon *vb;
> +
> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);

That's a lot of work here. And system_wq documentation says:
 *
 * system_wq is the one used by schedule[_delayed]_work[_on]().
 * Multi-CPU multi-threaded.  There are users which expect relatively
 * short queue flush time.  Don't queue works which can run for too
 * long.

You might want to create your own wq, maybe even with WQ_CPU_INTENSIVE.

> +	report_free_page_completion(vb);

So first you get list of pages, then an outbuf telling you
what they are in end of.  I think it's backwards.
Add an outbuf first followed by inbufs that tell you
what they are.


> +}
> +
> +static void free_page_request(struct virtqueue *vq)
> +{
> +	struct virtio_balloon *vb = vq->vdev->priv;
> +
> +	queue_work(system_freezable_wq, &vb->report_free_page_work);
> +}
> +
>  static int init_vqs(struct virtio_balloon *vb)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct virtqueue **vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	struct scatterlist sg;
> +	int i, nvqs, err = -ENOMEM;
> +
> +	/* Inflateq and deflateq are used unconditionally */
> +	nvqs = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> +		nvqs++;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> +		nvqs++;
> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs)
> +		goto err_vq;
> +	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
> +	if (!callbacks)
> +		goto err_callback;
> +	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
> +	if (!names)
> +		goto err_names;
> +
> +	callbacks[0] = balloon_ack;
> +	names[0] = "inflate";
> +	callbacks[1] = balloon_ack;
> +	names[1] = "deflate";
> +
> +	i = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		callbacks[i] = stats_request;
> +		names[i] = "stats";
> +		i++;
> +	}
>  
> -	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> -	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> -	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		callbacks[i] = free_page_request;
> +		names[i] = "free_page_vq";
> +	}
> +
> +	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
> +					 NULL, NULL);
>  	if (err)
> -		return err;
> +		goto err_find;
>  
>  	vb->inflate_vq = vqs[0];
>  	vb->deflate_vq = vqs[1];
> +	i = 2;
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> -		struct scatterlist sg;
> -		unsigned int num_stats;
> -		vb->stats_vq = vqs[2];
> -
> +		vb->stats_vq = vqs[i++];
>  		/*
>  		 * Prime this virtqueue with one buffer so the hypervisor can
>  		 * use it to signal us later (it can't be broken yet!).
>  		 */
> -		num_stats = update_balloon_stats(vb);
> -
> -		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
> +		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
>  		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
> -		    < 0)
> -			BUG();
> +		    < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
>  		virtqueue_kick(vb->stats_vq);
>  	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		vb->free_page_vq = vqs[i];
> +		vb->report_free_page_signal = 0;
> +		sg_init_one(&sg, &vb->report_free_page_signal,
> +			    sizeof(__virtio32));
> +		if (virtqueue_add_outbuf(vb->free_page_vq, &sg, 1, vb,
> +					 GFP_KERNEL) < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add signal buf failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
> +		virtqueue_kick(vb->free_page_vq);
> +	}
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
>  	return 0;
> +
> +err_find:
> +	kfree(names);
> +err_names:
> +	kfree(callbacks);
> +err_callback:
> +	kfree(vqs);
> +err_vq:
> +	return err;
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -675,6 +795,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
>  		xb_init(&vb->page_xb);
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> +		INIT_WORK(&vb->report_free_page_work, report_free_page);
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -739,6 +862,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	spin_unlock_irq(&vb->stop_update_lock);
>  	cancel_work_sync(&vb->update_balloon_size_work);
>  	cancel_work_sync(&vb->update_balloon_stats_work);
> +	cancel_work_sync(&vb->report_free_page_work);
>  
>  	remove_common(vb);
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -792,6 +916,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>  	VIRTIO_BALLOON_F_SG,
> +	VIRTIO_BALLOON_F_FREE_PAGE_VQ,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 37780a7..8214f84 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -35,6 +35,7 @@
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
> +#define VIRTIO_BALLOON_F_FREE_PAGE_VQ	4 /* Virtqueue to report free pages */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> -- 
> 2.7.4

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-08-17  3:26   ` Wei Wang
  (?)
  (?)
@ 2017-08-18  2:22     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18  2:22 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Thu, Aug 17, 2017 at 11:26:54AM +0800, Wei Wang wrote:
> Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer
> of balloon (i.e. inflated/deflated) pages using scatter-gather lists
> to the host.
> 
> The implementation of the previous virtio-balloon is not very
> efficient, because the balloon pages are transferred to the
> host one by one. Here is the breakdown of the time in percentage
> spent on each step of the balloon inflating process (inflating
> 7GB of an 8GB idle guest).
> 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete.
> The above profiling shows that the bottlenecks are stage 2)
> and stage 4).
> 
> This patch optimizes step 2) by transferring pages to the host in
> sgs. An sg describes a chunk of guest physically continuous pages.
> With this mechanism, step 4) can also be optimized by doing address
> translation and madvise() in chunks rather than page by page.
> 
> With this new feature, the above ballooning process takes ~541ms
> resulting in an improvement of ~87%.
> 
> TODO: optimize stage 1) by allocating/freeing a chunk of pages
> instead of a single page each time.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 157 ++++++++++++++++++++++++++++++++----
>  include/uapi/linux/virtio_balloon.h |   1 +
>  2 files changed, 141 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index f0b3a0b..72041b4 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -32,6 +32,7 @@
>  #include <linux/mm.h>
>  #include <linux/mount.h>
>  #include <linux/magic.h>
> +#include <linux/xbitmap.h>
>  
>  /*
>   * Balloon device works in 4K page units.  So each page is pointed to by
> @@ -79,6 +80,9 @@ struct virtio_balloon {
>  	/* Synchronize access/update to this struct virtio_balloon elements */
>  	struct mutex balloon_lock;
>  
> +	/* The xbitmap used to record ballooned pages */
> +	struct xb page_xb;
> +
>  	/* The array of pfns we tell the Host about. */
>  	unsigned int num_pfns;
>  	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> @@ -141,13 +145,98 @@ static void set_page_pfns(struct virtio_balloon *vb,
>  					  page_to_balloon_pfn(page) + i);
>  }
>  
> +static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	struct scatterlist sg;
> +
> +	sg_init_one(&sg, addr, size);
> +	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
> +}
> +
> +static void send_balloon_page_sg(struct virtio_balloon *vb,
> +				 struct virtqueue *vq,
> +				 void *addr,
> +				 uint32_t size)
> +{
> +	unsigned int len;
> +	int ret;
> +
> +	do {
> +		ret = add_one_sg(vq, addr, size);
> +		virtqueue_kick(vq);
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		/*
> +		 * It is uncommon to see the vq is full, because the sg is sent
> +		 * one by one and the device is able to handle it in time. But
> +		 * if that happens, we go back to retry after an entry gets
> +		 * released.
> +		 */

Why send one by one though? Why not batch some s/gs and wait for all
of them to be completed? If memory if fragmented, waiting every time is
worse than what we have now (VIRTIO_BALLOON_ARRAY_PFNS_MAX at a time).

> +	} while (unlikely(ret == -ENOSPC));
> +}
> +
> +/*
> + * Send balloon pages in sgs to host. The balloon pages are recorded in the
> + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> + * The page xbitmap is searched for continuous "1" bits, which correspond
> + * to continuous pages, to chunk into sgs.
> + *
> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
> + * need to be searched.
> + */
> +static void tell_host_sgs(struct virtio_balloon *vb,
> +			  struct virtqueue *vq,
> +			  unsigned long page_xb_start,
> +			  unsigned long page_xb_end)
> +{
> +	unsigned long sg_pfn_start, sg_pfn_end;
> +	void *sg_addr;
> +	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
> +
> +	sg_pfn_start = page_xb_start;
> +	while (sg_pfn_start < page_xb_end) {
> +		sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start,
> +						page_xb_end, 1);
> +		if (sg_pfn_start == page_xb_end + 1)
> +			break;
> +		sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1,
> +					      page_xb_end, 0);
> +		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
> +		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
> +		while (sg_len > sg_max_len) {
> +			send_balloon_page_sg(vb, vq, sg_addr, sg_max_len);
> +			sg_addr += sg_max_len;
> +			sg_len -= sg_max_len;
> +		}
> +		send_balloon_page_sg(vb, vq, sg_addr, sg_len);
> +		xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end);
> +		sg_pfn_start = sg_pfn_end + 1;
> +	}
> +}
> +
> +static inline void xb_set_page(struct virtio_balloon *vb,
> +			       struct page *page,
> +			       unsigned long *pfn_min,
> +			       unsigned long *pfn_max)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +	xb_preload(GFP_KERNEL);
> +	xb_set_bit(&vb->page_xb, pfn);
> +	xb_preload_end();
> +}
> +
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	unsigned num_allocated_pages;
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
>  	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
> @@ -162,7 +251,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  			msleep(200);
>  			break;
>  		}
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
> +		if (use_sg)
> +			xb_set_page(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		if (!virtio_has_feature(vb->vdev,
>  					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> @@ -171,8 +265,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  
>  	num_allocated_pages = vb->num_pfns;
>  	/* Did we get any? */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->inflate_vq);
> +	if (vb->num_pfns) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->inflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	return num_allocated_pages;
> @@ -198,9 +296,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> -	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	/* Traditionally, we can only do one array worth at a time. */
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	/* We can't release more pages than taken */
> @@ -210,7 +311,11 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  		page = balloon_page_dequeue(vb_dev_info);
>  		if (!page)
>  			break;
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (use_sg)
> +			xb_set_page(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
>  		list_add(&page->lru, &pages);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}
> @@ -221,8 +326,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>  	 * is true, we *have* to do it in this order
>  	 */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->deflate_vq);
> +	if (vb->num_pfns) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->deflate_vq);
> +	}
>  	release_pages_balloon(vb, &pages);
>  	mutex_unlock(&vb->balloon_lock);
>  	return num_freed_pages;
> @@ -441,6 +550,7 @@ static int init_vqs(struct virtio_balloon *vb)
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> +
>  /*
>   * virtballoon_migratepage - perform the balloon page migration on behalf of
>   *			     a compation thread.     (called under page lock)
> @@ -464,6 +574,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  {
>  	struct virtio_balloon *vb = container_of(vb_dev_info,
>  			struct virtio_balloon, vb_dev_info);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
>  	unsigned long flags;
>  
>  	/*
> @@ -485,16 +596,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  	vb_dev_info->isolated_pages--;
>  	__count_vm_event(BALLOON_MIGRATE);
>  	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, newpage);
> -	tell_host(vb, vb->inflate_vq);
> -
> +	if (use_sg) {
> +		send_balloon_page_sg(vb, vb->inflate_vq, page_address(newpage),
> +				     PAGE_SIZE);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, newpage);
> +		tell_host(vb, vb->inflate_vq);
> +	}
>  	/* balloon's page migration 2nd step -- deflate "page" */
>  	balloon_page_delete(page);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, page);
> -	tell_host(vb, vb->deflate_vq);
> -
> +	if (use_sg) {
> +		send_balloon_page_sg(vb, vb->deflate_vq, page_address(page),
> +				     PAGE_SIZE);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, page);
> +		tell_host(vb, vb->deflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	put_page(page); /* balloon reference */
> @@ -553,6 +672,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (err)
>  		goto out_free_vb;
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
> +		xb_init(&vb->page_xb);
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -669,6 +791,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_SG,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 343d7dd..37780a7 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-08-18  2:22     ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18  2:22 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Thu, Aug 17, 2017 at 11:26:54AM +0800, Wei Wang wrote:
> Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer
> of balloon (i.e. inflated/deflated) pages using scatter-gather lists
> to the host.
> 
> The implementation of the previous virtio-balloon is not very
> efficient, because the balloon pages are transferred to the
> host one by one. Here is the breakdown of the time in percentage
> spent on each step of the balloon inflating process (inflating
> 7GB of an 8GB idle guest).
> 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete.
> The above profiling shows that the bottlenecks are stage 2)
> and stage 4).
> 
> This patch optimizes step 2) by transferring pages to the host in
> sgs. An sg describes a chunk of guest physically continuous pages.
> With this mechanism, step 4) can also be optimized by doing address
> translation and madvise() in chunks rather than page by page.
> 
> With this new feature, the above ballooning process takes ~541ms
> resulting in an improvement of ~87%.
> 
> TODO: optimize stage 1) by allocating/freeing a chunk of pages
> instead of a single page each time.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 157 ++++++++++++++++++++++++++++++++----
>  include/uapi/linux/virtio_balloon.h |   1 +
>  2 files changed, 141 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index f0b3a0b..72041b4 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -32,6 +32,7 @@
>  #include <linux/mm.h>
>  #include <linux/mount.h>
>  #include <linux/magic.h>
> +#include <linux/xbitmap.h>
>  
>  /*
>   * Balloon device works in 4K page units.  So each page is pointed to by
> @@ -79,6 +80,9 @@ struct virtio_balloon {
>  	/* Synchronize access/update to this struct virtio_balloon elements */
>  	struct mutex balloon_lock;
>  
> +	/* The xbitmap used to record ballooned pages */
> +	struct xb page_xb;
> +
>  	/* The array of pfns we tell the Host about. */
>  	unsigned int num_pfns;
>  	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> @@ -141,13 +145,98 @@ static void set_page_pfns(struct virtio_balloon *vb,
>  					  page_to_balloon_pfn(page) + i);
>  }
>  
> +static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	struct scatterlist sg;
> +
> +	sg_init_one(&sg, addr, size);
> +	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
> +}
> +
> +static void send_balloon_page_sg(struct virtio_balloon *vb,
> +				 struct virtqueue *vq,
> +				 void *addr,
> +				 uint32_t size)
> +{
> +	unsigned int len;
> +	int ret;
> +
> +	do {
> +		ret = add_one_sg(vq, addr, size);
> +		virtqueue_kick(vq);
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		/*
> +		 * It is uncommon to see the vq is full, because the sg is sent
> +		 * one by one and the device is able to handle it in time. But
> +		 * if that happens, we go back to retry after an entry gets
> +		 * released.
> +		 */

Why send one by one though? Why not batch some s/gs and wait for all
of them to be completed? If memory if fragmented, waiting every time is
worse than what we have now (VIRTIO_BALLOON_ARRAY_PFNS_MAX at a time).

> +	} while (unlikely(ret == -ENOSPC));
> +}
> +
> +/*
> + * Send balloon pages in sgs to host. The balloon pages are recorded in the
> + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> + * The page xbitmap is searched for continuous "1" bits, which correspond
> + * to continuous pages, to chunk into sgs.
> + *
> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
> + * need to be searched.
> + */
> +static void tell_host_sgs(struct virtio_balloon *vb,
> +			  struct virtqueue *vq,
> +			  unsigned long page_xb_start,
> +			  unsigned long page_xb_end)
> +{
> +	unsigned long sg_pfn_start, sg_pfn_end;
> +	void *sg_addr;
> +	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
> +
> +	sg_pfn_start = page_xb_start;
> +	while (sg_pfn_start < page_xb_end) {
> +		sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start,
> +						page_xb_end, 1);
> +		if (sg_pfn_start == page_xb_end + 1)
> +			break;
> +		sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1,
> +					      page_xb_end, 0);
> +		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
> +		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
> +		while (sg_len > sg_max_len) {
> +			send_balloon_page_sg(vb, vq, sg_addr, sg_max_len);
> +			sg_addr += sg_max_len;
> +			sg_len -= sg_max_len;
> +		}
> +		send_balloon_page_sg(vb, vq, sg_addr, sg_len);
> +		xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end);
> +		sg_pfn_start = sg_pfn_end + 1;
> +	}
> +}
> +
> +static inline void xb_set_page(struct virtio_balloon *vb,
> +			       struct page *page,
> +			       unsigned long *pfn_min,
> +			       unsigned long *pfn_max)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +	xb_preload(GFP_KERNEL);
> +	xb_set_bit(&vb->page_xb, pfn);
> +	xb_preload_end();
> +}
> +
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	unsigned num_allocated_pages;
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
>  	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
> @@ -162,7 +251,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  			msleep(200);
>  			break;
>  		}
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
> +		if (use_sg)
> +			xb_set_page(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		if (!virtio_has_feature(vb->vdev,
>  					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> @@ -171,8 +265,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  
>  	num_allocated_pages = vb->num_pfns;
>  	/* Did we get any? */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->inflate_vq);
> +	if (vb->num_pfns) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->inflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	return num_allocated_pages;
> @@ -198,9 +296,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> -	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	/* Traditionally, we can only do one array worth at a time. */
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	/* We can't release more pages than taken */
> @@ -210,7 +311,11 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  		page = balloon_page_dequeue(vb_dev_info);
>  		if (!page)
>  			break;
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (use_sg)
> +			xb_set_page(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
>  		list_add(&page->lru, &pages);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}
> @@ -221,8 +326,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>  	 * is true, we *have* to do it in this order
>  	 */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->deflate_vq);
> +	if (vb->num_pfns) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->deflate_vq);
> +	}
>  	release_pages_balloon(vb, &pages);
>  	mutex_unlock(&vb->balloon_lock);
>  	return num_freed_pages;
> @@ -441,6 +550,7 @@ static int init_vqs(struct virtio_balloon *vb)
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> +
>  /*
>   * virtballoon_migratepage - perform the balloon page migration on behalf of
>   *			     a compation thread.     (called under page lock)
> @@ -464,6 +574,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  {
>  	struct virtio_balloon *vb = container_of(vb_dev_info,
>  			struct virtio_balloon, vb_dev_info);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
>  	unsigned long flags;
>  
>  	/*
> @@ -485,16 +596,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  	vb_dev_info->isolated_pages--;
>  	__count_vm_event(BALLOON_MIGRATE);
>  	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, newpage);
> -	tell_host(vb, vb->inflate_vq);
> -
> +	if (use_sg) {
> +		send_balloon_page_sg(vb, vb->inflate_vq, page_address(newpage),
> +				     PAGE_SIZE);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, newpage);
> +		tell_host(vb, vb->inflate_vq);
> +	}
>  	/* balloon's page migration 2nd step -- deflate "page" */
>  	balloon_page_delete(page);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, page);
> -	tell_host(vb, vb->deflate_vq);
> -
> +	if (use_sg) {
> +		send_balloon_page_sg(vb, vb->deflate_vq, page_address(page),
> +				     PAGE_SIZE);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, page);
> +		tell_host(vb, vb->deflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	put_page(page); /* balloon reference */
> @@ -553,6 +672,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (err)
>  		goto out_free_vb;
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
> +		xb_init(&vb->page_xb);
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -669,6 +791,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_SG,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 343d7dd..37780a7 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> -- 
> 2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [Qemu-devel] [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-08-18  2:22     ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18  2:22 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Thu, Aug 17, 2017 at 11:26:54AM +0800, Wei Wang wrote:
> Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer
> of balloon (i.e. inflated/deflated) pages using scatter-gather lists
> to the host.
> 
> The implementation of the previous virtio-balloon is not very
> efficient, because the balloon pages are transferred to the
> host one by one. Here is the breakdown of the time in percentage
> spent on each step of the balloon inflating process (inflating
> 7GB of an 8GB idle guest).
> 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete.
> The above profiling shows that the bottlenecks are stage 2)
> and stage 4).
> 
> This patch optimizes step 2) by transferring pages to the host in
> sgs. An sg describes a chunk of guest physically continuous pages.
> With this mechanism, step 4) can also be optimized by doing address
> translation and madvise() in chunks rather than page by page.
> 
> With this new feature, the above ballooning process takes ~541ms
> resulting in an improvement of ~87%.
> 
> TODO: optimize stage 1) by allocating/freeing a chunk of pages
> instead of a single page each time.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 157 ++++++++++++++++++++++++++++++++----
>  include/uapi/linux/virtio_balloon.h |   1 +
>  2 files changed, 141 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index f0b3a0b..72041b4 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -32,6 +32,7 @@
>  #include <linux/mm.h>
>  #include <linux/mount.h>
>  #include <linux/magic.h>
> +#include <linux/xbitmap.h>
>  
>  /*
>   * Balloon device works in 4K page units.  So each page is pointed to by
> @@ -79,6 +80,9 @@ struct virtio_balloon {
>  	/* Synchronize access/update to this struct virtio_balloon elements */
>  	struct mutex balloon_lock;
>  
> +	/* The xbitmap used to record ballooned pages */
> +	struct xb page_xb;
> +
>  	/* The array of pfns we tell the Host about. */
>  	unsigned int num_pfns;
>  	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> @@ -141,13 +145,98 @@ static void set_page_pfns(struct virtio_balloon *vb,
>  					  page_to_balloon_pfn(page) + i);
>  }
>  
> +static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	struct scatterlist sg;
> +
> +	sg_init_one(&sg, addr, size);
> +	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
> +}
> +
> +static void send_balloon_page_sg(struct virtio_balloon *vb,
> +				 struct virtqueue *vq,
> +				 void *addr,
> +				 uint32_t size)
> +{
> +	unsigned int len;
> +	int ret;
> +
> +	do {
> +		ret = add_one_sg(vq, addr, size);
> +		virtqueue_kick(vq);
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		/*
> +		 * It is uncommon to see the vq is full, because the sg is sent
> +		 * one by one and the device is able to handle it in time. But
> +		 * if that happens, we go back to retry after an entry gets
> +		 * released.
> +		 */

Why send one by one though? Why not batch some s/gs and wait for all
of them to be completed? If memory if fragmented, waiting every time is
worse than what we have now (VIRTIO_BALLOON_ARRAY_PFNS_MAX at a time).

> +	} while (unlikely(ret == -ENOSPC));
> +}
> +
> +/*
> + * Send balloon pages in sgs to host. The balloon pages are recorded in the
> + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> + * The page xbitmap is searched for continuous "1" bits, which correspond
> + * to continuous pages, to chunk into sgs.
> + *
> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
> + * need to be searched.
> + */
> +static void tell_host_sgs(struct virtio_balloon *vb,
> +			  struct virtqueue *vq,
> +			  unsigned long page_xb_start,
> +			  unsigned long page_xb_end)
> +{
> +	unsigned long sg_pfn_start, sg_pfn_end;
> +	void *sg_addr;
> +	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
> +
> +	sg_pfn_start = page_xb_start;
> +	while (sg_pfn_start < page_xb_end) {
> +		sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start,
> +						page_xb_end, 1);
> +		if (sg_pfn_start == page_xb_end + 1)
> +			break;
> +		sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1,
> +					      page_xb_end, 0);
> +		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
> +		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
> +		while (sg_len > sg_max_len) {
> +			send_balloon_page_sg(vb, vq, sg_addr, sg_max_len);
> +			sg_addr += sg_max_len;
> +			sg_len -= sg_max_len;
> +		}
> +		send_balloon_page_sg(vb, vq, sg_addr, sg_len);
> +		xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end);
> +		sg_pfn_start = sg_pfn_end + 1;
> +	}
> +}
> +
> +static inline void xb_set_page(struct virtio_balloon *vb,
> +			       struct page *page,
> +			       unsigned long *pfn_min,
> +			       unsigned long *pfn_max)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +	xb_preload(GFP_KERNEL);
> +	xb_set_bit(&vb->page_xb, pfn);
> +	xb_preload_end();
> +}
> +
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	unsigned num_allocated_pages;
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
>  	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
> @@ -162,7 +251,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  			msleep(200);
>  			break;
>  		}
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
> +		if (use_sg)
> +			xb_set_page(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		if (!virtio_has_feature(vb->vdev,
>  					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> @@ -171,8 +265,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  
>  	num_allocated_pages = vb->num_pfns;
>  	/* Did we get any? */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->inflate_vq);
> +	if (vb->num_pfns) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->inflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	return num_allocated_pages;
> @@ -198,9 +296,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> -	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	/* Traditionally, we can only do one array worth at a time. */
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	/* We can't release more pages than taken */
> @@ -210,7 +311,11 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  		page = balloon_page_dequeue(vb_dev_info);
>  		if (!page)
>  			break;
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (use_sg)
> +			xb_set_page(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
>  		list_add(&page->lru, &pages);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}
> @@ -221,8 +326,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>  	 * is true, we *have* to do it in this order
>  	 */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->deflate_vq);
> +	if (vb->num_pfns) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->deflate_vq);
> +	}
>  	release_pages_balloon(vb, &pages);
>  	mutex_unlock(&vb->balloon_lock);
>  	return num_freed_pages;
> @@ -441,6 +550,7 @@ static int init_vqs(struct virtio_balloon *vb)
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> +
>  /*
>   * virtballoon_migratepage - perform the balloon page migration on behalf of
>   *			     a compation thread.     (called under page lock)
> @@ -464,6 +574,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  {
>  	struct virtio_balloon *vb = container_of(vb_dev_info,
>  			struct virtio_balloon, vb_dev_info);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
>  	unsigned long flags;
>  
>  	/*
> @@ -485,16 +596,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  	vb_dev_info->isolated_pages--;
>  	__count_vm_event(BALLOON_MIGRATE);
>  	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, newpage);
> -	tell_host(vb, vb->inflate_vq);
> -
> +	if (use_sg) {
> +		send_balloon_page_sg(vb, vb->inflate_vq, page_address(newpage),
> +				     PAGE_SIZE);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, newpage);
> +		tell_host(vb, vb->inflate_vq);
> +	}
>  	/* balloon's page migration 2nd step -- deflate "page" */
>  	balloon_page_delete(page);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, page);
> -	tell_host(vb, vb->deflate_vq);
> -
> +	if (use_sg) {
> +		send_balloon_page_sg(vb, vb->deflate_vq, page_address(page),
> +				     PAGE_SIZE);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, page);
> +		tell_host(vb, vb->deflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	put_page(page); /* balloon reference */
> @@ -553,6 +672,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (err)
>  		goto out_free_vb;
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
> +		xb_init(&vb->page_xb);
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -669,6 +791,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_SG,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 343d7dd..37780a7 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-08-17  3:26   ` Wei Wang
                     ` (3 preceding siblings ...)
  (?)
@ 2017-08-18  2:22   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18  2:22 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko, mgorman

On Thu, Aug 17, 2017 at 11:26:54AM +0800, Wei Wang wrote:
> Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer
> of balloon (i.e. inflated/deflated) pages using scatter-gather lists
> to the host.
> 
> The implementation of the previous virtio-balloon is not very
> efficient, because the balloon pages are transferred to the
> host one by one. Here is the breakdown of the time in percentage
> spent on each step of the balloon inflating process (inflating
> 7GB of an 8GB idle guest).
> 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete.
> The above profiling shows that the bottlenecks are stage 2)
> and stage 4).
> 
> This patch optimizes step 2) by transferring pages to the host in
> sgs. An sg describes a chunk of guest physically continuous pages.
> With this mechanism, step 4) can also be optimized by doing address
> translation and madvise() in chunks rather than page by page.
> 
> With this new feature, the above ballooning process takes ~541ms
> resulting in an improvement of ~87%.
> 
> TODO: optimize stage 1) by allocating/freeing a chunk of pages
> instead of a single page each time.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 157 ++++++++++++++++++++++++++++++++----
>  include/uapi/linux/virtio_balloon.h |   1 +
>  2 files changed, 141 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index f0b3a0b..72041b4 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -32,6 +32,7 @@
>  #include <linux/mm.h>
>  #include <linux/mount.h>
>  #include <linux/magic.h>
> +#include <linux/xbitmap.h>
>  
>  /*
>   * Balloon device works in 4K page units.  So each page is pointed to by
> @@ -79,6 +80,9 @@ struct virtio_balloon {
>  	/* Synchronize access/update to this struct virtio_balloon elements */
>  	struct mutex balloon_lock;
>  
> +	/* The xbitmap used to record ballooned pages */
> +	struct xb page_xb;
> +
>  	/* The array of pfns we tell the Host about. */
>  	unsigned int num_pfns;
>  	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> @@ -141,13 +145,98 @@ static void set_page_pfns(struct virtio_balloon *vb,
>  					  page_to_balloon_pfn(page) + i);
>  }
>  
> +static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	struct scatterlist sg;
> +
> +	sg_init_one(&sg, addr, size);
> +	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
> +}
> +
> +static void send_balloon_page_sg(struct virtio_balloon *vb,
> +				 struct virtqueue *vq,
> +				 void *addr,
> +				 uint32_t size)
> +{
> +	unsigned int len;
> +	int ret;
> +
> +	do {
> +		ret = add_one_sg(vq, addr, size);
> +		virtqueue_kick(vq);
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		/*
> +		 * It is uncommon to see the vq is full, because the sg is sent
> +		 * one by one and the device is able to handle it in time. But
> +		 * if that happens, we go back to retry after an entry gets
> +		 * released.
> +		 */

Why send one by one though? Why not batch some s/gs and wait for all
of them to be completed? If memory if fragmented, waiting every time is
worse than what we have now (VIRTIO_BALLOON_ARRAY_PFNS_MAX at a time).

> +	} while (unlikely(ret == -ENOSPC));
> +}
> +
> +/*
> + * Send balloon pages in sgs to host. The balloon pages are recorded in the
> + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> + * The page xbitmap is searched for continuous "1" bits, which correspond
> + * to continuous pages, to chunk into sgs.
> + *
> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
> + * need to be searched.
> + */
> +static void tell_host_sgs(struct virtio_balloon *vb,
> +			  struct virtqueue *vq,
> +			  unsigned long page_xb_start,
> +			  unsigned long page_xb_end)
> +{
> +	unsigned long sg_pfn_start, sg_pfn_end;
> +	void *sg_addr;
> +	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
> +
> +	sg_pfn_start = page_xb_start;
> +	while (sg_pfn_start < page_xb_end) {
> +		sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start,
> +						page_xb_end, 1);
> +		if (sg_pfn_start == page_xb_end + 1)
> +			break;
> +		sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1,
> +					      page_xb_end, 0);
> +		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
> +		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
> +		while (sg_len > sg_max_len) {
> +			send_balloon_page_sg(vb, vq, sg_addr, sg_max_len);
> +			sg_addr += sg_max_len;
> +			sg_len -= sg_max_len;
> +		}
> +		send_balloon_page_sg(vb, vq, sg_addr, sg_len);
> +		xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end);
> +		sg_pfn_start = sg_pfn_end + 1;
> +	}
> +}
> +
> +static inline void xb_set_page(struct virtio_balloon *vb,
> +			       struct page *page,
> +			       unsigned long *pfn_min,
> +			       unsigned long *pfn_max)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +	xb_preload(GFP_KERNEL);
> +	xb_set_bit(&vb->page_xb, pfn);
> +	xb_preload_end();
> +}
> +
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	unsigned num_allocated_pages;
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
>  	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
> @@ -162,7 +251,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  			msleep(200);
>  			break;
>  		}
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
> +		if (use_sg)
> +			xb_set_page(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		if (!virtio_has_feature(vb->vdev,
>  					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> @@ -171,8 +265,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  
>  	num_allocated_pages = vb->num_pfns;
>  	/* Did we get any? */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->inflate_vq);
> +	if (vb->num_pfns) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->inflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	return num_allocated_pages;
> @@ -198,9 +296,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> -	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	/* Traditionally, we can only do one array worth at a time. */
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	/* We can't release more pages than taken */
> @@ -210,7 +311,11 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  		page = balloon_page_dequeue(vb_dev_info);
>  		if (!page)
>  			break;
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (use_sg)
> +			xb_set_page(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
>  		list_add(&page->lru, &pages);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}
> @@ -221,8 +326,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>  	 * is true, we *have* to do it in this order
>  	 */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->deflate_vq);
> +	if (vb->num_pfns) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->deflate_vq);
> +	}
>  	release_pages_balloon(vb, &pages);
>  	mutex_unlock(&vb->balloon_lock);
>  	return num_freed_pages;
> @@ -441,6 +550,7 @@ static int init_vqs(struct virtio_balloon *vb)
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> +
>  /*
>   * virtballoon_migratepage - perform the balloon page migration on behalf of
>   *			     a compation thread.     (called under page lock)
> @@ -464,6 +574,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  {
>  	struct virtio_balloon *vb = container_of(vb_dev_info,
>  			struct virtio_balloon, vb_dev_info);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
>  	unsigned long flags;
>  
>  	/*
> @@ -485,16 +596,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  	vb_dev_info->isolated_pages--;
>  	__count_vm_event(BALLOON_MIGRATE);
>  	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, newpage);
> -	tell_host(vb, vb->inflate_vq);
> -
> +	if (use_sg) {
> +		send_balloon_page_sg(vb, vb->inflate_vq, page_address(newpage),
> +				     PAGE_SIZE);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, newpage);
> +		tell_host(vb, vb->inflate_vq);
> +	}
>  	/* balloon's page migration 2nd step -- deflate "page" */
>  	balloon_page_delete(page);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, page);
> -	tell_host(vb, vb->deflate_vq);
> -
> +	if (use_sg) {
> +		send_balloon_page_sg(vb, vb->deflate_vq, page_address(page),
> +				     PAGE_SIZE);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, page);
> +		tell_host(vb, vb->deflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	put_page(page); /* balloon reference */
> @@ -553,6 +672,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (err)
>  		goto out_free_vb;
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
> +		xb_init(&vb->page_xb);
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -669,6 +791,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_SG,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 343d7dd..37780a7 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [virtio-dev] Re: [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-08-18  2:22     ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18  2:22 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Thu, Aug 17, 2017 at 11:26:54AM +0800, Wei Wang wrote:
> Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer
> of balloon (i.e. inflated/deflated) pages using scatter-gather lists
> to the host.
> 
> The implementation of the previous virtio-balloon is not very
> efficient, because the balloon pages are transferred to the
> host one by one. Here is the breakdown of the time in percentage
> spent on each step of the balloon inflating process (inflating
> 7GB of an 8GB idle guest).
> 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete.
> The above profiling shows that the bottlenecks are stage 2)
> and stage 4).
> 
> This patch optimizes step 2) by transferring pages to the host in
> sgs. An sg describes a chunk of guest physically continuous pages.
> With this mechanism, step 4) can also be optimized by doing address
> translation and madvise() in chunks rather than page by page.
> 
> With this new feature, the above ballooning process takes ~541ms
> resulting in an improvement of ~87%.
> 
> TODO: optimize stage 1) by allocating/freeing a chunk of pages
> instead of a single page each time.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 157 ++++++++++++++++++++++++++++++++----
>  include/uapi/linux/virtio_balloon.h |   1 +
>  2 files changed, 141 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index f0b3a0b..72041b4 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -32,6 +32,7 @@
>  #include <linux/mm.h>
>  #include <linux/mount.h>
>  #include <linux/magic.h>
> +#include <linux/xbitmap.h>
>  
>  /*
>   * Balloon device works in 4K page units.  So each page is pointed to by
> @@ -79,6 +80,9 @@ struct virtio_balloon {
>  	/* Synchronize access/update to this struct virtio_balloon elements */
>  	struct mutex balloon_lock;
>  
> +	/* The xbitmap used to record ballooned pages */
> +	struct xb page_xb;
> +
>  	/* The array of pfns we tell the Host about. */
>  	unsigned int num_pfns;
>  	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> @@ -141,13 +145,98 @@ static void set_page_pfns(struct virtio_balloon *vb,
>  					  page_to_balloon_pfn(page) + i);
>  }
>  
> +static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	struct scatterlist sg;
> +
> +	sg_init_one(&sg, addr, size);
> +	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
> +}
> +
> +static void send_balloon_page_sg(struct virtio_balloon *vb,
> +				 struct virtqueue *vq,
> +				 void *addr,
> +				 uint32_t size)
> +{
> +	unsigned int len;
> +	int ret;
> +
> +	do {
> +		ret = add_one_sg(vq, addr, size);
> +		virtqueue_kick(vq);
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		/*
> +		 * It is uncommon to see the vq is full, because the sg is sent
> +		 * one by one and the device is able to handle it in time. But
> +		 * if that happens, we go back to retry after an entry gets
> +		 * released.
> +		 */

Why send one by one though? Why not batch some s/gs and wait for all
of them to be completed? If memory if fragmented, waiting every time is
worse than what we have now (VIRTIO_BALLOON_ARRAY_PFNS_MAX at a time).

> +	} while (unlikely(ret == -ENOSPC));
> +}
> +
> +/*
> + * Send balloon pages in sgs to host. The balloon pages are recorded in the
> + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> + * The page xbitmap is searched for continuous "1" bits, which correspond
> + * to continuous pages, to chunk into sgs.
> + *
> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
> + * need to be searched.
> + */
> +static void tell_host_sgs(struct virtio_balloon *vb,
> +			  struct virtqueue *vq,
> +			  unsigned long page_xb_start,
> +			  unsigned long page_xb_end)
> +{
> +	unsigned long sg_pfn_start, sg_pfn_end;
> +	void *sg_addr;
> +	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
> +
> +	sg_pfn_start = page_xb_start;
> +	while (sg_pfn_start < page_xb_end) {
> +		sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start,
> +						page_xb_end, 1);
> +		if (sg_pfn_start == page_xb_end + 1)
> +			break;
> +		sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1,
> +					      page_xb_end, 0);
> +		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
> +		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
> +		while (sg_len > sg_max_len) {
> +			send_balloon_page_sg(vb, vq, sg_addr, sg_max_len);
> +			sg_addr += sg_max_len;
> +			sg_len -= sg_max_len;
> +		}
> +		send_balloon_page_sg(vb, vq, sg_addr, sg_len);
> +		xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end);
> +		sg_pfn_start = sg_pfn_end + 1;
> +	}
> +}
> +
> +static inline void xb_set_page(struct virtio_balloon *vb,
> +			       struct page *page,
> +			       unsigned long *pfn_min,
> +			       unsigned long *pfn_max)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +	xb_preload(GFP_KERNEL);
> +	xb_set_bit(&vb->page_xb, pfn);
> +	xb_preload_end();
> +}
> +
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	unsigned num_allocated_pages;
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
>  	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
> @@ -162,7 +251,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  			msleep(200);
>  			break;
>  		}
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
> +		if (use_sg)
> +			xb_set_page(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		if (!virtio_has_feature(vb->vdev,
>  					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> @@ -171,8 +265,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  
>  	num_allocated_pages = vb->num_pfns;
>  	/* Did we get any? */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->inflate_vq);
> +	if (vb->num_pfns) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->inflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	return num_allocated_pages;
> @@ -198,9 +296,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> -	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	/* Traditionally, we can only do one array worth at a time. */
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	/* We can't release more pages than taken */
> @@ -210,7 +311,11 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  		page = balloon_page_dequeue(vb_dev_info);
>  		if (!page)
>  			break;
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (use_sg)
> +			xb_set_page(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
>  		list_add(&page->lru, &pages);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}
> @@ -221,8 +326,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>  	 * is true, we *have* to do it in this order
>  	 */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->deflate_vq);
> +	if (vb->num_pfns) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->deflate_vq);
> +	}
>  	release_pages_balloon(vb, &pages);
>  	mutex_unlock(&vb->balloon_lock);
>  	return num_freed_pages;
> @@ -441,6 +550,7 @@ static int init_vqs(struct virtio_balloon *vb)
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> +
>  /*
>   * virtballoon_migratepage - perform the balloon page migration on behalf of
>   *			     a compation thread.     (called under page lock)
> @@ -464,6 +574,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  {
>  	struct virtio_balloon *vb = container_of(vb_dev_info,
>  			struct virtio_balloon, vb_dev_info);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
>  	unsigned long flags;
>  
>  	/*
> @@ -485,16 +596,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  	vb_dev_info->isolated_pages--;
>  	__count_vm_event(BALLOON_MIGRATE);
>  	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, newpage);
> -	tell_host(vb, vb->inflate_vq);
> -
> +	if (use_sg) {
> +		send_balloon_page_sg(vb, vb->inflate_vq, page_address(newpage),
> +				     PAGE_SIZE);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, newpage);
> +		tell_host(vb, vb->inflate_vq);
> +	}
>  	/* balloon's page migration 2nd step -- deflate "page" */
>  	balloon_page_delete(page);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, page);
> -	tell_host(vb, vb->deflate_vq);
> -
> +	if (use_sg) {
> +		send_balloon_page_sg(vb, vb->deflate_vq, page_address(page),
> +				     PAGE_SIZE);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, page);
> +		tell_host(vb, vb->deflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	put_page(page); /* balloon reference */
> @@ -553,6 +672,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (err)
>  		goto out_free_vb;
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
> +		xb_init(&vb->page_xb);
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -669,6 +791,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_SG,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 343d7dd..37780a7 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> -- 
> 2.7.4

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-08-17  3:26   ` Wei Wang
  (?)
  (?)
@ 2017-08-18  2:28     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18  2:28 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
> Add a new vq to report hints of guest free pages to the host.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
>  include/uapi/linux/virtio_balloon.h |   1 +
>  2 files changed, 147 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 72041b4..e6755bc 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
>  
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>  
>  	/* The balloon servicing is delegated to a freezable workqueue. */
>  	struct work_struct update_balloon_stats_work;
>  	struct work_struct update_balloon_size_work;
> +	struct work_struct report_free_page_work;
>  
>  	/* Prevent updating balloon when it is being canceled. */
>  	spinlock_t stop_update_lock;
> @@ -90,6 +91,13 @@ struct virtio_balloon {
>  	/* Memory statistics */
>  	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>  
> +	/*
> +	 * Used by the device and driver to signal each other.
> +	 * device->driver: start the free page report.
> +	 * driver->device: end the free page report.
> +	 */
> +	__virtio32 report_free_page_signal;
> +
>  	/* To register callback in oom notifier call chain */
>  	struct notifier_block nb;
>  };
> @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
>  	} while (unlikely(ret == -ENOSPC));
>  }
>  
> +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	unsigned int len;
> +
> +	add_one_sg(vq, addr, size);
> +	virtqueue_kick(vq);
> +	/* Release entries if there are */
> +	while (virtqueue_get_buf(vq, &len))
> +		;
> +}
> +
>  /*
>   * Send balloon pages in sgs to host. The balloon pages are recorded in the
>   * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
>  		queue_work(system_freezable_wq, work);
>  }
>  
> +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> +					   unsigned long nr_pages)
> +{
> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> +	void *addr = (void *)pfn_to_kaddr(pfn);
> +	uint32_t len = nr_pages << PAGE_SHIFT;
> +
> +	send_free_page_sg(vb->free_page_vq, addr, len);
> +}
> +
> +static void report_free_page_completion(struct virtio_balloon *vb)
> +{
> +	struct virtqueue *vq = vb->free_page_vq;
> +	struct scatterlist sg;
> +	unsigned int len;
> +	int ret;
> +
> +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
> +retry:
> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> +	virtqueue_kick(vq);
> +	if (unlikely(ret == -ENOSPC)) {
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		goto retry;
> +	}
> +}

So the annoying thing here is that once this starts going,
it will keep sending free pages from the list even if
host is no longer interested. There should be a way
for host to tell guest "stop" or "start from the beginning".


It's the result of using same vq for guest to host and
host to guest communication, and I think it's not a great idea.
I'd reuse stats vq for host to guest requests maybe.

> +
> +static void report_free_page(struct work_struct *work)
> +{
> +	struct virtio_balloon *vb;
> +
> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> +	report_free_page_completion(vb);
> +}
> +
> +static void free_page_request(struct virtqueue *vq)
> +{
> +	struct virtio_balloon *vb = vq->vdev->priv;
> +
> +	queue_work(system_freezable_wq, &vb->report_free_page_work);
> +}
> +
>  static int init_vqs(struct virtio_balloon *vb)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct virtqueue **vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	struct scatterlist sg;
> +	int i, nvqs, err = -ENOMEM;
> +
> +	/* Inflateq and deflateq are used unconditionally */
> +	nvqs = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> +		nvqs++;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> +		nvqs++;
> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs)
> +		goto err_vq;
> +	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
> +	if (!callbacks)
> +		goto err_callback;
> +	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
> +	if (!names)
> +		goto err_names;
> +
> +	callbacks[0] = balloon_ack;
> +	names[0] = "inflate";
> +	callbacks[1] = balloon_ack;
> +	names[1] = "deflate";
> +
> +	i = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		callbacks[i] = stats_request;
> +		names[i] = "stats";
> +		i++;
> +	}
>  
> -	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> -	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> -	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		callbacks[i] = free_page_request;
> +		names[i] = "free_page_vq";
> +	}
> +
> +	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
> +					 NULL, NULL);
>  	if (err)
> -		return err;
> +		goto err_find;
>  
>  	vb->inflate_vq = vqs[0];
>  	vb->deflate_vq = vqs[1];
> +	i = 2;
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> -		struct scatterlist sg;
> -		unsigned int num_stats;
> -		vb->stats_vq = vqs[2];
> -
> +		vb->stats_vq = vqs[i++];
>  		/*
>  		 * Prime this virtqueue with one buffer so the hypervisor can
>  		 * use it to signal us later (it can't be broken yet!).
>  		 */
> -		num_stats = update_balloon_stats(vb);
> -
> -		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
> +		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
>  		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
> -		    < 0)
> -			BUG();
> +		    < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
>  		virtqueue_kick(vb->stats_vq);
>  	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		vb->free_page_vq = vqs[i];
> +		vb->report_free_page_signal = 0;
> +		sg_init_one(&sg, &vb->report_free_page_signal,
> +			    sizeof(__virtio32));
> +		if (virtqueue_add_outbuf(vb->free_page_vq, &sg, 1, vb,
> +					 GFP_KERNEL) < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add signal buf failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
> +		virtqueue_kick(vb->free_page_vq);
> +	}
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
>  	return 0;
> +
> +err_find:
> +	kfree(names);
> +err_names:
> +	kfree(callbacks);
> +err_callback:
> +	kfree(vqs);
> +err_vq:
> +	return err;
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -675,6 +795,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
>  		xb_init(&vb->page_xb);
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> +		INIT_WORK(&vb->report_free_page_work, report_free_page);
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -739,6 +862,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	spin_unlock_irq(&vb->stop_update_lock);
>  	cancel_work_sync(&vb->update_balloon_size_work);
>  	cancel_work_sync(&vb->update_balloon_stats_work);
> +	cancel_work_sync(&vb->report_free_page_work);
>  
>  	remove_common(vb);
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -792,6 +916,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>  	VIRTIO_BALLOON_F_SG,
> +	VIRTIO_BALLOON_F_FREE_PAGE_VQ,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 37780a7..8214f84 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -35,6 +35,7 @@
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
> +#define VIRTIO_BALLOON_F_FREE_PAGE_VQ	4 /* Virtqueue to report free pages */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-18  2:28     ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18  2:28 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
> Add a new vq to report hints of guest free pages to the host.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
>  include/uapi/linux/virtio_balloon.h |   1 +
>  2 files changed, 147 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 72041b4..e6755bc 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
>  
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>  
>  	/* The balloon servicing is delegated to a freezable workqueue. */
>  	struct work_struct update_balloon_stats_work;
>  	struct work_struct update_balloon_size_work;
> +	struct work_struct report_free_page_work;
>  
>  	/* Prevent updating balloon when it is being canceled. */
>  	spinlock_t stop_update_lock;
> @@ -90,6 +91,13 @@ struct virtio_balloon {
>  	/* Memory statistics */
>  	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>  
> +	/*
> +	 * Used by the device and driver to signal each other.
> +	 * device->driver: start the free page report.
> +	 * driver->device: end the free page report.
> +	 */
> +	__virtio32 report_free_page_signal;
> +
>  	/* To register callback in oom notifier call chain */
>  	struct notifier_block nb;
>  };
> @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
>  	} while (unlikely(ret == -ENOSPC));
>  }
>  
> +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	unsigned int len;
> +
> +	add_one_sg(vq, addr, size);
> +	virtqueue_kick(vq);
> +	/* Release entries if there are */
> +	while (virtqueue_get_buf(vq, &len))
> +		;
> +}
> +
>  /*
>   * Send balloon pages in sgs to host. The balloon pages are recorded in the
>   * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
>  		queue_work(system_freezable_wq, work);
>  }
>  
> +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> +					   unsigned long nr_pages)
> +{
> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> +	void *addr = (void *)pfn_to_kaddr(pfn);
> +	uint32_t len = nr_pages << PAGE_SHIFT;
> +
> +	send_free_page_sg(vb->free_page_vq, addr, len);
> +}
> +
> +static void report_free_page_completion(struct virtio_balloon *vb)
> +{
> +	struct virtqueue *vq = vb->free_page_vq;
> +	struct scatterlist sg;
> +	unsigned int len;
> +	int ret;
> +
> +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
> +retry:
> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> +	virtqueue_kick(vq);
> +	if (unlikely(ret == -ENOSPC)) {
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		goto retry;
> +	}
> +}

So the annoying thing here is that once this starts going,
it will keep sending free pages from the list even if
host is no longer interested. There should be a way
for host to tell guest "stop" or "start from the beginning".


It's the result of using same vq for guest to host and
host to guest communication, and I think it's not a great idea.
I'd reuse stats vq for host to guest requests maybe.

> +
> +static void report_free_page(struct work_struct *work)
> +{
> +	struct virtio_balloon *vb;
> +
> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> +	report_free_page_completion(vb);
> +}
> +
> +static void free_page_request(struct virtqueue *vq)
> +{
> +	struct virtio_balloon *vb = vq->vdev->priv;
> +
> +	queue_work(system_freezable_wq, &vb->report_free_page_work);
> +}
> +
>  static int init_vqs(struct virtio_balloon *vb)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct virtqueue **vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	struct scatterlist sg;
> +	int i, nvqs, err = -ENOMEM;
> +
> +	/* Inflateq and deflateq are used unconditionally */
> +	nvqs = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> +		nvqs++;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> +		nvqs++;
> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs)
> +		goto err_vq;
> +	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
> +	if (!callbacks)
> +		goto err_callback;
> +	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
> +	if (!names)
> +		goto err_names;
> +
> +	callbacks[0] = balloon_ack;
> +	names[0] = "inflate";
> +	callbacks[1] = balloon_ack;
> +	names[1] = "deflate";
> +
> +	i = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		callbacks[i] = stats_request;
> +		names[i] = "stats";
> +		i++;
> +	}
>  
> -	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> -	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> -	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		callbacks[i] = free_page_request;
> +		names[i] = "free_page_vq";
> +	}
> +
> +	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
> +					 NULL, NULL);
>  	if (err)
> -		return err;
> +		goto err_find;
>  
>  	vb->inflate_vq = vqs[0];
>  	vb->deflate_vq = vqs[1];
> +	i = 2;
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> -		struct scatterlist sg;
> -		unsigned int num_stats;
> -		vb->stats_vq = vqs[2];
> -
> +		vb->stats_vq = vqs[i++];
>  		/*
>  		 * Prime this virtqueue with one buffer so the hypervisor can
>  		 * use it to signal us later (it can't be broken yet!).
>  		 */
> -		num_stats = update_balloon_stats(vb);
> -
> -		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
> +		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
>  		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
> -		    < 0)
> -			BUG();
> +		    < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
>  		virtqueue_kick(vb->stats_vq);
>  	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		vb->free_page_vq = vqs[i];
> +		vb->report_free_page_signal = 0;
> +		sg_init_one(&sg, &vb->report_free_page_signal,
> +			    sizeof(__virtio32));
> +		if (virtqueue_add_outbuf(vb->free_page_vq, &sg, 1, vb,
> +					 GFP_KERNEL) < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add signal buf failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
> +		virtqueue_kick(vb->free_page_vq);
> +	}
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
>  	return 0;
> +
> +err_find:
> +	kfree(names);
> +err_names:
> +	kfree(callbacks);
> +err_callback:
> +	kfree(vqs);
> +err_vq:
> +	return err;
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -675,6 +795,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
>  		xb_init(&vb->page_xb);
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> +		INIT_WORK(&vb->report_free_page_work, report_free_page);
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -739,6 +862,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	spin_unlock_irq(&vb->stop_update_lock);
>  	cancel_work_sync(&vb->update_balloon_size_work);
>  	cancel_work_sync(&vb->update_balloon_stats_work);
> +	cancel_work_sync(&vb->report_free_page_work);
>  
>  	remove_common(vb);
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -792,6 +916,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>  	VIRTIO_BALLOON_F_SG,
> +	VIRTIO_BALLOON_F_FREE_PAGE_VQ,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 37780a7..8214f84 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -35,6 +35,7 @@
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
> +#define VIRTIO_BALLOON_F_FREE_PAGE_VQ	4 /* Virtqueue to report free pages */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> -- 
> 2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [Qemu-devel] [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-18  2:28     ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18  2:28 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
> Add a new vq to report hints of guest free pages to the host.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
>  include/uapi/linux/virtio_balloon.h |   1 +
>  2 files changed, 147 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 72041b4..e6755bc 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
>  
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>  
>  	/* The balloon servicing is delegated to a freezable workqueue. */
>  	struct work_struct update_balloon_stats_work;
>  	struct work_struct update_balloon_size_work;
> +	struct work_struct report_free_page_work;
>  
>  	/* Prevent updating balloon when it is being canceled. */
>  	spinlock_t stop_update_lock;
> @@ -90,6 +91,13 @@ struct virtio_balloon {
>  	/* Memory statistics */
>  	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>  
> +	/*
> +	 * Used by the device and driver to signal each other.
> +	 * device->driver: start the free page report.
> +	 * driver->device: end the free page report.
> +	 */
> +	__virtio32 report_free_page_signal;
> +
>  	/* To register callback in oom notifier call chain */
>  	struct notifier_block nb;
>  };
> @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
>  	} while (unlikely(ret == -ENOSPC));
>  }
>  
> +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	unsigned int len;
> +
> +	add_one_sg(vq, addr, size);
> +	virtqueue_kick(vq);
> +	/* Release entries if there are */
> +	while (virtqueue_get_buf(vq, &len))
> +		;
> +}
> +
>  /*
>   * Send balloon pages in sgs to host. The balloon pages are recorded in the
>   * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
>  		queue_work(system_freezable_wq, work);
>  }
>  
> +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> +					   unsigned long nr_pages)
> +{
> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> +	void *addr = (void *)pfn_to_kaddr(pfn);
> +	uint32_t len = nr_pages << PAGE_SHIFT;
> +
> +	send_free_page_sg(vb->free_page_vq, addr, len);
> +}
> +
> +static void report_free_page_completion(struct virtio_balloon *vb)
> +{
> +	struct virtqueue *vq = vb->free_page_vq;
> +	struct scatterlist sg;
> +	unsigned int len;
> +	int ret;
> +
> +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
> +retry:
> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> +	virtqueue_kick(vq);
> +	if (unlikely(ret == -ENOSPC)) {
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		goto retry;
> +	}
> +}

So the annoying thing here is that once this starts going,
it will keep sending free pages from the list even if
host is no longer interested. There should be a way
for host to tell guest "stop" or "start from the beginning".


It's the result of using same vq for guest to host and
host to guest communication, and I think it's not a great idea.
I'd reuse stats vq for host to guest requests maybe.

> +
> +static void report_free_page(struct work_struct *work)
> +{
> +	struct virtio_balloon *vb;
> +
> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> +	report_free_page_completion(vb);
> +}
> +
> +static void free_page_request(struct virtqueue *vq)
> +{
> +	struct virtio_balloon *vb = vq->vdev->priv;
> +
> +	queue_work(system_freezable_wq, &vb->report_free_page_work);
> +}
> +
>  static int init_vqs(struct virtio_balloon *vb)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct virtqueue **vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	struct scatterlist sg;
> +	int i, nvqs, err = -ENOMEM;
> +
> +	/* Inflateq and deflateq are used unconditionally */
> +	nvqs = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> +		nvqs++;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> +		nvqs++;
> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs)
> +		goto err_vq;
> +	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
> +	if (!callbacks)
> +		goto err_callback;
> +	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
> +	if (!names)
> +		goto err_names;
> +
> +	callbacks[0] = balloon_ack;
> +	names[0] = "inflate";
> +	callbacks[1] = balloon_ack;
> +	names[1] = "deflate";
> +
> +	i = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		callbacks[i] = stats_request;
> +		names[i] = "stats";
> +		i++;
> +	}
>  
> -	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> -	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> -	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		callbacks[i] = free_page_request;
> +		names[i] = "free_page_vq";
> +	}
> +
> +	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
> +					 NULL, NULL);
>  	if (err)
> -		return err;
> +		goto err_find;
>  
>  	vb->inflate_vq = vqs[0];
>  	vb->deflate_vq = vqs[1];
> +	i = 2;
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> -		struct scatterlist sg;
> -		unsigned int num_stats;
> -		vb->stats_vq = vqs[2];
> -
> +		vb->stats_vq = vqs[i++];
>  		/*
>  		 * Prime this virtqueue with one buffer so the hypervisor can
>  		 * use it to signal us later (it can't be broken yet!).
>  		 */
> -		num_stats = update_balloon_stats(vb);
> -
> -		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
> +		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
>  		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
> -		    < 0)
> -			BUG();
> +		    < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
>  		virtqueue_kick(vb->stats_vq);
>  	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		vb->free_page_vq = vqs[i];
> +		vb->report_free_page_signal = 0;
> +		sg_init_one(&sg, &vb->report_free_page_signal,
> +			    sizeof(__virtio32));
> +		if (virtqueue_add_outbuf(vb->free_page_vq, &sg, 1, vb,
> +					 GFP_KERNEL) < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add signal buf failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
> +		virtqueue_kick(vb->free_page_vq);
> +	}
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
>  	return 0;
> +
> +err_find:
> +	kfree(names);
> +err_names:
> +	kfree(callbacks);
> +err_callback:
> +	kfree(vqs);
> +err_vq:
> +	return err;
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -675,6 +795,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
>  		xb_init(&vb->page_xb);
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> +		INIT_WORK(&vb->report_free_page_work, report_free_page);
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -739,6 +862,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	spin_unlock_irq(&vb->stop_update_lock);
>  	cancel_work_sync(&vb->update_balloon_size_work);
>  	cancel_work_sync(&vb->update_balloon_stats_work);
> +	cancel_work_sync(&vb->report_free_page_work);
>  
>  	remove_common(vb);
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -792,6 +916,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>  	VIRTIO_BALLOON_F_SG,
> +	VIRTIO_BALLOON_F_FREE_PAGE_VQ,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 37780a7..8214f84 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -35,6 +35,7 @@
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
> +#define VIRTIO_BALLOON_F_FREE_PAGE_VQ	4 /* Virtqueue to report free pages */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-08-17  3:26   ` Wei Wang
                     ` (5 preceding siblings ...)
  (?)
@ 2017-08-18  2:28   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18  2:28 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko, mgorman

On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
> Add a new vq to report hints of guest free pages to the host.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
>  include/uapi/linux/virtio_balloon.h |   1 +
>  2 files changed, 147 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 72041b4..e6755bc 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
>  
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>  
>  	/* The balloon servicing is delegated to a freezable workqueue. */
>  	struct work_struct update_balloon_stats_work;
>  	struct work_struct update_balloon_size_work;
> +	struct work_struct report_free_page_work;
>  
>  	/* Prevent updating balloon when it is being canceled. */
>  	spinlock_t stop_update_lock;
> @@ -90,6 +91,13 @@ struct virtio_balloon {
>  	/* Memory statistics */
>  	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>  
> +	/*
> +	 * Used by the device and driver to signal each other.
> +	 * device->driver: start the free page report.
> +	 * driver->device: end the free page report.
> +	 */
> +	__virtio32 report_free_page_signal;
> +
>  	/* To register callback in oom notifier call chain */
>  	struct notifier_block nb;
>  };
> @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
>  	} while (unlikely(ret == -ENOSPC));
>  }
>  
> +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	unsigned int len;
> +
> +	add_one_sg(vq, addr, size);
> +	virtqueue_kick(vq);
> +	/* Release entries if there are */
> +	while (virtqueue_get_buf(vq, &len))
> +		;
> +}
> +
>  /*
>   * Send balloon pages in sgs to host. The balloon pages are recorded in the
>   * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
>  		queue_work(system_freezable_wq, work);
>  }
>  
> +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> +					   unsigned long nr_pages)
> +{
> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> +	void *addr = (void *)pfn_to_kaddr(pfn);
> +	uint32_t len = nr_pages << PAGE_SHIFT;
> +
> +	send_free_page_sg(vb->free_page_vq, addr, len);
> +}
> +
> +static void report_free_page_completion(struct virtio_balloon *vb)
> +{
> +	struct virtqueue *vq = vb->free_page_vq;
> +	struct scatterlist sg;
> +	unsigned int len;
> +	int ret;
> +
> +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
> +retry:
> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> +	virtqueue_kick(vq);
> +	if (unlikely(ret == -ENOSPC)) {
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		goto retry;
> +	}
> +}

So the annoying thing here is that once this starts going,
it will keep sending free pages from the list even if
host is no longer interested. There should be a way
for host to tell guest "stop" or "start from the beginning".


It's the result of using same vq for guest to host and
host to guest communication, and I think it's not a great idea.
I'd reuse stats vq for host to guest requests maybe.

> +
> +static void report_free_page(struct work_struct *work)
> +{
> +	struct virtio_balloon *vb;
> +
> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> +	report_free_page_completion(vb);
> +}
> +
> +static void free_page_request(struct virtqueue *vq)
> +{
> +	struct virtio_balloon *vb = vq->vdev->priv;
> +
> +	queue_work(system_freezable_wq, &vb->report_free_page_work);
> +}
> +
>  static int init_vqs(struct virtio_balloon *vb)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct virtqueue **vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	struct scatterlist sg;
> +	int i, nvqs, err = -ENOMEM;
> +
> +	/* Inflateq and deflateq are used unconditionally */
> +	nvqs = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> +		nvqs++;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> +		nvqs++;
> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs)
> +		goto err_vq;
> +	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
> +	if (!callbacks)
> +		goto err_callback;
> +	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
> +	if (!names)
> +		goto err_names;
> +
> +	callbacks[0] = balloon_ack;
> +	names[0] = "inflate";
> +	callbacks[1] = balloon_ack;
> +	names[1] = "deflate";
> +
> +	i = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		callbacks[i] = stats_request;
> +		names[i] = "stats";
> +		i++;
> +	}
>  
> -	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> -	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> -	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		callbacks[i] = free_page_request;
> +		names[i] = "free_page_vq";
> +	}
> +
> +	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
> +					 NULL, NULL);
>  	if (err)
> -		return err;
> +		goto err_find;
>  
>  	vb->inflate_vq = vqs[0];
>  	vb->deflate_vq = vqs[1];
> +	i = 2;
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> -		struct scatterlist sg;
> -		unsigned int num_stats;
> -		vb->stats_vq = vqs[2];
> -
> +		vb->stats_vq = vqs[i++];
>  		/*
>  		 * Prime this virtqueue with one buffer so the hypervisor can
>  		 * use it to signal us later (it can't be broken yet!).
>  		 */
> -		num_stats = update_balloon_stats(vb);
> -
> -		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
> +		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
>  		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
> -		    < 0)
> -			BUG();
> +		    < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
>  		virtqueue_kick(vb->stats_vq);
>  	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		vb->free_page_vq = vqs[i];
> +		vb->report_free_page_signal = 0;
> +		sg_init_one(&sg, &vb->report_free_page_signal,
> +			    sizeof(__virtio32));
> +		if (virtqueue_add_outbuf(vb->free_page_vq, &sg, 1, vb,
> +					 GFP_KERNEL) < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add signal buf failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
> +		virtqueue_kick(vb->free_page_vq);
> +	}
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
>  	return 0;
> +
> +err_find:
> +	kfree(names);
> +err_names:
> +	kfree(callbacks);
> +err_callback:
> +	kfree(vqs);
> +err_vq:
> +	return err;
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -675,6 +795,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
>  		xb_init(&vb->page_xb);
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> +		INIT_WORK(&vb->report_free_page_work, report_free_page);
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -739,6 +862,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	spin_unlock_irq(&vb->stop_update_lock);
>  	cancel_work_sync(&vb->update_balloon_size_work);
>  	cancel_work_sync(&vb->update_balloon_stats_work);
> +	cancel_work_sync(&vb->report_free_page_work);
>  
>  	remove_common(vb);
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -792,6 +916,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>  	VIRTIO_BALLOON_F_SG,
> +	VIRTIO_BALLOON_F_FREE_PAGE_VQ,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 37780a7..8214f84 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -35,6 +35,7 @@
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
> +#define VIRTIO_BALLOON_F_FREE_PAGE_VQ	4 /* Virtqueue to report free pages */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [virtio-dev] Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-18  2:28     ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18  2:28 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
> Add a new vq to report hints of guest free pages to the host.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
>  include/uapi/linux/virtio_balloon.h |   1 +
>  2 files changed, 147 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 72041b4..e6755bc 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
>  
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>  
>  	/* The balloon servicing is delegated to a freezable workqueue. */
>  	struct work_struct update_balloon_stats_work;
>  	struct work_struct update_balloon_size_work;
> +	struct work_struct report_free_page_work;
>  
>  	/* Prevent updating balloon when it is being canceled. */
>  	spinlock_t stop_update_lock;
> @@ -90,6 +91,13 @@ struct virtio_balloon {
>  	/* Memory statistics */
>  	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>  
> +	/*
> +	 * Used by the device and driver to signal each other.
> +	 * device->driver: start the free page report.
> +	 * driver->device: end the free page report.
> +	 */
> +	__virtio32 report_free_page_signal;
> +
>  	/* To register callback in oom notifier call chain */
>  	struct notifier_block nb;
>  };
> @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
>  	} while (unlikely(ret == -ENOSPC));
>  }
>  
> +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	unsigned int len;
> +
> +	add_one_sg(vq, addr, size);
> +	virtqueue_kick(vq);
> +	/* Release entries if there are */
> +	while (virtqueue_get_buf(vq, &len))
> +		;
> +}
> +
>  /*
>   * Send balloon pages in sgs to host. The balloon pages are recorded in the
>   * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
>  		queue_work(system_freezable_wq, work);
>  }
>  
> +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> +					   unsigned long nr_pages)
> +{
> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> +	void *addr = (void *)pfn_to_kaddr(pfn);
> +	uint32_t len = nr_pages << PAGE_SHIFT;
> +
> +	send_free_page_sg(vb->free_page_vq, addr, len);
> +}
> +
> +static void report_free_page_completion(struct virtio_balloon *vb)
> +{
> +	struct virtqueue *vq = vb->free_page_vq;
> +	struct scatterlist sg;
> +	unsigned int len;
> +	int ret;
> +
> +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
> +retry:
> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> +	virtqueue_kick(vq);
> +	if (unlikely(ret == -ENOSPC)) {
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		goto retry;
> +	}
> +}

So the annoying thing here is that once this starts going,
it will keep sending free pages from the list even if
host is no longer interested. There should be a way
for host to tell guest "stop" or "start from the beginning".


It's the result of using same vq for guest to host and
host to guest communication, and I think it's not a great idea.
I'd reuse stats vq for host to guest requests maybe.

> +
> +static void report_free_page(struct work_struct *work)
> +{
> +	struct virtio_balloon *vb;
> +
> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> +	report_free_page_completion(vb);
> +}
> +
> +static void free_page_request(struct virtqueue *vq)
> +{
> +	struct virtio_balloon *vb = vq->vdev->priv;
> +
> +	queue_work(system_freezable_wq, &vb->report_free_page_work);
> +}
> +
>  static int init_vqs(struct virtio_balloon *vb)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct virtqueue **vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	struct scatterlist sg;
> +	int i, nvqs, err = -ENOMEM;
> +
> +	/* Inflateq and deflateq are used unconditionally */
> +	nvqs = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> +		nvqs++;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> +		nvqs++;
> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs)
> +		goto err_vq;
> +	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
> +	if (!callbacks)
> +		goto err_callback;
> +	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
> +	if (!names)
> +		goto err_names;
> +
> +	callbacks[0] = balloon_ack;
> +	names[0] = "inflate";
> +	callbacks[1] = balloon_ack;
> +	names[1] = "deflate";
> +
> +	i = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		callbacks[i] = stats_request;
> +		names[i] = "stats";
> +		i++;
> +	}
>  
> -	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> -	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> -	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		callbacks[i] = free_page_request;
> +		names[i] = "free_page_vq";
> +	}
> +
> +	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
> +					 NULL, NULL);
>  	if (err)
> -		return err;
> +		goto err_find;
>  
>  	vb->inflate_vq = vqs[0];
>  	vb->deflate_vq = vqs[1];
> +	i = 2;
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> -		struct scatterlist sg;
> -		unsigned int num_stats;
> -		vb->stats_vq = vqs[2];
> -
> +		vb->stats_vq = vqs[i++];
>  		/*
>  		 * Prime this virtqueue with one buffer so the hypervisor can
>  		 * use it to signal us later (it can't be broken yet!).
>  		 */
> -		num_stats = update_balloon_stats(vb);
> -
> -		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
> +		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
>  		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
> -		    < 0)
> -			BUG();
> +		    < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
>  		virtqueue_kick(vb->stats_vq);
>  	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
> +		vb->free_page_vq = vqs[i];
> +		vb->report_free_page_signal = 0;
> +		sg_init_one(&sg, &vb->report_free_page_signal,
> +			    sizeof(__virtio32));
> +		if (virtqueue_add_outbuf(vb->free_page_vq, &sg, 1, vb,
> +					 GFP_KERNEL) < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add signal buf failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
> +		virtqueue_kick(vb->free_page_vq);
> +	}
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
>  	return 0;
> +
> +err_find:
> +	kfree(names);
> +err_names:
> +	kfree(callbacks);
> +err_callback:
> +	kfree(vqs);
> +err_vq:
> +	return err;
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -675,6 +795,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
>  		xb_init(&vb->page_xb);
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
> +		INIT_WORK(&vb->report_free_page_work, report_free_page);
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -739,6 +862,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	spin_unlock_irq(&vb->stop_update_lock);
>  	cancel_work_sync(&vb->update_balloon_size_work);
>  	cancel_work_sync(&vb->update_balloon_stats_work);
> +	cancel_work_sync(&vb->report_free_page_work);
>  
>  	remove_common(vb);
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -792,6 +916,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>  	VIRTIO_BALLOON_F_SG,
> +	VIRTIO_BALLOON_F_FREE_PAGE_VQ,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 37780a7..8214f84 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -35,6 +35,7 @@
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
> +#define VIRTIO_BALLOON_F_FREE_PAGE_VQ	4 /* Virtqueue to report free pages */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> -- 
> 2.7.4

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-08-18  2:22     ` Michael S. Tsirkin
  (?)
  (?)
@ 2017-08-18  7:39       ` Wei Wang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-18  7:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/18/2017 10:22 AM, Michael S. Tsirkin wrote:
> +static void send_balloon_page_sg(struct virtio_balloon *vb,
> +				 struct virtqueue *vq,
> +				 void *addr,
> +				 uint32_t size)
> +{
> +	unsigned int len;
> +	int ret;
> +
> +	do {
> +		ret = add_one_sg(vq, addr, size);
> +		virtqueue_kick(vq);
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		/*
> +		 * It is uncommon to see the vq is full, because the sg is sent
> +		 * one by one and the device is able to handle it in time. But
> +		 * if that happens, we go back to retry after an entry gets
> +		 * released.
> +		 */
> Why send one by one though? Why not batch some s/gs and wait for all
> of them to be completed? If memory if fragmented, waiting every time is
> worse than what we have now (VIRTIO_BALLOON_ARRAY_PFNS_MAX at a time).
>

OK, I'll do batching in some fashion.


Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-08-18  7:39       ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-18  7:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/18/2017 10:22 AM, Michael S. Tsirkin wrote:
> +static void send_balloon_page_sg(struct virtio_balloon *vb,
> +				 struct virtqueue *vq,
> +				 void *addr,
> +				 uint32_t size)
> +{
> +	unsigned int len;
> +	int ret;
> +
> +	do {
> +		ret = add_one_sg(vq, addr, size);
> +		virtqueue_kick(vq);
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		/*
> +		 * It is uncommon to see the vq is full, because the sg is sent
> +		 * one by one and the device is able to handle it in time. But
> +		 * if that happens, we go back to retry after an entry gets
> +		 * released.
> +		 */
> Why send one by one though? Why not batch some s/gs and wait for all
> of them to be completed? If memory if fragmented, waiting every time is
> worse than what we have now (VIRTIO_BALLOON_ARRAY_PFNS_MAX at a time).
>

OK, I'll do batching in some fashion.


Best,
Wei



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [Qemu-devel] [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-08-18  7:39       ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-18  7:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/18/2017 10:22 AM, Michael S. Tsirkin wrote:
> +static void send_balloon_page_sg(struct virtio_balloon *vb,
> +				 struct virtqueue *vq,
> +				 void *addr,
> +				 uint32_t size)
> +{
> +	unsigned int len;
> +	int ret;
> +
> +	do {
> +		ret = add_one_sg(vq, addr, size);
> +		virtqueue_kick(vq);
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		/*
> +		 * It is uncommon to see the vq is full, because the sg is sent
> +		 * one by one and the device is able to handle it in time. But
> +		 * if that happens, we go back to retry after an entry gets
> +		 * released.
> +		 */
> Why send one by one though? Why not batch some s/gs and wait for all
> of them to be completed? If memory if fragmented, waiting every time is
> worse than what we have now (VIRTIO_BALLOON_ARRAY_PFNS_MAX at a time).
>

OK, I'll do batching in some fashion.


Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-08-18  2:22     ` Michael S. Tsirkin
                       ` (2 preceding siblings ...)
  (?)
@ 2017-08-18  7:39     ` Wei Wang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-18  7:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko, mgorman

On 08/18/2017 10:22 AM, Michael S. Tsirkin wrote:
> +static void send_balloon_page_sg(struct virtio_balloon *vb,
> +				 struct virtqueue *vq,
> +				 void *addr,
> +				 uint32_t size)
> +{
> +	unsigned int len;
> +	int ret;
> +
> +	do {
> +		ret = add_one_sg(vq, addr, size);
> +		virtqueue_kick(vq);
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		/*
> +		 * It is uncommon to see the vq is full, because the sg is sent
> +		 * one by one and the device is able to handle it in time. But
> +		 * if that happens, we go back to retry after an entry gets
> +		 * released.
> +		 */
> Why send one by one though? Why not batch some s/gs and wait for all
> of them to be completed? If memory if fragmented, waiting every time is
> worse than what we have now (VIRTIO_BALLOON_ARRAY_PFNS_MAX at a time).
>

OK, I'll do batching in some fashion.


Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [virtio-dev] Re: [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-08-18  7:39       ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-18  7:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/18/2017 10:22 AM, Michael S. Tsirkin wrote:
> +static void send_balloon_page_sg(struct virtio_balloon *vb,
> +				 struct virtqueue *vq,
> +				 void *addr,
> +				 uint32_t size)
> +{
> +	unsigned int len;
> +	int ret;
> +
> +	do {
> +		ret = add_one_sg(vq, addr, size);
> +		virtqueue_kick(vq);
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		/*
> +		 * It is uncommon to see the vq is full, because the sg is sent
> +		 * one by one and the device is able to handle it in time. But
> +		 * if that happens, we go back to retry after an entry gets
> +		 * released.
> +		 */
> Why send one by one though? Why not batch some s/gs and wait for all
> of them to be completed? If memory if fragmented, waiting every time is
> worse than what we have now (VIRTIO_BALLOON_ARRAY_PFNS_MAX at a time).
>

OK, I'll do batching in some fashion.


Best,
Wei




---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-08-18  2:28     ` Michael S. Tsirkin
  (?)
  (?)
@ 2017-08-18  8:36       ` Wei Wang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-18  8:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/18/2017 10:28 AM, Michael S. Tsirkin wrote:
> On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
>> Add a new vq to report hints of guest free pages to the host.
>>
>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>> ---
>>   drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
>>   include/uapi/linux/virtio_balloon.h |   1 +
>>   2 files changed, 147 insertions(+), 21 deletions(-)
>>
>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>> index 72041b4..e6755bc 100644
>> --- a/drivers/virtio/virtio_balloon.c
>> +++ b/drivers/virtio/virtio_balloon.c
>> @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
>>   
>>   struct virtio_balloon {
>>   	struct virtio_device *vdev;
>> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
>> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>>   
>>   	/* The balloon servicing is delegated to a freezable workqueue. */
>>   	struct work_struct update_balloon_stats_work;
>>   	struct work_struct update_balloon_size_work;
>> +	struct work_struct report_free_page_work;
>>   
>>   	/* Prevent updating balloon when it is being canceled. */
>>   	spinlock_t stop_update_lock;
>> @@ -90,6 +91,13 @@ struct virtio_balloon {
>>   	/* Memory statistics */
>>   	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>>   
>> +	/*
>> +	 * Used by the device and driver to signal each other.
>> +	 * device->driver: start the free page report.
>> +	 * driver->device: end the free page report.
>> +	 */
>> +	__virtio32 report_free_page_signal;
>> +
>>   	/* To register callback in oom notifier call chain */
>>   	struct notifier_block nb;
>>   };
>> @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
>>   	} while (unlikely(ret == -ENOSPC));
>>   }
>>   
>> +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
>> +{
>> +	unsigned int len;
>> +
>> +	add_one_sg(vq, addr, size);
>> +	virtqueue_kick(vq);
>> +	/* Release entries if there are */
>> +	while (virtqueue_get_buf(vq, &len))
>> +		;
>> +}
>> +
>>   /*
>>    * Send balloon pages in sgs to host. The balloon pages are recorded in the
>>    * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
>> @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
>>   		queue_work(system_freezable_wq, work);
>>   }
>>   
>> +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
>> +					   unsigned long nr_pages)
>> +{
>> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
>> +	void *addr = (void *)pfn_to_kaddr(pfn);
>> +	uint32_t len = nr_pages << PAGE_SHIFT;
>> +
>> +	send_free_page_sg(vb->free_page_vq, addr, len);
>> +}
>> +
>> +static void report_free_page_completion(struct virtio_balloon *vb)
>> +{
>> +	struct virtqueue *vq = vb->free_page_vq;
>> +	struct scatterlist sg;
>> +	unsigned int len;
>> +	int ret;
>> +
>> +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
>> +retry:
>> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>> +	virtqueue_kick(vq);
>> +	if (unlikely(ret == -ENOSPC)) {
>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>> +		goto retry;
>> +	}
>> +}
> So the annoying thing here is that once this starts going,
> it will keep sending free pages from the list even if
> host is no longer interested. There should be a way
> for host to tell guest "stop" or "start from the beginning".

This can be achieved via two output signal buf here:
signal_buf_start: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START
signal_buf_end: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_END

The device holds both, and can put one of them to the vq and notify.



>
> It's the result of using same vq for guest to host and
> host to guest communication, and I think it's not a great idea.
> I'd reuse stats vq for host to guest requests maybe.
>


As we discussed before, we can't have a vq interleave the report of 
stats and free pages.
The vq will be locked when one command is in use. So, when live 
migration starts, the
periodically reported stats will be delayed. Would this be OK? Or would 
you like to have
one host to guest vq, and multiple host to guest vqs? That is,

- host to guest:
CMD_VQ

- guest to host:
STATS_REPORT_VQ
FREE_PAGE_VQ


Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-18  8:36       ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-18  8:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/18/2017 10:28 AM, Michael S. Tsirkin wrote:
> On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
>> Add a new vq to report hints of guest free pages to the host.
>>
>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>> ---
>>   drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
>>   include/uapi/linux/virtio_balloon.h |   1 +
>>   2 files changed, 147 insertions(+), 21 deletions(-)
>>
>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>> index 72041b4..e6755bc 100644
>> --- a/drivers/virtio/virtio_balloon.c
>> +++ b/drivers/virtio/virtio_balloon.c
>> @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
>>   
>>   struct virtio_balloon {
>>   	struct virtio_device *vdev;
>> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
>> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>>   
>>   	/* The balloon servicing is delegated to a freezable workqueue. */
>>   	struct work_struct update_balloon_stats_work;
>>   	struct work_struct update_balloon_size_work;
>> +	struct work_struct report_free_page_work;
>>   
>>   	/* Prevent updating balloon when it is being canceled. */
>>   	spinlock_t stop_update_lock;
>> @@ -90,6 +91,13 @@ struct virtio_balloon {
>>   	/* Memory statistics */
>>   	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>>   
>> +	/*
>> +	 * Used by the device and driver to signal each other.
>> +	 * device->driver: start the free page report.
>> +	 * driver->device: end the free page report.
>> +	 */
>> +	__virtio32 report_free_page_signal;
>> +
>>   	/* To register callback in oom notifier call chain */
>>   	struct notifier_block nb;
>>   };
>> @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
>>   	} while (unlikely(ret == -ENOSPC));
>>   }
>>   
>> +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
>> +{
>> +	unsigned int len;
>> +
>> +	add_one_sg(vq, addr, size);
>> +	virtqueue_kick(vq);
>> +	/* Release entries if there are */
>> +	while (virtqueue_get_buf(vq, &len))
>> +		;
>> +}
>> +
>>   /*
>>    * Send balloon pages in sgs to host. The balloon pages are recorded in the
>>    * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
>> @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
>>   		queue_work(system_freezable_wq, work);
>>   }
>>   
>> +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
>> +					   unsigned long nr_pages)
>> +{
>> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
>> +	void *addr = (void *)pfn_to_kaddr(pfn);
>> +	uint32_t len = nr_pages << PAGE_SHIFT;
>> +
>> +	send_free_page_sg(vb->free_page_vq, addr, len);
>> +}
>> +
>> +static void report_free_page_completion(struct virtio_balloon *vb)
>> +{
>> +	struct virtqueue *vq = vb->free_page_vq;
>> +	struct scatterlist sg;
>> +	unsigned int len;
>> +	int ret;
>> +
>> +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
>> +retry:
>> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>> +	virtqueue_kick(vq);
>> +	if (unlikely(ret == -ENOSPC)) {
>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>> +		goto retry;
>> +	}
>> +}
> So the annoying thing here is that once this starts going,
> it will keep sending free pages from the list even if
> host is no longer interested. There should be a way
> for host to tell guest "stop" or "start from the beginning".

This can be achieved via two output signal buf here:
signal_buf_start: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START
signal_buf_end: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_END

The device holds both, and can put one of them to the vq and notify.



>
> It's the result of using same vq for guest to host and
> host to guest communication, and I think it's not a great idea.
> I'd reuse stats vq for host to guest requests maybe.
>


As we discussed before, we can't have a vq interleave the report of 
stats and free pages.
The vq will be locked when one command is in use. So, when live 
migration starts, the
periodically reported stats will be delayed. Would this be OK? Or would 
you like to have
one host to guest vq, and multiple host to guest vqs? That is,

- host to guest:
CMD_VQ

- guest to host:
STATS_REPORT_VQ
FREE_PAGE_VQ


Best,
Wei



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [Qemu-devel] [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-18  8:36       ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-18  8:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/18/2017 10:28 AM, Michael S. Tsirkin wrote:
> On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
>> Add a new vq to report hints of guest free pages to the host.
>>
>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>> ---
>>   drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
>>   include/uapi/linux/virtio_balloon.h |   1 +
>>   2 files changed, 147 insertions(+), 21 deletions(-)
>>
>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>> index 72041b4..e6755bc 100644
>> --- a/drivers/virtio/virtio_balloon.c
>> +++ b/drivers/virtio/virtio_balloon.c
>> @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
>>   
>>   struct virtio_balloon {
>>   	struct virtio_device *vdev;
>> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
>> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>>   
>>   	/* The balloon servicing is delegated to a freezable workqueue. */
>>   	struct work_struct update_balloon_stats_work;
>>   	struct work_struct update_balloon_size_work;
>> +	struct work_struct report_free_page_work;
>>   
>>   	/* Prevent updating balloon when it is being canceled. */
>>   	spinlock_t stop_update_lock;
>> @@ -90,6 +91,13 @@ struct virtio_balloon {
>>   	/* Memory statistics */
>>   	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>>   
>> +	/*
>> +	 * Used by the device and driver to signal each other.
>> +	 * device->driver: start the free page report.
>> +	 * driver->device: end the free page report.
>> +	 */
>> +	__virtio32 report_free_page_signal;
>> +
>>   	/* To register callback in oom notifier call chain */
>>   	struct notifier_block nb;
>>   };
>> @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
>>   	} while (unlikely(ret == -ENOSPC));
>>   }
>>   
>> +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
>> +{
>> +	unsigned int len;
>> +
>> +	add_one_sg(vq, addr, size);
>> +	virtqueue_kick(vq);
>> +	/* Release entries if there are */
>> +	while (virtqueue_get_buf(vq, &len))
>> +		;
>> +}
>> +
>>   /*
>>    * Send balloon pages in sgs to host. The balloon pages are recorded in the
>>    * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
>> @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
>>   		queue_work(system_freezable_wq, work);
>>   }
>>   
>> +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
>> +					   unsigned long nr_pages)
>> +{
>> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
>> +	void *addr = (void *)pfn_to_kaddr(pfn);
>> +	uint32_t len = nr_pages << PAGE_SHIFT;
>> +
>> +	send_free_page_sg(vb->free_page_vq, addr, len);
>> +}
>> +
>> +static void report_free_page_completion(struct virtio_balloon *vb)
>> +{
>> +	struct virtqueue *vq = vb->free_page_vq;
>> +	struct scatterlist sg;
>> +	unsigned int len;
>> +	int ret;
>> +
>> +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
>> +retry:
>> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>> +	virtqueue_kick(vq);
>> +	if (unlikely(ret == -ENOSPC)) {
>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>> +		goto retry;
>> +	}
>> +}
> So the annoying thing here is that once this starts going,
> it will keep sending free pages from the list even if
> host is no longer interested. There should be a way
> for host to tell guest "stop" or "start from the beginning".

This can be achieved via two output signal buf here:
signal_buf_start: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START
signal_buf_end: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_END

The device holds both, and can put one of them to the vq and notify.



>
> It's the result of using same vq for guest to host and
> host to guest communication, and I think it's not a great idea.
> I'd reuse stats vq for host to guest requests maybe.
>


As we discussed before, we can't have a vq interleave the report of 
stats and free pages.
The vq will be locked when one command is in use. So, when live 
migration starts, the
periodically reported stats will be delayed. Would this be OK? Or would 
you like to have
one host to guest vq, and multiple host to guest vqs? That is,

- host to guest:
CMD_VQ

- guest to host:
STATS_REPORT_VQ
FREE_PAGE_VQ


Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-08-18  2:28     ` Michael S. Tsirkin
                       ` (2 preceding siblings ...)
  (?)
@ 2017-08-18  8:36     ` Wei Wang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-18  8:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko, mgorman

On 08/18/2017 10:28 AM, Michael S. Tsirkin wrote:
> On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
>> Add a new vq to report hints of guest free pages to the host.
>>
>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>> ---
>>   drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
>>   include/uapi/linux/virtio_balloon.h |   1 +
>>   2 files changed, 147 insertions(+), 21 deletions(-)
>>
>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>> index 72041b4..e6755bc 100644
>> --- a/drivers/virtio/virtio_balloon.c
>> +++ b/drivers/virtio/virtio_balloon.c
>> @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
>>   
>>   struct virtio_balloon {
>>   	struct virtio_device *vdev;
>> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
>> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>>   
>>   	/* The balloon servicing is delegated to a freezable workqueue. */
>>   	struct work_struct update_balloon_stats_work;
>>   	struct work_struct update_balloon_size_work;
>> +	struct work_struct report_free_page_work;
>>   
>>   	/* Prevent updating balloon when it is being canceled. */
>>   	spinlock_t stop_update_lock;
>> @@ -90,6 +91,13 @@ struct virtio_balloon {
>>   	/* Memory statistics */
>>   	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>>   
>> +	/*
>> +	 * Used by the device and driver to signal each other.
>> +	 * device->driver: start the free page report.
>> +	 * driver->device: end the free page report.
>> +	 */
>> +	__virtio32 report_free_page_signal;
>> +
>>   	/* To register callback in oom notifier call chain */
>>   	struct notifier_block nb;
>>   };
>> @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
>>   	} while (unlikely(ret == -ENOSPC));
>>   }
>>   
>> +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
>> +{
>> +	unsigned int len;
>> +
>> +	add_one_sg(vq, addr, size);
>> +	virtqueue_kick(vq);
>> +	/* Release entries if there are */
>> +	while (virtqueue_get_buf(vq, &len))
>> +		;
>> +}
>> +
>>   /*
>>    * Send balloon pages in sgs to host. The balloon pages are recorded in the
>>    * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
>> @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
>>   		queue_work(system_freezable_wq, work);
>>   }
>>   
>> +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
>> +					   unsigned long nr_pages)
>> +{
>> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
>> +	void *addr = (void *)pfn_to_kaddr(pfn);
>> +	uint32_t len = nr_pages << PAGE_SHIFT;
>> +
>> +	send_free_page_sg(vb->free_page_vq, addr, len);
>> +}
>> +
>> +static void report_free_page_completion(struct virtio_balloon *vb)
>> +{
>> +	struct virtqueue *vq = vb->free_page_vq;
>> +	struct scatterlist sg;
>> +	unsigned int len;
>> +	int ret;
>> +
>> +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
>> +retry:
>> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>> +	virtqueue_kick(vq);
>> +	if (unlikely(ret == -ENOSPC)) {
>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>> +		goto retry;
>> +	}
>> +}
> So the annoying thing here is that once this starts going,
> it will keep sending free pages from the list even if
> host is no longer interested. There should be a way
> for host to tell guest "stop" or "start from the beginning".

This can be achieved via two output signal buf here:
signal_buf_start: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START
signal_buf_end: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_END

The device holds both, and can put one of them to the vq and notify.



>
> It's the result of using same vq for guest to host and
> host to guest communication, and I think it's not a great idea.
> I'd reuse stats vq for host to guest requests maybe.
>


As we discussed before, we can't have a vq interleave the report of 
stats and free pages.
The vq will be locked when one command is in use. So, when live 
migration starts, the
periodically reported stats will be delayed. Would this be OK? Or would 
you like to have
one host to guest vq, and multiple host to guest vqs? That is,

- host to guest:
CMD_VQ

- guest to host:
STATS_REPORT_VQ
FREE_PAGE_VQ


Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [virtio-dev] Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-18  8:36       ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-18  8:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/18/2017 10:28 AM, Michael S. Tsirkin wrote:
> On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
>> Add a new vq to report hints of guest free pages to the host.
>>
>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>> ---
>>   drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
>>   include/uapi/linux/virtio_balloon.h |   1 +
>>   2 files changed, 147 insertions(+), 21 deletions(-)
>>
>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>> index 72041b4..e6755bc 100644
>> --- a/drivers/virtio/virtio_balloon.c
>> +++ b/drivers/virtio/virtio_balloon.c
>> @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
>>   
>>   struct virtio_balloon {
>>   	struct virtio_device *vdev;
>> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
>> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>>   
>>   	/* The balloon servicing is delegated to a freezable workqueue. */
>>   	struct work_struct update_balloon_stats_work;
>>   	struct work_struct update_balloon_size_work;
>> +	struct work_struct report_free_page_work;
>>   
>>   	/* Prevent updating balloon when it is being canceled. */
>>   	spinlock_t stop_update_lock;
>> @@ -90,6 +91,13 @@ struct virtio_balloon {
>>   	/* Memory statistics */
>>   	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>>   
>> +	/*
>> +	 * Used by the device and driver to signal each other.
>> +	 * device->driver: start the free page report.
>> +	 * driver->device: end the free page report.
>> +	 */
>> +	__virtio32 report_free_page_signal;
>> +
>>   	/* To register callback in oom notifier call chain */
>>   	struct notifier_block nb;
>>   };
>> @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
>>   	} while (unlikely(ret == -ENOSPC));
>>   }
>>   
>> +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
>> +{
>> +	unsigned int len;
>> +
>> +	add_one_sg(vq, addr, size);
>> +	virtqueue_kick(vq);
>> +	/* Release entries if there are */
>> +	while (virtqueue_get_buf(vq, &len))
>> +		;
>> +}
>> +
>>   /*
>>    * Send balloon pages in sgs to host. The balloon pages are recorded in the
>>    * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
>> @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
>>   		queue_work(system_freezable_wq, work);
>>   }
>>   
>> +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
>> +					   unsigned long nr_pages)
>> +{
>> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
>> +	void *addr = (void *)pfn_to_kaddr(pfn);
>> +	uint32_t len = nr_pages << PAGE_SHIFT;
>> +
>> +	send_free_page_sg(vb->free_page_vq, addr, len);
>> +}
>> +
>> +static void report_free_page_completion(struct virtio_balloon *vb)
>> +{
>> +	struct virtqueue *vq = vb->free_page_vq;
>> +	struct scatterlist sg;
>> +	unsigned int len;
>> +	int ret;
>> +
>> +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
>> +retry:
>> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>> +	virtqueue_kick(vq);
>> +	if (unlikely(ret == -ENOSPC)) {
>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>> +		goto retry;
>> +	}
>> +}
> So the annoying thing here is that once this starts going,
> it will keep sending free pages from the list even if
> host is no longer interested. There should be a way
> for host to tell guest "stop" or "start from the beginning".

This can be achieved via two output signal buf here:
signal_buf_start: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START
signal_buf_end: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_END

The device holds both, and can put one of them to the vq and notify.



>
> It's the result of using same vq for guest to host and
> host to guest communication, and I think it's not a great idea.
> I'd reuse stats vq for host to guest requests maybe.
>


As we discussed before, we can't have a vq interleave the report of 
stats and free pages.
The vq will be locked when one command is in use. So, when live 
migration starts, the
periodically reported stats will be delayed. Would this be OK? Or would 
you like to have
one host to guest vq, and multiple host to guest vqs? That is,

- host to guest:
CMD_VQ

- guest to host:
STATS_REPORT_VQ
FREE_PAGE_VQ


Best,
Wei




---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-08-18  2:13     ` Michael S. Tsirkin
  (?)
  (?)
@ 2017-08-18  8:41       ` Wei Wang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-18  8:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/18/2017 10:13 AM, Michael S. Tsirkin wrote:
> On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
>> Add a new vq to report hints of guest free pages to the host.
> Please add some text here explaining the report_free_page_signal
> thing.
>
>
> I also really think we need some kind of ID in the
> buffer to do a handshake. whenever id changes you
> add another outbuf.

Please let me introduce the current design first:
1) device put the signal buf to the vq and notify the driver (we need
a buffer because currently the device can't notify when the vq is empty);

2) the driver starts the report of free page blocks via inbuf;

3) the driver adds an the signal buf via outbuf to tell the device all 
are reported.


Could you please elaborate more on the usage of ID?

>> +retry:
>> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>> +	virtqueue_kick(vq);
>> +	if (unlikely(ret == -ENOSPC)) {
> what if there's another error?

Another error is -EIO, how about disabling the free page report feature?
(I also saw it isn't handled in many other virtio devices e.g. virtio-net)

>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>> +		goto retry;
>> +	}
> what is this trickery doing? needs more comments or
> a simplification.

Just this:
if the vq is full, blocking wait till an entry gets released, then 
retry. This is the
final one, which puts the signal buf to the vq to signify the end of the 
report and
the mm lock is not held here, so it is fine to block.


>
>
>> +}
>> +
>> +static void report_free_page(struct work_struct *work)
>> +{
>> +	struct virtio_balloon *vb;
>> +
>> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
>> +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> That's a lot of work here. And system_wq documentation says:
>   *
>   * system_wq is the one used by schedule[_delayed]_work[_on]().
>   * Multi-CPU multi-threaded.  There are users which expect relatively
>   * short queue flush time.  Don't queue works which can run for too
>   * long.
>
> You might want to create your own wq, maybe even with WQ_CPU_INTENSIVE.

Thanks for the reminder. If not creating a new wq, how about 
system_unbound_wq?
The first round of live migration needs the free pages, in that way we 
can have the
pages reported to the hypervisor quicker.

>
>> +	report_free_page_completion(vb);
> So first you get list of pages, then an outbuf telling you
> what they are in end of.  I think it's backwards.
> Add an outbuf first followed by inbufs that tell you
> what they are.


If we have the signal filled with those flags like
VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START,
Probably not necessary to have an inbuf followed by an outbuf, right?


Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-18  8:41       ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-18  8:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/18/2017 10:13 AM, Michael S. Tsirkin wrote:
> On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
>> Add a new vq to report hints of guest free pages to the host.
> Please add some text here explaining the report_free_page_signal
> thing.
>
>
> I also really think we need some kind of ID in the
> buffer to do a handshake. whenever id changes you
> add another outbuf.

Please let me introduce the current design first:
1) device put the signal buf to the vq and notify the driver (we need
a buffer because currently the device can't notify when the vq is empty);

2) the driver starts the report of free page blocks via inbuf;

3) the driver adds an the signal buf via outbuf to tell the device all 
are reported.


Could you please elaborate more on the usage of ID?

>> +retry:
>> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>> +	virtqueue_kick(vq);
>> +	if (unlikely(ret == -ENOSPC)) {
> what if there's another error?

Another error is -EIO, how about disabling the free page report feature?
(I also saw it isn't handled in many other virtio devices e.g. virtio-net)

>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>> +		goto retry;
>> +	}
> what is this trickery doing? needs more comments or
> a simplification.

Just this:
if the vq is full, blocking wait till an entry gets released, then 
retry. This is the
final one, which puts the signal buf to the vq to signify the end of the 
report and
the mm lock is not held here, so it is fine to block.


>
>
>> +}
>> +
>> +static void report_free_page(struct work_struct *work)
>> +{
>> +	struct virtio_balloon *vb;
>> +
>> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
>> +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> That's a lot of work here. And system_wq documentation says:
>   *
>   * system_wq is the one used by schedule[_delayed]_work[_on]().
>   * Multi-CPU multi-threaded.  There are users which expect relatively
>   * short queue flush time.  Don't queue works which can run for too
>   * long.
>
> You might want to create your own wq, maybe even with WQ_CPU_INTENSIVE.

Thanks for the reminder. If not creating a new wq, how about 
system_unbound_wq?
The first round of live migration needs the free pages, in that way we 
can have the
pages reported to the hypervisor quicker.

>
>> +	report_free_page_completion(vb);
> So first you get list of pages, then an outbuf telling you
> what they are in end of.  I think it's backwards.
> Add an outbuf first followed by inbufs that tell you
> what they are.


If we have the signal filled with those flags like
VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START,
Probably not necessary to have an inbuf followed by an outbuf, right?


Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [Qemu-devel] [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-18  8:41       ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-18  8:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/18/2017 10:13 AM, Michael S. Tsirkin wrote:
> On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
>> Add a new vq to report hints of guest free pages to the host.
> Please add some text here explaining the report_free_page_signal
> thing.
>
>
> I also really think we need some kind of ID in the
> buffer to do a handshake. whenever id changes you
> add another outbuf.

Please let me introduce the current design first:
1) device put the signal buf to the vq and notify the driver (we need
a buffer because currently the device can't notify when the vq is empty);

2) the driver starts the report of free page blocks via inbuf;

3) the driver adds an the signal buf via outbuf to tell the device all 
are reported.


Could you please elaborate more on the usage of ID?

>> +retry:
>> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>> +	virtqueue_kick(vq);
>> +	if (unlikely(ret == -ENOSPC)) {
> what if there's another error?

Another error is -EIO, how about disabling the free page report feature?
(I also saw it isn't handled in many other virtio devices e.g. virtio-net)

>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>> +		goto retry;
>> +	}
> what is this trickery doing? needs more comments or
> a simplification.

Just this:
if the vq is full, blocking wait till an entry gets released, then 
retry. This is the
final one, which puts the signal buf to the vq to signify the end of the 
report and
the mm lock is not held here, so it is fine to block.


>
>
>> +}
>> +
>> +static void report_free_page(struct work_struct *work)
>> +{
>> +	struct virtio_balloon *vb;
>> +
>> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
>> +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> That's a lot of work here. And system_wq documentation says:
>   *
>   * system_wq is the one used by schedule[_delayed]_work[_on]().
>   * Multi-CPU multi-threaded.  There are users which expect relatively
>   * short queue flush time.  Don't queue works which can run for too
>   * long.
>
> You might want to create your own wq, maybe even with WQ_CPU_INTENSIVE.

Thanks for the reminder. If not creating a new wq, how about 
system_unbound_wq?
The first round of live migration needs the free pages, in that way we 
can have the
pages reported to the hypervisor quicker.

>
>> +	report_free_page_completion(vb);
> So first you get list of pages, then an outbuf telling you
> what they are in end of.  I think it's backwards.
> Add an outbuf first followed by inbufs that tell you
> what they are.


If we have the signal filled with those flags like
VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START,
Probably not necessary to have an inbuf followed by an outbuf, right?


Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-08-18  2:13     ` Michael S. Tsirkin
                       ` (2 preceding siblings ...)
  (?)
@ 2017-08-18  8:41     ` Wei Wang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-18  8:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko, mgorman

On 08/18/2017 10:13 AM, Michael S. Tsirkin wrote:
> On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
>> Add a new vq to report hints of guest free pages to the host.
> Please add some text here explaining the report_free_page_signal
> thing.
>
>
> I also really think we need some kind of ID in the
> buffer to do a handshake. whenever id changes you
> add another outbuf.

Please let me introduce the current design first:
1) device put the signal buf to the vq and notify the driver (we need
a buffer because currently the device can't notify when the vq is empty);

2) the driver starts the report of free page blocks via inbuf;

3) the driver adds an the signal buf via outbuf to tell the device all 
are reported.


Could you please elaborate more on the usage of ID?

>> +retry:
>> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>> +	virtqueue_kick(vq);
>> +	if (unlikely(ret == -ENOSPC)) {
> what if there's another error?

Another error is -EIO, how about disabling the free page report feature?
(I also saw it isn't handled in many other virtio devices e.g. virtio-net)

>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>> +		goto retry;
>> +	}
> what is this trickery doing? needs more comments or
> a simplification.

Just this:
if the vq is full, blocking wait till an entry gets released, then 
retry. This is the
final one, which puts the signal buf to the vq to signify the end of the 
report and
the mm lock is not held here, so it is fine to block.


>
>
>> +}
>> +
>> +static void report_free_page(struct work_struct *work)
>> +{
>> +	struct virtio_balloon *vb;
>> +
>> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
>> +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> That's a lot of work here. And system_wq documentation says:
>   *
>   * system_wq is the one used by schedule[_delayed]_work[_on]().
>   * Multi-CPU multi-threaded.  There are users which expect relatively
>   * short queue flush time.  Don't queue works which can run for too
>   * long.
>
> You might want to create your own wq, maybe even with WQ_CPU_INTENSIVE.

Thanks for the reminder. If not creating a new wq, how about 
system_unbound_wq?
The first round of live migration needs the free pages, in that way we 
can have the
pages reported to the hypervisor quicker.

>
>> +	report_free_page_completion(vb);
> So first you get list of pages, then an outbuf telling you
> what they are in end of.  I think it's backwards.
> Add an outbuf first followed by inbufs that tell you
> what they are.


If we have the signal filled with those flags like
VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START,
Probably not necessary to have an inbuf followed by an outbuf, right?


Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [virtio-dev] Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-18  8:41       ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-18  8:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/18/2017 10:13 AM, Michael S. Tsirkin wrote:
> On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
>> Add a new vq to report hints of guest free pages to the host.
> Please add some text here explaining the report_free_page_signal
> thing.
>
>
> I also really think we need some kind of ID in the
> buffer to do a handshake. whenever id changes you
> add another outbuf.

Please let me introduce the current design first:
1) device put the signal buf to the vq and notify the driver (we need
a buffer because currently the device can't notify when the vq is empty);

2) the driver starts the report of free page blocks via inbuf;

3) the driver adds an the signal buf via outbuf to tell the device all 
are reported.


Could you please elaborate more on the usage of ID?

>> +retry:
>> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>> +	virtqueue_kick(vq);
>> +	if (unlikely(ret == -ENOSPC)) {
> what if there's another error?

Another error is -EIO, how about disabling the free page report feature?
(I also saw it isn't handled in many other virtio devices e.g. virtio-net)

>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>> +		goto retry;
>> +	}
> what is this trickery doing? needs more comments or
> a simplification.

Just this:
if the vq is full, blocking wait till an entry gets released, then 
retry. This is the
final one, which puts the signal buf to the vq to signify the end of the 
report and
the mm lock is not held here, so it is fine to block.


>
>
>> +}
>> +
>> +static void report_free_page(struct work_struct *work)
>> +{
>> +	struct virtio_balloon *vb;
>> +
>> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
>> +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> That's a lot of work here. And system_wq documentation says:
>   *
>   * system_wq is the one used by schedule[_delayed]_work[_on]().
>   * Multi-CPU multi-threaded.  There are users which expect relatively
>   * short queue flush time.  Don't queue works which can run for too
>   * long.
>
> You might want to create your own wq, maybe even with WQ_CPU_INTENSIVE.

Thanks for the reminder. If not creating a new wq, how about 
system_unbound_wq?
The first round of live migration needs the free pages, in that way we 
can have the
pages reported to the hypervisor quicker.

>
>> +	report_free_page_completion(vb);
> So first you get list of pages, then an outbuf telling you
> what they are in end of.  I think it's backwards.
> Add an outbuf first followed by inbufs that tell you
> what they are.


If we have the signal filled with those flags like
VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START,
Probably not necessary to have an inbuf followed by an outbuf, right?


Best,
Wei

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 4/5] mm: support reporting free page blocks
  2017-08-17  3:26   ` Wei Wang
  (?)
@ 2017-08-18 13:46     ` Michal Hocko
  -1 siblings, 0 replies; 116+ messages in thread
From: Michal Hocko @ 2017-08-18 13:46 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Thu 17-08-17 11:26:55, Wei Wang wrote:
> This patch adds support to walk through the free page blocks in the
> system and report them via a callback function. Some page blocks may
> leave the free list after zone->lock is released, so it is the caller's
> responsibility to either detect or prevent the use of such pages.

This could see more details to be honest. Especially the usecase you are
going to use this for. This will help us to understand the motivation
in future when the current user might be gone a new ones largely diverge
into a different usage. This wouldn't be the first time I have seen
something like that.

> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> ---
>  include/linux/mm.h |  6 ++++++
>  mm/page_alloc.c    | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 50 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 46b9ac5..cd29b9f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>  		unsigned long zone_start_pfn, unsigned long *zholes_size);
>  extern void free_initmem(void);
>  
> +extern void walk_free_mem_block(void *opaque1,
> +				unsigned int min_order,
> +				void (*visit)(void *opaque2,
> +					      unsigned long pfn,
> +					      unsigned long nr_pages));
> +
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>   * into the buddy system. The freed pages will be poisoned with pattern
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6d00f74..a721a35 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  	show_swap_cache_info();
>  }
>  
> +/**
> + * walk_free_mem_block - Walk through the free page blocks in the system
> + * @opaque1: the context passed from the caller
> + * @min_order: the minimum order of free lists to check
> + * @visit: the callback function given by the caller

The original suggestion for using visit was motivated by a visit design
pattern but I can see how this can be confusing. Maybe a more explicit
name wold be better. What about report_free_range.

> + *
> + * The function is used to walk through the free page blocks in the system,
> + * and each free page block is reported to the caller via the @visit callback.
> + * Please note:
> + * 1) The function is used to report hints of free pages, so the caller should
> + * not use those reported pages after the callback returns.
> + * 2) The callback is invoked with the zone->lock being held, so it should not
> + * block and should finish as soon as possible.

I think that the explicit note about zone->lock is not really need. This
can change in future and I would even bet that somebody might rely on
the lock being held for some purpose and silently get broken with the
change. Instead I would much rather see something like the following:
"
Please note that there are no locking guarantees for the callback and
that the reported pfn range might be freed or disappear after the
callback returns so the caller has to be very careful how it is used.

The callback itself must not sleep or perform any operations which would
require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
or via any lock dependency. It is generally advisable to implement
the callback as simple as possible and defer any heavy lifting to a
different context.

There is no guarantee that each free range will be reported only once
during one walk_free_mem_block invocation.

pfn_to_page on the given range is strongly discouraged and if there is
an absolute need for that make sure to contact MM people to discuss
potential problems.

The function itself might sleep so it cannot be called from atomic
contexts.

In general low orders tend to be very volatile and so it makes more
sense to query larger ones for various optimizations which like
ballooning etc... This will reduce the overhead as well.
"

> + */
> +void walk_free_mem_block(void *opaque1,
> +			 unsigned int min_order,

make the order int and...
> +			 void (*visit)(void *opaque2,
> +				       unsigned long pfn,
> +				       unsigned long nr_pages))
> +{
> +	struct zone *zone;
> +	struct page *page;
> +	struct list_head *list;
> +	unsigned int order;
> +	enum migratetype mt;
> +	unsigned long pfn, flags;
> +
> +	for_each_populated_zone(zone) {
> +		for (order = MAX_ORDER - 1;
> +		     order < MAX_ORDER && order >= min_order; order--) {

you will not need the underflow check which is just ugly

> +			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> +				spin_lock_irqsave(&zone->lock, flags);
> +				list = &zone->free_area[order].free_list[mt];
> +				list_for_each_entry(page, list, lru) {
> +					pfn = page_to_pfn(page);
> +					visit(opaque1, pfn, 1 << order);
> +				}
> +				spin_unlock_irqrestore(&zone->lock, flags);

				cond_resched();
> +			}
> +		}
> +	}
> +}
> +EXPORT_SYMBOL_GPL(walk_free_mem_block);
> +
>  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
>  {
>  	zoneref->zone = zone;
> -- 
> 2.7.4

Other than that this looks _much_ more reasonable than previous
versions.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 4/5] mm: support reporting free page blocks
@ 2017-08-18 13:46     ` Michal Hocko
  0 siblings, 0 replies; 116+ messages in thread
From: Michal Hocko @ 2017-08-18 13:46 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Thu 17-08-17 11:26:55, Wei Wang wrote:
> This patch adds support to walk through the free page blocks in the
> system and report them via a callback function. Some page blocks may
> leave the free list after zone->lock is released, so it is the caller's
> responsibility to either detect or prevent the use of such pages.

This could see more details to be honest. Especially the usecase you are
going to use this for. This will help us to understand the motivation
in future when the current user might be gone a new ones largely diverge
into a different usage. This wouldn't be the first time I have seen
something like that.

> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> ---
>  include/linux/mm.h |  6 ++++++
>  mm/page_alloc.c    | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 50 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 46b9ac5..cd29b9f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>  		unsigned long zone_start_pfn, unsigned long *zholes_size);
>  extern void free_initmem(void);
>  
> +extern void walk_free_mem_block(void *opaque1,
> +				unsigned int min_order,
> +				void (*visit)(void *opaque2,
> +					      unsigned long pfn,
> +					      unsigned long nr_pages));
> +
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>   * into the buddy system. The freed pages will be poisoned with pattern
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6d00f74..a721a35 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  	show_swap_cache_info();
>  }
>  
> +/**
> + * walk_free_mem_block - Walk through the free page blocks in the system
> + * @opaque1: the context passed from the caller
> + * @min_order: the minimum order of free lists to check
> + * @visit: the callback function given by the caller

The original suggestion for using visit was motivated by a visit design
pattern but I can see how this can be confusing. Maybe a more explicit
name wold be better. What about report_free_range.

> + *
> + * The function is used to walk through the free page blocks in the system,
> + * and each free page block is reported to the caller via the @visit callback.
> + * Please note:
> + * 1) The function is used to report hints of free pages, so the caller should
> + * not use those reported pages after the callback returns.
> + * 2) The callback is invoked with the zone->lock being held, so it should not
> + * block and should finish as soon as possible.

I think that the explicit note about zone->lock is not really need. This
can change in future and I would even bet that somebody might rely on
the lock being held for some purpose and silently get broken with the
change. Instead I would much rather see something like the following:
"
Please note that there are no locking guarantees for the callback and
that the reported pfn range might be freed or disappear after the
callback returns so the caller has to be very careful how it is used.

The callback itself must not sleep or perform any operations which would
require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
or via any lock dependency. It is generally advisable to implement
the callback as simple as possible and defer any heavy lifting to a
different context.

There is no guarantee that each free range will be reported only once
during one walk_free_mem_block invocation.

pfn_to_page on the given range is strongly discouraged and if there is
an absolute need for that make sure to contact MM people to discuss
potential problems.

The function itself might sleep so it cannot be called from atomic
contexts.

In general low orders tend to be very volatile and so it makes more
sense to query larger ones for various optimizations which like
ballooning etc... This will reduce the overhead as well.
"

> + */
> +void walk_free_mem_block(void *opaque1,
> +			 unsigned int min_order,

make the order int and...
> +			 void (*visit)(void *opaque2,
> +				       unsigned long pfn,
> +				       unsigned long nr_pages))
> +{
> +	struct zone *zone;
> +	struct page *page;
> +	struct list_head *list;
> +	unsigned int order;
> +	enum migratetype mt;
> +	unsigned long pfn, flags;
> +
> +	for_each_populated_zone(zone) {
> +		for (order = MAX_ORDER - 1;
> +		     order < MAX_ORDER && order >= min_order; order--) {

you will not need the underflow check which is just ugly

> +			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> +				spin_lock_irqsave(&zone->lock, flags);
> +				list = &zone->free_area[order].free_list[mt];
> +				list_for_each_entry(page, list, lru) {
> +					pfn = page_to_pfn(page);
> +					visit(opaque1, pfn, 1 << order);
> +				}
> +				spin_unlock_irqrestore(&zone->lock, flags);

				cond_resched();
> +			}
> +		}
> +	}
> +}
> +EXPORT_SYMBOL_GPL(walk_free_mem_block);
> +
>  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
>  {
>  	zoneref->zone = zone;
> -- 
> 2.7.4

Other than that this looks _much_ more reasonable than previous
versions.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [Qemu-devel] [PATCH v14 4/5] mm: support reporting free page blocks
@ 2017-08-18 13:46     ` Michal Hocko
  0 siblings, 0 replies; 116+ messages in thread
From: Michal Hocko @ 2017-08-18 13:46 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Thu 17-08-17 11:26:55, Wei Wang wrote:
> This patch adds support to walk through the free page blocks in the
> system and report them via a callback function. Some page blocks may
> leave the free list after zone->lock is released, so it is the caller's
> responsibility to either detect or prevent the use of such pages.

This could see more details to be honest. Especially the usecase you are
going to use this for. This will help us to understand the motivation
in future when the current user might be gone a new ones largely diverge
into a different usage. This wouldn't be the first time I have seen
something like that.

> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> ---
>  include/linux/mm.h |  6 ++++++
>  mm/page_alloc.c    | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 50 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 46b9ac5..cd29b9f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>  		unsigned long zone_start_pfn, unsigned long *zholes_size);
>  extern void free_initmem(void);
>  
> +extern void walk_free_mem_block(void *opaque1,
> +				unsigned int min_order,
> +				void (*visit)(void *opaque2,
> +					      unsigned long pfn,
> +					      unsigned long nr_pages));
> +
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>   * into the buddy system. The freed pages will be poisoned with pattern
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6d00f74..a721a35 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  	show_swap_cache_info();
>  }
>  
> +/**
> + * walk_free_mem_block - Walk through the free page blocks in the system
> + * @opaque1: the context passed from the caller
> + * @min_order: the minimum order of free lists to check
> + * @visit: the callback function given by the caller

The original suggestion for using visit was motivated by a visit design
pattern but I can see how this can be confusing. Maybe a more explicit
name wold be better. What about report_free_range.

> + *
> + * The function is used to walk through the free page blocks in the system,
> + * and each free page block is reported to the caller via the @visit callback.
> + * Please note:
> + * 1) The function is used to report hints of free pages, so the caller should
> + * not use those reported pages after the callback returns.
> + * 2) The callback is invoked with the zone->lock being held, so it should not
> + * block and should finish as soon as possible.

I think that the explicit note about zone->lock is not really need. This
can change in future and I would even bet that somebody might rely on
the lock being held for some purpose and silently get broken with the
change. Instead I would much rather see something like the following:
"
Please note that there are no locking guarantees for the callback and
that the reported pfn range might be freed or disappear after the
callback returns so the caller has to be very careful how it is used.

The callback itself must not sleep or perform any operations which would
require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
or via any lock dependency. It is generally advisable to implement
the callback as simple as possible and defer any heavy lifting to a
different context.

There is no guarantee that each free range will be reported only once
during one walk_free_mem_block invocation.

pfn_to_page on the given range is strongly discouraged and if there is
an absolute need for that make sure to contact MM people to discuss
potential problems.

The function itself might sleep so it cannot be called from atomic
contexts.

In general low orders tend to be very volatile and so it makes more
sense to query larger ones for various optimizations which like
ballooning etc... This will reduce the overhead as well.
"

> + */
> +void walk_free_mem_block(void *opaque1,
> +			 unsigned int min_order,

make the order int and...
> +			 void (*visit)(void *opaque2,
> +				       unsigned long pfn,
> +				       unsigned long nr_pages))
> +{
> +	struct zone *zone;
> +	struct page *page;
> +	struct list_head *list;
> +	unsigned int order;
> +	enum migratetype mt;
> +	unsigned long pfn, flags;
> +
> +	for_each_populated_zone(zone) {
> +		for (order = MAX_ORDER - 1;
> +		     order < MAX_ORDER && order >= min_order; order--) {

you will not need the underflow check which is just ugly

> +			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> +				spin_lock_irqsave(&zone->lock, flags);
> +				list = &zone->free_area[order].free_list[mt];
> +				list_for_each_entry(page, list, lru) {
> +					pfn = page_to_pfn(page);
> +					visit(opaque1, pfn, 1 << order);
> +				}
> +				spin_unlock_irqrestore(&zone->lock, flags);

				cond_resched();
> +			}
> +		}
> +	}
> +}
> +EXPORT_SYMBOL_GPL(walk_free_mem_block);
> +
>  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
>  {
>  	zoneref->zone = zone;
> -- 
> 2.7.4

Other than that this looks _much_ more reasonable than previous
versions.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 4/5] mm: support reporting free page blocks
  2017-08-17  3:26   ` Wei Wang
                     ` (3 preceding siblings ...)
  (?)
@ 2017-08-18 13:46   ` Michal Hocko
  -1 siblings, 0 replies; 116+ messages in thread
From: Michal Hocko @ 2017-08-18 13:46 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, mst, qemu-devel, amit.shah,
	liliang.opensource, mawilcox, linux-kernel, willy,
	virtualization, linux-mm, yang.zhang.wz, quan.xu, cornelia.huck,
	pbonzini, akpm, mgorman

On Thu 17-08-17 11:26:55, Wei Wang wrote:
> This patch adds support to walk through the free page blocks in the
> system and report them via a callback function. Some page blocks may
> leave the free list after zone->lock is released, so it is the caller's
> responsibility to either detect or prevent the use of such pages.

This could see more details to be honest. Especially the usecase you are
going to use this for. This will help us to understand the motivation
in future when the current user might be gone a new ones largely diverge
into a different usage. This wouldn't be the first time I have seen
something like that.

> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> ---
>  include/linux/mm.h |  6 ++++++
>  mm/page_alloc.c    | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 50 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 46b9ac5..cd29b9f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>  		unsigned long zone_start_pfn, unsigned long *zholes_size);
>  extern void free_initmem(void);
>  
> +extern void walk_free_mem_block(void *opaque1,
> +				unsigned int min_order,
> +				void (*visit)(void *opaque2,
> +					      unsigned long pfn,
> +					      unsigned long nr_pages));
> +
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>   * into the buddy system. The freed pages will be poisoned with pattern
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6d00f74..a721a35 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  	show_swap_cache_info();
>  }
>  
> +/**
> + * walk_free_mem_block - Walk through the free page blocks in the system
> + * @opaque1: the context passed from the caller
> + * @min_order: the minimum order of free lists to check
> + * @visit: the callback function given by the caller

The original suggestion for using visit was motivated by a visit design
pattern but I can see how this can be confusing. Maybe a more explicit
name wold be better. What about report_free_range.

> + *
> + * The function is used to walk through the free page blocks in the system,
> + * and each free page block is reported to the caller via the @visit callback.
> + * Please note:
> + * 1) The function is used to report hints of free pages, so the caller should
> + * not use those reported pages after the callback returns.
> + * 2) The callback is invoked with the zone->lock being held, so it should not
> + * block and should finish as soon as possible.

I think that the explicit note about zone->lock is not really need. This
can change in future and I would even bet that somebody might rely on
the lock being held for some purpose and silently get broken with the
change. Instead I would much rather see something like the following:
"
Please note that there are no locking guarantees for the callback and
that the reported pfn range might be freed or disappear after the
callback returns so the caller has to be very careful how it is used.

The callback itself must not sleep or perform any operations which would
require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
or via any lock dependency. It is generally advisable to implement
the callback as simple as possible and defer any heavy lifting to a
different context.

There is no guarantee that each free range will be reported only once
during one walk_free_mem_block invocation.

pfn_to_page on the given range is strongly discouraged and if there is
an absolute need for that make sure to contact MM people to discuss
potential problems.

The function itself might sleep so it cannot be called from atomic
contexts.

In general low orders tend to be very volatile and so it makes more
sense to query larger ones for various optimizations which like
ballooning etc... This will reduce the overhead as well.
"

> + */
> +void walk_free_mem_block(void *opaque1,
> +			 unsigned int min_order,

make the order int and...
> +			 void (*visit)(void *opaque2,
> +				       unsigned long pfn,
> +				       unsigned long nr_pages))
> +{
> +	struct zone *zone;
> +	struct page *page;
> +	struct list_head *list;
> +	unsigned int order;
> +	enum migratetype mt;
> +	unsigned long pfn, flags;
> +
> +	for_each_populated_zone(zone) {
> +		for (order = MAX_ORDER - 1;
> +		     order < MAX_ORDER && order >= min_order; order--) {

you will not need the underflow check which is just ugly

> +			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> +				spin_lock_irqsave(&zone->lock, flags);
> +				list = &zone->free_area[order].free_list[mt];
> +				list_for_each_entry(page, list, lru) {
> +					pfn = page_to_pfn(page);
> +					visit(opaque1, pfn, 1 << order);
> +				}
> +				spin_unlock_irqrestore(&zone->lock, flags);

				cond_resched();
> +			}
> +		}
> +	}
> +}
> +EXPORT_SYMBOL_GPL(walk_free_mem_block);
> +
>  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
>  {
>  	zoneref->zone = zone;
> -- 
> 2.7.4

Other than that this looks _much_ more reasonable than previous
versions.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 4/5] mm: support reporting free page blocks
  2017-08-17  3:26   ` Wei Wang
  (?)
  (?)
@ 2017-08-18 17:23     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18 17:23 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Thu, Aug 17, 2017 at 11:26:55AM +0800, Wei Wang wrote:
> This patch adds support to walk through the free page blocks in the
> system and report them via a callback function. Some page blocks may
> leave the free list after zone->lock is released, so it is the caller's
> responsibility to either detect or prevent the use of such pages.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> ---
>  include/linux/mm.h |  6 ++++++
>  mm/page_alloc.c    | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 50 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 46b9ac5..cd29b9f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>  		unsigned long zone_start_pfn, unsigned long *zholes_size);
>  extern void free_initmem(void);
>  
> +extern void walk_free_mem_block(void *opaque1,
> +				unsigned int min_order,
> +				void (*visit)(void *opaque2,
> +					      unsigned long pfn,
> +					      unsigned long nr_pages));
> +
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>   * into the buddy system. The freed pages will be poisoned with pattern
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6d00f74..a721a35 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  	show_swap_cache_info();
>  }
>  
> +/**
> + * walk_free_mem_block - Walk through the free page blocks in the system
> + * @opaque1: the context passed from the caller
> + * @min_order: the minimum order of free lists to check
> + * @visit: the callback function given by the caller
> + *
> + * The function is used to walk through the free page blocks in the system,
> + * and each free page block is reported to the caller via the @visit callback.
> + * Please note:
> + * 1) The function is used to report hints of free pages, so the caller should
> + * not use those reported pages after the callback returns.
> + * 2) The callback is invoked with the zone->lock being held, so it should not
> + * block and should finish as soon as possible.
> + */
> +void walk_free_mem_block(void *opaque1,
> +			 unsigned int min_order,
> +			 void (*visit)(void *opaque2,

You can just avoid opaque2 completely I think, then opaque1 can
be renamed opaque.

> +				       unsigned long pfn,
> +				       unsigned long nr_pages))
> +{
> +	struct zone *zone;
> +	struct page *page;
> +	struct list_head *list;
> +	unsigned int order;
> +	enum migratetype mt;
> +	unsigned long pfn, flags;
> +
> +	for_each_populated_zone(zone) {
> +		for (order = MAX_ORDER - 1;
> +		     order < MAX_ORDER && order >= min_order; order--) {
> +			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> +				spin_lock_irqsave(&zone->lock, flags);
> +				list = &zone->free_area[order].free_list[mt];
> +				list_for_each_entry(page, list, lru) {
> +					pfn = page_to_pfn(page);
> +					visit(opaque1, pfn, 1 << order);

My only concern here is inability of callback to
1. break out of list
2. remove page from the list

So I would make the callback bool, and I would use
list_for_each_entry_safe.


> +				}
> +				spin_unlock_irqrestore(&zone->lock, flags);
> +			}
> +		}
> +	}
> +}
> +EXPORT_SYMBOL_GPL(walk_free_mem_block);
> +
>  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
>  {
>  	zoneref->zone = zone;
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 4/5] mm: support reporting free page blocks
@ 2017-08-18 17:23     ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18 17:23 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Thu, Aug 17, 2017 at 11:26:55AM +0800, Wei Wang wrote:
> This patch adds support to walk through the free page blocks in the
> system and report them via a callback function. Some page blocks may
> leave the free list after zone->lock is released, so it is the caller's
> responsibility to either detect or prevent the use of such pages.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> ---
>  include/linux/mm.h |  6 ++++++
>  mm/page_alloc.c    | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 50 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 46b9ac5..cd29b9f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>  		unsigned long zone_start_pfn, unsigned long *zholes_size);
>  extern void free_initmem(void);
>  
> +extern void walk_free_mem_block(void *opaque1,
> +				unsigned int min_order,
> +				void (*visit)(void *opaque2,
> +					      unsigned long pfn,
> +					      unsigned long nr_pages));
> +
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>   * into the buddy system. The freed pages will be poisoned with pattern
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6d00f74..a721a35 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  	show_swap_cache_info();
>  }
>  
> +/**
> + * walk_free_mem_block - Walk through the free page blocks in the system
> + * @opaque1: the context passed from the caller
> + * @min_order: the minimum order of free lists to check
> + * @visit: the callback function given by the caller
> + *
> + * The function is used to walk through the free page blocks in the system,
> + * and each free page block is reported to the caller via the @visit callback.
> + * Please note:
> + * 1) The function is used to report hints of free pages, so the caller should
> + * not use those reported pages after the callback returns.
> + * 2) The callback is invoked with the zone->lock being held, so it should not
> + * block and should finish as soon as possible.
> + */
> +void walk_free_mem_block(void *opaque1,
> +			 unsigned int min_order,
> +			 void (*visit)(void *opaque2,

You can just avoid opaque2 completely I think, then opaque1 can
be renamed opaque.

> +				       unsigned long pfn,
> +				       unsigned long nr_pages))
> +{
> +	struct zone *zone;
> +	struct page *page;
> +	struct list_head *list;
> +	unsigned int order;
> +	enum migratetype mt;
> +	unsigned long pfn, flags;
> +
> +	for_each_populated_zone(zone) {
> +		for (order = MAX_ORDER - 1;
> +		     order < MAX_ORDER && order >= min_order; order--) {
> +			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> +				spin_lock_irqsave(&zone->lock, flags);
> +				list = &zone->free_area[order].free_list[mt];
> +				list_for_each_entry(page, list, lru) {
> +					pfn = page_to_pfn(page);
> +					visit(opaque1, pfn, 1 << order);

My only concern here is inability of callback to
1. break out of list
2. remove page from the list

So I would make the callback bool, and I would use
list_for_each_entry_safe.


> +				}
> +				spin_unlock_irqrestore(&zone->lock, flags);
> +			}
> +		}
> +	}
> +}
> +EXPORT_SYMBOL_GPL(walk_free_mem_block);
> +
>  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
>  {
>  	zoneref->zone = zone;
> -- 
> 2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [Qemu-devel] [PATCH v14 4/5] mm: support reporting free page blocks
@ 2017-08-18 17:23     ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18 17:23 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Thu, Aug 17, 2017 at 11:26:55AM +0800, Wei Wang wrote:
> This patch adds support to walk through the free page blocks in the
> system and report them via a callback function. Some page blocks may
> leave the free list after zone->lock is released, so it is the caller's
> responsibility to either detect or prevent the use of such pages.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> ---
>  include/linux/mm.h |  6 ++++++
>  mm/page_alloc.c    | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 50 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 46b9ac5..cd29b9f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>  		unsigned long zone_start_pfn, unsigned long *zholes_size);
>  extern void free_initmem(void);
>  
> +extern void walk_free_mem_block(void *opaque1,
> +				unsigned int min_order,
> +				void (*visit)(void *opaque2,
> +					      unsigned long pfn,
> +					      unsigned long nr_pages));
> +
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>   * into the buddy system. The freed pages will be poisoned with pattern
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6d00f74..a721a35 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  	show_swap_cache_info();
>  }
>  
> +/**
> + * walk_free_mem_block - Walk through the free page blocks in the system
> + * @opaque1: the context passed from the caller
> + * @min_order: the minimum order of free lists to check
> + * @visit: the callback function given by the caller
> + *
> + * The function is used to walk through the free page blocks in the system,
> + * and each free page block is reported to the caller via the @visit callback.
> + * Please note:
> + * 1) The function is used to report hints of free pages, so the caller should
> + * not use those reported pages after the callback returns.
> + * 2) The callback is invoked with the zone->lock being held, so it should not
> + * block and should finish as soon as possible.
> + */
> +void walk_free_mem_block(void *opaque1,
> +			 unsigned int min_order,
> +			 void (*visit)(void *opaque2,

You can just avoid opaque2 completely I think, then opaque1 can
be renamed opaque.

> +				       unsigned long pfn,
> +				       unsigned long nr_pages))
> +{
> +	struct zone *zone;
> +	struct page *page;
> +	struct list_head *list;
> +	unsigned int order;
> +	enum migratetype mt;
> +	unsigned long pfn, flags;
> +
> +	for_each_populated_zone(zone) {
> +		for (order = MAX_ORDER - 1;
> +		     order < MAX_ORDER && order >= min_order; order--) {
> +			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> +				spin_lock_irqsave(&zone->lock, flags);
> +				list = &zone->free_area[order].free_list[mt];
> +				list_for_each_entry(page, list, lru) {
> +					pfn = page_to_pfn(page);
> +					visit(opaque1, pfn, 1 << order);

My only concern here is inability of callback to
1. break out of list
2. remove page from the list

So I would make the callback bool, and I would use
list_for_each_entry_safe.


> +				}
> +				spin_unlock_irqrestore(&zone->lock, flags);
> +			}
> +		}
> +	}
> +}
> +EXPORT_SYMBOL_GPL(walk_free_mem_block);
> +
>  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
>  {
>  	zoneref->zone = zone;
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 4/5] mm: support reporting free page blocks
  2017-08-17  3:26   ` Wei Wang
                     ` (5 preceding siblings ...)
  (?)
@ 2017-08-18 17:23   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18 17:23 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko, mgorman

On Thu, Aug 17, 2017 at 11:26:55AM +0800, Wei Wang wrote:
> This patch adds support to walk through the free page blocks in the
> system and report them via a callback function. Some page blocks may
> leave the free list after zone->lock is released, so it is the caller's
> responsibility to either detect or prevent the use of such pages.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> ---
>  include/linux/mm.h |  6 ++++++
>  mm/page_alloc.c    | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 50 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 46b9ac5..cd29b9f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>  		unsigned long zone_start_pfn, unsigned long *zholes_size);
>  extern void free_initmem(void);
>  
> +extern void walk_free_mem_block(void *opaque1,
> +				unsigned int min_order,
> +				void (*visit)(void *opaque2,
> +					      unsigned long pfn,
> +					      unsigned long nr_pages));
> +
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>   * into the buddy system. The freed pages will be poisoned with pattern
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6d00f74..a721a35 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  	show_swap_cache_info();
>  }
>  
> +/**
> + * walk_free_mem_block - Walk through the free page blocks in the system
> + * @opaque1: the context passed from the caller
> + * @min_order: the minimum order of free lists to check
> + * @visit: the callback function given by the caller
> + *
> + * The function is used to walk through the free page blocks in the system,
> + * and each free page block is reported to the caller via the @visit callback.
> + * Please note:
> + * 1) The function is used to report hints of free pages, so the caller should
> + * not use those reported pages after the callback returns.
> + * 2) The callback is invoked with the zone->lock being held, so it should not
> + * block and should finish as soon as possible.
> + */
> +void walk_free_mem_block(void *opaque1,
> +			 unsigned int min_order,
> +			 void (*visit)(void *opaque2,

You can just avoid opaque2 completely I think, then opaque1 can
be renamed opaque.

> +				       unsigned long pfn,
> +				       unsigned long nr_pages))
> +{
> +	struct zone *zone;
> +	struct page *page;
> +	struct list_head *list;
> +	unsigned int order;
> +	enum migratetype mt;
> +	unsigned long pfn, flags;
> +
> +	for_each_populated_zone(zone) {
> +		for (order = MAX_ORDER - 1;
> +		     order < MAX_ORDER && order >= min_order; order--) {
> +			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> +				spin_lock_irqsave(&zone->lock, flags);
> +				list = &zone->free_area[order].free_list[mt];
> +				list_for_each_entry(page, list, lru) {
> +					pfn = page_to_pfn(page);
> +					visit(opaque1, pfn, 1 << order);

My only concern here is inability of callback to
1. break out of list
2. remove page from the list

So I would make the callback bool, and I would use
list_for_each_entry_safe.


> +				}
> +				spin_unlock_irqrestore(&zone->lock, flags);
> +			}
> +		}
> +	}
> +}
> +EXPORT_SYMBOL_GPL(walk_free_mem_block);
> +
>  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
>  {
>  	zoneref->zone = zone;
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [virtio-dev] Re: [PATCH v14 4/5] mm: support reporting free page blocks
@ 2017-08-18 17:23     ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18 17:23 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Thu, Aug 17, 2017 at 11:26:55AM +0800, Wei Wang wrote:
> This patch adds support to walk through the free page blocks in the
> system and report them via a callback function. Some page blocks may
> leave the free list after zone->lock is released, so it is the caller's
> responsibility to either detect or prevent the use of such pages.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> ---
>  include/linux/mm.h |  6 ++++++
>  mm/page_alloc.c    | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 50 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 46b9ac5..cd29b9f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>  		unsigned long zone_start_pfn, unsigned long *zholes_size);
>  extern void free_initmem(void);
>  
> +extern void walk_free_mem_block(void *opaque1,
> +				unsigned int min_order,
> +				void (*visit)(void *opaque2,
> +					      unsigned long pfn,
> +					      unsigned long nr_pages));
> +
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>   * into the buddy system. The freed pages will be poisoned with pattern
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6d00f74..a721a35 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  	show_swap_cache_info();
>  }
>  
> +/**
> + * walk_free_mem_block - Walk through the free page blocks in the system
> + * @opaque1: the context passed from the caller
> + * @min_order: the minimum order of free lists to check
> + * @visit: the callback function given by the caller
> + *
> + * The function is used to walk through the free page blocks in the system,
> + * and each free page block is reported to the caller via the @visit callback.
> + * Please note:
> + * 1) The function is used to report hints of free pages, so the caller should
> + * not use those reported pages after the callback returns.
> + * 2) The callback is invoked with the zone->lock being held, so it should not
> + * block and should finish as soon as possible.
> + */
> +void walk_free_mem_block(void *opaque1,
> +			 unsigned int min_order,
> +			 void (*visit)(void *opaque2,

You can just avoid opaque2 completely I think, then opaque1 can
be renamed opaque.

> +				       unsigned long pfn,
> +				       unsigned long nr_pages))
> +{
> +	struct zone *zone;
> +	struct page *page;
> +	struct list_head *list;
> +	unsigned int order;
> +	enum migratetype mt;
> +	unsigned long pfn, flags;
> +
> +	for_each_populated_zone(zone) {
> +		for (order = MAX_ORDER - 1;
> +		     order < MAX_ORDER && order >= min_order; order--) {
> +			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> +				spin_lock_irqsave(&zone->lock, flags);
> +				list = &zone->free_area[order].free_list[mt];
> +				list_for_each_entry(page, list, lru) {
> +					pfn = page_to_pfn(page);
> +					visit(opaque1, pfn, 1 << order);

My only concern here is inability of callback to
1. break out of list
2. remove page from the list

So I would make the callback bool, and I would use
list_for_each_entry_safe.


> +				}
> +				spin_unlock_irqrestore(&zone->lock, flags);
> +			}
> +		}
> +	}
> +}
> +EXPORT_SYMBOL_GPL(walk_free_mem_block);
> +
>  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
>  {
>  	zoneref->zone = zone;
> -- 
> 2.7.4

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-08-18  8:36       ` Wei Wang
  (?)
  (?)
@ 2017-08-18 18:10         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18 18:10 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Fri, Aug 18, 2017 at 04:36:06PM +0800, Wei Wang wrote:
> On 08/18/2017 10:28 AM, Michael S. Tsirkin wrote:
> > On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
> > > Add a new vq to report hints of guest free pages to the host.
> > > 
> > > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > > ---
> > >   drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
> > >   include/uapi/linux/virtio_balloon.h |   1 +
> > >   2 files changed, 147 insertions(+), 21 deletions(-)
> > > 
> > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> > > index 72041b4..e6755bc 100644
> > > --- a/drivers/virtio/virtio_balloon.c
> > > +++ b/drivers/virtio/virtio_balloon.c
> > > @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
> > >   struct virtio_balloon {
> > >   	struct virtio_device *vdev;
> > > -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> > > +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
> > >   	/* The balloon servicing is delegated to a freezable workqueue. */
> > >   	struct work_struct update_balloon_stats_work;
> > >   	struct work_struct update_balloon_size_work;
> > > +	struct work_struct report_free_page_work;
> > >   	/* Prevent updating balloon when it is being canceled. */
> > >   	spinlock_t stop_update_lock;
> > > @@ -90,6 +91,13 @@ struct virtio_balloon {
> > >   	/* Memory statistics */
> > >   	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
> > > +	/*
> > > +	 * Used by the device and driver to signal each other.
> > > +	 * device->driver: start the free page report.
> > > +	 * driver->device: end the free page report.
> > > +	 */
> > > +	__virtio32 report_free_page_signal;
> > > +
> > >   	/* To register callback in oom notifier call chain */
> > >   	struct notifier_block nb;
> > >   };
> > > @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
> > >   	} while (unlikely(ret == -ENOSPC));
> > >   }
> > > +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
> > > +{
> > > +	unsigned int len;
> > > +
> > > +	add_one_sg(vq, addr, size);
> > > +	virtqueue_kick(vq);
> > > +	/* Release entries if there are */
> > > +	while (virtqueue_get_buf(vq, &len))
> > > +		;
> > > +}
> > > +
> > >   /*
> > >    * Send balloon pages in sgs to host. The balloon pages are recorded in the
> > >    * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> > > @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
> > >   		queue_work(system_freezable_wq, work);
> > >   }
> > > +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> > > +					   unsigned long nr_pages)
> > > +{
> > > +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> > > +	void *addr = (void *)pfn_to_kaddr(pfn);
> > > +	uint32_t len = nr_pages << PAGE_SHIFT;
> > > +
> > > +	send_free_page_sg(vb->free_page_vq, addr, len);
> > > +}
> > > +
> > > +static void report_free_page_completion(struct virtio_balloon *vb)
> > > +{
> > > +	struct virtqueue *vq = vb->free_page_vq;
> > > +	struct scatterlist sg;
> > > +	unsigned int len;
> > > +	int ret;
> > > +
> > > +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
> > > +retry:
> > > +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > > +	virtqueue_kick(vq);
> > > +	if (unlikely(ret == -ENOSPC)) {
> > > +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > +		goto retry;
> > > +	}
> > > +}
> > So the annoying thing here is that once this starts going,
> > it will keep sending free pages from the list even if
> > host is no longer interested. There should be a way
> > for host to tell guest "stop" or "start from the beginning".
> 
> This can be achieved via two output signal buf here:
> signal_buf_start: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START
> signal_buf_end: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_END
> 
> The device holds both, and can put one of them to the vq and notify.

Do you mean device writes start and end in the buf? then it's an inbuf
not an outbuf.

> 
> 
> > 
> > It's the result of using same vq for guest to host and
> > host to guest communication, and I think it's not a great idea.
> > I'd reuse stats vq for host to guest requests maybe.
> > 
> 
> 
> As we discussed before, we can't have a vq interleave the report of stats
> and free pages.
> The vq will be locked when one command is in use. So, when live migration
> starts, the
> periodically reported stats will be delayed.






> Would this be OK? Or would you
> like to have
> one host to guest vq, and multiple host to guest vqs? That is,
> 
> - host to guest:
> CMD_VQ
> 
> - guest to host:
> STATS_REPORT_VQ
> FREE_PAGE_VQ
> 
> 
> Best,
> Wei
> 

Point is stats report vq is also host to guest.
So I think it can be combined with CMD VQ.
If it's too hard a separate vq isn't too bad though.

-- 
MST

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-18 18:10         ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18 18:10 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Fri, Aug 18, 2017 at 04:36:06PM +0800, Wei Wang wrote:
> On 08/18/2017 10:28 AM, Michael S. Tsirkin wrote:
> > On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
> > > Add a new vq to report hints of guest free pages to the host.
> > > 
> > > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > > ---
> > >   drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
> > >   include/uapi/linux/virtio_balloon.h |   1 +
> > >   2 files changed, 147 insertions(+), 21 deletions(-)
> > > 
> > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> > > index 72041b4..e6755bc 100644
> > > --- a/drivers/virtio/virtio_balloon.c
> > > +++ b/drivers/virtio/virtio_balloon.c
> > > @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
> > >   struct virtio_balloon {
> > >   	struct virtio_device *vdev;
> > > -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> > > +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
> > >   	/* The balloon servicing is delegated to a freezable workqueue. */
> > >   	struct work_struct update_balloon_stats_work;
> > >   	struct work_struct update_balloon_size_work;
> > > +	struct work_struct report_free_page_work;
> > >   	/* Prevent updating balloon when it is being canceled. */
> > >   	spinlock_t stop_update_lock;
> > > @@ -90,6 +91,13 @@ struct virtio_balloon {
> > >   	/* Memory statistics */
> > >   	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
> > > +	/*
> > > +	 * Used by the device and driver to signal each other.
> > > +	 * device->driver: start the free page report.
> > > +	 * driver->device: end the free page report.
> > > +	 */
> > > +	__virtio32 report_free_page_signal;
> > > +
> > >   	/* To register callback in oom notifier call chain */
> > >   	struct notifier_block nb;
> > >   };
> > > @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
> > >   	} while (unlikely(ret == -ENOSPC));
> > >   }
> > > +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
> > > +{
> > > +	unsigned int len;
> > > +
> > > +	add_one_sg(vq, addr, size);
> > > +	virtqueue_kick(vq);
> > > +	/* Release entries if there are */
> > > +	while (virtqueue_get_buf(vq, &len))
> > > +		;
> > > +}
> > > +
> > >   /*
> > >    * Send balloon pages in sgs to host. The balloon pages are recorded in the
> > >    * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> > > @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
> > >   		queue_work(system_freezable_wq, work);
> > >   }
> > > +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> > > +					   unsigned long nr_pages)
> > > +{
> > > +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> > > +	void *addr = (void *)pfn_to_kaddr(pfn);
> > > +	uint32_t len = nr_pages << PAGE_SHIFT;
> > > +
> > > +	send_free_page_sg(vb->free_page_vq, addr, len);
> > > +}
> > > +
> > > +static void report_free_page_completion(struct virtio_balloon *vb)
> > > +{
> > > +	struct virtqueue *vq = vb->free_page_vq;
> > > +	struct scatterlist sg;
> > > +	unsigned int len;
> > > +	int ret;
> > > +
> > > +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
> > > +retry:
> > > +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > > +	virtqueue_kick(vq);
> > > +	if (unlikely(ret == -ENOSPC)) {
> > > +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > +		goto retry;
> > > +	}
> > > +}
> > So the annoying thing here is that once this starts going,
> > it will keep sending free pages from the list even if
> > host is no longer interested. There should be a way
> > for host to tell guest "stop" or "start from the beginning".
> 
> This can be achieved via two output signal buf here:
> signal_buf_start: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START
> signal_buf_end: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_END
> 
> The device holds both, and can put one of them to the vq and notify.

Do you mean device writes start and end in the buf? then it's an inbuf
not an outbuf.

> 
> 
> > 
> > It's the result of using same vq for guest to host and
> > host to guest communication, and I think it's not a great idea.
> > I'd reuse stats vq for host to guest requests maybe.
> > 
> 
> 
> As we discussed before, we can't have a vq interleave the report of stats
> and free pages.
> The vq will be locked when one command is in use. So, when live migration
> starts, the
> periodically reported stats will be delayed.






> Would this be OK? Or would you
> like to have
> one host to guest vq, and multiple host to guest vqs? That is,
> 
> - host to guest:
> CMD_VQ
> 
> - guest to host:
> STATS_REPORT_VQ
> FREE_PAGE_VQ
> 
> 
> Best,
> Wei
> 

Point is stats report vq is also host to guest.
So I think it can be combined with CMD VQ.
If it's too hard a separate vq isn't too bad though.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [Qemu-devel] [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-18 18:10         ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18 18:10 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Fri, Aug 18, 2017 at 04:36:06PM +0800, Wei Wang wrote:
> On 08/18/2017 10:28 AM, Michael S. Tsirkin wrote:
> > On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
> > > Add a new vq to report hints of guest free pages to the host.
> > > 
> > > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > > ---
> > >   drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
> > >   include/uapi/linux/virtio_balloon.h |   1 +
> > >   2 files changed, 147 insertions(+), 21 deletions(-)
> > > 
> > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> > > index 72041b4..e6755bc 100644
> > > --- a/drivers/virtio/virtio_balloon.c
> > > +++ b/drivers/virtio/virtio_balloon.c
> > > @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
> > >   struct virtio_balloon {
> > >   	struct virtio_device *vdev;
> > > -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> > > +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
> > >   	/* The balloon servicing is delegated to a freezable workqueue. */
> > >   	struct work_struct update_balloon_stats_work;
> > >   	struct work_struct update_balloon_size_work;
> > > +	struct work_struct report_free_page_work;
> > >   	/* Prevent updating balloon when it is being canceled. */
> > >   	spinlock_t stop_update_lock;
> > > @@ -90,6 +91,13 @@ struct virtio_balloon {
> > >   	/* Memory statistics */
> > >   	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
> > > +	/*
> > > +	 * Used by the device and driver to signal each other.
> > > +	 * device->driver: start the free page report.
> > > +	 * driver->device: end the free page report.
> > > +	 */
> > > +	__virtio32 report_free_page_signal;
> > > +
> > >   	/* To register callback in oom notifier call chain */
> > >   	struct notifier_block nb;
> > >   };
> > > @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
> > >   	} while (unlikely(ret == -ENOSPC));
> > >   }
> > > +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
> > > +{
> > > +	unsigned int len;
> > > +
> > > +	add_one_sg(vq, addr, size);
> > > +	virtqueue_kick(vq);
> > > +	/* Release entries if there are */
> > > +	while (virtqueue_get_buf(vq, &len))
> > > +		;
> > > +}
> > > +
> > >   /*
> > >    * Send balloon pages in sgs to host. The balloon pages are recorded in the
> > >    * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> > > @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
> > >   		queue_work(system_freezable_wq, work);
> > >   }
> > > +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> > > +					   unsigned long nr_pages)
> > > +{
> > > +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> > > +	void *addr = (void *)pfn_to_kaddr(pfn);
> > > +	uint32_t len = nr_pages << PAGE_SHIFT;
> > > +
> > > +	send_free_page_sg(vb->free_page_vq, addr, len);
> > > +}
> > > +
> > > +static void report_free_page_completion(struct virtio_balloon *vb)
> > > +{
> > > +	struct virtqueue *vq = vb->free_page_vq;
> > > +	struct scatterlist sg;
> > > +	unsigned int len;
> > > +	int ret;
> > > +
> > > +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
> > > +retry:
> > > +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > > +	virtqueue_kick(vq);
> > > +	if (unlikely(ret == -ENOSPC)) {
> > > +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > +		goto retry;
> > > +	}
> > > +}
> > So the annoying thing here is that once this starts going,
> > it will keep sending free pages from the list even if
> > host is no longer interested. There should be a way
> > for host to tell guest "stop" or "start from the beginning".
> 
> This can be achieved via two output signal buf here:
> signal_buf_start: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START
> signal_buf_end: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_END
> 
> The device holds both, and can put one of them to the vq and notify.

Do you mean device writes start and end in the buf? then it's an inbuf
not an outbuf.

> 
> 
> > 
> > It's the result of using same vq for guest to host and
> > host to guest communication, and I think it's not a great idea.
> > I'd reuse stats vq for host to guest requests maybe.
> > 
> 
> 
> As we discussed before, we can't have a vq interleave the report of stats
> and free pages.
> The vq will be locked when one command is in use. So, when live migration
> starts, the
> periodically reported stats will be delayed.






> Would this be OK? Or would you
> like to have
> one host to guest vq, and multiple host to guest vqs? That is,
> 
> - host to guest:
> CMD_VQ
> 
> - guest to host:
> STATS_REPORT_VQ
> FREE_PAGE_VQ
> 
> 
> Best,
> Wei
> 

Point is stats report vq is also host to guest.
So I think it can be combined with CMD VQ.
If it's too hard a separate vq isn't too bad though.

-- 
MST

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-08-18  8:36       ` Wei Wang
                         ` (3 preceding siblings ...)
  (?)
@ 2017-08-18 18:10       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18 18:10 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko, mgorman

On Fri, Aug 18, 2017 at 04:36:06PM +0800, Wei Wang wrote:
> On 08/18/2017 10:28 AM, Michael S. Tsirkin wrote:
> > On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
> > > Add a new vq to report hints of guest free pages to the host.
> > > 
> > > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > > ---
> > >   drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
> > >   include/uapi/linux/virtio_balloon.h |   1 +
> > >   2 files changed, 147 insertions(+), 21 deletions(-)
> > > 
> > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> > > index 72041b4..e6755bc 100644
> > > --- a/drivers/virtio/virtio_balloon.c
> > > +++ b/drivers/virtio/virtio_balloon.c
> > > @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
> > >   struct virtio_balloon {
> > >   	struct virtio_device *vdev;
> > > -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> > > +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
> > >   	/* The balloon servicing is delegated to a freezable workqueue. */
> > >   	struct work_struct update_balloon_stats_work;
> > >   	struct work_struct update_balloon_size_work;
> > > +	struct work_struct report_free_page_work;
> > >   	/* Prevent updating balloon when it is being canceled. */
> > >   	spinlock_t stop_update_lock;
> > > @@ -90,6 +91,13 @@ struct virtio_balloon {
> > >   	/* Memory statistics */
> > >   	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
> > > +	/*
> > > +	 * Used by the device and driver to signal each other.
> > > +	 * device->driver: start the free page report.
> > > +	 * driver->device: end the free page report.
> > > +	 */
> > > +	__virtio32 report_free_page_signal;
> > > +
> > >   	/* To register callback in oom notifier call chain */
> > >   	struct notifier_block nb;
> > >   };
> > > @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
> > >   	} while (unlikely(ret == -ENOSPC));
> > >   }
> > > +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
> > > +{
> > > +	unsigned int len;
> > > +
> > > +	add_one_sg(vq, addr, size);
> > > +	virtqueue_kick(vq);
> > > +	/* Release entries if there are */
> > > +	while (virtqueue_get_buf(vq, &len))
> > > +		;
> > > +}
> > > +
> > >   /*
> > >    * Send balloon pages in sgs to host. The balloon pages are recorded in the
> > >    * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> > > @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
> > >   		queue_work(system_freezable_wq, work);
> > >   }
> > > +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> > > +					   unsigned long nr_pages)
> > > +{
> > > +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> > > +	void *addr = (void *)pfn_to_kaddr(pfn);
> > > +	uint32_t len = nr_pages << PAGE_SHIFT;
> > > +
> > > +	send_free_page_sg(vb->free_page_vq, addr, len);
> > > +}
> > > +
> > > +static void report_free_page_completion(struct virtio_balloon *vb)
> > > +{
> > > +	struct virtqueue *vq = vb->free_page_vq;
> > > +	struct scatterlist sg;
> > > +	unsigned int len;
> > > +	int ret;
> > > +
> > > +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
> > > +retry:
> > > +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > > +	virtqueue_kick(vq);
> > > +	if (unlikely(ret == -ENOSPC)) {
> > > +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > +		goto retry;
> > > +	}
> > > +}
> > So the annoying thing here is that once this starts going,
> > it will keep sending free pages from the list even if
> > host is no longer interested. There should be a way
> > for host to tell guest "stop" or "start from the beginning".
> 
> This can be achieved via two output signal buf here:
> signal_buf_start: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START
> signal_buf_end: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_END
> 
> The device holds both, and can put one of them to the vq and notify.

Do you mean device writes start and end in the buf? then it's an inbuf
not an outbuf.

> 
> 
> > 
> > It's the result of using same vq for guest to host and
> > host to guest communication, and I think it's not a great idea.
> > I'd reuse stats vq for host to guest requests maybe.
> > 
> 
> 
> As we discussed before, we can't have a vq interleave the report of stats
> and free pages.
> The vq will be locked when one command is in use. So, when live migration
> starts, the
> periodically reported stats will be delayed.






> Would this be OK? Or would you
> like to have
> one host to guest vq, and multiple host to guest vqs? That is,
> 
> - host to guest:
> CMD_VQ
> 
> - guest to host:
> STATS_REPORT_VQ
> FREE_PAGE_VQ
> 
> 
> Best,
> Wei
> 

Point is stats report vq is also host to guest.
So I think it can be combined with CMD VQ.
If it's too hard a separate vq isn't too bad though.

-- 
MST

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [virtio-dev] Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-18 18:10         ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18 18:10 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Fri, Aug 18, 2017 at 04:36:06PM +0800, Wei Wang wrote:
> On 08/18/2017 10:28 AM, Michael S. Tsirkin wrote:
> > On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
> > > Add a new vq to report hints of guest free pages to the host.
> > > 
> > > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > > ---
> > >   drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
> > >   include/uapi/linux/virtio_balloon.h |   1 +
> > >   2 files changed, 147 insertions(+), 21 deletions(-)
> > > 
> > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> > > index 72041b4..e6755bc 100644
> > > --- a/drivers/virtio/virtio_balloon.c
> > > +++ b/drivers/virtio/virtio_balloon.c
> > > @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
> > >   struct virtio_balloon {
> > >   	struct virtio_device *vdev;
> > > -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> > > +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
> > >   	/* The balloon servicing is delegated to a freezable workqueue. */
> > >   	struct work_struct update_balloon_stats_work;
> > >   	struct work_struct update_balloon_size_work;
> > > +	struct work_struct report_free_page_work;
> > >   	/* Prevent updating balloon when it is being canceled. */
> > >   	spinlock_t stop_update_lock;
> > > @@ -90,6 +91,13 @@ struct virtio_balloon {
> > >   	/* Memory statistics */
> > >   	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
> > > +	/*
> > > +	 * Used by the device and driver to signal each other.
> > > +	 * device->driver: start the free page report.
> > > +	 * driver->device: end the free page report.
> > > +	 */
> > > +	__virtio32 report_free_page_signal;
> > > +
> > >   	/* To register callback in oom notifier call chain */
> > >   	struct notifier_block nb;
> > >   };
> > > @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
> > >   	} while (unlikely(ret == -ENOSPC));
> > >   }
> > > +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
> > > +{
> > > +	unsigned int len;
> > > +
> > > +	add_one_sg(vq, addr, size);
> > > +	virtqueue_kick(vq);
> > > +	/* Release entries if there are */
> > > +	while (virtqueue_get_buf(vq, &len))
> > > +		;
> > > +}
> > > +
> > >   /*
> > >    * Send balloon pages in sgs to host. The balloon pages are recorded in the
> > >    * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> > > @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
> > >   		queue_work(system_freezable_wq, work);
> > >   }
> > > +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> > > +					   unsigned long nr_pages)
> > > +{
> > > +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> > > +	void *addr = (void *)pfn_to_kaddr(pfn);
> > > +	uint32_t len = nr_pages << PAGE_SHIFT;
> > > +
> > > +	send_free_page_sg(vb->free_page_vq, addr, len);
> > > +}
> > > +
> > > +static void report_free_page_completion(struct virtio_balloon *vb)
> > > +{
> > > +	struct virtqueue *vq = vb->free_page_vq;
> > > +	struct scatterlist sg;
> > > +	unsigned int len;
> > > +	int ret;
> > > +
> > > +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
> > > +retry:
> > > +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > > +	virtqueue_kick(vq);
> > > +	if (unlikely(ret == -ENOSPC)) {
> > > +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > +		goto retry;
> > > +	}
> > > +}
> > So the annoying thing here is that once this starts going,
> > it will keep sending free pages from the list even if
> > host is no longer interested. There should be a way
> > for host to tell guest "stop" or "start from the beginning".
> 
> This can be achieved via two output signal buf here:
> signal_buf_start: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START
> signal_buf_end: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_END
> 
> The device holds both, and can put one of them to the vq and notify.

Do you mean device writes start and end in the buf? then it's an inbuf
not an outbuf.

> 
> 
> > 
> > It's the result of using same vq for guest to host and
> > host to guest communication, and I think it's not a great idea.
> > I'd reuse stats vq for host to guest requests maybe.
> > 
> 
> 
> As we discussed before, we can't have a vq interleave the report of stats
> and free pages.
> The vq will be locked when one command is in use. So, when live migration
> starts, the
> periodically reported stats will be delayed.






> Would this be OK? Or would you
> like to have
> one host to guest vq, and multiple host to guest vqs? That is,
> 
> - host to guest:
> CMD_VQ
> 
> - guest to host:
> STATS_REPORT_VQ
> FREE_PAGE_VQ
> 
> 
> Best,
> Wei
> 

Point is stats report vq is also host to guest.
So I think it can be combined with CMD VQ.
If it's too hard a separate vq isn't too bad though.

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-08-18  8:41       ` Wei Wang
  (?)
  (?)
@ 2017-08-18 18:26         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18 18:26 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Fri, Aug 18, 2017 at 04:41:41PM +0800, Wei Wang wrote:
> On 08/18/2017 10:13 AM, Michael S. Tsirkin wrote:
> > On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
> > > Add a new vq to report hints of guest free pages to the host.
> > Please add some text here explaining the report_free_page_signal
> > thing.
> > 
> > 
> > I also really think we need some kind of ID in the
> > buffer to do a handshake. whenever id changes you
> > add another outbuf.
> 
> Please let me introduce the current design first:
> 1) device put the signal buf to the vq and notify the driver (we need
> a buffer because currently the device can't notify when the vq is empty);
> 
> 2) the driver starts the report of free page blocks via inbuf;
> 
> 3) the driver adds an the signal buf via outbuf to tell the device all are
> reported.
> 
> 
> Could you please elaborate more on the usage of ID?

While driver is free to maintain at most one buffer in flight
the design must work with pipelined requests as that
is important for performance.

So host might be able to request the reporting twice.
How does it know what is the report in response to?

If we put an id in request and in response, then that fixes it.


So there's a vq used for requesting free page reports.
driver does add_inbuf( &device->id).

Then when it starts reporting it does


add_outbuf(&device->id)

followed by pages.


Also if device->id changes it knows it should restart
reporting from beginning.






> > > +retry:
> > > +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > > +	virtqueue_kick(vq);
> > > +	if (unlikely(ret == -ENOSPC)) {
> > what if there's another error?
> 
> Another error is -EIO, how about disabling the free page report feature?
> (I also saw it isn't handled in many other virtio devices e.g. virtio-net)
> 
> > > +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > +		goto retry;
> > > +	}
> > what is this trickery doing? needs more comments or
> > a simplification.
> 
> Just this:
> if the vq is full, blocking wait till an entry gets released, then retry.
> This is the
> final one, which puts the signal buf to the vq to signify the end of the
> report and
> the mm lock is not held here, so it is fine to block.
> 

But why do you kick here on failure? I would understand it if you
did not kick when adding pages, as it is I don't understand.


Also pls rewrite this with a for or while loop for clarity.


> > 
> > 
> > > +}
> > > +
> > > +static void report_free_page(struct work_struct *work)
> > > +{
> > > +	struct virtio_balloon *vb;
> > > +
> > > +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> > > +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> > That's a lot of work here. And system_wq documentation says:
> >   *
> >   * system_wq is the one used by schedule[_delayed]_work[_on]().
> >   * Multi-CPU multi-threaded.  There are users which expect relatively
> >   * short queue flush time.  Don't queue works which can run for too
> >   * long.
> > 
> > You might want to create your own wq, maybe even with WQ_CPU_INTENSIVE.
> 
> Thanks for the reminder. If not creating a new wq, how about
> system_unbound_wq?

I don't think that one's freezeable. 

> The first round of live migration needs the free pages, in that way we can
> have the
> pages reported to the hypervisor quicker.

The reason people call it *live* migration is because tasks keep
running. If you pin VCPUs with maintainance tasks it becomes pointless.

Maybe we need to set a special wq which will create idle
class threads. Does not seem to be supported but not hard to do.

> > 
> > > +	report_free_page_completion(vb);
> > So first you get list of pages, then an outbuf telling you
> > what they are in end of.  I think it's backwards.
> > Add an outbuf first followed by inbufs that tell you
> > what they are.
> 
> 
> If we have the signal filled with those flags like
> VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START,
> Probably not necessary to have an inbuf followed by an outbuf, right?
> 
> 
> Best,
> Wei

You really should document the messages in the commit log
and in the header.

-- 
MST

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-18 18:26         ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18 18:26 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Fri, Aug 18, 2017 at 04:41:41PM +0800, Wei Wang wrote:
> On 08/18/2017 10:13 AM, Michael S. Tsirkin wrote:
> > On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
> > > Add a new vq to report hints of guest free pages to the host.
> > Please add some text here explaining the report_free_page_signal
> > thing.
> > 
> > 
> > I also really think we need some kind of ID in the
> > buffer to do a handshake. whenever id changes you
> > add another outbuf.
> 
> Please let me introduce the current design first:
> 1) device put the signal buf to the vq and notify the driver (we need
> a buffer because currently the device can't notify when the vq is empty);
> 
> 2) the driver starts the report of free page blocks via inbuf;
> 
> 3) the driver adds an the signal buf via outbuf to tell the device all are
> reported.
> 
> 
> Could you please elaborate more on the usage of ID?

While driver is free to maintain at most one buffer in flight
the design must work with pipelined requests as that
is important for performance.

So host might be able to request the reporting twice.
How does it know what is the report in response to?

If we put an id in request and in response, then that fixes it.


So there's a vq used for requesting free page reports.
driver does add_inbuf( &device->id).

Then when it starts reporting it does


add_outbuf(&device->id)

followed by pages.


Also if device->id changes it knows it should restart
reporting from beginning.






> > > +retry:
> > > +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > > +	virtqueue_kick(vq);
> > > +	if (unlikely(ret == -ENOSPC)) {
> > what if there's another error?
> 
> Another error is -EIO, how about disabling the free page report feature?
> (I also saw it isn't handled in many other virtio devices e.g. virtio-net)
> 
> > > +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > +		goto retry;
> > > +	}
> > what is this trickery doing? needs more comments or
> > a simplification.
> 
> Just this:
> if the vq is full, blocking wait till an entry gets released, then retry.
> This is the
> final one, which puts the signal buf to the vq to signify the end of the
> report and
> the mm lock is not held here, so it is fine to block.
> 

But why do you kick here on failure? I would understand it if you
did not kick when adding pages, as it is I don't understand.


Also pls rewrite this with a for or while loop for clarity.


> > 
> > 
> > > +}
> > > +
> > > +static void report_free_page(struct work_struct *work)
> > > +{
> > > +	struct virtio_balloon *vb;
> > > +
> > > +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> > > +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> > That's a lot of work here. And system_wq documentation says:
> >   *
> >   * system_wq is the one used by schedule[_delayed]_work[_on]().
> >   * Multi-CPU multi-threaded.  There are users which expect relatively
> >   * short queue flush time.  Don't queue works which can run for too
> >   * long.
> > 
> > You might want to create your own wq, maybe even with WQ_CPU_INTENSIVE.
> 
> Thanks for the reminder. If not creating a new wq, how about
> system_unbound_wq?

I don't think that one's freezeable. 

> The first round of live migration needs the free pages, in that way we can
> have the
> pages reported to the hypervisor quicker.

The reason people call it *live* migration is because tasks keep
running. If you pin VCPUs with maintainance tasks it becomes pointless.

Maybe we need to set a special wq which will create idle
class threads. Does not seem to be supported but not hard to do.

> > 
> > > +	report_free_page_completion(vb);
> > So first you get list of pages, then an outbuf telling you
> > what they are in end of.  I think it's backwards.
> > Add an outbuf first followed by inbufs that tell you
> > what they are.
> 
> 
> If we have the signal filled with those flags like
> VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START,
> Probably not necessary to have an inbuf followed by an outbuf, right?
> 
> 
> Best,
> Wei

You really should document the messages in the commit log
and in the header.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [Qemu-devel] [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-18 18:26         ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18 18:26 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Fri, Aug 18, 2017 at 04:41:41PM +0800, Wei Wang wrote:
> On 08/18/2017 10:13 AM, Michael S. Tsirkin wrote:
> > On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
> > > Add a new vq to report hints of guest free pages to the host.
> > Please add some text here explaining the report_free_page_signal
> > thing.
> > 
> > 
> > I also really think we need some kind of ID in the
> > buffer to do a handshake. whenever id changes you
> > add another outbuf.
> 
> Please let me introduce the current design first:
> 1) device put the signal buf to the vq and notify the driver (we need
> a buffer because currently the device can't notify when the vq is empty);
> 
> 2) the driver starts the report of free page blocks via inbuf;
> 
> 3) the driver adds an the signal buf via outbuf to tell the device all are
> reported.
> 
> 
> Could you please elaborate more on the usage of ID?

While driver is free to maintain at most one buffer in flight
the design must work with pipelined requests as that
is important for performance.

So host might be able to request the reporting twice.
How does it know what is the report in response to?

If we put an id in request and in response, then that fixes it.


So there's a vq used for requesting free page reports.
driver does add_inbuf( &device->id).

Then when it starts reporting it does


add_outbuf(&device->id)

followed by pages.


Also if device->id changes it knows it should restart
reporting from beginning.






> > > +retry:
> > > +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > > +	virtqueue_kick(vq);
> > > +	if (unlikely(ret == -ENOSPC)) {
> > what if there's another error?
> 
> Another error is -EIO, how about disabling the free page report feature?
> (I also saw it isn't handled in many other virtio devices e.g. virtio-net)
> 
> > > +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > +		goto retry;
> > > +	}
> > what is this trickery doing? needs more comments or
> > a simplification.
> 
> Just this:
> if the vq is full, blocking wait till an entry gets released, then retry.
> This is the
> final one, which puts the signal buf to the vq to signify the end of the
> report and
> the mm lock is not held here, so it is fine to block.
> 

But why do you kick here on failure? I would understand it if you
did not kick when adding pages, as it is I don't understand.


Also pls rewrite this with a for or while loop for clarity.


> > 
> > 
> > > +}
> > > +
> > > +static void report_free_page(struct work_struct *work)
> > > +{
> > > +	struct virtio_balloon *vb;
> > > +
> > > +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> > > +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> > That's a lot of work here. And system_wq documentation says:
> >   *
> >   * system_wq is the one used by schedule[_delayed]_work[_on]().
> >   * Multi-CPU multi-threaded.  There are users which expect relatively
> >   * short queue flush time.  Don't queue works which can run for too
> >   * long.
> > 
> > You might want to create your own wq, maybe even with WQ_CPU_INTENSIVE.
> 
> Thanks for the reminder. If not creating a new wq, how about
> system_unbound_wq?

I don't think that one's freezeable. 

> The first round of live migration needs the free pages, in that way we can
> have the
> pages reported to the hypervisor quicker.

The reason people call it *live* migration is because tasks keep
running. If you pin VCPUs with maintainance tasks it becomes pointless.

Maybe we need to set a special wq which will create idle
class threads. Does not seem to be supported but not hard to do.

> > 
> > > +	report_free_page_completion(vb);
> > So first you get list of pages, then an outbuf telling you
> > what they are in end of.  I think it's backwards.
> > Add an outbuf first followed by inbufs that tell you
> > what they are.
> 
> 
> If we have the signal filled with those flags like
> VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START,
> Probably not necessary to have an inbuf followed by an outbuf, right?
> 
> 
> Best,
> Wei

You really should document the messages in the commit log
and in the header.

-- 
MST

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-08-18  8:41       ` Wei Wang
                         ` (3 preceding siblings ...)
  (?)
@ 2017-08-18 18:26       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18 18:26 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko, mgorman

On Fri, Aug 18, 2017 at 04:41:41PM +0800, Wei Wang wrote:
> On 08/18/2017 10:13 AM, Michael S. Tsirkin wrote:
> > On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
> > > Add a new vq to report hints of guest free pages to the host.
> > Please add some text here explaining the report_free_page_signal
> > thing.
> > 
> > 
> > I also really think we need some kind of ID in the
> > buffer to do a handshake. whenever id changes you
> > add another outbuf.
> 
> Please let me introduce the current design first:
> 1) device put the signal buf to the vq and notify the driver (we need
> a buffer because currently the device can't notify when the vq is empty);
> 
> 2) the driver starts the report of free page blocks via inbuf;
> 
> 3) the driver adds an the signal buf via outbuf to tell the device all are
> reported.
> 
> 
> Could you please elaborate more on the usage of ID?

While driver is free to maintain at most one buffer in flight
the design must work with pipelined requests as that
is important for performance.

So host might be able to request the reporting twice.
How does it know what is the report in response to?

If we put an id in request and in response, then that fixes it.


So there's a vq used for requesting free page reports.
driver does add_inbuf( &device->id).

Then when it starts reporting it does


add_outbuf(&device->id)

followed by pages.


Also if device->id changes it knows it should restart
reporting from beginning.






> > > +retry:
> > > +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > > +	virtqueue_kick(vq);
> > > +	if (unlikely(ret == -ENOSPC)) {
> > what if there's another error?
> 
> Another error is -EIO, how about disabling the free page report feature?
> (I also saw it isn't handled in many other virtio devices e.g. virtio-net)
> 
> > > +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > +		goto retry;
> > > +	}
> > what is this trickery doing? needs more comments or
> > a simplification.
> 
> Just this:
> if the vq is full, blocking wait till an entry gets released, then retry.
> This is the
> final one, which puts the signal buf to the vq to signify the end of the
> report and
> the mm lock is not held here, so it is fine to block.
> 

But why do you kick here on failure? I would understand it if you
did not kick when adding pages, as it is I don't understand.


Also pls rewrite this with a for or while loop for clarity.


> > 
> > 
> > > +}
> > > +
> > > +static void report_free_page(struct work_struct *work)
> > > +{
> > > +	struct virtio_balloon *vb;
> > > +
> > > +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> > > +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> > That's a lot of work here. And system_wq documentation says:
> >   *
> >   * system_wq is the one used by schedule[_delayed]_work[_on]().
> >   * Multi-CPU multi-threaded.  There are users which expect relatively
> >   * short queue flush time.  Don't queue works which can run for too
> >   * long.
> > 
> > You might want to create your own wq, maybe even with WQ_CPU_INTENSIVE.
> 
> Thanks for the reminder. If not creating a new wq, how about
> system_unbound_wq?

I don't think that one's freezeable. 

> The first round of live migration needs the free pages, in that way we can
> have the
> pages reported to the hypervisor quicker.

The reason people call it *live* migration is because tasks keep
running. If you pin VCPUs with maintainance tasks it becomes pointless.

Maybe we need to set a special wq which will create idle
class threads. Does not seem to be supported but not hard to do.

> > 
> > > +	report_free_page_completion(vb);
> > So first you get list of pages, then an outbuf telling you
> > what they are in end of.  I think it's backwards.
> > Add an outbuf first followed by inbufs that tell you
> > what they are.
> 
> 
> If we have the signal filled with those flags like
> VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START,
> Probably not necessary to have an inbuf followed by an outbuf, right?
> 
> 
> Best,
> Wei

You really should document the messages in the commit log
and in the header.

-- 
MST

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [virtio-dev] Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-18 18:26         ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-18 18:26 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Fri, Aug 18, 2017 at 04:41:41PM +0800, Wei Wang wrote:
> On 08/18/2017 10:13 AM, Michael S. Tsirkin wrote:
> > On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
> > > Add a new vq to report hints of guest free pages to the host.
> > Please add some text here explaining the report_free_page_signal
> > thing.
> > 
> > 
> > I also really think we need some kind of ID in the
> > buffer to do a handshake. whenever id changes you
> > add another outbuf.
> 
> Please let me introduce the current design first:
> 1) device put the signal buf to the vq and notify the driver (we need
> a buffer because currently the device can't notify when the vq is empty);
> 
> 2) the driver starts the report of free page blocks via inbuf;
> 
> 3) the driver adds an the signal buf via outbuf to tell the device all are
> reported.
> 
> 
> Could you please elaborate more on the usage of ID?

While driver is free to maintain at most one buffer in flight
the design must work with pipelined requests as that
is important for performance.

So host might be able to request the reporting twice.
How does it know what is the report in response to?

If we put an id in request and in response, then that fixes it.


So there's a vq used for requesting free page reports.
driver does add_inbuf( &device->id).

Then when it starts reporting it does


add_outbuf(&device->id)

followed by pages.


Also if device->id changes it knows it should restart
reporting from beginning.






> > > +retry:
> > > +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > > +	virtqueue_kick(vq);
> > > +	if (unlikely(ret == -ENOSPC)) {
> > what if there's another error?
> 
> Another error is -EIO, how about disabling the free page report feature?
> (I also saw it isn't handled in many other virtio devices e.g. virtio-net)
> 
> > > +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > +		goto retry;
> > > +	}
> > what is this trickery doing? needs more comments or
> > a simplification.
> 
> Just this:
> if the vq is full, blocking wait till an entry gets released, then retry.
> This is the
> final one, which puts the signal buf to the vq to signify the end of the
> report and
> the mm lock is not held here, so it is fine to block.
> 

But why do you kick here on failure? I would understand it if you
did not kick when adding pages, as it is I don't understand.


Also pls rewrite this with a for or while loop for clarity.


> > 
> > 
> > > +}
> > > +
> > > +static void report_free_page(struct work_struct *work)
> > > +{
> > > +	struct virtio_balloon *vb;
> > > +
> > > +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> > > +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> > That's a lot of work here. And system_wq documentation says:
> >   *
> >   * system_wq is the one used by schedule[_delayed]_work[_on]().
> >   * Multi-CPU multi-threaded.  There are users which expect relatively
> >   * short queue flush time.  Don't queue works which can run for too
> >   * long.
> > 
> > You might want to create your own wq, maybe even with WQ_CPU_INTENSIVE.
> 
> Thanks for the reminder. If not creating a new wq, how about
> system_unbound_wq?

I don't think that one's freezeable. 

> The first round of live migration needs the free pages, in that way we can
> have the
> pages reported to the hypervisor quicker.

The reason people call it *live* migration is because tasks keep
running. If you pin VCPUs with maintainance tasks it becomes pointless.

Maybe we need to set a special wq which will create idle
class threads. Does not seem to be supported but not hard to do.

> > 
> > > +	report_free_page_completion(vb);
> > So first you get list of pages, then an outbuf telling you
> > what they are in end of.  I think it's backwards.
> > Add an outbuf first followed by inbufs that tell you
> > what they are.
> 
> 
> If we have the signal filled with those flags like
> VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START,
> Probably not necessary to have an inbuf followed by an outbuf, right?
> 
> 
> Best,
> Wei

You really should document the messages in the commit log
and in the header.

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 1/5] lib/xbitmap: Introduce xbitmap
  2017-08-17  3:26   ` Wei Wang
  (?)
@ 2017-08-19 20:30     ` kbuild test robot
  -1 siblings, 0 replies; 116+ messages in thread
From: kbuild test robot @ 2017-08-19 20:30 UTC (permalink / raw)
  To: Wei Wang
  Cc: kbuild-all, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, willy, wei.w.wang,
	liliang.opensource, yang.zhang.wz, quan.xu

[-- Attachment #1: Type: text/plain, Size: 1675 bytes --]

Hi Matthew,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.13-rc5 next-20170817]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Wei-Wang/lib-xbitmap-Introduce-xbitmap/20170820-035516
config: i386-tinyconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   lib/xbitmap.c: In function 'xb_test_bit':
>> lib/xbitmap.c:153:26: warning: passing argument 1 of 'xb_bit_ops' discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
     return (bool)xb_bit_ops(xb, bit, XB_TEST);
                             ^~
   lib/xbitmap.c:23:12: note: expected 'struct xb *' but argument is of type 'const struct xb *'
    static int xb_bit_ops(struct xb *xb, unsigned long bit, enum xb_ops ops)
               ^~~~~~~~~~

vim +153 lib/xbitmap.c

   142	
   143	/**
   144	 * xb_test_bit - test a bit in the xbitmap
   145	 * @xb: the xbitmap tree used to record the bit
   146	 * @bit: index of the bit to set
   147	 *
   148	 * This function is used to test a bit in the xbitmap.
   149	 * Returns: 1 if the bit is set, or 0 otherwise.
   150	 */
   151	bool xb_test_bit(const struct xb *xb, unsigned long bit)
   152	{
 > 153		return (bool)xb_bit_ops(xb, bit, XB_TEST);
   154	}
   155	EXPORT_SYMBOL(xb_test_bit);
   156	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 6665 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 1/5] lib/xbitmap: Introduce xbitmap
@ 2017-08-19 20:30     ` kbuild test robot
  0 siblings, 0 replies; 116+ messages in thread
From: kbuild test robot @ 2017-08-19 20:30 UTC (permalink / raw)
  To: Wei Wang
  Cc: kbuild-all, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu

[-- Attachment #1: Type: text/plain, Size: 1675 bytes --]

Hi Matthew,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.13-rc5 next-20170817]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Wei-Wang/lib-xbitmap-Introduce-xbitmap/20170820-035516
config: i386-tinyconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   lib/xbitmap.c: In function 'xb_test_bit':
>> lib/xbitmap.c:153:26: warning: passing argument 1 of 'xb_bit_ops' discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
     return (bool)xb_bit_ops(xb, bit, XB_TEST);
                             ^~
   lib/xbitmap.c:23:12: note: expected 'struct xb *' but argument is of type 'const struct xb *'
    static int xb_bit_ops(struct xb *xb, unsigned long bit, enum xb_ops ops)
               ^~~~~~~~~~

vim +153 lib/xbitmap.c

   142	
   143	/**
   144	 * xb_test_bit - test a bit in the xbitmap
   145	 * @xb: the xbitmap tree used to record the bit
   146	 * @bit: index of the bit to set
   147	 *
   148	 * This function is used to test a bit in the xbitmap.
   149	 * Returns: 1 if the bit is set, or 0 otherwise.
   150	 */
   151	bool xb_test_bit(const struct xb *xb, unsigned long bit)
   152	{
 > 153		return (bool)xb_bit_ops(xb, bit, XB_TEST);
   154	}
   155	EXPORT_SYMBOL(xb_test_bit);
   156	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 6665 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [Qemu-devel] [PATCH v14 1/5] lib/xbitmap: Introduce xbitmap
@ 2017-08-19 20:30     ` kbuild test robot
  0 siblings, 0 replies; 116+ messages in thread
From: kbuild test robot @ 2017-08-19 20:30 UTC (permalink / raw)
  To: Wei Wang
  Cc: kbuild-all, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu

[-- Attachment #1: Type: text/plain, Size: 1675 bytes --]

Hi Matthew,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.13-rc5 next-20170817]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Wei-Wang/lib-xbitmap-Introduce-xbitmap/20170820-035516
config: i386-tinyconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   lib/xbitmap.c: In function 'xb_test_bit':
>> lib/xbitmap.c:153:26: warning: passing argument 1 of 'xb_bit_ops' discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
     return (bool)xb_bit_ops(xb, bit, XB_TEST);
                             ^~
   lib/xbitmap.c:23:12: note: expected 'struct xb *' but argument is of type 'const struct xb *'
    static int xb_bit_ops(struct xb *xb, unsigned long bit, enum xb_ops ops)
               ^~~~~~~~~~

vim +153 lib/xbitmap.c

   142	
   143	/**
   144	 * xb_test_bit - test a bit in the xbitmap
   145	 * @xb: the xbitmap tree used to record the bit
   146	 * @bit: index of the bit to set
   147	 *
   148	 * This function is used to test a bit in the xbitmap.
   149	 * Returns: 1 if the bit is set, or 0 otherwise.
   150	 */
   151	bool xb_test_bit(const struct xb *xb, unsigned long bit)
   152	{
 > 153		return (bool)xb_bit_ops(xb, bit, XB_TEST);
   154	}
   155	EXPORT_SYMBOL(xb_test_bit);
   156	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 6665 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 1/5] lib/xbitmap: Introduce xbitmap
  2017-08-17  3:26   ` Wei Wang
                     ` (3 preceding siblings ...)
  (?)
@ 2017-08-19 20:30   ` kbuild test robot
  -1 siblings, 0 replies; 116+ messages in thread
From: kbuild test robot @ 2017-08-19 20:30 UTC (permalink / raw)
  To: Wei Wang
  Cc: yang.zhang.wz, kvm, mst, liliang.opensource, qemu-devel,
	virtualization, linux-mm, aarcange, virtio-dev, mawilcox, willy,
	quan.xu, cornelia.huck, mhocko, linux-kernel, kbuild-all,
	amit.shah, pbonzini, akpm, mgorman

[-- Attachment #1: Type: text/plain, Size: 1675 bytes --]

Hi Matthew,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.13-rc5 next-20170817]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Wei-Wang/lib-xbitmap-Introduce-xbitmap/20170820-035516
config: i386-tinyconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   lib/xbitmap.c: In function 'xb_test_bit':
>> lib/xbitmap.c:153:26: warning: passing argument 1 of 'xb_bit_ops' discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
     return (bool)xb_bit_ops(xb, bit, XB_TEST);
                             ^~
   lib/xbitmap.c:23:12: note: expected 'struct xb *' but argument is of type 'const struct xb *'
    static int xb_bit_ops(struct xb *xb, unsigned long bit, enum xb_ops ops)
               ^~~~~~~~~~

vim +153 lib/xbitmap.c

   142	
   143	/**
   144	 * xb_test_bit - test a bit in the xbitmap
   145	 * @xb: the xbitmap tree used to record the bit
   146	 * @bit: index of the bit to set
   147	 *
   148	 * This function is used to test a bit in the xbitmap.
   149	 * Returns: 1 if the bit is set, or 0 otherwise.
   150	 */
   151	bool xb_test_bit(const struct xb *xb, unsigned long bit)
   152	{
 > 153		return (bool)xb_bit_ops(xb, bit, XB_TEST);
   154	}
   155	EXPORT_SYMBOL(xb_test_bit);
   156	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 6665 bytes --]

[-- Attachment #3: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-08-17  3:26   ` Wei Wang
  (?)
@ 2017-08-19 21:37     ` kbuild test robot
  -1 siblings, 0 replies; 116+ messages in thread
From: kbuild test robot @ 2017-08-19 21:37 UTC (permalink / raw)
  To: Wei Wang
  Cc: kbuild-all, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, willy, wei.w.wang,
	liliang.opensource, yang.zhang.wz, quan.xu

[-- Attachment #1: Type: text/plain, Size: 2808 bytes --]

Hi Wei,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.13-rc5 next-20170817]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Wei-Wang/lib-xbitmap-Introduce-xbitmap/20170820-035516
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=xtensa 

All errors (new ones prefixed by >>):

   drivers/virtio/virtio_balloon.c: In function 'tell_host_sgs':
>> drivers/virtio/virtio_balloon.c:203:3: error: implicit declaration of function 'pfn_to_kaddr' [-Werror=implicit-function-declaration]
      sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
      ^
   cc1: some warnings being treated as errors

vim +/pfn_to_kaddr +203 drivers/virtio/virtio_balloon.c

   176	
   177	/*
   178	 * Send balloon pages in sgs to host. The balloon pages are recorded in the
   179	 * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
   180	 * The page xbitmap is searched for continuous "1" bits, which correspond
   181	 * to continuous pages, to chunk into sgs.
   182	 *
   183	 * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
   184	 * need to be searched.
   185	 */
   186	static void tell_host_sgs(struct virtio_balloon *vb,
   187				  struct virtqueue *vq,
   188				  unsigned long page_xb_start,
   189				  unsigned long page_xb_end)
   190	{
   191		unsigned long sg_pfn_start, sg_pfn_end;
   192		void *sg_addr;
   193		uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
   194	
   195		sg_pfn_start = page_xb_start;
   196		while (sg_pfn_start < page_xb_end) {
   197			sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start,
   198							page_xb_end, 1);
   199			if (sg_pfn_start == page_xb_end + 1)
   200				break;
   201			sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1,
   202						      page_xb_end, 0);
 > 203			sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
   204			sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
   205			while (sg_len > sg_max_len) {
   206				send_balloon_page_sg(vb, vq, sg_addr, sg_max_len);
   207				sg_addr += sg_max_len;
   208				sg_len -= sg_max_len;
   209			}
   210			send_balloon_page_sg(vb, vq, sg_addr, sg_len);
   211			xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end);
   212			sg_pfn_start = sg_pfn_end + 1;
   213		}
   214	}
   215	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 50926 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-08-19 21:37     ` kbuild test robot
  0 siblings, 0 replies; 116+ messages in thread
From: kbuild test robot @ 2017-08-19 21:37 UTC (permalink / raw)
  To: Wei Wang
  Cc: kbuild-all, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu

[-- Attachment #1: Type: text/plain, Size: 2808 bytes --]

Hi Wei,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.13-rc5 next-20170817]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Wei-Wang/lib-xbitmap-Introduce-xbitmap/20170820-035516
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=xtensa 

All errors (new ones prefixed by >>):

   drivers/virtio/virtio_balloon.c: In function 'tell_host_sgs':
>> drivers/virtio/virtio_balloon.c:203:3: error: implicit declaration of function 'pfn_to_kaddr' [-Werror=implicit-function-declaration]
      sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
      ^
   cc1: some warnings being treated as errors

vim +/pfn_to_kaddr +203 drivers/virtio/virtio_balloon.c

   176	
   177	/*
   178	 * Send balloon pages in sgs to host. The balloon pages are recorded in the
   179	 * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
   180	 * The page xbitmap is searched for continuous "1" bits, which correspond
   181	 * to continuous pages, to chunk into sgs.
   182	 *
   183	 * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
   184	 * need to be searched.
   185	 */
   186	static void tell_host_sgs(struct virtio_balloon *vb,
   187				  struct virtqueue *vq,
   188				  unsigned long page_xb_start,
   189				  unsigned long page_xb_end)
   190	{
   191		unsigned long sg_pfn_start, sg_pfn_end;
   192		void *sg_addr;
   193		uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
   194	
   195		sg_pfn_start = page_xb_start;
   196		while (sg_pfn_start < page_xb_end) {
   197			sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start,
   198							page_xb_end, 1);
   199			if (sg_pfn_start == page_xb_end + 1)
   200				break;
   201			sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1,
   202						      page_xb_end, 0);
 > 203			sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
   204			sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
   205			while (sg_len > sg_max_len) {
   206				send_balloon_page_sg(vb, vq, sg_addr, sg_max_len);
   207				sg_addr += sg_max_len;
   208				sg_len -= sg_max_len;
   209			}
   210			send_balloon_page_sg(vb, vq, sg_addr, sg_len);
   211			xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end);
   212			sg_pfn_start = sg_pfn_end + 1;
   213		}
   214	}
   215	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 50926 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [Qemu-devel] [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-08-19 21:37     ` kbuild test robot
  0 siblings, 0 replies; 116+ messages in thread
From: kbuild test robot @ 2017-08-19 21:37 UTC (permalink / raw)
  To: Wei Wang
  Cc: kbuild-all, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu

[-- Attachment #1: Type: text/plain, Size: 2808 bytes --]

Hi Wei,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.13-rc5 next-20170817]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Wei-Wang/lib-xbitmap-Introduce-xbitmap/20170820-035516
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=xtensa 

All errors (new ones prefixed by >>):

   drivers/virtio/virtio_balloon.c: In function 'tell_host_sgs':
>> drivers/virtio/virtio_balloon.c:203:3: error: implicit declaration of function 'pfn_to_kaddr' [-Werror=implicit-function-declaration]
      sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
      ^
   cc1: some warnings being treated as errors

vim +/pfn_to_kaddr +203 drivers/virtio/virtio_balloon.c

   176	
   177	/*
   178	 * Send balloon pages in sgs to host. The balloon pages are recorded in the
   179	 * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
   180	 * The page xbitmap is searched for continuous "1" bits, which correspond
   181	 * to continuous pages, to chunk into sgs.
   182	 *
   183	 * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
   184	 * need to be searched.
   185	 */
   186	static void tell_host_sgs(struct virtio_balloon *vb,
   187				  struct virtqueue *vq,
   188				  unsigned long page_xb_start,
   189				  unsigned long page_xb_end)
   190	{
   191		unsigned long sg_pfn_start, sg_pfn_end;
   192		void *sg_addr;
   193		uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
   194	
   195		sg_pfn_start = page_xb_start;
   196		while (sg_pfn_start < page_xb_end) {
   197			sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start,
   198							page_xb_end, 1);
   199			if (sg_pfn_start == page_xb_end + 1)
   200				break;
   201			sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1,
   202						      page_xb_end, 0);
 > 203			sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
   204			sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
   205			while (sg_len > sg_max_len) {
   206				send_balloon_page_sg(vb, vq, sg_addr, sg_max_len);
   207				sg_addr += sg_max_len;
   208				sg_len -= sg_max_len;
   209			}
   210			send_balloon_page_sg(vb, vq, sg_addr, sg_len);
   211			xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end);
   212			sg_pfn_start = sg_pfn_end + 1;
   213		}
   214	}
   215	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 50926 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-08-17  3:26   ` Wei Wang
                     ` (4 preceding siblings ...)
  (?)
@ 2017-08-19 21:37   ` kbuild test robot
  -1 siblings, 0 replies; 116+ messages in thread
From: kbuild test robot @ 2017-08-19 21:37 UTC (permalink / raw)
  To: Wei Wang
  Cc: yang.zhang.wz, kvm, mst, liliang.opensource, qemu-devel,
	virtualization, linux-mm, aarcange, virtio-dev, mawilcox, willy,
	quan.xu, cornelia.huck, mhocko, linux-kernel, kbuild-all,
	amit.shah, pbonzini, akpm, mgorman

[-- Attachment #1: Type: text/plain, Size: 2808 bytes --]

Hi Wei,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.13-rc5 next-20170817]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Wei-Wang/lib-xbitmap-Introduce-xbitmap/20170820-035516
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=xtensa 

All errors (new ones prefixed by >>):

   drivers/virtio/virtio_balloon.c: In function 'tell_host_sgs':
>> drivers/virtio/virtio_balloon.c:203:3: error: implicit declaration of function 'pfn_to_kaddr' [-Werror=implicit-function-declaration]
      sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
      ^
   cc1: some warnings being treated as errors

vim +/pfn_to_kaddr +203 drivers/virtio/virtio_balloon.c

   176	
   177	/*
   178	 * Send balloon pages in sgs to host. The balloon pages are recorded in the
   179	 * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
   180	 * The page xbitmap is searched for continuous "1" bits, which correspond
   181	 * to continuous pages, to chunk into sgs.
   182	 *
   183	 * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
   184	 * need to be searched.
   185	 */
   186	static void tell_host_sgs(struct virtio_balloon *vb,
   187				  struct virtqueue *vq,
   188				  unsigned long page_xb_start,
   189				  unsigned long page_xb_end)
   190	{
   191		unsigned long sg_pfn_start, sg_pfn_end;
   192		void *sg_addr;
   193		uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
   194	
   195		sg_pfn_start = page_xb_start;
   196		while (sg_pfn_start < page_xb_end) {
   197			sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start,
   198							page_xb_end, 1);
   199			if (sg_pfn_start == page_xb_end + 1)
   200				break;
   201			sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1,
   202						      page_xb_end, 0);
 > 203			sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
   204			sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
   205			while (sg_len > sg_max_len) {
   206				send_balloon_page_sg(vb, vq, sg_addr, sg_max_len);
   207				sg_addr += sg_max_len;
   208				sg_len -= sg_max_len;
   209			}
   210			send_balloon_page_sg(vb, vq, sg_addr, sg_len);
   211			xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end);
   212			sg_pfn_start = sg_pfn_end + 1;
   213		}
   214	}
   215	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 50926 bytes --]

[-- Attachment #3: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-08-18 18:26         ` Michael S. Tsirkin
  (?)
  (?)
@ 2017-08-21  5:21           ` Wei Wang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-21  5:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/19/2017 02:26 AM, Michael S. Tsirkin wrote:
> On Fri, Aug 18, 2017 at 04:41:41PM +0800, Wei Wang wrote:
>> On 08/18/2017 10:13 AM, Michael S. Tsirkin wrote:
>>> On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
>>>> Add a new vq to report hints of guest free pages to the host.
>>> Please add some text here explaining the report_free_page_signal
>>> thing.
>>>
>>>
>>> I also really think we need some kind of ID in the
>>> buffer to do a handshake. whenever id changes you
>>> add another outbuf.
>> Please let me introduce the current design first:
>> 1) device put the signal buf to the vq and notify the driver (we need
>> a buffer because currently the device can't notify when the vq is empty);
>>
>> 2) the driver starts the report of free page blocks via inbuf;
>>
>> 3) the driver adds an the signal buf via outbuf to tell the device all are
>> reported.
>>
>>
>> Could you please elaborate more on the usage of ID?
> While driver is free to maintain at most one buffer in flight
> the design must work with pipelined requests as that
> is important for performance.

How would the pipeline be designed?

Currently, once the report starts,
- the driver work: add_inbuf(free_pages) & kick;

- the device work:
     record the pages into a free page bitmap;
     virtqueue_push(elem);
     virtio_notify();

For the driver, as long as the vq has available entries, it keeps doing 
its work;
For the device, as long as there are free pages in the vq, it also keeps 
doing its work.


>
> So host might be able to request the reporting twice.
> How does it know what is the report in response to?

The request to start is sent when live migration starts, where would be
the second chance to send the request to start?



>
> If we put an id in request and in response, then that fixes it.
>
>
> So there's a vq used for requesting free page reports.
> driver does add_inbuf( &device->id).
>
> Then when it starts reporting it does
>
>
> add_outbuf(&device->id)
>
> followed by pages.
>
>
> Also if device->id changes it knows it should restart
> reporting from beginning.
>
>
>
>
>
>
>>>> +retry:
>>>> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>>>> +	virtqueue_kick(vq);
>>>> +	if (unlikely(ret == -ENOSPC)) {
>>> what if there's another error?
>> Another error is -EIO, how about disabling the free page report feature?
>> (I also saw it isn't handled in many other virtio devices e.g. virtio-net)
>>
>>>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>> +		goto retry;
>>>> +	}
>>> what is this trickery doing? needs more comments or
>>> a simplification.
>> Just this:
>> if the vq is full, blocking wait till an entry gets released, then retry.
>> This is the
>> final one, which puts the signal buf to the vq to signify the end of the
>> report and
>> the mm lock is not held here, so it is fine to block.
>>
> But why do you kick here on failure? I would understand it if you
> did not kick when adding pages, as it is I don't understand.
>
>
> Also pls rewrite this with a for or while loop for clarity.

OK, I will rewrite this part.


Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-21  5:21           ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-21  5:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/19/2017 02:26 AM, Michael S. Tsirkin wrote:
> On Fri, Aug 18, 2017 at 04:41:41PM +0800, Wei Wang wrote:
>> On 08/18/2017 10:13 AM, Michael S. Tsirkin wrote:
>>> On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
>>>> Add a new vq to report hints of guest free pages to the host.
>>> Please add some text here explaining the report_free_page_signal
>>> thing.
>>>
>>>
>>> I also really think we need some kind of ID in the
>>> buffer to do a handshake. whenever id changes you
>>> add another outbuf.
>> Please let me introduce the current design first:
>> 1) device put the signal buf to the vq and notify the driver (we need
>> a buffer because currently the device can't notify when the vq is empty);
>>
>> 2) the driver starts the report of free page blocks via inbuf;
>>
>> 3) the driver adds an the signal buf via outbuf to tell the device all are
>> reported.
>>
>>
>> Could you please elaborate more on the usage of ID?
> While driver is free to maintain at most one buffer in flight
> the design must work with pipelined requests as that
> is important for performance.

How would the pipeline be designed?

Currently, once the report starts,
- the driver work: add_inbuf(free_pages) & kick;

- the device work:
     record the pages into a free page bitmap;
     virtqueue_push(elem);
     virtio_notify();

For the driver, as long as the vq has available entries, it keeps doing 
its work;
For the device, as long as there are free pages in the vq, it also keeps 
doing its work.


>
> So host might be able to request the reporting twice.
> How does it know what is the report in response to?

The request to start is sent when live migration starts, where would be
the second chance to send the request to start?



>
> If we put an id in request and in response, then that fixes it.
>
>
> So there's a vq used for requesting free page reports.
> driver does add_inbuf( &device->id).
>
> Then when it starts reporting it does
>
>
> add_outbuf(&device->id)
>
> followed by pages.
>
>
> Also if device->id changes it knows it should restart
> reporting from beginning.
>
>
>
>
>
>
>>>> +retry:
>>>> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>>>> +	virtqueue_kick(vq);
>>>> +	if (unlikely(ret == -ENOSPC)) {
>>> what if there's another error?
>> Another error is -EIO, how about disabling the free page report feature?
>> (I also saw it isn't handled in many other virtio devices e.g. virtio-net)
>>
>>>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>> +		goto retry;
>>>> +	}
>>> what is this trickery doing? needs more comments or
>>> a simplification.
>> Just this:
>> if the vq is full, blocking wait till an entry gets released, then retry.
>> This is the
>> final one, which puts the signal buf to the vq to signify the end of the
>> report and
>> the mm lock is not held here, so it is fine to block.
>>
> But why do you kick here on failure? I would understand it if you
> did not kick when adding pages, as it is I don't understand.
>
>
> Also pls rewrite this with a for or while loop for clarity.

OK, I will rewrite this part.


Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [Qemu-devel] [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-21  5:21           ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-21  5:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/19/2017 02:26 AM, Michael S. Tsirkin wrote:
> On Fri, Aug 18, 2017 at 04:41:41PM +0800, Wei Wang wrote:
>> On 08/18/2017 10:13 AM, Michael S. Tsirkin wrote:
>>> On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
>>>> Add a new vq to report hints of guest free pages to the host.
>>> Please add some text here explaining the report_free_page_signal
>>> thing.
>>>
>>>
>>> I also really think we need some kind of ID in the
>>> buffer to do a handshake. whenever id changes you
>>> add another outbuf.
>> Please let me introduce the current design first:
>> 1) device put the signal buf to the vq and notify the driver (we need
>> a buffer because currently the device can't notify when the vq is empty);
>>
>> 2) the driver starts the report of free page blocks via inbuf;
>>
>> 3) the driver adds an the signal buf via outbuf to tell the device all are
>> reported.
>>
>>
>> Could you please elaborate more on the usage of ID?
> While driver is free to maintain at most one buffer in flight
> the design must work with pipelined requests as that
> is important for performance.

How would the pipeline be designed?

Currently, once the report starts,
- the driver work: add_inbuf(free_pages) & kick;

- the device work:
     record the pages into a free page bitmap;
     virtqueue_push(elem);
     virtio_notify();

For the driver, as long as the vq has available entries, it keeps doing 
its work;
For the device, as long as there are free pages in the vq, it also keeps 
doing its work.


>
> So host might be able to request the reporting twice.
> How does it know what is the report in response to?

The request to start is sent when live migration starts, where would be
the second chance to send the request to start?



>
> If we put an id in request and in response, then that fixes it.
>
>
> So there's a vq used for requesting free page reports.
> driver does add_inbuf( &device->id).
>
> Then when it starts reporting it does
>
>
> add_outbuf(&device->id)
>
> followed by pages.
>
>
> Also if device->id changes it knows it should restart
> reporting from beginning.
>
>
>
>
>
>
>>>> +retry:
>>>> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>>>> +	virtqueue_kick(vq);
>>>> +	if (unlikely(ret == -ENOSPC)) {
>>> what if there's another error?
>> Another error is -EIO, how about disabling the free page report feature?
>> (I also saw it isn't handled in many other virtio devices e.g. virtio-net)
>>
>>>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>> +		goto retry;
>>>> +	}
>>> what is this trickery doing? needs more comments or
>>> a simplification.
>> Just this:
>> if the vq is full, blocking wait till an entry gets released, then retry.
>> This is the
>> final one, which puts the signal buf to the vq to signify the end of the
>> report and
>> the mm lock is not held here, so it is fine to block.
>>
> But why do you kick here on failure? I would understand it if you
> did not kick when adding pages, as it is I don't understand.
>
>
> Also pls rewrite this with a for or while loop for clarity.

OK, I will rewrite this part.


Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-08-18 18:26         ` Michael S. Tsirkin
                           ` (3 preceding siblings ...)
  (?)
@ 2017-08-21  5:21         ` Wei Wang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-21  5:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko, mgorman

On 08/19/2017 02:26 AM, Michael S. Tsirkin wrote:
> On Fri, Aug 18, 2017 at 04:41:41PM +0800, Wei Wang wrote:
>> On 08/18/2017 10:13 AM, Michael S. Tsirkin wrote:
>>> On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
>>>> Add a new vq to report hints of guest free pages to the host.
>>> Please add some text here explaining the report_free_page_signal
>>> thing.
>>>
>>>
>>> I also really think we need some kind of ID in the
>>> buffer to do a handshake. whenever id changes you
>>> add another outbuf.
>> Please let me introduce the current design first:
>> 1) device put the signal buf to the vq and notify the driver (we need
>> a buffer because currently the device can't notify when the vq is empty);
>>
>> 2) the driver starts the report of free page blocks via inbuf;
>>
>> 3) the driver adds an the signal buf via outbuf to tell the device all are
>> reported.
>>
>>
>> Could you please elaborate more on the usage of ID?
> While driver is free to maintain at most one buffer in flight
> the design must work with pipelined requests as that
> is important for performance.

How would the pipeline be designed?

Currently, once the report starts,
- the driver work: add_inbuf(free_pages) & kick;

- the device work:
     record the pages into a free page bitmap;
     virtqueue_push(elem);
     virtio_notify();

For the driver, as long as the vq has available entries, it keeps doing 
its work;
For the device, as long as there are free pages in the vq, it also keeps 
doing its work.


>
> So host might be able to request the reporting twice.
> How does it know what is the report in response to?

The request to start is sent when live migration starts, where would be
the second chance to send the request to start?



>
> If we put an id in request and in response, then that fixes it.
>
>
> So there's a vq used for requesting free page reports.
> driver does add_inbuf( &device->id).
>
> Then when it starts reporting it does
>
>
> add_outbuf(&device->id)
>
> followed by pages.
>
>
> Also if device->id changes it knows it should restart
> reporting from beginning.
>
>
>
>
>
>
>>>> +retry:
>>>> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>>>> +	virtqueue_kick(vq);
>>>> +	if (unlikely(ret == -ENOSPC)) {
>>> what if there's another error?
>> Another error is -EIO, how about disabling the free page report feature?
>> (I also saw it isn't handled in many other virtio devices e.g. virtio-net)
>>
>>>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>> +		goto retry;
>>>> +	}
>>> what is this trickery doing? needs more comments or
>>> a simplification.
>> Just this:
>> if the vq is full, blocking wait till an entry gets released, then retry.
>> This is the
>> final one, which puts the signal buf to the vq to signify the end of the
>> report and
>> the mm lock is not held here, so it is fine to block.
>>
> But why do you kick here on failure? I would understand it if you
> did not kick when adding pages, as it is I don't understand.
>
>
> Also pls rewrite this with a for or while loop for clarity.

OK, I will rewrite this part.


Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [virtio-dev] Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-21  5:21           ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-21  5:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/19/2017 02:26 AM, Michael S. Tsirkin wrote:
> On Fri, Aug 18, 2017 at 04:41:41PM +0800, Wei Wang wrote:
>> On 08/18/2017 10:13 AM, Michael S. Tsirkin wrote:
>>> On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
>>>> Add a new vq to report hints of guest free pages to the host.
>>> Please add some text here explaining the report_free_page_signal
>>> thing.
>>>
>>>
>>> I also really think we need some kind of ID in the
>>> buffer to do a handshake. whenever id changes you
>>> add another outbuf.
>> Please let me introduce the current design first:
>> 1) device put the signal buf to the vq and notify the driver (we need
>> a buffer because currently the device can't notify when the vq is empty);
>>
>> 2) the driver starts the report of free page blocks via inbuf;
>>
>> 3) the driver adds an the signal buf via outbuf to tell the device all are
>> reported.
>>
>>
>> Could you please elaborate more on the usage of ID?
> While driver is free to maintain at most one buffer in flight
> the design must work with pipelined requests as that
> is important for performance.

How would the pipeline be designed?

Currently, once the report starts,
- the driver work: add_inbuf(free_pages) & kick;

- the device work:
     record the pages into a free page bitmap;
     virtqueue_push(elem);
     virtio_notify();

For the driver, as long as the vq has available entries, it keeps doing 
its work;
For the device, as long as there are free pages in the vq, it also keeps 
doing its work.


>
> So host might be able to request the reporting twice.
> How does it know what is the report in response to?

The request to start is sent when live migration starts, where would be
the second chance to send the request to start?



>
> If we put an id in request and in response, then that fixes it.
>
>
> So there's a vq used for requesting free page reports.
> driver does add_inbuf( &device->id).
>
> Then when it starts reporting it does
>
>
> add_outbuf(&device->id)
>
> followed by pages.
>
>
> Also if device->id changes it knows it should restart
> reporting from beginning.
>
>
>
>
>
>
>>>> +retry:
>>>> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>>>> +	virtqueue_kick(vq);
>>>> +	if (unlikely(ret == -ENOSPC)) {
>>> what if there's another error?
>> Another error is -EIO, how about disabling the free page report feature?
>> (I also saw it isn't handled in many other virtio devices e.g. virtio-net)
>>
>>>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>> +		goto retry;
>>>> +	}
>>> what is this trickery doing? needs more comments or
>>> a simplification.
>> Just this:
>> if the vq is full, blocking wait till an entry gets released, then retry.
>> This is the
>> final one, which puts the signal buf to the vq to signify the end of the
>> report and
>> the mm lock is not held here, so it is fine to block.
>>
> But why do you kick here on failure? I would understand it if you
> did not kick when adding pages, as it is I don't understand.
>
>
> Also pls rewrite this with a for or while loop for clarity.

OK, I will rewrite this part.


Best,
Wei

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [virtio-dev] Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-08-18 18:10         ` Michael S. Tsirkin
                             ` (2 preceding siblings ...)
  (?)
@ 2017-08-21  5:28           ` Wei Wang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-21  5:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/19/2017 02:10 AM, Michael S. Tsirkin wrote:
> On Fri, Aug 18, 2017 at 04:36:06PM +0800, Wei Wang wrote:
>> On 08/18/2017 10:28 AM, Michael S. Tsirkin wrote:
>>> On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
>>>> Add a new vq to report hints of guest free pages to the host.
>>>>
>>>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>>>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>>>> ---
>>>>    drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
>>>>    include/uapi/linux/virtio_balloon.h |   1 +
>>>>    2 files changed, 147 insertions(+), 21 deletions(-)
>>>>
>>>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>>>> index 72041b4..e6755bc 100644
>>>> --- a/drivers/virtio/virtio_balloon.c
>>>> +++ b/drivers/virtio/virtio_balloon.c
>>>> @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
>>>>    struct virtio_balloon {
>>>>    	struct virtio_device *vdev;
>>>> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
>>>> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>>>>    	/* The balloon servicing is delegated to a freezable workqueue. */
>>>>    	struct work_struct update_balloon_stats_work;
>>>>    	struct work_struct update_balloon_size_work;
>>>> +	struct work_struct report_free_page_work;
>>>>    	/* Prevent updating balloon when it is being canceled. */
>>>>    	spinlock_t stop_update_lock;
>>>> @@ -90,6 +91,13 @@ struct virtio_balloon {
>>>>    	/* Memory statistics */
>>>>    	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>>>> +	/*
>>>> +	 * Used by the device and driver to signal each other.
>>>> +	 * device->driver: start the free page report.
>>>> +	 * driver->device: end the free page report.
>>>> +	 */
>>>> +	__virtio32 report_free_page_signal;
>>>> +
>>>>    	/* To register callback in oom notifier call chain */
>>>>    	struct notifier_block nb;
>>>>    };
>>>> @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
>>>>    	} while (unlikely(ret == -ENOSPC));
>>>>    }
>>>> +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
>>>> +{
>>>> +	unsigned int len;
>>>> +
>>>> +	add_one_sg(vq, addr, size);
>>>> +	virtqueue_kick(vq);
>>>> +	/* Release entries if there are */
>>>> +	while (virtqueue_get_buf(vq, &len))
>>>> +		;
>>>> +}
>>>> +
>>>>    /*
>>>>     * Send balloon pages in sgs to host. The balloon pages are recorded in the
>>>>     * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
>>>> @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
>>>>    		queue_work(system_freezable_wq, work);
>>>>    }
>>>> +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
>>>> +					   unsigned long nr_pages)
>>>> +{
>>>> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
>>>> +	void *addr = (void *)pfn_to_kaddr(pfn);
>>>> +	uint32_t len = nr_pages << PAGE_SHIFT;
>>>> +
>>>> +	send_free_page_sg(vb->free_page_vq, addr, len);
>>>> +}
>>>> +
>>>> +static void report_free_page_completion(struct virtio_balloon *vb)
>>>> +{
>>>> +	struct virtqueue *vq = vb->free_page_vq;
>>>> +	struct scatterlist sg;
>>>> +	unsigned int len;
>>>> +	int ret;
>>>> +
>>>> +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
>>>> +retry:
>>>> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>>>> +	virtqueue_kick(vq);
>>>> +	if (unlikely(ret == -ENOSPC)) {
>>>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>> +		goto retry;
>>>> +	}
>>>> +}
>>> So the annoying thing here is that once this starts going,
>>> it will keep sending free pages from the list even if
>>> host is no longer interested. There should be a way
>>> for host to tell guest "stop" or "start from the beginning".
>> This can be achieved via two output signal buf here:
>> signal_buf_start: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START
>> signal_buf_end: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_END
>>
>> The device holds both, and can put one of them to the vq and notify.
> Do you mean device writes start and end in the buf? then it's an inbuf
> not an outbuf.
>

Not really. I meant that the driver fills two signal buffer,_START and _STOP
and send them as outbuf to the device. Then the device holds two read-only
signal buffer:
When request to start: add the _START elem to the vq
When request  to stop: add the _STOP elem to the vq


>>
>>> It's the result of using same vq for guest to host and
>>> host to guest communication, and I think it's not a great idea.
>>> I'd reuse stats vq for host to guest requests maybe.
>>>
>>
>> As we discussed before, we can't have a vq interleave the report of stats
>> and free pages.
>> The vq will be locked when one command is in use. So, when live migration
>> starts, the
>> periodically reported stats will be delayed.
>
>
>
>
>
>> Would this be OK? Or would you
>> like to have
>> one host to guest vq, and multiple host to guest vqs? That is,
>>
>> - host to guest:
>> CMD_VQ
>>
>> - guest to host:
>> STATS_REPORT_VQ
>> FREE_PAGE_VQ
>>
>>
>> Best,
>> Wei
>>
> Point is stats report vq is also host to guest.
> So I think it can be combined with CMD VQ.
> If it's too hard a separate vq isn't too bad though.
>

IMHO, this kind of categorization - using stat_vq for

host-to-guest stats request,
guest-to-host stats report,
host-to-guest free page request,

seems a little tricky and unclear. I would think it is better to have 
unrelated
features decoupled. For example, both stats report and free page report
are optional features. With the above implementation, using the free page
feature will depend on the stats report feature (in fact we can implement
it to have the free page feature work independently)

That being said, if you don't mind the above, we can go with that option
in the next version.

Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-21  5:28           ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-21  5:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/19/2017 02:10 AM, Michael S. Tsirkin wrote:
> On Fri, Aug 18, 2017 at 04:36:06PM +0800, Wei Wang wrote:
>> On 08/18/2017 10:28 AM, Michael S. Tsirkin wrote:
>>> On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
>>>> Add a new vq to report hints of guest free pages to the host.
>>>>
>>>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>>>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>>>> ---
>>>>    drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
>>>>    include/uapi/linux/virtio_balloon.h |   1 +
>>>>    2 files changed, 147 insertions(+), 21 deletions(-)
>>>>
>>>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>>>> index 72041b4..e6755bc 100644
>>>> --- a/drivers/virtio/virtio_balloon.c
>>>> +++ b/drivers/virtio/virtio_balloon.c
>>>> @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
>>>>    struct virtio_balloon {
>>>>    	struct virtio_device *vdev;
>>>> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
>>>> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>>>>    	/* The balloon servicing is delegated to a freezable workqueue. */
>>>>    	struct work_struct update_balloon_stats_work;
>>>>    	struct work_struct update_balloon_size_work;
>>>> +	struct work_struct report_free_page_work;
>>>>    	/* Prevent updating balloon when it is being canceled. */
>>>>    	spinlock_t stop_update_lock;
>>>> @@ -90,6 +91,13 @@ struct virtio_balloon {
>>>>    	/* Memory statistics */
>>>>    	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>>>> +	/*
>>>> +	 * Used by the device and driver to signal each other.
>>>> +	 * device->driver: start the free page report.
>>>> +	 * driver->device: end the free page report.
>>>> +	 */
>>>> +	__virtio32 report_free_page_signal;
>>>> +
>>>>    	/* To register callback in oom notifier call chain */
>>>>    	struct notifier_block nb;
>>>>    };
>>>> @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
>>>>    	} while (unlikely(ret == -ENOSPC));
>>>>    }
>>>> +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
>>>> +{
>>>> +	unsigned int len;
>>>> +
>>>> +	add_one_sg(vq, addr, size);
>>>> +	virtqueue_kick(vq);
>>>> +	/* Release entries if there are */
>>>> +	while (virtqueue_get_buf(vq, &len))
>>>> +		;
>>>> +}
>>>> +
>>>>    /*
>>>>     * Send balloon pages in sgs to host. The balloon pages are recorded in the
>>>>     * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
>>>> @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
>>>>    		queue_work(system_freezable_wq, work);
>>>>    }
>>>> +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
>>>> +					   unsigned long nr_pages)
>>>> +{
>>>> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
>>>> +	void *addr = (void *)pfn_to_kaddr(pfn);
>>>> +	uint32_t len = nr_pages << PAGE_SHIFT;
>>>> +
>>>> +	send_free_page_sg(vb->free_page_vq, addr, len);
>>>> +}
>>>> +
>>>> +static void report_free_page_completion(struct virtio_balloon *vb)
>>>> +{
>>>> +	struct virtqueue *vq = vb->free_page_vq;
>>>> +	struct scatterlist sg;
>>>> +	unsigned int len;
>>>> +	int ret;
>>>> +
>>>> +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
>>>> +retry:
>>>> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>>>> +	virtqueue_kick(vq);
>>>> +	if (unlikely(ret == -ENOSPC)) {
>>>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>> +		goto retry;
>>>> +	}
>>>> +}
>>> So the annoying thing here is that once this starts going,
>>> it will keep sending free pages from the list even if
>>> host is no longer interested. There should be a way
>>> for host to tell guest "stop" or "start from the beginning".
>> This can be achieved via two output signal buf here:
>> signal_buf_start: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START
>> signal_buf_end: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_END
>>
>> The device holds both, and can put one of them to the vq and notify.
> Do you mean device writes start and end in the buf? then it's an inbuf
> not an outbuf.
>

Not really. I meant that the driver fills two signal buffer,_START and _STOP
and send them as outbuf to the device. Then the device holds two read-only
signal buffer:
When request to start: add the _START elem to the vq
When request  to stop: add the _STOP elem to the vq


>>
>>> It's the result of using same vq for guest to host and
>>> host to guest communication, and I think it's not a great idea.
>>> I'd reuse stats vq for host to guest requests maybe.
>>>
>>
>> As we discussed before, we can't have a vq interleave the report of stats
>> and free pages.
>> The vq will be locked when one command is in use. So, when live migration
>> starts, the
>> periodically reported stats will be delayed.
>
>
>
>
>
>> Would this be OK? Or would you
>> like to have
>> one host to guest vq, and multiple host to guest vqs? That is,
>>
>> - host to guest:
>> CMD_VQ
>>
>> - guest to host:
>> STATS_REPORT_VQ
>> FREE_PAGE_VQ
>>
>>
>> Best,
>> Wei
>>
> Point is stats report vq is also host to guest.
> So I think it can be combined with CMD VQ.
> If it's too hard a separate vq isn't too bad though.
>

IMHO, this kind of categorization - using stat_vq for

host-to-guest stats request,
guest-to-host stats report,
host-to-guest free page request,

seems a little tricky and unclear. I would think it is better to have 
unrelated
features decoupled. For example, both stats report and free page report
are optional features. With the above implementation, using the free page
feature will depend on the stats report feature (in fact we can implement
it to have the free page feature work independently)

That being said, if you don't mind the above, we can go with that option
in the next version.

Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [virtio-dev] Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-21  5:28           ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-21  5:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/19/2017 02:10 AM, Michael S. Tsirkin wrote:
> On Fri, Aug 18, 2017 at 04:36:06PM +0800, Wei Wang wrote:
>> On 08/18/2017 10:28 AM, Michael S. Tsirkin wrote:
>>> On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
>>>> Add a new vq to report hints of guest free pages to the host.
>>>>
>>>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>>>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>>>> ---
>>>>    drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
>>>>    include/uapi/linux/virtio_balloon.h |   1 +
>>>>    2 files changed, 147 insertions(+), 21 deletions(-)
>>>>
>>>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>>>> index 72041b4..e6755bc 100644
>>>> --- a/drivers/virtio/virtio_balloon.c
>>>> +++ b/drivers/virtio/virtio_balloon.c
>>>> @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
>>>>    struct virtio_balloon {
>>>>    	struct virtio_device *vdev;
>>>> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
>>>> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>>>>    	/* The balloon servicing is delegated to a freezable workqueue. */
>>>>    	struct work_struct update_balloon_stats_work;
>>>>    	struct work_struct update_balloon_size_work;
>>>> +	struct work_struct report_free_page_work;
>>>>    	/* Prevent updating balloon when it is being canceled. */
>>>>    	spinlock_t stop_update_lock;
>>>> @@ -90,6 +91,13 @@ struct virtio_balloon {
>>>>    	/* Memory statistics */
>>>>    	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>>>> +	/*
>>>> +	 * Used by the device and driver to signal each other.
>>>> +	 * device->driver: start the free page report.
>>>> +	 * driver->device: end the free page report.
>>>> +	 */
>>>> +	__virtio32 report_free_page_signal;
>>>> +
>>>>    	/* To register callback in oom notifier call chain */
>>>>    	struct notifier_block nb;
>>>>    };
>>>> @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
>>>>    	} while (unlikely(ret == -ENOSPC));
>>>>    }
>>>> +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
>>>> +{
>>>> +	unsigned int len;
>>>> +
>>>> +	add_one_sg(vq, addr, size);
>>>> +	virtqueue_kick(vq);
>>>> +	/* Release entries if there are */
>>>> +	while (virtqueue_get_buf(vq, &len))
>>>> +		;
>>>> +}
>>>> +
>>>>    /*
>>>>     * Send balloon pages in sgs to host. The balloon pages are recorded in the
>>>>     * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
>>>> @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
>>>>    		queue_work(system_freezable_wq, work);
>>>>    }
>>>> +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
>>>> +					   unsigned long nr_pages)
>>>> +{
>>>> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
>>>> +	void *addr = (void *)pfn_to_kaddr(pfn);
>>>> +	uint32_t len = nr_pages << PAGE_SHIFT;
>>>> +
>>>> +	send_free_page_sg(vb->free_page_vq, addr, len);
>>>> +}
>>>> +
>>>> +static void report_free_page_completion(struct virtio_balloon *vb)
>>>> +{
>>>> +	struct virtqueue *vq = vb->free_page_vq;
>>>> +	struct scatterlist sg;
>>>> +	unsigned int len;
>>>> +	int ret;
>>>> +
>>>> +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
>>>> +retry:
>>>> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>>>> +	virtqueue_kick(vq);
>>>> +	if (unlikely(ret == -ENOSPC)) {
>>>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>> +		goto retry;
>>>> +	}
>>>> +}
>>> So the annoying thing here is that once this starts going,
>>> it will keep sending free pages from the list even if
>>> host is no longer interested. There should be a way
>>> for host to tell guest "stop" or "start from the beginning".
>> This can be achieved via two output signal buf here:
>> signal_buf_start: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START
>> signal_buf_end: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_END
>>
>> The device holds both, and can put one of them to the vq and notify.
> Do you mean device writes start and end in the buf? then it's an inbuf
> not an outbuf.
>

Not really. I meant that the driver fills two signal buffer,_START and _STOP
and send them as outbuf to the device. Then the device holds two read-only
signal buffer:
When request to start: add the _START elem to the vq
When request  to stop: add the _STOP elem to the vq


>>
>>> It's the result of using same vq for guest to host and
>>> host to guest communication, and I think it's not a great idea.
>>> I'd reuse stats vq for host to guest requests maybe.
>>>
>>
>> As we discussed before, we can't have a vq interleave the report of stats
>> and free pages.
>> The vq will be locked when one command is in use. So, when live migration
>> starts, the
>> periodically reported stats will be delayed.
>
>
>
>
>
>> Would this be OK? Or would you
>> like to have
>> one host to guest vq, and multiple host to guest vqs? That is,
>>
>> - host to guest:
>> CMD_VQ
>>
>> - guest to host:
>> STATS_REPORT_VQ
>> FREE_PAGE_VQ
>>
>>
>> Best,
>> Wei
>>
> Point is stats report vq is also host to guest.
> So I think it can be combined with CMD VQ.
> If it's too hard a separate vq isn't too bad though.
>

IMHO, this kind of categorization - using stat_vq for

host-to-guest stats request,
guest-to-host stats report,
host-to-guest free page request,

seems a little tricky and unclear. I would think it is better to have 
unrelated
features decoupled. For example, both stats report and free page report
are optional features. With the above implementation, using the free page
feature will depend on the stats report feature (in fact we can implement
it to have the free page feature work independently)

That being said, if you don't mind the above, we can go with that option
in the next version.

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [Qemu-devel] [virtio-dev] Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-21  5:28           ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-21  5:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/19/2017 02:10 AM, Michael S. Tsirkin wrote:
> On Fri, Aug 18, 2017 at 04:36:06PM +0800, Wei Wang wrote:
>> On 08/18/2017 10:28 AM, Michael S. Tsirkin wrote:
>>> On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
>>>> Add a new vq to report hints of guest free pages to the host.
>>>>
>>>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>>>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>>>> ---
>>>>    drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
>>>>    include/uapi/linux/virtio_balloon.h |   1 +
>>>>    2 files changed, 147 insertions(+), 21 deletions(-)
>>>>
>>>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>>>> index 72041b4..e6755bc 100644
>>>> --- a/drivers/virtio/virtio_balloon.c
>>>> +++ b/drivers/virtio/virtio_balloon.c
>>>> @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
>>>>    struct virtio_balloon {
>>>>    	struct virtio_device *vdev;
>>>> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
>>>> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>>>>    	/* The balloon servicing is delegated to a freezable workqueue. */
>>>>    	struct work_struct update_balloon_stats_work;
>>>>    	struct work_struct update_balloon_size_work;
>>>> +	struct work_struct report_free_page_work;
>>>>    	/* Prevent updating balloon when it is being canceled. */
>>>>    	spinlock_t stop_update_lock;
>>>> @@ -90,6 +91,13 @@ struct virtio_balloon {
>>>>    	/* Memory statistics */
>>>>    	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>>>> +	/*
>>>> +	 * Used by the device and driver to signal each other.
>>>> +	 * device->driver: start the free page report.
>>>> +	 * driver->device: end the free page report.
>>>> +	 */
>>>> +	__virtio32 report_free_page_signal;
>>>> +
>>>>    	/* To register callback in oom notifier call chain */
>>>>    	struct notifier_block nb;
>>>>    };
>>>> @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
>>>>    	} while (unlikely(ret == -ENOSPC));
>>>>    }
>>>> +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
>>>> +{
>>>> +	unsigned int len;
>>>> +
>>>> +	add_one_sg(vq, addr, size);
>>>> +	virtqueue_kick(vq);
>>>> +	/* Release entries if there are */
>>>> +	while (virtqueue_get_buf(vq, &len))
>>>> +		;
>>>> +}
>>>> +
>>>>    /*
>>>>     * Send balloon pages in sgs to host. The balloon pages are recorded in the
>>>>     * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
>>>> @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
>>>>    		queue_work(system_freezable_wq, work);
>>>>    }
>>>> +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
>>>> +					   unsigned long nr_pages)
>>>> +{
>>>> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
>>>> +	void *addr = (void *)pfn_to_kaddr(pfn);
>>>> +	uint32_t len = nr_pages << PAGE_SHIFT;
>>>> +
>>>> +	send_free_page_sg(vb->free_page_vq, addr, len);
>>>> +}
>>>> +
>>>> +static void report_free_page_completion(struct virtio_balloon *vb)
>>>> +{
>>>> +	struct virtqueue *vq = vb->free_page_vq;
>>>> +	struct scatterlist sg;
>>>> +	unsigned int len;
>>>> +	int ret;
>>>> +
>>>> +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
>>>> +retry:
>>>> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>>>> +	virtqueue_kick(vq);
>>>> +	if (unlikely(ret == -ENOSPC)) {
>>>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>> +		goto retry;
>>>> +	}
>>>> +}
>>> So the annoying thing here is that once this starts going,
>>> it will keep sending free pages from the list even if
>>> host is no longer interested. There should be a way
>>> for host to tell guest "stop" or "start from the beginning".
>> This can be achieved via two output signal buf here:
>> signal_buf_start: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START
>> signal_buf_end: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_END
>>
>> The device holds both, and can put one of them to the vq and notify.
> Do you mean device writes start and end in the buf? then it's an inbuf
> not an outbuf.
>

Not really. I meant that the driver fills two signal buffer,_START and _STOP
and send them as outbuf to the device. Then the device holds two read-only
signal buffer:
When request to start: add the _START elem to the vq
When request  to stop: add the _STOP elem to the vq


>>
>>> It's the result of using same vq for guest to host and
>>> host to guest communication, and I think it's not a great idea.
>>> I'd reuse stats vq for host to guest requests maybe.
>>>
>>
>> As we discussed before, we can't have a vq interleave the report of stats
>> and free pages.
>> The vq will be locked when one command is in use. So, when live migration
>> starts, the
>> periodically reported stats will be delayed.
>
>
>
>
>
>> Would this be OK? Or would you
>> like to have
>> one host to guest vq, and multiple host to guest vqs? That is,
>>
>> - host to guest:
>> CMD_VQ
>>
>> - guest to host:
>> STATS_REPORT_VQ
>> FREE_PAGE_VQ
>>
>>
>> Best,
>> Wei
>>
> Point is stats report vq is also host to guest.
> So I think it can be combined with CMD VQ.
> If it's too hard a separate vq isn't too bad though.
>

IMHO, this kind of categorization - using stat_vq for

host-to-guest stats request,
guest-to-host stats report,
host-to-guest free page request,

seems a little tricky and unclear. I would think it is better to have 
unrelated
features decoupled. For example, both stats report and free page report
are optional features. With the above implementation, using the free page
feature will depend on the stats report feature (in fact we can implement
it to have the free page feature work independently)

That being said, if you don't mind the above, we can go with that option
in the next version.

Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [virtio-dev] Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
  2017-08-18 18:10         ` Michael S. Tsirkin
                           ` (2 preceding siblings ...)
  (?)
@ 2017-08-21  5:28         ` Wei Wang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-21  5:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko, mgorman

On 08/19/2017 02:10 AM, Michael S. Tsirkin wrote:
> On Fri, Aug 18, 2017 at 04:36:06PM +0800, Wei Wang wrote:
>> On 08/18/2017 10:28 AM, Michael S. Tsirkin wrote:
>>> On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
>>>> Add a new vq to report hints of guest free pages to the host.
>>>>
>>>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>>>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>>>> ---
>>>>    drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
>>>>    include/uapi/linux/virtio_balloon.h |   1 +
>>>>    2 files changed, 147 insertions(+), 21 deletions(-)
>>>>
>>>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>>>> index 72041b4..e6755bc 100644
>>>> --- a/drivers/virtio/virtio_balloon.c
>>>> +++ b/drivers/virtio/virtio_balloon.c
>>>> @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
>>>>    struct virtio_balloon {
>>>>    	struct virtio_device *vdev;
>>>> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
>>>> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>>>>    	/* The balloon servicing is delegated to a freezable workqueue. */
>>>>    	struct work_struct update_balloon_stats_work;
>>>>    	struct work_struct update_balloon_size_work;
>>>> +	struct work_struct report_free_page_work;
>>>>    	/* Prevent updating balloon when it is being canceled. */
>>>>    	spinlock_t stop_update_lock;
>>>> @@ -90,6 +91,13 @@ struct virtio_balloon {
>>>>    	/* Memory statistics */
>>>>    	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>>>> +	/*
>>>> +	 * Used by the device and driver to signal each other.
>>>> +	 * device->driver: start the free page report.
>>>> +	 * driver->device: end the free page report.
>>>> +	 */
>>>> +	__virtio32 report_free_page_signal;
>>>> +
>>>>    	/* To register callback in oom notifier call chain */
>>>>    	struct notifier_block nb;
>>>>    };
>>>> @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
>>>>    	} while (unlikely(ret == -ENOSPC));
>>>>    }
>>>> +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
>>>> +{
>>>> +	unsigned int len;
>>>> +
>>>> +	add_one_sg(vq, addr, size);
>>>> +	virtqueue_kick(vq);
>>>> +	/* Release entries if there are */
>>>> +	while (virtqueue_get_buf(vq, &len))
>>>> +		;
>>>> +}
>>>> +
>>>>    /*
>>>>     * Send balloon pages in sgs to host. The balloon pages are recorded in the
>>>>     * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
>>>> @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
>>>>    		queue_work(system_freezable_wq, work);
>>>>    }
>>>> +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
>>>> +					   unsigned long nr_pages)
>>>> +{
>>>> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
>>>> +	void *addr = (void *)pfn_to_kaddr(pfn);
>>>> +	uint32_t len = nr_pages << PAGE_SHIFT;
>>>> +
>>>> +	send_free_page_sg(vb->free_page_vq, addr, len);
>>>> +}
>>>> +
>>>> +static void report_free_page_completion(struct virtio_balloon *vb)
>>>> +{
>>>> +	struct virtqueue *vq = vb->free_page_vq;
>>>> +	struct scatterlist sg;
>>>> +	unsigned int len;
>>>> +	int ret;
>>>> +
>>>> +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
>>>> +retry:
>>>> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>>>> +	virtqueue_kick(vq);
>>>> +	if (unlikely(ret == -ENOSPC)) {
>>>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>> +		goto retry;
>>>> +	}
>>>> +}
>>> So the annoying thing here is that once this starts going,
>>> it will keep sending free pages from the list even if
>>> host is no longer interested. There should be a way
>>> for host to tell guest "stop" or "start from the beginning".
>> This can be achieved via two output signal buf here:
>> signal_buf_start: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START
>> signal_buf_end: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_END
>>
>> The device holds both, and can put one of them to the vq and notify.
> Do you mean device writes start and end in the buf? then it's an inbuf
> not an outbuf.
>

Not really. I meant that the driver fills two signal buffer,_START and _STOP
and send them as outbuf to the device. Then the device holds two read-only
signal buffer:
When request to start: add the _START elem to the vq
When request  to stop: add the _STOP elem to the vq


>>
>>> It's the result of using same vq for guest to host and
>>> host to guest communication, and I think it's not a great idea.
>>> I'd reuse stats vq for host to guest requests maybe.
>>>
>>
>> As we discussed before, we can't have a vq interleave the report of stats
>> and free pages.
>> The vq will be locked when one command is in use. So, when live migration
>> starts, the
>> periodically reported stats will be delayed.
>
>
>
>
>
>> Would this be OK? Or would you
>> like to have
>> one host to guest vq, and multiple host to guest vqs? That is,
>>
>> - host to guest:
>> CMD_VQ
>>
>> - guest to host:
>> STATS_REPORT_VQ
>> FREE_PAGE_VQ
>>
>>
>> Best,
>> Wei
>>
> Point is stats report vq is also host to guest.
> So I think it can be combined with CMD VQ.
> If it's too hard a separate vq isn't too bad though.
>

IMHO, this kind of categorization - using stat_vq for

host-to-guest stats request,
guest-to-host stats report,
host-to-guest free page request,

seems a little tricky and unclear. I would think it is better to have 
unrelated
features decoupled. For example, both stats report and free page report
are optional features. With the above implementation, using the free page
feature will depend on the stats report feature (in fact we can implement
it to have the free page feature work independently)

That being said, if you don't mind the above, we can go with that option
in the next version.

Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [virtio-dev] Re: [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
@ 2017-08-21  5:28           ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-21  5:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/19/2017 02:10 AM, Michael S. Tsirkin wrote:
> On Fri, Aug 18, 2017 at 04:36:06PM +0800, Wei Wang wrote:
>> On 08/18/2017 10:28 AM, Michael S. Tsirkin wrote:
>>> On Thu, Aug 17, 2017 at 11:26:56AM +0800, Wei Wang wrote:
>>>> Add a new vq to report hints of guest free pages to the host.
>>>>
>>>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>>>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>>>> ---
>>>>    drivers/virtio/virtio_balloon.c     | 167 +++++++++++++++++++++++++++++++-----
>>>>    include/uapi/linux/virtio_balloon.h |   1 +
>>>>    2 files changed, 147 insertions(+), 21 deletions(-)
>>>>
>>>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>>>> index 72041b4..e6755bc 100644
>>>> --- a/drivers/virtio/virtio_balloon.c
>>>> +++ b/drivers/virtio/virtio_balloon.c
>>>> @@ -54,11 +54,12 @@ static struct vfsmount *balloon_mnt;
>>>>    struct virtio_balloon {
>>>>    	struct virtio_device *vdev;
>>>> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
>>>> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>>>>    	/* The balloon servicing is delegated to a freezable workqueue. */
>>>>    	struct work_struct update_balloon_stats_work;
>>>>    	struct work_struct update_balloon_size_work;
>>>> +	struct work_struct report_free_page_work;
>>>>    	/* Prevent updating balloon when it is being canceled. */
>>>>    	spinlock_t stop_update_lock;
>>>> @@ -90,6 +91,13 @@ struct virtio_balloon {
>>>>    	/* Memory statistics */
>>>>    	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>>>> +	/*
>>>> +	 * Used by the device and driver to signal each other.
>>>> +	 * device->driver: start the free page report.
>>>> +	 * driver->device: end the free page report.
>>>> +	 */
>>>> +	__virtio32 report_free_page_signal;
>>>> +
>>>>    	/* To register callback in oom notifier call chain */
>>>>    	struct notifier_block nb;
>>>>    };
>>>> @@ -174,6 +182,17 @@ static void send_balloon_page_sg(struct virtio_balloon *vb,
>>>>    	} while (unlikely(ret == -ENOSPC));
>>>>    }
>>>> +static void send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
>>>> +{
>>>> +	unsigned int len;
>>>> +
>>>> +	add_one_sg(vq, addr, size);
>>>> +	virtqueue_kick(vq);
>>>> +	/* Release entries if there are */
>>>> +	while (virtqueue_get_buf(vq, &len))
>>>> +		;
>>>> +}
>>>> +
>>>>    /*
>>>>     * Send balloon pages in sgs to host. The balloon pages are recorded in the
>>>>     * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
>>>> @@ -511,42 +530,143 @@ static void update_balloon_size_func(struct work_struct *work)
>>>>    		queue_work(system_freezable_wq, work);
>>>>    }
>>>> +static void virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
>>>> +					   unsigned long nr_pages)
>>>> +{
>>>> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
>>>> +	void *addr = (void *)pfn_to_kaddr(pfn);
>>>> +	uint32_t len = nr_pages << PAGE_SHIFT;
>>>> +
>>>> +	send_free_page_sg(vb->free_page_vq, addr, len);
>>>> +}
>>>> +
>>>> +static void report_free_page_completion(struct virtio_balloon *vb)
>>>> +{
>>>> +	struct virtqueue *vq = vb->free_page_vq;
>>>> +	struct scatterlist sg;
>>>> +	unsigned int len;
>>>> +	int ret;
>>>> +
>>>> +	sg_init_one(&sg, &vb->report_free_page_signal, sizeof(__virtio32));
>>>> +retry:
>>>> +	ret = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>>>> +	virtqueue_kick(vq);
>>>> +	if (unlikely(ret == -ENOSPC)) {
>>>> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>> +		goto retry;
>>>> +	}
>>>> +}
>>> So the annoying thing here is that once this starts going,
>>> it will keep sending free pages from the list even if
>>> host is no longer interested. There should be a way
>>> for host to tell guest "stop" or "start from the beginning".
>> This can be achieved via two output signal buf here:
>> signal_buf_start: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_START
>> signal_buf_end: filled with VIRTIO_BALLOON_F_FREE_PAGE_REPORT_END
>>
>> The device holds both, and can put one of them to the vq and notify.
> Do you mean device writes start and end in the buf? then it's an inbuf
> not an outbuf.
>

Not really. I meant that the driver fills two signal buffer,_START and _STOP
and send them as outbuf to the device. Then the device holds two read-only
signal buffer:
When request to start: add the _START elem to the vq
When request  to stop: add the _STOP elem to the vq


>>
>>> It's the result of using same vq for guest to host and
>>> host to guest communication, and I think it's not a great idea.
>>> I'd reuse stats vq for host to guest requests maybe.
>>>
>>
>> As we discussed before, we can't have a vq interleave the report of stats
>> and free pages.
>> The vq will be locked when one command is in use. So, when live migration
>> starts, the
>> periodically reported stats will be delayed.
>
>
>
>
>
>> Would this be OK? Or would you
>> like to have
>> one host to guest vq, and multiple host to guest vqs? That is,
>>
>> - host to guest:
>> CMD_VQ
>>
>> - guest to host:
>> STATS_REPORT_VQ
>> FREE_PAGE_VQ
>>
>>
>> Best,
>> Wei
>>
> Point is stats report vq is also host to guest.
> So I think it can be combined with CMD VQ.
> If it's too hard a separate vq isn't too bad though.
>

IMHO, this kind of categorization - using stat_vq for

host-to-guest stats request,
guest-to-host stats report,
host-to-guest free page request,

seems a little tricky and unclear. I would think it is better to have 
unrelated
features decoupled. For example, both stats report and free page report
are optional features. With the above implementation, using the free page
feature will depend on the stats report feature (in fact we can implement
it to have the free page feature work independently)

That being said, if you don't mind the above, we can go with that option
in the next version.

Best,
Wei

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 4/5] mm: support reporting free page blocks
  2017-08-18 13:46     ` Michal Hocko
  (?)
  (?)
@ 2017-08-21  6:12       ` Wei Wang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-21  6:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/18/2017 09:46 PM, Michal Hocko wrote:
> On Thu 17-08-17 11:26:55, Wei Wang wrote:
>> This patch adds support to walk through the free page blocks in the
>> system and report them via a callback function. Some page blocks may
>> leave the free list after zone->lock is released, so it is the caller's
>> responsibility to either detect or prevent the use of such pages.
> This could see more details to be honest. Especially the usecase you are
> going to use this for. This will help us to understand the motivation
> in future when the current user might be gone a new ones largely diverge
> into a different usage. This wouldn't be the first time I have seen
> something like that.

OK, I will more details here about how it's used to accelerate live 
migration.

>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Michael S. Tsirkin <mst@redhat.com>
>> ---
>>   include/linux/mm.h |  6 ++++++
>>   mm/page_alloc.c    | 44 ++++++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 50 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 46b9ac5..cd29b9f 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>>   		unsigned long zone_start_pfn, unsigned long *zholes_size);
>>   extern void free_initmem(void);
>>   
>> +extern void walk_free_mem_block(void *opaque1,
>> +				unsigned int min_order,
>> +				void (*visit)(void *opaque2,
>> +					      unsigned long pfn,
>> +					      unsigned long nr_pages));
>> +
>>   /*
>>    * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>>    * into the buddy system. The freed pages will be poisoned with pattern
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 6d00f74..a721a35 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>>   	show_swap_cache_info();
>>   }
>>   
>> +/**
>> + * walk_free_mem_block - Walk through the free page blocks in the system
>> + * @opaque1: the context passed from the caller
>> + * @min_order: the minimum order of free lists to check
>> + * @visit: the callback function given by the caller
> The original suggestion for using visit was motivated by a visit design
> pattern but I can see how this can be confusing. Maybe a more explicit
> name wold be better. What about report_free_range.


I'm afraid that name would be too long to fit in nicely.
How about simply naming it "report"?


>
>> + *
>> + * The function is used to walk through the free page blocks in the system,
>> + * and each free page block is reported to the caller via the @visit callback.
>> + * Please note:
>> + * 1) The function is used to report hints of free pages, so the caller should
>> + * not use those reported pages after the callback returns.
>> + * 2) The callback is invoked with the zone->lock being held, so it should not
>> + * block and should finish as soon as possible.
> I think that the explicit note about zone->lock is not really need. This
> can change in future and I would even bet that somebody might rely on
> the lock being held for some purpose and silently get broken with the
> change. Instead I would much rather see something like the following:
> "
> Please note that there are no locking guarantees for the callback

Just a little confused with this one:

The callback is invoked within zone->lock, why would we claim it "no
locking guarantees for the callback"?

> and
> that the reported pfn range might be freed or disappear after the
> callback returns so the caller has to be very careful how it is used.
>
> The callback itself must not sleep or perform any operations which would
> require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
> or via any lock dependency. It is generally advisable to implement
> the callback as simple as possible and defer any heavy lifting to a
> different context.
>
> There is no guarantee that each free range will be reported only once
> during one walk_free_mem_block invocation.
>
> pfn_to_page on the given range is strongly discouraged and if there is
> an absolute need for that make sure to contact MM people to discuss
> potential problems.
>
> The function itself might sleep so it cannot be called from atomic
> contexts.
>
> In general low orders tend to be very volatile and so it makes more
> sense to query larger ones for various optimizations which like
> ballooning etc... This will reduce the overhead as well.
> "

I think it looks quite comprehensive. Thanks.


Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 4/5] mm: support reporting free page blocks
@ 2017-08-21  6:12       ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-21  6:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/18/2017 09:46 PM, Michal Hocko wrote:
> On Thu 17-08-17 11:26:55, Wei Wang wrote:
>> This patch adds support to walk through the free page blocks in the
>> system and report them via a callback function. Some page blocks may
>> leave the free list after zone->lock is released, so it is the caller's
>> responsibility to either detect or prevent the use of such pages.
> This could see more details to be honest. Especially the usecase you are
> going to use this for. This will help us to understand the motivation
> in future when the current user might be gone a new ones largely diverge
> into a different usage. This wouldn't be the first time I have seen
> something like that.

OK, I will more details here about how it's used to accelerate live 
migration.

>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Michael S. Tsirkin <mst@redhat.com>
>> ---
>>   include/linux/mm.h |  6 ++++++
>>   mm/page_alloc.c    | 44 ++++++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 50 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 46b9ac5..cd29b9f 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>>   		unsigned long zone_start_pfn, unsigned long *zholes_size);
>>   extern void free_initmem(void);
>>   
>> +extern void walk_free_mem_block(void *opaque1,
>> +				unsigned int min_order,
>> +				void (*visit)(void *opaque2,
>> +					      unsigned long pfn,
>> +					      unsigned long nr_pages));
>> +
>>   /*
>>    * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>>    * into the buddy system. The freed pages will be poisoned with pattern
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 6d00f74..a721a35 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>>   	show_swap_cache_info();
>>   }
>>   
>> +/**
>> + * walk_free_mem_block - Walk through the free page blocks in the system
>> + * @opaque1: the context passed from the caller
>> + * @min_order: the minimum order of free lists to check
>> + * @visit: the callback function given by the caller
> The original suggestion for using visit was motivated by a visit design
> pattern but I can see how this can be confusing. Maybe a more explicit
> name wold be better. What about report_free_range.


I'm afraid that name would be too long to fit in nicely.
How about simply naming it "report"?


>
>> + *
>> + * The function is used to walk through the free page blocks in the system,
>> + * and each free page block is reported to the caller via the @visit callback.
>> + * Please note:
>> + * 1) The function is used to report hints of free pages, so the caller should
>> + * not use those reported pages after the callback returns.
>> + * 2) The callback is invoked with the zone->lock being held, so it should not
>> + * block and should finish as soon as possible.
> I think that the explicit note about zone->lock is not really need. This
> can change in future and I would even bet that somebody might rely on
> the lock being held for some purpose and silently get broken with the
> change. Instead I would much rather see something like the following:
> "
> Please note that there are no locking guarantees for the callback

Just a little confused with this one:

The callback is invoked within zone->lock, why would we claim it "no
locking guarantees for the callback"?

> and
> that the reported pfn range might be freed or disappear after the
> callback returns so the caller has to be very careful how it is used.
>
> The callback itself must not sleep or perform any operations which would
> require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
> or via any lock dependency. It is generally advisable to implement
> the callback as simple as possible and defer any heavy lifting to a
> different context.
>
> There is no guarantee that each free range will be reported only once
> during one walk_free_mem_block invocation.
>
> pfn_to_page on the given range is strongly discouraged and if there is
> an absolute need for that make sure to contact MM people to discuss
> potential problems.
>
> The function itself might sleep so it cannot be called from atomic
> contexts.
>
> In general low orders tend to be very volatile and so it makes more
> sense to query larger ones for various optimizations which like
> ballooning etc... This will reduce the overhead as well.
> "

I think it looks quite comprehensive. Thanks.


Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [Qemu-devel] [PATCH v14 4/5] mm: support reporting free page blocks
@ 2017-08-21  6:12       ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-21  6:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/18/2017 09:46 PM, Michal Hocko wrote:
> On Thu 17-08-17 11:26:55, Wei Wang wrote:
>> This patch adds support to walk through the free page blocks in the
>> system and report them via a callback function. Some page blocks may
>> leave the free list after zone->lock is released, so it is the caller's
>> responsibility to either detect or prevent the use of such pages.
> This could see more details to be honest. Especially the usecase you are
> going to use this for. This will help us to understand the motivation
> in future when the current user might be gone a new ones largely diverge
> into a different usage. This wouldn't be the first time I have seen
> something like that.

OK, I will more details here about how it's used to accelerate live 
migration.

>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Michael S. Tsirkin <mst@redhat.com>
>> ---
>>   include/linux/mm.h |  6 ++++++
>>   mm/page_alloc.c    | 44 ++++++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 50 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 46b9ac5..cd29b9f 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>>   		unsigned long zone_start_pfn, unsigned long *zholes_size);
>>   extern void free_initmem(void);
>>   
>> +extern void walk_free_mem_block(void *opaque1,
>> +				unsigned int min_order,
>> +				void (*visit)(void *opaque2,
>> +					      unsigned long pfn,
>> +					      unsigned long nr_pages));
>> +
>>   /*
>>    * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>>    * into the buddy system. The freed pages will be poisoned with pattern
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 6d00f74..a721a35 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>>   	show_swap_cache_info();
>>   }
>>   
>> +/**
>> + * walk_free_mem_block - Walk through the free page blocks in the system
>> + * @opaque1: the context passed from the caller
>> + * @min_order: the minimum order of free lists to check
>> + * @visit: the callback function given by the caller
> The original suggestion for using visit was motivated by a visit design
> pattern but I can see how this can be confusing. Maybe a more explicit
> name wold be better. What about report_free_range.


I'm afraid that name would be too long to fit in nicely.
How about simply naming it "report"?


>
>> + *
>> + * The function is used to walk through the free page blocks in the system,
>> + * and each free page block is reported to the caller via the @visit callback.
>> + * Please note:
>> + * 1) The function is used to report hints of free pages, so the caller should
>> + * not use those reported pages after the callback returns.
>> + * 2) The callback is invoked with the zone->lock being held, so it should not
>> + * block and should finish as soon as possible.
> I think that the explicit note about zone->lock is not really need. This
> can change in future and I would even bet that somebody might rely on
> the lock being held for some purpose and silently get broken with the
> change. Instead I would much rather see something like the following:
> "
> Please note that there are no locking guarantees for the callback

Just a little confused with this one:

The callback is invoked within zone->lock, why would we claim it "no
locking guarantees for the callback"?

> and
> that the reported pfn range might be freed or disappear after the
> callback returns so the caller has to be very careful how it is used.
>
> The callback itself must not sleep or perform any operations which would
> require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
> or via any lock dependency. It is generally advisable to implement
> the callback as simple as possible and defer any heavy lifting to a
> different context.
>
> There is no guarantee that each free range will be reported only once
> during one walk_free_mem_block invocation.
>
> pfn_to_page on the given range is strongly discouraged and if there is
> an absolute need for that make sure to contact MM people to discuss
> potential problems.
>
> The function itself might sleep so it cannot be called from atomic
> contexts.
>
> In general low orders tend to be very volatile and so it makes more
> sense to query larger ones for various optimizations which like
> ballooning etc... This will reduce the overhead as well.
> "

I think it looks quite comprehensive. Thanks.


Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 4/5] mm: support reporting free page blocks
  2017-08-18 13:46     ` Michal Hocko
  (?)
  (?)
@ 2017-08-21  6:12     ` Wei Wang
  -1 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-21  6:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: aarcange, virtio-dev, kvm, mst, qemu-devel, amit.shah,
	liliang.opensource, mawilcox, linux-kernel, willy,
	virtualization, linux-mm, yang.zhang.wz, quan.xu, cornelia.huck,
	pbonzini, akpm, mgorman

On 08/18/2017 09:46 PM, Michal Hocko wrote:
> On Thu 17-08-17 11:26:55, Wei Wang wrote:
>> This patch adds support to walk through the free page blocks in the
>> system and report them via a callback function. Some page blocks may
>> leave the free list after zone->lock is released, so it is the caller's
>> responsibility to either detect or prevent the use of such pages.
> This could see more details to be honest. Especially the usecase you are
> going to use this for. This will help us to understand the motivation
> in future when the current user might be gone a new ones largely diverge
> into a different usage. This wouldn't be the first time I have seen
> something like that.

OK, I will more details here about how it's used to accelerate live 
migration.

>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Michael S. Tsirkin <mst@redhat.com>
>> ---
>>   include/linux/mm.h |  6 ++++++
>>   mm/page_alloc.c    | 44 ++++++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 50 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 46b9ac5..cd29b9f 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>>   		unsigned long zone_start_pfn, unsigned long *zholes_size);
>>   extern void free_initmem(void);
>>   
>> +extern void walk_free_mem_block(void *opaque1,
>> +				unsigned int min_order,
>> +				void (*visit)(void *opaque2,
>> +					      unsigned long pfn,
>> +					      unsigned long nr_pages));
>> +
>>   /*
>>    * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>>    * into the buddy system. The freed pages will be poisoned with pattern
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 6d00f74..a721a35 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>>   	show_swap_cache_info();
>>   }
>>   
>> +/**
>> + * walk_free_mem_block - Walk through the free page blocks in the system
>> + * @opaque1: the context passed from the caller
>> + * @min_order: the minimum order of free lists to check
>> + * @visit: the callback function given by the caller
> The original suggestion for using visit was motivated by a visit design
> pattern but I can see how this can be confusing. Maybe a more explicit
> name wold be better. What about report_free_range.


I'm afraid that name would be too long to fit in nicely.
How about simply naming it "report"?


>
>> + *
>> + * The function is used to walk through the free page blocks in the system,
>> + * and each free page block is reported to the caller via the @visit callback.
>> + * Please note:
>> + * 1) The function is used to report hints of free pages, so the caller should
>> + * not use those reported pages after the callback returns.
>> + * 2) The callback is invoked with the zone->lock being held, so it should not
>> + * block and should finish as soon as possible.
> I think that the explicit note about zone->lock is not really need. This
> can change in future and I would even bet that somebody might rely on
> the lock being held for some purpose and silently get broken with the
> change. Instead I would much rather see something like the following:
> "
> Please note that there are no locking guarantees for the callback

Just a little confused with this one:

The callback is invoked within zone->lock, why would we claim it "no
locking guarantees for the callback"?

> and
> that the reported pfn range might be freed or disappear after the
> callback returns so the caller has to be very careful how it is used.
>
> The callback itself must not sleep or perform any operations which would
> require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
> or via any lock dependency. It is generally advisable to implement
> the callback as simple as possible and defer any heavy lifting to a
> different context.
>
> There is no guarantee that each free range will be reported only once
> during one walk_free_mem_block invocation.
>
> pfn_to_page on the given range is strongly discouraged and if there is
> an absolute need for that make sure to contact MM people to discuss
> potential problems.
>
> The function itself might sleep so it cannot be called from atomic
> contexts.
>
> In general low orders tend to be very volatile and so it makes more
> sense to query larger ones for various optimizations which like
> ballooning etc... This will reduce the overhead as well.
> "

I think it looks quite comprehensive. Thanks.


Best,
Wei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [virtio-dev] Re: [PATCH v14 4/5] mm: support reporting free page blocks
@ 2017-08-21  6:12       ` Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-21  6:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 08/18/2017 09:46 PM, Michal Hocko wrote:
> On Thu 17-08-17 11:26:55, Wei Wang wrote:
>> This patch adds support to walk through the free page blocks in the
>> system and report them via a callback function. Some page blocks may
>> leave the free list after zone->lock is released, so it is the caller's
>> responsibility to either detect or prevent the use of such pages.
> This could see more details to be honest. Especially the usecase you are
> going to use this for. This will help us to understand the motivation
> in future when the current user might be gone a new ones largely diverge
> into a different usage. This wouldn't be the first time I have seen
> something like that.

OK, I will more details here about how it's used to accelerate live 
migration.

>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Michael S. Tsirkin <mst@redhat.com>
>> ---
>>   include/linux/mm.h |  6 ++++++
>>   mm/page_alloc.c    | 44 ++++++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 50 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 46b9ac5..cd29b9f 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>>   		unsigned long zone_start_pfn, unsigned long *zholes_size);
>>   extern void free_initmem(void);
>>   
>> +extern void walk_free_mem_block(void *opaque1,
>> +				unsigned int min_order,
>> +				void (*visit)(void *opaque2,
>> +					      unsigned long pfn,
>> +					      unsigned long nr_pages));
>> +
>>   /*
>>    * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>>    * into the buddy system. The freed pages will be poisoned with pattern
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 6d00f74..a721a35 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>>   	show_swap_cache_info();
>>   }
>>   
>> +/**
>> + * walk_free_mem_block - Walk through the free page blocks in the system
>> + * @opaque1: the context passed from the caller
>> + * @min_order: the minimum order of free lists to check
>> + * @visit: the callback function given by the caller
> The original suggestion for using visit was motivated by a visit design
> pattern but I can see how this can be confusing. Maybe a more explicit
> name wold be better. What about report_free_range.


I'm afraid that name would be too long to fit in nicely.
How about simply naming it "report"?


>
>> + *
>> + * The function is used to walk through the free page blocks in the system,
>> + * and each free page block is reported to the caller via the @visit callback.
>> + * Please note:
>> + * 1) The function is used to report hints of free pages, so the caller should
>> + * not use those reported pages after the callback returns.
>> + * 2) The callback is invoked with the zone->lock being held, so it should not
>> + * block and should finish as soon as possible.
> I think that the explicit note about zone->lock is not really need. This
> can change in future and I would even bet that somebody might rely on
> the lock being held for some purpose and silently get broken with the
> change. Instead I would much rather see something like the following:
> "
> Please note that there are no locking guarantees for the callback

Just a little confused with this one:

The callback is invoked within zone->lock, why would we claim it "no
locking guarantees for the callback"?

> and
> that the reported pfn range might be freed or disappear after the
> callback returns so the caller has to be very careful how it is used.
>
> The callback itself must not sleep or perform any operations which would
> require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
> or via any lock dependency. It is generally advisable to implement
> the callback as simple as possible and defer any heavy lifting to a
> different context.
>
> There is no guarantee that each free range will be reported only once
> during one walk_free_mem_block invocation.
>
> pfn_to_page on the given range is strongly discouraged and if there is
> an absolute need for that make sure to contact MM people to discuss
> potential problems.
>
> The function itself might sleep so it cannot be called from atomic
> contexts.
>
> In general low orders tend to be very volatile and so it makes more
> sense to query larger ones for various optimizations which like
> ballooning etc... This will reduce the overhead as well.
> "

I think it looks quite comprehensive. Thanks.


Best,
Wei

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 4/5] mm: support reporting free page blocks
  2017-08-21  6:12       ` Wei Wang
  (?)
  (?)
@ 2017-08-21  6:14         ` Michal Hocko
  -1 siblings, 0 replies; 116+ messages in thread
From: Michal Hocko @ 2017-08-21  6:14 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Mon 21-08-17 14:12:47, Wei Wang wrote:
> On 08/18/2017 09:46 PM, Michal Hocko wrote:
[...]
> >>+/**
> >>+ * walk_free_mem_block - Walk through the free page blocks in the system
> >>+ * @opaque1: the context passed from the caller
> >>+ * @min_order: the minimum order of free lists to check
> >>+ * @visit: the callback function given by the caller
> >The original suggestion for using visit was motivated by a visit design
> >pattern but I can see how this can be confusing. Maybe a more explicit
> >name wold be better. What about report_free_range.
> 
> 
> I'm afraid that name would be too long to fit in nicely.
> How about simply naming it "report"?

I do not have a strong opinion on this. I wouldn't be afraid of using
slightly longer name here for the clarity sake, though.
 
> >>+ *
> >>+ * The function is used to walk through the free page blocks in the system,
> >>+ * and each free page block is reported to the caller via the @visit callback.
> >>+ * Please note:
> >>+ * 1) The function is used to report hints of free pages, so the caller should
> >>+ * not use those reported pages after the callback returns.
> >>+ * 2) The callback is invoked with the zone->lock being held, so it should not
> >>+ * block and should finish as soon as possible.
> >I think that the explicit note about zone->lock is not really need. This
> >can change in future and I would even bet that somebody might rely on
> >the lock being held for some purpose and silently get broken with the
> >change. Instead I would much rather see something like the following:
> >"
> >Please note that there are no locking guarantees for the callback
> 
> Just a little confused with this one:
> 
> The callback is invoked within zone->lock, why would we claim it "no
> locking guarantees for the callback"?

Because we definitely do not want anybody to rely on that fact and
(ab)use it. This might change in future and it would be better to be
clear about that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 4/5] mm: support reporting free page blocks
@ 2017-08-21  6:14         ` Michal Hocko
  0 siblings, 0 replies; 116+ messages in thread
From: Michal Hocko @ 2017-08-21  6:14 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, mst, qemu-devel, amit.shah,
	liliang.opensource, mawilcox, linux-kernel, willy,
	virtualization, linux-mm, yang.zhang.wz, quan.xu, cornelia.huck,
	pbonzini, akpm, mgorman

On Mon 21-08-17 14:12:47, Wei Wang wrote:
> On 08/18/2017 09:46 PM, Michal Hocko wrote:
[...]
> >>+/**
> >>+ * walk_free_mem_block - Walk through the free page blocks in the system
> >>+ * @opaque1: the context passed from the caller
> >>+ * @min_order: the minimum order of free lists to check
> >>+ * @visit: the callback function given by the caller
> >The original suggestion for using visit was motivated by a visit design
> >pattern but I can see how this can be confusing. Maybe a more explicit
> >name wold be better. What about report_free_range.
> 
> 
> I'm afraid that name would be too long to fit in nicely.
> How about simply naming it "report"?

I do not have a strong opinion on this. I wouldn't be afraid of using
slightly longer name here for the clarity sake, though.
 
> >>+ *
> >>+ * The function is used to walk through the free page blocks in the system,
> >>+ * and each free page block is reported to the caller via the @visit callback.
> >>+ * Please note:
> >>+ * 1) The function is used to report hints of free pages, so the caller should
> >>+ * not use those reported pages after the callback returns.
> >>+ * 2) The callback is invoked with the zone->lock being held, so it should not
> >>+ * block and should finish as soon as possible.
> >I think that the explicit note about zone->lock is not really need. This
> >can change in future and I would even bet that somebody might rely on
> >the lock being held for some purpose and silently get broken with the
> >change. Instead I would much rather see something like the following:
> >"
> >Please note that there are no locking guarantees for the callback
> 
> Just a little confused with this one:
> 
> The callback is invoked within zone->lock, why would we claim it "no
> locking guarantees for the callback"?

Because we definitely do not want anybody to rely on that fact and
(ab)use it. This might change in future and it would be better to be
clear about that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 4/5] mm: support reporting free page blocks
@ 2017-08-21  6:14         ` Michal Hocko
  0 siblings, 0 replies; 116+ messages in thread
From: Michal Hocko @ 2017-08-21  6:14 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Mon 21-08-17 14:12:47, Wei Wang wrote:
> On 08/18/2017 09:46 PM, Michal Hocko wrote:
[...]
> >>+/**
> >>+ * walk_free_mem_block - Walk through the free page blocks in the system
> >>+ * @opaque1: the context passed from the caller
> >>+ * @min_order: the minimum order of free lists to check
> >>+ * @visit: the callback function given by the caller
> >The original suggestion for using visit was motivated by a visit design
> >pattern but I can see how this can be confusing. Maybe a more explicit
> >name wold be better. What about report_free_range.
> 
> 
> I'm afraid that name would be too long to fit in nicely.
> How about simply naming it "report"?

I do not have a strong opinion on this. I wouldn't be afraid of using
slightly longer name here for the clarity sake, though.
 
> >>+ *
> >>+ * The function is used to walk through the free page blocks in the system,
> >>+ * and each free page block is reported to the caller via the @visit callback.
> >>+ * Please note:
> >>+ * 1) The function is used to report hints of free pages, so the caller should
> >>+ * not use those reported pages after the callback returns.
> >>+ * 2) The callback is invoked with the zone->lock being held, so it should not
> >>+ * block and should finish as soon as possible.
> >I think that the explicit note about zone->lock is not really need. This
> >can change in future and I would even bet that somebody might rely on
> >the lock being held for some purpose and silently get broken with the
> >change. Instead I would much rather see something like the following:
> >"
> >Please note that there are no locking guarantees for the callback
> 
> Just a little confused with this one:
> 
> The callback is invoked within zone->lock, why would we claim it "no
> locking guarantees for the callback"?

Because we definitely do not want anybody to rely on that fact and
(ab)use it. This might change in future and it would be better to be
clear about that.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [Qemu-devel] [PATCH v14 4/5] mm: support reporting free page blocks
@ 2017-08-21  6:14         ` Michal Hocko
  0 siblings, 0 replies; 116+ messages in thread
From: Michal Hocko @ 2017-08-21  6:14 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Mon 21-08-17 14:12:47, Wei Wang wrote:
> On 08/18/2017 09:46 PM, Michal Hocko wrote:
[...]
> >>+/**
> >>+ * walk_free_mem_block - Walk through the free page blocks in the system
> >>+ * @opaque1: the context passed from the caller
> >>+ * @min_order: the minimum order of free lists to check
> >>+ * @visit: the callback function given by the caller
> >The original suggestion for using visit was motivated by a visit design
> >pattern but I can see how this can be confusing. Maybe a more explicit
> >name wold be better. What about report_free_range.
> 
> 
> I'm afraid that name would be too long to fit in nicely.
> How about simply naming it "report"?

I do not have a strong opinion on this. I wouldn't be afraid of using
slightly longer name here for the clarity sake, though.
 
> >>+ *
> >>+ * The function is used to walk through the free page blocks in the system,
> >>+ * and each free page block is reported to the caller via the @visit callback.
> >>+ * Please note:
> >>+ * 1) The function is used to report hints of free pages, so the caller should
> >>+ * not use those reported pages after the callback returns.
> >>+ * 2) The callback is invoked with the zone->lock being held, so it should not
> >>+ * block and should finish as soon as possible.
> >I think that the explicit note about zone->lock is not really need. This
> >can change in future and I would even bet that somebody might rely on
> >the lock being held for some purpose and silently get broken with the
> >change. Instead I would much rather see something like the following:
> >"
> >Please note that there are no locking guarantees for the callback
> 
> Just a little confused with this one:
> 
> The callback is invoked within zone->lock, why would we claim it "no
> locking guarantees for the callback"?

Because we definitely do not want anybody to rely on that fact and
(ab)use it. This might change in future and it would be better to be
clear about that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 4/5] mm: support reporting free page blocks
  2017-08-18 17:23     ` Michael S. Tsirkin
  (?)
@ 2017-08-21  6:18       ` Michal Hocko
  -1 siblings, 0 replies; 116+ messages in thread
From: Michal Hocko @ 2017-08-21  6:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Fri 18-08-17 20:23:05, Michael S. Tsirkin wrote:
> On Thu, Aug 17, 2017 at 11:26:55AM +0800, Wei Wang wrote:
[...]
> > +void walk_free_mem_block(void *opaque1,
> > +			 unsigned int min_order,
> > +			 void (*visit)(void *opaque2,
> 
> You can just avoid opaque2 completely I think, then opaque1 can
> be renamed opaque.
> 
> > +				       unsigned long pfn,
> > +				       unsigned long nr_pages))
> > +{
> > +	struct zone *zone;
> > +	struct page *page;
> > +	struct list_head *list;
> > +	unsigned int order;
> > +	enum migratetype mt;
> > +	unsigned long pfn, flags;
> > +
> > +	for_each_populated_zone(zone) {
> > +		for (order = MAX_ORDER - 1;
> > +		     order < MAX_ORDER && order >= min_order; order--) {
> > +			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> > +				spin_lock_irqsave(&zone->lock, flags);
> > +				list = &zone->free_area[order].free_list[mt];
> > +				list_for_each_entry(page, list, lru) {
> > +					pfn = page_to_pfn(page);
> > +					visit(opaque1, pfn, 1 << order);
> 
> My only concern here is inability of callback to
> 1. break out of list
> 2. remove page from the list

As I've said before this has to be a read only API. You cannot simply
fiddle with the page allocator internals under its feet.

> So I would make the callback bool, and I would use
> list_for_each_entry_safe.

If a bool would tell to break out of the loop then I agree. This sounds
useful.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 4/5] mm: support reporting free page blocks
@ 2017-08-21  6:18       ` Michal Hocko
  0 siblings, 0 replies; 116+ messages in thread
From: Michal Hocko @ 2017-08-21  6:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Fri 18-08-17 20:23:05, Michael S. Tsirkin wrote:
> On Thu, Aug 17, 2017 at 11:26:55AM +0800, Wei Wang wrote:
[...]
> > +void walk_free_mem_block(void *opaque1,
> > +			 unsigned int min_order,
> > +			 void (*visit)(void *opaque2,
> 
> You can just avoid opaque2 completely I think, then opaque1 can
> be renamed opaque.
> 
> > +				       unsigned long pfn,
> > +				       unsigned long nr_pages))
> > +{
> > +	struct zone *zone;
> > +	struct page *page;
> > +	struct list_head *list;
> > +	unsigned int order;
> > +	enum migratetype mt;
> > +	unsigned long pfn, flags;
> > +
> > +	for_each_populated_zone(zone) {
> > +		for (order = MAX_ORDER - 1;
> > +		     order < MAX_ORDER && order >= min_order; order--) {
> > +			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> > +				spin_lock_irqsave(&zone->lock, flags);
> > +				list = &zone->free_area[order].free_list[mt];
> > +				list_for_each_entry(page, list, lru) {
> > +					pfn = page_to_pfn(page);
> > +					visit(opaque1, pfn, 1 << order);
> 
> My only concern here is inability of callback to
> 1. break out of list
> 2. remove page from the list

As I've said before this has to be a read only API. You cannot simply
fiddle with the page allocator internals under its feet.

> So I would make the callback bool, and I would use
> list_for_each_entry_safe.

If a bool would tell to break out of the loop then I agree. This sounds
useful.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [Qemu-devel] [PATCH v14 4/5] mm: support reporting free page blocks
@ 2017-08-21  6:18       ` Michal Hocko
  0 siblings, 0 replies; 116+ messages in thread
From: Michal Hocko @ 2017-08-21  6:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Fri 18-08-17 20:23:05, Michael S. Tsirkin wrote:
> On Thu, Aug 17, 2017 at 11:26:55AM +0800, Wei Wang wrote:
[...]
> > +void walk_free_mem_block(void *opaque1,
> > +			 unsigned int min_order,
> > +			 void (*visit)(void *opaque2,
> 
> You can just avoid opaque2 completely I think, then opaque1 can
> be renamed opaque.
> 
> > +				       unsigned long pfn,
> > +				       unsigned long nr_pages))
> > +{
> > +	struct zone *zone;
> > +	struct page *page;
> > +	struct list_head *list;
> > +	unsigned int order;
> > +	enum migratetype mt;
> > +	unsigned long pfn, flags;
> > +
> > +	for_each_populated_zone(zone) {
> > +		for (order = MAX_ORDER - 1;
> > +		     order < MAX_ORDER && order >= min_order; order--) {
> > +			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> > +				spin_lock_irqsave(&zone->lock, flags);
> > +				list = &zone->free_area[order].free_list[mt];
> > +				list_for_each_entry(page, list, lru) {
> > +					pfn = page_to_pfn(page);
> > +					visit(opaque1, pfn, 1 << order);
> 
> My only concern here is inability of callback to
> 1. break out of list
> 2. remove page from the list

As I've said before this has to be a read only API. You cannot simply
fiddle with the page allocator internals under its feet.

> So I would make the callback bool, and I would use
> list_for_each_entry_safe.

If a bool would tell to break out of the loop then I agree. This sounds
useful.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 4/5] mm: support reporting free page blocks
  2017-08-18 17:23     ` Michael S. Tsirkin
                       ` (3 preceding siblings ...)
  (?)
@ 2017-08-21  6:18     ` Michal Hocko
  -1 siblings, 0 replies; 116+ messages in thread
From: Michal Hocko @ 2017-08-21  6:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mgorman

On Fri 18-08-17 20:23:05, Michael S. Tsirkin wrote:
> On Thu, Aug 17, 2017 at 11:26:55AM +0800, Wei Wang wrote:
[...]
> > +void walk_free_mem_block(void *opaque1,
> > +			 unsigned int min_order,
> > +			 void (*visit)(void *opaque2,
> 
> You can just avoid opaque2 completely I think, then opaque1 can
> be renamed opaque.
> 
> > +				       unsigned long pfn,
> > +				       unsigned long nr_pages))
> > +{
> > +	struct zone *zone;
> > +	struct page *page;
> > +	struct list_head *list;
> > +	unsigned int order;
> > +	enum migratetype mt;
> > +	unsigned long pfn, flags;
> > +
> > +	for_each_populated_zone(zone) {
> > +		for (order = MAX_ORDER - 1;
> > +		     order < MAX_ORDER && order >= min_order; order--) {
> > +			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> > +				spin_lock_irqsave(&zone->lock, flags);
> > +				list = &zone->free_area[order].free_list[mt];
> > +				list_for_each_entry(page, list, lru) {
> > +					pfn = page_to_pfn(page);
> > +					visit(opaque1, pfn, 1 << order);
> 
> My only concern here is inability of callback to
> 1. break out of list
> 2. remove page from the list

As I've said before this has to be a read only API. You cannot simply
fiddle with the page allocator internals under its feet.

> So I would make the callback bool, and I would use
> list_for_each_entry_safe.

If a bool would tell to break out of the loop then I agree. This sounds
useful.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-08-18  7:39       ` Wei Wang
  (?)
  (?)
@ 2017-08-21 20:22         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-21 20:22 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Fri, Aug 18, 2017 at 03:39:27PM +0800, Wei Wang wrote:
> On 08/18/2017 10:22 AM, Michael S. Tsirkin wrote:
> > +static void send_balloon_page_sg(struct virtio_balloon *vb,
> > +				 struct virtqueue *vq,
> > +				 void *addr,
> > +				 uint32_t size)
> > +{
> > +	unsigned int len;
> > +	int ret;
> > +
> > +	do {
> > +		ret = add_one_sg(vq, addr, size);
> > +		virtqueue_kick(vq);
> > +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > +		/*
> > +		 * It is uncommon to see the vq is full, because the sg is sent
> > +		 * one by one and the device is able to handle it in time. But
> > +		 * if that happens, we go back to retry after an entry gets
> > +		 * released.
> > +		 */
> > Why send one by one though? Why not batch some s/gs and wait for all
> > of them to be completed? If memory if fragmented, waiting every time is
> > worse than what we have now (VIRTIO_BALLOON_ARRAY_PFNS_MAX at a time).
> > 
> 
> OK, I'll do batching in some fashion.
> 
> 
> Best,
> Wei
> 
> 

btw you need to address the build errors that kbot has found.

-- 
MST

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-08-21 20:22         ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-21 20:22 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Fri, Aug 18, 2017 at 03:39:27PM +0800, Wei Wang wrote:
> On 08/18/2017 10:22 AM, Michael S. Tsirkin wrote:
> > +static void send_balloon_page_sg(struct virtio_balloon *vb,
> > +				 struct virtqueue *vq,
> > +				 void *addr,
> > +				 uint32_t size)
> > +{
> > +	unsigned int len;
> > +	int ret;
> > +
> > +	do {
> > +		ret = add_one_sg(vq, addr, size);
> > +		virtqueue_kick(vq);
> > +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > +		/*
> > +		 * It is uncommon to see the vq is full, because the sg is sent
> > +		 * one by one and the device is able to handle it in time. But
> > +		 * if that happens, we go back to retry after an entry gets
> > +		 * released.
> > +		 */
> > Why send one by one though? Why not batch some s/gs and wait for all
> > of them to be completed? If memory if fragmented, waiting every time is
> > worse than what we have now (VIRTIO_BALLOON_ARRAY_PFNS_MAX at a time).
> > 
> 
> OK, I'll do batching in some fashion.
> 
> 
> Best,
> Wei
> 
> 

btw you need to address the build errors that kbot has found.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [Qemu-devel] [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-08-21 20:22         ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-21 20:22 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Fri, Aug 18, 2017 at 03:39:27PM +0800, Wei Wang wrote:
> On 08/18/2017 10:22 AM, Michael S. Tsirkin wrote:
> > +static void send_balloon_page_sg(struct virtio_balloon *vb,
> > +				 struct virtqueue *vq,
> > +				 void *addr,
> > +				 uint32_t size)
> > +{
> > +	unsigned int len;
> > +	int ret;
> > +
> > +	do {
> > +		ret = add_one_sg(vq, addr, size);
> > +		virtqueue_kick(vq);
> > +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > +		/*
> > +		 * It is uncommon to see the vq is full, because the sg is sent
> > +		 * one by one and the device is able to handle it in time. But
> > +		 * if that happens, we go back to retry after an entry gets
> > +		 * released.
> > +		 */
> > Why send one by one though? Why not batch some s/gs and wait for all
> > of them to be completed? If memory if fragmented, waiting every time is
> > worse than what we have now (VIRTIO_BALLOON_ARRAY_PFNS_MAX at a time).
> > 
> 
> OK, I'll do batching in some fashion.
> 
> 
> Best,
> Wei
> 
> 

btw you need to address the build errors that kbot has found.

-- 
MST

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-08-18  7:39       ` Wei Wang
                         ` (3 preceding siblings ...)
  (?)
@ 2017-08-21 20:22       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-21 20:22 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko, mgorman

On Fri, Aug 18, 2017 at 03:39:27PM +0800, Wei Wang wrote:
> On 08/18/2017 10:22 AM, Michael S. Tsirkin wrote:
> > +static void send_balloon_page_sg(struct virtio_balloon *vb,
> > +				 struct virtqueue *vq,
> > +				 void *addr,
> > +				 uint32_t size)
> > +{
> > +	unsigned int len;
> > +	int ret;
> > +
> > +	do {
> > +		ret = add_one_sg(vq, addr, size);
> > +		virtqueue_kick(vq);
> > +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > +		/*
> > +		 * It is uncommon to see the vq is full, because the sg is sent
> > +		 * one by one and the device is able to handle it in time. But
> > +		 * if that happens, we go back to retry after an entry gets
> > +		 * released.
> > +		 */
> > Why send one by one though? Why not batch some s/gs and wait for all
> > of them to be completed? If memory if fragmented, waiting every time is
> > worse than what we have now (VIRTIO_BALLOON_ARRAY_PFNS_MAX at a time).
> > 
> 
> OK, I'll do batching in some fashion.
> 
> 
> Best,
> Wei
> 
> 

btw you need to address the build errors that kbot has found.

-- 
MST

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [virtio-dev] Re: [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-08-21 20:22         ` Michael S. Tsirkin
  0 siblings, 0 replies; 116+ messages in thread
From: Michael S. Tsirkin @ 2017-08-21 20:22 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Fri, Aug 18, 2017 at 03:39:27PM +0800, Wei Wang wrote:
> On 08/18/2017 10:22 AM, Michael S. Tsirkin wrote:
> > +static void send_balloon_page_sg(struct virtio_balloon *vb,
> > +				 struct virtqueue *vq,
> > +				 void *addr,
> > +				 uint32_t size)
> > +{
> > +	unsigned int len;
> > +	int ret;
> > +
> > +	do {
> > +		ret = add_one_sg(vq, addr, size);
> > +		virtqueue_kick(vq);
> > +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > +		/*
> > +		 * It is uncommon to see the vq is full, because the sg is sent
> > +		 * one by one and the device is able to handle it in time. But
> > +		 * if that happens, we go back to retry after an entry gets
> > +		 * released.
> > +		 */
> > Why send one by one though? Why not batch some s/gs and wait for all
> > of them to be completed? If memory if fragmented, waiting every time is
> > worse than what we have now (VIRTIO_BALLOON_ARRAY_PFNS_MAX at a time).
> > 
> 
> OK, I'll do batching in some fashion.
> 
> 
> Best,
> Wei
> 
> 

btw you need to address the build errors that kbot has found.

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v14 0/5] Virtio-balloon Enhancement
@ 2017-08-17  3:26 Wei Wang
  0 siblings, 0 replies; 116+ messages in thread
From: Wei Wang @ 2017-08-17  3:26 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: aarcange, yang.zhang.wz, liliang.opensource, willy, amit.shah,
	quan.xu, cornelia.huck, pbonzini, mgorman

This patch series enhances the existing virtio-balloon with the following
new features:
1) fast ballooning: transfer ballooned pages between the guest and host in
chunks using sgs, instead of one by one; and
2) free_page_vq: a new virtqueue to report guest free pages to the host.

The second feature can be used to accelerate live migration of VMs. Here
are some details:

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.

The second feature  enables the optimization of the 1st round memory
transfer - the hypervisor can skip the transfer of guest free pages in the
1st round. It is not concerned that the memory pages are used after they
are given to the hypervisor as a hint of the free pages, because they will
be tracked by the hypervisor and transferred in the next round if they are
used and written.

Change Log:
v13->v14:
1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c.
2) xbitmap: consolidate the implementation of xb_bit_set/clear/test into
one xb_bit_ops.
3) xbitmap: add documents for the exported APIs.
4) mm: rewrite the function to walk through free page blocks.
5) virtio-balloon: when reporting a free page blcok to the device, if the
vq is full (less likey to happen in practice), just skip reporting this
block, instead of busywaiting till an entry gets released.
6) virtio-balloon: fail the probe function if adding the signal buf in
init_vqs fails.

v12->v13:
1) mm: use a callback function to handle the the free page blocks from the
report function. This avoids exposing the zone internal to a kernel module.
2) virtio-balloon: send balloon pages or a free page block using a single sg
each time. This has the benefits of simpler implementation with no new APIs.
3) virtio-balloon: the free_page_vq is used to report free pages only (no
multiple usages interleaving)
4) virtio-balloon: Balloon pages and free page blocks are sent via input sgs,
and the completion signal to the host is sent via an output sg.

v11->v12:
1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned pages.
2) virtio-ring: enable the driver to build up a desc chain using vring desc.
3) virtio-ring: Add locking to the existing START_USE() and END_USE() macro
to lock/unlock the vq when a vq operation starts/ends.
4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
5) virtio-balloon: describe chunks of ballooned pages and free pages blocks
directly using one or more chains of desc from the vq.

v10->v11:
1) virtio_balloon: use vring_desc to describe a chunk;
2) virtio_ring: support to add an indirect desc table to virtqueue;
3)  virtio_balloon: use cmdq to report guest memory statistics.

v9->v10:
1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
2) virtio-balloon: add virtballoon_validate();
3) virtio-balloon: msg format change;
4) virtio-balloon: move miscq handling to a task on system_freezable_wq;
5) virtio-balloon: code cleanup.

v8->v9:
1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
implementation;
2) Simpler function to get the free page block.

v7->v8:
1) Use only one chunk format, instead of two.
2) re-write the virtio-balloon implementation patch.
3) commit changes
4) patch re-org

Matthew Wilcox (1):
  lib/xbitmap: Introduce xbitmap

Wei Wang (4):
  lib/xbitmap: add xb_find_next_bit() and xb_zero()
  virtio-balloon: VIRTIO_BALLOON_F_SG
  mm: support reporting free page blocks
  virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ

 drivers/virtio/virtio_balloon.c     | 324 +++++++++++++++++++++++++++++++-----
 include/linux/mm.h                  |   6 +
 include/linux/radix-tree.h          |   3 +
 include/linux/xbitmap.h             |  64 +++++++
 include/uapi/linux/virtio_balloon.h |   2 +
 lib/Makefile                        |   2 +-
 lib/radix-tree.c                    |  22 ++-
 lib/xbitmap.c                       | 215 ++++++++++++++++++++++++
 mm/page_alloc.c                     |  44 +++++
 9 files changed, 640 insertions(+), 42 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 116+ messages in thread

end of thread, other threads:[~2017-08-21 20:32 UTC | newest]

Thread overview: 116+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-17  3:26 [PATCH v14 0/5] Virtio-balloon Enhancement Wei Wang
2017-08-17  3:26 ` [virtio-dev] " Wei Wang
2017-08-17  3:26 ` [Qemu-devel] " Wei Wang
2017-08-17  3:26 ` Wei Wang
2017-08-17  3:26 ` [PATCH v14 1/5] lib/xbitmap: Introduce xbitmap Wei Wang
2017-08-17  3:26 ` Wei Wang
2017-08-17  3:26   ` [virtio-dev] " Wei Wang
2017-08-17  3:26   ` [Qemu-devel] " Wei Wang
2017-08-17  3:26   ` Wei Wang
2017-08-19 20:30   ` kbuild test robot
2017-08-19 20:30     ` [Qemu-devel] " kbuild test robot
2017-08-19 20:30     ` kbuild test robot
2017-08-19 20:30   ` kbuild test robot
2017-08-17  3:26 ` [PATCH v14 2/5] lib/xbitmap: add xb_find_next_bit() and xb_zero() Wei Wang
2017-08-17  3:26 ` Wei Wang
2017-08-17  3:26   ` [virtio-dev] " Wei Wang
2017-08-17  3:26   ` [Qemu-devel] " Wei Wang
2017-08-17  3:26   ` Wei Wang
2017-08-17  3:26 ` [PATCH v14 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
2017-08-17  3:26 ` Wei Wang
2017-08-17  3:26   ` [virtio-dev] " Wei Wang
2017-08-17  3:26   ` [Qemu-devel] " Wei Wang
2017-08-17  3:26   ` Wei Wang
2017-08-18  2:22   ` Michael S. Tsirkin
2017-08-18  2:22     ` [virtio-dev] " Michael S. Tsirkin
2017-08-18  2:22     ` [Qemu-devel] " Michael S. Tsirkin
2017-08-18  2:22     ` Michael S. Tsirkin
2017-08-18  7:39     ` Wei Wang
2017-08-18  7:39     ` Wei Wang
2017-08-18  7:39       ` [virtio-dev] " Wei Wang
2017-08-18  7:39       ` [Qemu-devel] " Wei Wang
2017-08-18  7:39       ` Wei Wang
2017-08-21 20:22       ` Michael S. Tsirkin
2017-08-21 20:22         ` [virtio-dev] " Michael S. Tsirkin
2017-08-21 20:22         ` [Qemu-devel] " Michael S. Tsirkin
2017-08-21 20:22         ` Michael S. Tsirkin
2017-08-21 20:22       ` Michael S. Tsirkin
2017-08-18  2:22   ` Michael S. Tsirkin
2017-08-19 21:37   ` kbuild test robot
2017-08-19 21:37   ` kbuild test robot
2017-08-19 21:37     ` [Qemu-devel] " kbuild test robot
2017-08-19 21:37     ` kbuild test robot
2017-08-17  3:26 ` [PATCH v14 4/5] mm: support reporting free page blocks Wei Wang
2017-08-17  3:26   ` [virtio-dev] " Wei Wang
2017-08-17  3:26   ` [Qemu-devel] " Wei Wang
2017-08-17  3:26   ` Wei Wang
2017-08-18 13:46   ` Michal Hocko
2017-08-18 13:46     ` [Qemu-devel] " Michal Hocko
2017-08-18 13:46     ` Michal Hocko
2017-08-21  6:12     ` Wei Wang
2017-08-21  6:12     ` Wei Wang
2017-08-21  6:12       ` [virtio-dev] " Wei Wang
2017-08-21  6:12       ` [Qemu-devel] " Wei Wang
2017-08-21  6:12       ` Wei Wang
2017-08-21  6:14       ` Michal Hocko
2017-08-21  6:14         ` [Qemu-devel] " Michal Hocko
2017-08-21  6:14         ` Michal Hocko
2017-08-21  6:14         ` Michal Hocko
2017-08-18 13:46   ` Michal Hocko
2017-08-18 17:23   ` Michael S. Tsirkin
2017-08-18 17:23     ` [virtio-dev] " Michael S. Tsirkin
2017-08-18 17:23     ` [Qemu-devel] " Michael S. Tsirkin
2017-08-18 17:23     ` Michael S. Tsirkin
2017-08-21  6:18     ` Michal Hocko
2017-08-21  6:18       ` [Qemu-devel] " Michal Hocko
2017-08-21  6:18       ` Michal Hocko
2017-08-21  6:18     ` Michal Hocko
2017-08-18 17:23   ` Michael S. Tsirkin
2017-08-17  3:26 ` Wei Wang
2017-08-17  3:26 ` [PATCH v14 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ Wei Wang
2017-08-17  3:26 ` Wei Wang
2017-08-17  3:26   ` [virtio-dev] " Wei Wang
2017-08-17  3:26   ` [Qemu-devel] " Wei Wang
2017-08-17  3:26   ` Wei Wang
2017-08-18  2:13   ` Michael S. Tsirkin
2017-08-18  2:13   ` Michael S. Tsirkin
2017-08-18  2:13     ` [virtio-dev] " Michael S. Tsirkin
2017-08-18  2:13     ` [Qemu-devel] " Michael S. Tsirkin
2017-08-18  2:13     ` Michael S. Tsirkin
2017-08-18  8:41     ` Wei Wang
2017-08-18  8:41     ` Wei Wang
2017-08-18  8:41       ` [virtio-dev] " Wei Wang
2017-08-18  8:41       ` [Qemu-devel] " Wei Wang
2017-08-18  8:41       ` Wei Wang
2017-08-18 18:26       ` Michael S. Tsirkin
2017-08-18 18:26         ` [virtio-dev] " Michael S. Tsirkin
2017-08-18 18:26         ` [Qemu-devel] " Michael S. Tsirkin
2017-08-18 18:26         ` Michael S. Tsirkin
2017-08-21  5:21         ` Wei Wang
2017-08-21  5:21           ` [virtio-dev] " Wei Wang
2017-08-21  5:21           ` [Qemu-devel] " Wei Wang
2017-08-21  5:21           ` Wei Wang
2017-08-21  5:21         ` Wei Wang
2017-08-18 18:26       ` Michael S. Tsirkin
2017-08-18  2:28   ` Michael S. Tsirkin
2017-08-18  2:28     ` [virtio-dev] " Michael S. Tsirkin
2017-08-18  2:28     ` [Qemu-devel] " Michael S. Tsirkin
2017-08-18  2:28     ` Michael S. Tsirkin
2017-08-18  8:36     ` Wei Wang
2017-08-18  8:36     ` Wei Wang
2017-08-18  8:36       ` [virtio-dev] " Wei Wang
2017-08-18  8:36       ` [Qemu-devel] " Wei Wang
2017-08-18  8:36       ` Wei Wang
2017-08-18 18:10       ` Michael S. Tsirkin
2017-08-18 18:10         ` [virtio-dev] " Michael S. Tsirkin
2017-08-18 18:10         ` [Qemu-devel] " Michael S. Tsirkin
2017-08-18 18:10         ` Michael S. Tsirkin
2017-08-21  5:28         ` [virtio-dev] " Wei Wang
2017-08-21  5:28         ` Wei Wang
2017-08-21  5:28           ` Wei Wang
2017-08-21  5:28           ` [Qemu-devel] " Wei Wang
2017-08-21  5:28           ` Wei Wang
2017-08-21  5:28           ` Wei Wang
2017-08-18 18:10       ` Michael S. Tsirkin
2017-08-18  2:28   ` Michael S. Tsirkin
  -- strict thread matches above, loose matches on Subject: below --
2017-08-17  3:26 [PATCH v14 0/5] Virtio-balloon Enhancement Wei Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.