All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v16 0/5] Virtio-balloon Enhancement
@ 2017-09-30  4:05 ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

This patch series enhances the existing virtio-balloon with the following
new features:
1) fast ballooning: transfer ballooned pages between the guest and host in
chunks using sgs, instead of one array each time; and
2) free page block reporting: a new virtqueue to report guest free pages
to the host.

The second feature can be used to accelerate live migration of VMs. Here
are some details:

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.

The second feature enables the optimization of the 1st round memory
transfer - the hypervisor can skip the transfer of guest free pages in the
1st round. It is not concerned that the memory pages are used after they
are given to the hypervisor as a hint of the free pages, because they will
be tracked by the hypervisor and transferred in the next round if they are
used and written.

Change Log:
v15->v16:
1) mm: stop reporting the free pfn range if the callback returns false;
2) mm: move some implementaion of walk_free_mem_block into a function to
make the code layout looks better;
3) xbitmap: added some optimizations suggested by Matthew, please refer to
the ChangLog in the xbitmap patch for details.
4) xbitmap: added a test suite
5) virtio-balloon: bail out with a warning when virtqueue_add_inbuf returns
an error
6) virtio-balloon: some small code re-arrangement, e.g. detachinf used buf
from the vq before adding a new buf

v14->v15:
1) mm: make the report callback return a bool value - returning 1 to stop
walking through the free page list.
2) virtio-balloon: batching sgs of balloon pages till the vq is full
3) virtio-balloon: create a new workqueue, rather than using the default
system_wq, to queue the free page reporting work item.
4) virtio-balloon: add a ctrl_vq to be a central control plane which will
handle all the future control related commands between the host and guest.
Add free page report as the first feature controlled under ctrl_vq, and
the free_page_vq is a data plane vq dedicated to the transmission of free
page blocks.

v13->v14:
1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c.
2) xbitmap: consolidate the implementation of xb_bit_set/clear/test into
one xb_bit_ops.
3) xbitmap: add documents for the exported APIs.
4) mm: rewrite the function to walk through free page blocks.
5) virtio-balloon: when reporting a free page blcok to the device, if the
vq is full (less likey to happen in practice), just skip reporting this
block, instead of busywaiting till an entry gets released.
6) virtio-balloon: fail the probe function if adding the signal buf in
init_vqs fails.

v12->v13:
1) mm: use a callback function to handle the the free page blocks from the
report function. This avoids exposing the zone internal to a kernel
module.
2) virtio-balloon: send balloon pages or a free page block using a single
sg each time. This has the benefits of simpler implementation with no new
APIs.
3) virtio-balloon: the free_page_vq is used to report free pages only (no
multiple usages interleaving)
4) virtio-balloon: Balloon pages and free page blocks are sent via input
sgs, and the completion signal to the host is sent via an output sg.

v11->v12:
1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned pages.
2) virtio-ring: enable the driver to build up a desc chain using vring
desc.
3) virtio-ring: Add locking to the existing START_USE() and END_USE()
macro to lock/unlock the vq when a vq operation starts/ends.
4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
5) virtio-balloon: describe chunks of ballooned pages and free pages
blocks directly using one or more chains of desc from the vq.

v10->v11:
1) virtio_balloon: use vring_desc to describe a chunk;
2) virtio_ring: support to add an indirect desc table to virtqueue;
3)  virtio_balloon: use cmdq to report guest memory statistics.

v9->v10:
1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
2) virtio-balloon: add virtballoon_validate();
3) virtio-balloon: msg format change;
4) virtio-balloon: move miscq handling to a task on system_freezable_wq;
5) virtio-balloon: code cleanup.

v8->v9:
1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
implementation;
2) Simpler function to get the free page block.

v7->v8:
1) Use only one chunk format, instead of two.
2) re-write the virtio-balloon implementation patch.
3) commit changes
4) patch re-org

Matthew Wilcox (2):
  lib/xbitmap: Introduce xbitmap
  radix tree test suite: add tests for xbitmap

Wei Wang (3):
  virtio-balloon: VIRTIO_BALLOON_F_SG
  mm: support reporting free page blocks
  virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ

 drivers/virtio/virtio_balloon.c         | 437 +++++++++++++++++++++++++++++---
 include/linux/mm.h                      |   6 +
 include/linux/radix-tree.h              |   2 +
 include/linux/xbitmap.h                 |  66 +++++
 include/uapi/linux/virtio_balloon.h     |  16 ++
 lib/Makefile                            |   2 +-
 lib/radix-tree.c                        |  42 ++-
 lib/xbitmap.c                           | 264 +++++++++++++++++++
 mm/page_alloc.c                         |  91 +++++++
 tools/include/linux/bitmap.h            |  34 +++
 tools/include/linux/kernel.h            |   2 +
 tools/testing/radix-tree/Makefile       |   7 +-
 tools/testing/radix-tree/linux/kernel.h |   2 -
 tools/testing/radix-tree/main.c         |   5 +
 tools/testing/radix-tree/test.h         |   1 +
 tools/testing/radix-tree/xbitmap.c      | 269 ++++++++++++++++++++
 16 files changed, 1203 insertions(+), 43 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c
 create mode 100644 tools/testing/radix-tree/xbitmap.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [PATCH v16 0/5] Virtio-balloon Enhancement
@ 2017-09-30  4:05 ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

This patch series enhances the existing virtio-balloon with the following
new features:
1) fast ballooning: transfer ballooned pages between the guest and host in
chunks using sgs, instead of one array each time; and
2) free page block reporting: a new virtqueue to report guest free pages
to the host.

The second feature can be used to accelerate live migration of VMs. Here
are some details:

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.

The second feature enables the optimization of the 1st round memory
transfer - the hypervisor can skip the transfer of guest free pages in the
1st round. It is not concerned that the memory pages are used after they
are given to the hypervisor as a hint of the free pages, because they will
be tracked by the hypervisor and transferred in the next round if they are
used and written.

Change Log:
v15->v16:
1) mm: stop reporting the free pfn range if the callback returns false;
2) mm: move some implementaion of walk_free_mem_block into a function to
make the code layout looks better;
3) xbitmap: added some optimizations suggested by Matthew, please refer to
the ChangLog in the xbitmap patch for details.
4) xbitmap: added a test suite
5) virtio-balloon: bail out with a warning when virtqueue_add_inbuf returns
an error
6) virtio-balloon: some small code re-arrangement, e.g. detachinf used buf
from the vq before adding a new buf

v14->v15:
1) mm: make the report callback return a bool value - returning 1 to stop
walking through the free page list.
2) virtio-balloon: batching sgs of balloon pages till the vq is full
3) virtio-balloon: create a new workqueue, rather than using the default
system_wq, to queue the free page reporting work item.
4) virtio-balloon: add a ctrl_vq to be a central control plane which will
handle all the future control related commands between the host and guest.
Add free page report as the first feature controlled under ctrl_vq, and
the free_page_vq is a data plane vq dedicated to the transmission of free
page blocks.

v13->v14:
1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c.
2) xbitmap: consolidate the implementation of xb_bit_set/clear/test into
one xb_bit_ops.
3) xbitmap: add documents for the exported APIs.
4) mm: rewrite the function to walk through free page blocks.
5) virtio-balloon: when reporting a free page blcok to the device, if the
vq is full (less likey to happen in practice), just skip reporting this
block, instead of busywaiting till an entry gets released.
6) virtio-balloon: fail the probe function if adding the signal buf in
init_vqs fails.

v12->v13:
1) mm: use a callback function to handle the the free page blocks from the
report function. This avoids exposing the zone internal to a kernel
module.
2) virtio-balloon: send balloon pages or a free page block using a single
sg each time. This has the benefits of simpler implementation with no new
APIs.
3) virtio-balloon: the free_page_vq is used to report free pages only (no
multiple usages interleaving)
4) virtio-balloon: Balloon pages and free page blocks are sent via input
sgs, and the completion signal to the host is sent via an output sg.

v11->v12:
1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned pages.
2) virtio-ring: enable the driver to build up a desc chain using vring
desc.
3) virtio-ring: Add locking to the existing START_USE() and END_USE()
macro to lock/unlock the vq when a vq operation starts/ends.
4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
5) virtio-balloon: describe chunks of ballooned pages and free pages
blocks directly using one or more chains of desc from the vq.

v10->v11:
1) virtio_balloon: use vring_desc to describe a chunk;
2) virtio_ring: support to add an indirect desc table to virtqueue;
3)  virtio_balloon: use cmdq to report guest memory statistics.

v9->v10:
1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
2) virtio-balloon: add virtballoon_validate();
3) virtio-balloon: msg format change;
4) virtio-balloon: move miscq handling to a task on system_freezable_wq;
5) virtio-balloon: code cleanup.

v8->v9:
1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
implementation;
2) Simpler function to get the free page block.

v7->v8:
1) Use only one chunk format, instead of two.
2) re-write the virtio-balloon implementation patch.
3) commit changes
4) patch re-org

Matthew Wilcox (2):
  lib/xbitmap: Introduce xbitmap
  radix tree test suite: add tests for xbitmap

Wei Wang (3):
  virtio-balloon: VIRTIO_BALLOON_F_SG
  mm: support reporting free page blocks
  virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ

 drivers/virtio/virtio_balloon.c         | 437 +++++++++++++++++++++++++++++---
 include/linux/mm.h                      |   6 +
 include/linux/radix-tree.h              |   2 +
 include/linux/xbitmap.h                 |  66 +++++
 include/uapi/linux/virtio_balloon.h     |  16 ++
 lib/Makefile                            |   2 +-
 lib/radix-tree.c                        |  42 ++-
 lib/xbitmap.c                           | 264 +++++++++++++++++++
 mm/page_alloc.c                         |  91 +++++++
 tools/include/linux/bitmap.h            |  34 +++
 tools/include/linux/kernel.h            |   2 +
 tools/testing/radix-tree/Makefile       |   7 +-
 tools/testing/radix-tree/linux/kernel.h |   2 -
 tools/testing/radix-tree/main.c         |   5 +
 tools/testing/radix-tree/test.h         |   1 +
 tools/testing/radix-tree/xbitmap.c      | 269 ++++++++++++++++++++
 16 files changed, 1203 insertions(+), 43 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c
 create mode 100644 tools/testing/radix-tree/xbitmap.c

-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [Qemu-devel] [PATCH v16 0/5] Virtio-balloon Enhancement
@ 2017-09-30  4:05 ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

This patch series enhances the existing virtio-balloon with the following
new features:
1) fast ballooning: transfer ballooned pages between the guest and host in
chunks using sgs, instead of one array each time; and
2) free page block reporting: a new virtqueue to report guest free pages
to the host.

The second feature can be used to accelerate live migration of VMs. Here
are some details:

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.

The second feature enables the optimization of the 1st round memory
transfer - the hypervisor can skip the transfer of guest free pages in the
1st round. It is not concerned that the memory pages are used after they
are given to the hypervisor as a hint of the free pages, because they will
be tracked by the hypervisor and transferred in the next round if they are
used and written.

Change Log:
v15->v16:
1) mm: stop reporting the free pfn range if the callback returns false;
2) mm: move some implementaion of walk_free_mem_block into a function to
make the code layout looks better;
3) xbitmap: added some optimizations suggested by Matthew, please refer to
the ChangLog in the xbitmap patch for details.
4) xbitmap: added a test suite
5) virtio-balloon: bail out with a warning when virtqueue_add_inbuf returns
an error
6) virtio-balloon: some small code re-arrangement, e.g. detachinf used buf
from the vq before adding a new buf

v14->v15:
1) mm: make the report callback return a bool value - returning 1 to stop
walking through the free page list.
2) virtio-balloon: batching sgs of balloon pages till the vq is full
3) virtio-balloon: create a new workqueue, rather than using the default
system_wq, to queue the free page reporting work item.
4) virtio-balloon: add a ctrl_vq to be a central control plane which will
handle all the future control related commands between the host and guest.
Add free page report as the first feature controlled under ctrl_vq, and
the free_page_vq is a data plane vq dedicated to the transmission of free
page blocks.

v13->v14:
1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c.
2) xbitmap: consolidate the implementation of xb_bit_set/clear/test into
one xb_bit_ops.
3) xbitmap: add documents for the exported APIs.
4) mm: rewrite the function to walk through free page blocks.
5) virtio-balloon: when reporting a free page blcok to the device, if the
vq is full (less likey to happen in practice), just skip reporting this
block, instead of busywaiting till an entry gets released.
6) virtio-balloon: fail the probe function if adding the signal buf in
init_vqs fails.

v12->v13:
1) mm: use a callback function to handle the the free page blocks from the
report function. This avoids exposing the zone internal to a kernel
module.
2) virtio-balloon: send balloon pages or a free page block using a single
sg each time. This has the benefits of simpler implementation with no new
APIs.
3) virtio-balloon: the free_page_vq is used to report free pages only (no
multiple usages interleaving)
4) virtio-balloon: Balloon pages and free page blocks are sent via input
sgs, and the completion signal to the host is sent via an output sg.

v11->v12:
1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned pages.
2) virtio-ring: enable the driver to build up a desc chain using vring
desc.
3) virtio-ring: Add locking to the existing START_USE() and END_USE()
macro to lock/unlock the vq when a vq operation starts/ends.
4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
5) virtio-balloon: describe chunks of ballooned pages and free pages
blocks directly using one or more chains of desc from the vq.

v10->v11:
1) virtio_balloon: use vring_desc to describe a chunk;
2) virtio_ring: support to add an indirect desc table to virtqueue;
3)  virtio_balloon: use cmdq to report guest memory statistics.

v9->v10:
1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
2) virtio-balloon: add virtballoon_validate();
3) virtio-balloon: msg format change;
4) virtio-balloon: move miscq handling to a task on system_freezable_wq;
5) virtio-balloon: code cleanup.

v8->v9:
1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
implementation;
2) Simpler function to get the free page block.

v7->v8:
1) Use only one chunk format, instead of two.
2) re-write the virtio-balloon implementation patch.
3) commit changes
4) patch re-org

Matthew Wilcox (2):
  lib/xbitmap: Introduce xbitmap
  radix tree test suite: add tests for xbitmap

Wei Wang (3):
  virtio-balloon: VIRTIO_BALLOON_F_SG
  mm: support reporting free page blocks
  virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ

 drivers/virtio/virtio_balloon.c         | 437 +++++++++++++++++++++++++++++---
 include/linux/mm.h                      |   6 +
 include/linux/radix-tree.h              |   2 +
 include/linux/xbitmap.h                 |  66 +++++
 include/uapi/linux/virtio_balloon.h     |  16 ++
 lib/Makefile                            |   2 +-
 lib/radix-tree.c                        |  42 ++-
 lib/xbitmap.c                           | 264 +++++++++++++++++++
 mm/page_alloc.c                         |  91 +++++++
 tools/include/linux/bitmap.h            |  34 +++
 tools/include/linux/kernel.h            |   2 +
 tools/testing/radix-tree/Makefile       |   7 +-
 tools/testing/radix-tree/linux/kernel.h |   2 -
 tools/testing/radix-tree/main.c         |   5 +
 tools/testing/radix-tree/test.h         |   1 +
 tools/testing/radix-tree/xbitmap.c      | 269 ++++++++++++++++++++
 16 files changed, 1203 insertions(+), 43 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c
 create mode 100644 tools/testing/radix-tree/xbitmap.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [virtio-dev] [PATCH v16 0/5] Virtio-balloon Enhancement
@ 2017-09-30  4:05 ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

This patch series enhances the existing virtio-balloon with the following
new features:
1) fast ballooning: transfer ballooned pages between the guest and host in
chunks using sgs, instead of one array each time; and
2) free page block reporting: a new virtqueue to report guest free pages
to the host.

The second feature can be used to accelerate live migration of VMs. Here
are some details:

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.

The second feature enables the optimization of the 1st round memory
transfer - the hypervisor can skip the transfer of guest free pages in the
1st round. It is not concerned that the memory pages are used after they
are given to the hypervisor as a hint of the free pages, because they will
be tracked by the hypervisor and transferred in the next round if they are
used and written.

Change Log:
v15->v16:
1) mm: stop reporting the free pfn range if the callback returns false;
2) mm: move some implementaion of walk_free_mem_block into a function to
make the code layout looks better;
3) xbitmap: added some optimizations suggested by Matthew, please refer to
the ChangLog in the xbitmap patch for details.
4) xbitmap: added a test suite
5) virtio-balloon: bail out with a warning when virtqueue_add_inbuf returns
an error
6) virtio-balloon: some small code re-arrangement, e.g. detachinf used buf
from the vq before adding a new buf

v14->v15:
1) mm: make the report callback return a bool value - returning 1 to stop
walking through the free page list.
2) virtio-balloon: batching sgs of balloon pages till the vq is full
3) virtio-balloon: create a new workqueue, rather than using the default
system_wq, to queue the free page reporting work item.
4) virtio-balloon: add a ctrl_vq to be a central control plane which will
handle all the future control related commands between the host and guest.
Add free page report as the first feature controlled under ctrl_vq, and
the free_page_vq is a data plane vq dedicated to the transmission of free
page blocks.

v13->v14:
1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c.
2) xbitmap: consolidate the implementation of xb_bit_set/clear/test into
one xb_bit_ops.
3) xbitmap: add documents for the exported APIs.
4) mm: rewrite the function to walk through free page blocks.
5) virtio-balloon: when reporting a free page blcok to the device, if the
vq is full (less likey to happen in practice), just skip reporting this
block, instead of busywaiting till an entry gets released.
6) virtio-balloon: fail the probe function if adding the signal buf in
init_vqs fails.

v12->v13:
1) mm: use a callback function to handle the the free page blocks from the
report function. This avoids exposing the zone internal to a kernel
module.
2) virtio-balloon: send balloon pages or a free page block using a single
sg each time. This has the benefits of simpler implementation with no new
APIs.
3) virtio-balloon: the free_page_vq is used to report free pages only (no
multiple usages interleaving)
4) virtio-balloon: Balloon pages and free page blocks are sent via input
sgs, and the completion signal to the host is sent via an output sg.

v11->v12:
1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned pages.
2) virtio-ring: enable the driver to build up a desc chain using vring
desc.
3) virtio-ring: Add locking to the existing START_USE() and END_USE()
macro to lock/unlock the vq when a vq operation starts/ends.
4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
5) virtio-balloon: describe chunks of ballooned pages and free pages
blocks directly using one or more chains of desc from the vq.

v10->v11:
1) virtio_balloon: use vring_desc to describe a chunk;
2) virtio_ring: support to add an indirect desc table to virtqueue;
3)  virtio_balloon: use cmdq to report guest memory statistics.

v9->v10:
1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
2) virtio-balloon: add virtballoon_validate();
3) virtio-balloon: msg format change;
4) virtio-balloon: move miscq handling to a task on system_freezable_wq;
5) virtio-balloon: code cleanup.

v8->v9:
1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
implementation;
2) Simpler function to get the free page block.

v7->v8:
1) Use only one chunk format, instead of two.
2) re-write the virtio-balloon implementation patch.
3) commit changes
4) patch re-org

Matthew Wilcox (2):
  lib/xbitmap: Introduce xbitmap
  radix tree test suite: add tests for xbitmap

Wei Wang (3):
  virtio-balloon: VIRTIO_BALLOON_F_SG
  mm: support reporting free page blocks
  virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ

 drivers/virtio/virtio_balloon.c         | 437 +++++++++++++++++++++++++++++---
 include/linux/mm.h                      |   6 +
 include/linux/radix-tree.h              |   2 +
 include/linux/xbitmap.h                 |  66 +++++
 include/uapi/linux/virtio_balloon.h     |  16 ++
 lib/Makefile                            |   2 +-
 lib/radix-tree.c                        |  42 ++-
 lib/xbitmap.c                           | 264 +++++++++++++++++++
 mm/page_alloc.c                         |  91 +++++++
 tools/include/linux/bitmap.h            |  34 +++
 tools/include/linux/kernel.h            |   2 +
 tools/testing/radix-tree/Makefile       |   7 +-
 tools/testing/radix-tree/linux/kernel.h |   2 -
 tools/testing/radix-tree/main.c         |   5 +
 tools/testing/radix-tree/test.h         |   1 +
 tools/testing/radix-tree/xbitmap.c      | 269 ++++++++++++++++++++
 16 files changed, 1203 insertions(+), 43 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c
 create mode 100644 tools/testing/radix-tree/xbitmap.c

-- 
2.7.4


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 146+ messages in thread

* [PATCH v16 1/5] lib/xbitmap: Introduce xbitmap
  2017-09-30  4:05 ` Wei Wang
                     ` (2 preceding siblings ...)
  (?)
@ 2017-09-30  4:05   ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

From: Matthew Wilcox <mawilcox@microsoft.com>

The eXtensible Bitmap is a sparse bitmap representation which is
efficient for set bits which tend to cluster.  It supports up to
'unsigned long' worth of bits, and this commit adds the bare bones --
xb_set_bit(), xb_clear_bit() and xb_test_bit().

More possible optimizations to add in the future:
1) xb_set_bit_range: set a range of bits
2) when searching a bit, if the bit is not found in the slot, move on to
the next slot directly.
3) add Tags to help searching

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>

v15->v16 ChangeLog:
1) coding style - separate small functions for bit set/clear/test;
2) Clear a range of bits in a more efficient way:
   A) clear a range of bits from the same ida bitmap directly rather than
      search the bitmap again for each bit;
   B) when the range of bits to clear covers the whole ida bitmap,
      directly free the bitmap - no need to zero the bitmap first.
3) more efficient bit searching, like 2.A.
---
 include/linux/radix-tree.h |   2 +
 include/linux/xbitmap.h    |  66 ++++++++++++
 lib/Makefile               |   2 +-
 lib/radix-tree.c           |  42 +++++++-
 lib/xbitmap.c              | 264 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 373 insertions(+), 3 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 3e57350..1cffeb3 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -309,6 +309,8 @@ void radix_tree_iter_replace(struct radix_tree_root *,
 		const struct radix_tree_iter *, void __rcu **slot, void *entry);
 void radix_tree_replace_slot(struct radix_tree_root *,
 			     void __rcu **slot, void *entry);
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot);
 void __radix_tree_delete_node(struct radix_tree_root *,
 			      struct radix_tree_node *,
 			      radix_tree_update_node_t update_node,
diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
new file mode 100644
index 0000000..f634bd9
--- /dev/null
+++ b/include/linux/xbitmap.h
@@ -0,0 +1,66 @@
+/*
+ * eXtensible Bitmaps
+ * Copyright (c) 2017 Microsoft Corporation <mawilcox@microsoft.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * eXtensible Bitmaps provide an unlimited-size sparse bitmap facility.
+ * All bits are initially zero.
+ */
+
+#ifndef __XBITMAP_H__
+#define __XBITMAP_H__
+
+#include <linux/idr.h>
+
+struct xb {
+	struct radix_tree_root xbrt;
+};
+
+#define XB_INIT {							\
+	.xbrt = RADIX_TREE_INIT(IDR_RT_MARKER | GFP_NOWAIT),		\
+}
+#define DEFINE_XB(name)		struct xb name = XB_INIT
+
+static inline void xb_init(struct xb *xb)
+{
+	INIT_RADIX_TREE(&xb->xbrt, IDR_RT_MARKER | GFP_NOWAIT);
+}
+
+int xb_set_bit(struct xb *xb, unsigned long bit);
+bool xb_test_bit(struct xb *xb, unsigned long bit);
+void xb_clear_bit(struct xb *xb, unsigned long bit);
+unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
+				   unsigned long end);
+unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start,
+				    unsigned long end);
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end);
+
+/* Check if the xb tree is empty */
+static inline bool xb_is_empty(const struct xb *xb)
+{
+	return radix_tree_empty(&xb->xbrt);
+}
+
+void xb_preload(gfp_t gfp);
+
+/**
+ * xb_preload_end - end preload section started with xb_preload()
+ *
+ * Each xb_preload() should be matched with an invocation of this
+ * function. See xb_preload() for details.
+ */
+static inline void xb_preload_end(void)
+{
+	preempt_enable();
+}
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 40c1837..ea50496 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -18,7 +18,7 @@ KCOV_INSTRUMENT_dynamic_debug.o := n
 
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o dump_stack.o timerqueue.o\
-	 idr.o int_sqrt.o extable.o \
+	 idr.o xbitmap.o int_sqrt.o extable.o \
 	 sha1.o chacha20.o irq_regs.o argv_split.o \
 	 flex_proportions.o ratelimit.o show_mem.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 898e879..1e15e30 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -78,6 +78,19 @@ static struct kmem_cache *radix_tree_node_cachep;
 #define IDA_PRELOAD_SIZE	(IDA_MAX_PATH * 2 - 1)
 
 /*
+ * The xbitmap implementation supports up to ULONG_MAX bits, and it is
+ * implemented based on ida bitmaps. So, given an unsigned long index,
+ * the high order XB_INDEX_BITS bits of the index is used to find the
+ * corresponding item (i.e. ida bitmap) from the radix tree, and the low
+ * order (i.e. ilog2(IDA_BITMAP_BITS)) bits of the index are indexed into
+ * the ida bitmap to find the bit.
+ */
+#define XB_INDEX_BITS		(BITS_PER_LONG - ilog2(IDA_BITMAP_BITS))
+#define XB_MAX_PATH		(DIV_ROUND_UP(XB_INDEX_BITS, \
+					      RADIX_TREE_MAP_SHIFT))
+#define XB_PRELOAD_SIZE		(XB_MAX_PATH * 2 - 1)
+
+/*
  * Per-cpu pool of preloaded nodes
  */
 struct radix_tree_preload {
@@ -840,6 +853,8 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
 							offset, 0, 0);
 			if (!child)
 				return -ENOMEM;
+			if (is_idr(root))
+				all_tag_set(child, IDR_FREE);
 			rcu_assign_pointer(*slot, node_to_entry(child));
 			if (node)
 				node->count++;
@@ -1986,8 +2001,8 @@ void __radix_tree_delete_node(struct radix_tree_root *root,
 	delete_node(root, node, update_node, private);
 }
 
-static bool __radix_tree_delete(struct radix_tree_root *root,
-				struct radix_tree_node *node, void __rcu **slot)
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot)
 {
 	void *old = rcu_dereference_raw(*slot);
 	int exceptional = radix_tree_exceptional_entry(old) ? -1 : 0;
@@ -2005,6 +2020,29 @@ static bool __radix_tree_delete(struct radix_tree_root *root,
 }
 
 /**
+ *  xb_preload - preload for xb_set_bit()
+ *  @gfp_mask: allocation mask to use for preloading
+ *
+ * Preallocate memory to use for the next call to xb_set_bit(). This function
+ * returns with preemption disabled. It will be enabled by xb_preload_end().
+ */
+void xb_preload(gfp_t gfp)
+{
+	if (__radix_tree_preload(gfp, XB_PRELOAD_SIZE) < 0)
+		preempt_disable();
+
+	if (!this_cpu_read(ida_bitmap)) {
+		struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
+
+		if (!bitmap)
+			return;
+		bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
+		kfree(bitmap);
+	}
+}
+EXPORT_SYMBOL(xb_preload);
+
+/**
  * radix_tree_iter_delete - delete the entry at this iterator position
  * @root: radix tree root
  * @iter: iterator state
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
new file mode 100644
index 0000000..4ab9ac2
--- /dev/null
+++ b/lib/xbitmap.c
@@ -0,0 +1,264 @@
+#include <linux/slab.h>
+#include <linux/xbitmap.h>
+
+/**
+ *  xb_set_bit - set a bit in the xbitmap
+ *  @xb: the xbitmap tree used to record the bit
+ *  @bit: index of the bit to set
+ *
+ * This function is used to set a bit in the xbitmap. If the bitmap that @bit
+ * resides in is not there, it will be allocated.
+ *
+ * Returns: 0 on success. %-EAGAIN indicates that @bit was not set. The caller
+ * may want to call the function again.
+ */
+int xb_set_bit(struct xb *xb, unsigned long bit)
+{
+	int err;
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	err = __radix_tree_create(root, index, 0, &node, &slot);
+	if (err)
+		return err;
+	bitmap = rcu_dereference_raw(*slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit < BITS_PER_LONG) {
+			tmp |= 1UL << ebit;
+			rcu_assign_pointer(*slot, (void *)tmp);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+		rcu_assign_pointer(*slot, bitmap);
+	}
+
+	if (!bitmap) {
+		if (ebit < BITS_PER_LONG) {
+			bitmap = (void *)((1UL << ebit) |
+					RADIX_TREE_EXCEPTIONAL_ENTRY);
+			__radix_tree_replace(root, node, slot, bitmap, NULL,
+						NULL);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		__radix_tree_replace(root, node, slot, bitmap, NULL, NULL);
+	}
+
+	__set_bit(bit, bitmap->bitmap);
+	return 0;
+}
+EXPORT_SYMBOL(xb_set_bit);
+
+/**
+ * xb_clear_bit - clear a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to clear
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	bitmap = __radix_tree_lookup(root, index, &node, &slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit >= BITS_PER_LONG)
+			return;
+		tmp &= ~(1UL << ebit);
+		if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+			__radix_tree_delete(root, node, slot);
+		else
+			rcu_assign_pointer(*slot, (void *)tmp);
+		return;
+	}
+
+	if (!bitmap)
+		return;
+
+	__clear_bit(bit, bitmap->bitmap);
+	if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+		kfree(bitmap);
+		__radix_tree_delete(root, node, slot);
+	}
+}
+EXPORT_SYMBOL(xb_clear_bit);
+
+/**
+ * xb_clear_bit - clear a range of bits in the xbitmap
+ * @start: the start of the bit range, inclusive
+ * @end: the end of the bit range, inclusive
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned int nbits;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bitmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bitmap)) {
+			unsigned long ebit = bit + 2;
+			unsigned long tmp = (unsigned long)bitmap;
+
+			nbits = min(end - start + 1, BITS_PER_LONG - ebit);
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			bitmap_clear(&tmp, ebit, nbits);
+			if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+				__radix_tree_delete(root, node, slot);
+			else
+				rcu_assign_pointer(*slot, (void *)tmp);
+		} else if (bitmap) {
+			nbits = min(end - start + 1, IDA_BITMAP_BITS - bit);
+
+			if (nbits != IDA_BITMAP_BITS)
+				bitmap_clear(bitmap->bitmap, bit, nbits);
+
+			if (nbits == IDA_BITMAP_BITS ||
+				bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+				kfree(bitmap);
+				__radix_tree_delete(root, node, slot);
+			}
+		}
+	}
+}
+EXPORT_SYMBOL(xb_clear_bit_range);
+
+/**
+ * xb_test_bit - test a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to test
+ *
+ * This function is used to test a bit in the xbitmap.
+ * Returns: 1 if the bit is set, or 0 otherwise.
+ */
+bool xb_test_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	const struct radix_tree_root *root = &xb->xbrt;
+	struct ida_bitmap *bitmap = radix_tree_lookup(root, index);
+
+	bit %= IDA_BITMAP_BITS;
+
+	if (!bitmap)
+		return false;
+	if (radix_tree_exception(bitmap)) {
+		bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
+		if (bit > BITS_PER_LONG)
+			return false;
+		return (unsigned long)bitmap & (1UL << bit);
+	}
+
+	return test_bit(bit, bitmap->bitmap);
+}
+EXPORT_SYMBOL(xb_test_bit);
+
+static unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+				      unsigned long end, bool set)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bmap;
+	unsigned long ret = end + 1;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bmap)) {
+			unsigned long tmp = (unsigned long)bmap;
+			unsigned long ebit = bit + 2;
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			if (set)
+				ret = find_next_bit(&tmp, BITS_PER_LONG, ebit);
+			else
+				ret = find_next_zero_bit(&tmp, BITS_PER_LONG,
+							 ebit);
+			if (ret < BITS_PER_LONG)
+				return ret - 2 + IDA_BITMAP_BITS * index;
+		} else if (bmap) {
+			if (set)
+				ret = find_next_bit(bmap->bitmap,
+						    IDA_BITMAP_BITS, bit);
+			else
+				ret = find_next_zero_bit(bmap->bitmap,
+							 IDA_BITMAP_BITS, bit);
+			if (ret < IDA_BITMAP_BITS)
+				return ret + index * IDA_BITMAP_BITS;
+		} else if (!bmap && !set) {
+			return start;
+		}
+	}
+
+	return ret;
+}
+
+/**
+ * xb_find_next_set_bit - find the next set bit in a range
+ * @xb: the xbitmap to search
+ * @start: the start of the range, inclusive
+ * @end: the end of the range, inclusive
+ *
+ * Returns: the index of the found bit, or @end + 1 if no such bit is found.
+ */
+unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
+				   unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 1);
+}
+EXPORT_SYMBOL(xb_find_next_set_bit);
+
+/**
+ * xb_find_next_zero_bit - find the next zero bit in a range
+ * @xb: the xbitmap to search
+ * @start: the start of the range, inclusive
+ * @end: the end of the range, inclusive
+ *
+ * Returns: the index of the found bit, or @end + 1 if no such bit is found.
+ */
+unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start,
+				    unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 0);
+}
+EXPORT_SYMBOL(xb_find_next_zero_bit);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v16 1/5] lib/xbitmap: Introduce xbitmap
@ 2017-09-30  4:05   ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

From: Matthew Wilcox <mawilcox@microsoft.com>

The eXtensible Bitmap is a sparse bitmap representation which is
efficient for set bits which tend to cluster.  It supports up to
'unsigned long' worth of bits, and this commit adds the bare bones --
xb_set_bit(), xb_clear_bit() and xb_test_bit().

More possible optimizations to add in the future:
1) xb_set_bit_range: set a range of bits
2) when searching a bit, if the bit is not found in the slot, move on to
the next slot directly.
3) add Tags to help searching

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>

v15->v16 ChangeLog:
1) coding style - separate small functions for bit set/clear/test;
2) Clear a range of bits in a more efficient way:
   A) clear a range of bits from the same ida bitmap directly rather than
      search the bitmap again for each bit;
   B) when the range of bits to clear covers the whole ida bitmap,
      directly free the bitmap - no need to zero the bitmap first.
3) more efficient bit searching, like 2.A.
---
 include/linux/radix-tree.h |   2 +
 include/linux/xbitmap.h    |  66 ++++++++++++
 lib/Makefile               |   2 +-
 lib/radix-tree.c           |  42 +++++++-
 lib/xbitmap.c              | 264 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 373 insertions(+), 3 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 3e57350..1cffeb3 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -309,6 +309,8 @@ void radix_tree_iter_replace(struct radix_tree_root *,
 		const struct radix_tree_iter *, void __rcu **slot, void *entry);
 void radix_tree_replace_slot(struct radix_tree_root *,
 			     void __rcu **slot, void *entry);
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot);
 void __radix_tree_delete_node(struct radix_tree_root *,
 			      struct radix_tree_node *,
 			      radix_tree_update_node_t update_node,
diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
new file mode 100644
index 0000000..f634bd9
--- /dev/null
+++ b/include/linux/xbitmap.h
@@ -0,0 +1,66 @@
+/*
+ * eXtensible Bitmaps
+ * Copyright (c) 2017 Microsoft Corporation <mawilcox@microsoft.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * eXtensible Bitmaps provide an unlimited-size sparse bitmap facility.
+ * All bits are initially zero.
+ */
+
+#ifndef __XBITMAP_H__
+#define __XBITMAP_H__
+
+#include <linux/idr.h>
+
+struct xb {
+	struct radix_tree_root xbrt;
+};
+
+#define XB_INIT {							\
+	.xbrt = RADIX_TREE_INIT(IDR_RT_MARKER | GFP_NOWAIT),		\
+}
+#define DEFINE_XB(name)		struct xb name = XB_INIT
+
+static inline void xb_init(struct xb *xb)
+{
+	INIT_RADIX_TREE(&xb->xbrt, IDR_RT_MARKER | GFP_NOWAIT);
+}
+
+int xb_set_bit(struct xb *xb, unsigned long bit);
+bool xb_test_bit(struct xb *xb, unsigned long bit);
+void xb_clear_bit(struct xb *xb, unsigned long bit);
+unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
+				   unsigned long end);
+unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start,
+				    unsigned long end);
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end);
+
+/* Check if the xb tree is empty */
+static inline bool xb_is_empty(const struct xb *xb)
+{
+	return radix_tree_empty(&xb->xbrt);
+}
+
+void xb_preload(gfp_t gfp);
+
+/**
+ * xb_preload_end - end preload section started with xb_preload()
+ *
+ * Each xb_preload() should be matched with an invocation of this
+ * function. See xb_preload() for details.
+ */
+static inline void xb_preload_end(void)
+{
+	preempt_enable();
+}
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 40c1837..ea50496 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -18,7 +18,7 @@ KCOV_INSTRUMENT_dynamic_debug.o := n
 
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o dump_stack.o timerqueue.o\
-	 idr.o int_sqrt.o extable.o \
+	 idr.o xbitmap.o int_sqrt.o extable.o \
 	 sha1.o chacha20.o irq_regs.o argv_split.o \
 	 flex_proportions.o ratelimit.o show_mem.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 898e879..1e15e30 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -78,6 +78,19 @@ static struct kmem_cache *radix_tree_node_cachep;
 #define IDA_PRELOAD_SIZE	(IDA_MAX_PATH * 2 - 1)
 
 /*
+ * The xbitmap implementation supports up to ULONG_MAX bits, and it is
+ * implemented based on ida bitmaps. So, given an unsigned long index,
+ * the high order XB_INDEX_BITS bits of the index is used to find the
+ * corresponding item (i.e. ida bitmap) from the radix tree, and the low
+ * order (i.e. ilog2(IDA_BITMAP_BITS)) bits of the index are indexed into
+ * the ida bitmap to find the bit.
+ */
+#define XB_INDEX_BITS		(BITS_PER_LONG - ilog2(IDA_BITMAP_BITS))
+#define XB_MAX_PATH		(DIV_ROUND_UP(XB_INDEX_BITS, \
+					      RADIX_TREE_MAP_SHIFT))
+#define XB_PRELOAD_SIZE		(XB_MAX_PATH * 2 - 1)
+
+/*
  * Per-cpu pool of preloaded nodes
  */
 struct radix_tree_preload {
@@ -840,6 +853,8 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
 							offset, 0, 0);
 			if (!child)
 				return -ENOMEM;
+			if (is_idr(root))
+				all_tag_set(child, IDR_FREE);
 			rcu_assign_pointer(*slot, node_to_entry(child));
 			if (node)
 				node->count++;
@@ -1986,8 +2001,8 @@ void __radix_tree_delete_node(struct radix_tree_root *root,
 	delete_node(root, node, update_node, private);
 }
 
-static bool __radix_tree_delete(struct radix_tree_root *root,
-				struct radix_tree_node *node, void __rcu **slot)
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot)
 {
 	void *old = rcu_dereference_raw(*slot);
 	int exceptional = radix_tree_exceptional_entry(old) ? -1 : 0;
@@ -2005,6 +2020,29 @@ static bool __radix_tree_delete(struct radix_tree_root *root,
 }
 
 /**
+ *  xb_preload - preload for xb_set_bit()
+ *  @gfp_mask: allocation mask to use for preloading
+ *
+ * Preallocate memory to use for the next call to xb_set_bit(). This function
+ * returns with preemption disabled. It will be enabled by xb_preload_end().
+ */
+void xb_preload(gfp_t gfp)
+{
+	if (__radix_tree_preload(gfp, XB_PRELOAD_SIZE) < 0)
+		preempt_disable();
+
+	if (!this_cpu_read(ida_bitmap)) {
+		struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
+
+		if (!bitmap)
+			return;
+		bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
+		kfree(bitmap);
+	}
+}
+EXPORT_SYMBOL(xb_preload);
+
+/**
  * radix_tree_iter_delete - delete the entry at this iterator position
  * @root: radix tree root
  * @iter: iterator state
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
new file mode 100644
index 0000000..4ab9ac2
--- /dev/null
+++ b/lib/xbitmap.c
@@ -0,0 +1,264 @@
+#include <linux/slab.h>
+#include <linux/xbitmap.h>
+
+/**
+ *  xb_set_bit - set a bit in the xbitmap
+ *  @xb: the xbitmap tree used to record the bit
+ *  @bit: index of the bit to set
+ *
+ * This function is used to set a bit in the xbitmap. If the bitmap that @bit
+ * resides in is not there, it will be allocated.
+ *
+ * Returns: 0 on success. %-EAGAIN indicates that @bit was not set. The caller
+ * may want to call the function again.
+ */
+int xb_set_bit(struct xb *xb, unsigned long bit)
+{
+	int err;
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	err = __radix_tree_create(root, index, 0, &node, &slot);
+	if (err)
+		return err;
+	bitmap = rcu_dereference_raw(*slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit < BITS_PER_LONG) {
+			tmp |= 1UL << ebit;
+			rcu_assign_pointer(*slot, (void *)tmp);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+		rcu_assign_pointer(*slot, bitmap);
+	}
+
+	if (!bitmap) {
+		if (ebit < BITS_PER_LONG) {
+			bitmap = (void *)((1UL << ebit) |
+					RADIX_TREE_EXCEPTIONAL_ENTRY);
+			__radix_tree_replace(root, node, slot, bitmap, NULL,
+						NULL);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		__radix_tree_replace(root, node, slot, bitmap, NULL, NULL);
+	}
+
+	__set_bit(bit, bitmap->bitmap);
+	return 0;
+}
+EXPORT_SYMBOL(xb_set_bit);
+
+/**
+ * xb_clear_bit - clear a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to clear
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	bitmap = __radix_tree_lookup(root, index, &node, &slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit >= BITS_PER_LONG)
+			return;
+		tmp &= ~(1UL << ebit);
+		if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+			__radix_tree_delete(root, node, slot);
+		else
+			rcu_assign_pointer(*slot, (void *)tmp);
+		return;
+	}
+
+	if (!bitmap)
+		return;
+
+	__clear_bit(bit, bitmap->bitmap);
+	if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+		kfree(bitmap);
+		__radix_tree_delete(root, node, slot);
+	}
+}
+EXPORT_SYMBOL(xb_clear_bit);
+
+/**
+ * xb_clear_bit - clear a range of bits in the xbitmap
+ * @start: the start of the bit range, inclusive
+ * @end: the end of the bit range, inclusive
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned int nbits;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bitmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bitmap)) {
+			unsigned long ebit = bit + 2;
+			unsigned long tmp = (unsigned long)bitmap;
+
+			nbits = min(end - start + 1, BITS_PER_LONG - ebit);
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			bitmap_clear(&tmp, ebit, nbits);
+			if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+				__radix_tree_delete(root, node, slot);
+			else
+				rcu_assign_pointer(*slot, (void *)tmp);
+		} else if (bitmap) {
+			nbits = min(end - start + 1, IDA_BITMAP_BITS - bit);
+
+			if (nbits != IDA_BITMAP_BITS)
+				bitmap_clear(bitmap->bitmap, bit, nbits);
+
+			if (nbits == IDA_BITMAP_BITS ||
+				bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+				kfree(bitmap);
+				__radix_tree_delete(root, node, slot);
+			}
+		}
+	}
+}
+EXPORT_SYMBOL(xb_clear_bit_range);
+
+/**
+ * xb_test_bit - test a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to test
+ *
+ * This function is used to test a bit in the xbitmap.
+ * Returns: 1 if the bit is set, or 0 otherwise.
+ */
+bool xb_test_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	const struct radix_tree_root *root = &xb->xbrt;
+	struct ida_bitmap *bitmap = radix_tree_lookup(root, index);
+
+	bit %= IDA_BITMAP_BITS;
+
+	if (!bitmap)
+		return false;
+	if (radix_tree_exception(bitmap)) {
+		bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
+		if (bit > BITS_PER_LONG)
+			return false;
+		return (unsigned long)bitmap & (1UL << bit);
+	}
+
+	return test_bit(bit, bitmap->bitmap);
+}
+EXPORT_SYMBOL(xb_test_bit);
+
+static unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+				      unsigned long end, bool set)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bmap;
+	unsigned long ret = end + 1;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bmap)) {
+			unsigned long tmp = (unsigned long)bmap;
+			unsigned long ebit = bit + 2;
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			if (set)
+				ret = find_next_bit(&tmp, BITS_PER_LONG, ebit);
+			else
+				ret = find_next_zero_bit(&tmp, BITS_PER_LONG,
+							 ebit);
+			if (ret < BITS_PER_LONG)
+				return ret - 2 + IDA_BITMAP_BITS * index;
+		} else if (bmap) {
+			if (set)
+				ret = find_next_bit(bmap->bitmap,
+						    IDA_BITMAP_BITS, bit);
+			else
+				ret = find_next_zero_bit(bmap->bitmap,
+							 IDA_BITMAP_BITS, bit);
+			if (ret < IDA_BITMAP_BITS)
+				return ret + index * IDA_BITMAP_BITS;
+		} else if (!bmap && !set) {
+			return start;
+		}
+	}
+
+	return ret;
+}
+
+/**
+ * xb_find_next_set_bit - find the next set bit in a range
+ * @xb: the xbitmap to search
+ * @start: the start of the range, inclusive
+ * @end: the end of the range, inclusive
+ *
+ * Returns: the index of the found bit, or @end + 1 if no such bit is found.
+ */
+unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
+				   unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 1);
+}
+EXPORT_SYMBOL(xb_find_next_set_bit);
+
+/**
+ * xb_find_next_zero_bit - find the next zero bit in a range
+ * @xb: the xbitmap to search
+ * @start: the start of the range, inclusive
+ * @end: the end of the range, inclusive
+ *
+ * Returns: the index of the found bit, or @end + 1 if no such bit is found.
+ */
+unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start,
+				    unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 0);
+}
+EXPORT_SYMBOL(xb_find_next_zero_bit);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v16 1/5] lib/xbitmap: Introduce xbitmap
@ 2017-09-30  4:05   ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

From: Matthew Wilcox <mawilcox@microsoft.com>

The eXtensible Bitmap is a sparse bitmap representation which is
efficient for set bits which tend to cluster.  It supports up to
'unsigned long' worth of bits, and this commit adds the bare bones --
xb_set_bit(), xb_clear_bit() and xb_test_bit().

More possible optimizations to add in the futurei 1/4 ?
1) xb_set_bit_range: set a range of bits
2) when searching a bit, if the bit is not found in the slot, move on to
the next slot directly.
3) add Tags to help searching

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>

v15->v16 ChangeLog:
1) coding style - separate small functions for bit set/clear/test;
2) Clear a range of bits in a more efficient way:
   A) clear a range of bits from the same ida bitmap directly rather than
      search the bitmap again for each bit;
   B) when the range of bits to clear covers the whole ida bitmap,
      directly free the bitmap - no need to zero the bitmap first.
3) more efficient bit searching, like 2.A.
---
 include/linux/radix-tree.h |   2 +
 include/linux/xbitmap.h    |  66 ++++++++++++
 lib/Makefile               |   2 +-
 lib/radix-tree.c           |  42 +++++++-
 lib/xbitmap.c              | 264 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 373 insertions(+), 3 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 3e57350..1cffeb3 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -309,6 +309,8 @@ void radix_tree_iter_replace(struct radix_tree_root *,
 		const struct radix_tree_iter *, void __rcu **slot, void *entry);
 void radix_tree_replace_slot(struct radix_tree_root *,
 			     void __rcu **slot, void *entry);
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot);
 void __radix_tree_delete_node(struct radix_tree_root *,
 			      struct radix_tree_node *,
 			      radix_tree_update_node_t update_node,
diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
new file mode 100644
index 0000000..f634bd9
--- /dev/null
+++ b/include/linux/xbitmap.h
@@ -0,0 +1,66 @@
+/*
+ * eXtensible Bitmaps
+ * Copyright (c) 2017 Microsoft Corporation <mawilcox@microsoft.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * eXtensible Bitmaps provide an unlimited-size sparse bitmap facility.
+ * All bits are initially zero.
+ */
+
+#ifndef __XBITMAP_H__
+#define __XBITMAP_H__
+
+#include <linux/idr.h>
+
+struct xb {
+	struct radix_tree_root xbrt;
+};
+
+#define XB_INIT {							\
+	.xbrt = RADIX_TREE_INIT(IDR_RT_MARKER | GFP_NOWAIT),		\
+}
+#define DEFINE_XB(name)		struct xb name = XB_INIT
+
+static inline void xb_init(struct xb *xb)
+{
+	INIT_RADIX_TREE(&xb->xbrt, IDR_RT_MARKER | GFP_NOWAIT);
+}
+
+int xb_set_bit(struct xb *xb, unsigned long bit);
+bool xb_test_bit(struct xb *xb, unsigned long bit);
+void xb_clear_bit(struct xb *xb, unsigned long bit);
+unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
+				   unsigned long end);
+unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start,
+				    unsigned long end);
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end);
+
+/* Check if the xb tree is empty */
+static inline bool xb_is_empty(const struct xb *xb)
+{
+	return radix_tree_empty(&xb->xbrt);
+}
+
+void xb_preload(gfp_t gfp);
+
+/**
+ * xb_preload_end - end preload section started with xb_preload()
+ *
+ * Each xb_preload() should be matched with an invocation of this
+ * function. See xb_preload() for details.
+ */
+static inline void xb_preload_end(void)
+{
+	preempt_enable();
+}
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 40c1837..ea50496 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -18,7 +18,7 @@ KCOV_INSTRUMENT_dynamic_debug.o := n
 
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o dump_stack.o timerqueue.o\
-	 idr.o int_sqrt.o extable.o \
+	 idr.o xbitmap.o int_sqrt.o extable.o \
 	 sha1.o chacha20.o irq_regs.o argv_split.o \
 	 flex_proportions.o ratelimit.o show_mem.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 898e879..1e15e30 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -78,6 +78,19 @@ static struct kmem_cache *radix_tree_node_cachep;
 #define IDA_PRELOAD_SIZE	(IDA_MAX_PATH * 2 - 1)
 
 /*
+ * The xbitmap implementation supports up to ULONG_MAX bits, and it is
+ * implemented based on ida bitmaps. So, given an unsigned long index,
+ * the high order XB_INDEX_BITS bits of the index is used to find the
+ * corresponding item (i.e. ida bitmap) from the radix tree, and the low
+ * order (i.e. ilog2(IDA_BITMAP_BITS)) bits of the index are indexed into
+ * the ida bitmap to find the bit.
+ */
+#define XB_INDEX_BITS		(BITS_PER_LONG - ilog2(IDA_BITMAP_BITS))
+#define XB_MAX_PATH		(DIV_ROUND_UP(XB_INDEX_BITS, \
+					      RADIX_TREE_MAP_SHIFT))
+#define XB_PRELOAD_SIZE		(XB_MAX_PATH * 2 - 1)
+
+/*
  * Per-cpu pool of preloaded nodes
  */
 struct radix_tree_preload {
@@ -840,6 +853,8 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
 							offset, 0, 0);
 			if (!child)
 				return -ENOMEM;
+			if (is_idr(root))
+				all_tag_set(child, IDR_FREE);
 			rcu_assign_pointer(*slot, node_to_entry(child));
 			if (node)
 				node->count++;
@@ -1986,8 +2001,8 @@ void __radix_tree_delete_node(struct radix_tree_root *root,
 	delete_node(root, node, update_node, private);
 }
 
-static bool __radix_tree_delete(struct radix_tree_root *root,
-				struct radix_tree_node *node, void __rcu **slot)
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot)
 {
 	void *old = rcu_dereference_raw(*slot);
 	int exceptional = radix_tree_exceptional_entry(old) ? -1 : 0;
@@ -2005,6 +2020,29 @@ static bool __radix_tree_delete(struct radix_tree_root *root,
 }
 
 /**
+ *  xb_preload - preload for xb_set_bit()
+ *  @gfp_mask: allocation mask to use for preloading
+ *
+ * Preallocate memory to use for the next call to xb_set_bit(). This function
+ * returns with preemption disabled. It will be enabled by xb_preload_end().
+ */
+void xb_preload(gfp_t gfp)
+{
+	if (__radix_tree_preload(gfp, XB_PRELOAD_SIZE) < 0)
+		preempt_disable();
+
+	if (!this_cpu_read(ida_bitmap)) {
+		struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
+
+		if (!bitmap)
+			return;
+		bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
+		kfree(bitmap);
+	}
+}
+EXPORT_SYMBOL(xb_preload);
+
+/**
  * radix_tree_iter_delete - delete the entry at this iterator position
  * @root: radix tree root
  * @iter: iterator state
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
new file mode 100644
index 0000000..4ab9ac2
--- /dev/null
+++ b/lib/xbitmap.c
@@ -0,0 +1,264 @@
+#include <linux/slab.h>
+#include <linux/xbitmap.h>
+
+/**
+ *  xb_set_bit - set a bit in the xbitmap
+ *  @xb: the xbitmap tree used to record the bit
+ *  @bit: index of the bit to set
+ *
+ * This function is used to set a bit in the xbitmap. If the bitmap that @bit
+ * resides in is not there, it will be allocated.
+ *
+ * Returns: 0 on success. %-EAGAIN indicates that @bit was not set. The caller
+ * may want to call the function again.
+ */
+int xb_set_bit(struct xb *xb, unsigned long bit)
+{
+	int err;
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	err = __radix_tree_create(root, index, 0, &node, &slot);
+	if (err)
+		return err;
+	bitmap = rcu_dereference_raw(*slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit < BITS_PER_LONG) {
+			tmp |= 1UL << ebit;
+			rcu_assign_pointer(*slot, (void *)tmp);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+		rcu_assign_pointer(*slot, bitmap);
+	}
+
+	if (!bitmap) {
+		if (ebit < BITS_PER_LONG) {
+			bitmap = (void *)((1UL << ebit) |
+					RADIX_TREE_EXCEPTIONAL_ENTRY);
+			__radix_tree_replace(root, node, slot, bitmap, NULL,
+						NULL);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		__radix_tree_replace(root, node, slot, bitmap, NULL, NULL);
+	}
+
+	__set_bit(bit, bitmap->bitmap);
+	return 0;
+}
+EXPORT_SYMBOL(xb_set_bit);
+
+/**
+ * xb_clear_bit - clear a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to clear
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	bitmap = __radix_tree_lookup(root, index, &node, &slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit >= BITS_PER_LONG)
+			return;
+		tmp &= ~(1UL << ebit);
+		if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+			__radix_tree_delete(root, node, slot);
+		else
+			rcu_assign_pointer(*slot, (void *)tmp);
+		return;
+	}
+
+	if (!bitmap)
+		return;
+
+	__clear_bit(bit, bitmap->bitmap);
+	if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+		kfree(bitmap);
+		__radix_tree_delete(root, node, slot);
+	}
+}
+EXPORT_SYMBOL(xb_clear_bit);
+
+/**
+ * xb_clear_bit - clear a range of bits in the xbitmap
+ * @start: the start of the bit range, inclusive
+ * @end: the end of the bit range, inclusive
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned int nbits;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bitmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bitmap)) {
+			unsigned long ebit = bit + 2;
+			unsigned long tmp = (unsigned long)bitmap;
+
+			nbits = min(end - start + 1, BITS_PER_LONG - ebit);
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			bitmap_clear(&tmp, ebit, nbits);
+			if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+				__radix_tree_delete(root, node, slot);
+			else
+				rcu_assign_pointer(*slot, (void *)tmp);
+		} else if (bitmap) {
+			nbits = min(end - start + 1, IDA_BITMAP_BITS - bit);
+
+			if (nbits != IDA_BITMAP_BITS)
+				bitmap_clear(bitmap->bitmap, bit, nbits);
+
+			if (nbits == IDA_BITMAP_BITS ||
+				bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+				kfree(bitmap);
+				__radix_tree_delete(root, node, slot);
+			}
+		}
+	}
+}
+EXPORT_SYMBOL(xb_clear_bit_range);
+
+/**
+ * xb_test_bit - test a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to test
+ *
+ * This function is used to test a bit in the xbitmap.
+ * Returns: 1 if the bit is set, or 0 otherwise.
+ */
+bool xb_test_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	const struct radix_tree_root *root = &xb->xbrt;
+	struct ida_bitmap *bitmap = radix_tree_lookup(root, index);
+
+	bit %= IDA_BITMAP_BITS;
+
+	if (!bitmap)
+		return false;
+	if (radix_tree_exception(bitmap)) {
+		bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
+		if (bit > BITS_PER_LONG)
+			return false;
+		return (unsigned long)bitmap & (1UL << bit);
+	}
+
+	return test_bit(bit, bitmap->bitmap);
+}
+EXPORT_SYMBOL(xb_test_bit);
+
+static unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+				      unsigned long end, bool set)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bmap;
+	unsigned long ret = end + 1;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bmap)) {
+			unsigned long tmp = (unsigned long)bmap;
+			unsigned long ebit = bit + 2;
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			if (set)
+				ret = find_next_bit(&tmp, BITS_PER_LONG, ebit);
+			else
+				ret = find_next_zero_bit(&tmp, BITS_PER_LONG,
+							 ebit);
+			if (ret < BITS_PER_LONG)
+				return ret - 2 + IDA_BITMAP_BITS * index;
+		} else if (bmap) {
+			if (set)
+				ret = find_next_bit(bmap->bitmap,
+						    IDA_BITMAP_BITS, bit);
+			else
+				ret = find_next_zero_bit(bmap->bitmap,
+							 IDA_BITMAP_BITS, bit);
+			if (ret < IDA_BITMAP_BITS)
+				return ret + index * IDA_BITMAP_BITS;
+		} else if (!bmap && !set) {
+			return start;
+		}
+	}
+
+	return ret;
+}
+
+/**
+ * xb_find_next_set_bit - find the next set bit in a range
+ * @xb: the xbitmap to search
+ * @start: the start of the range, inclusive
+ * @end: the end of the range, inclusive
+ *
+ * Returns: the index of the found bit, or @end + 1 if no such bit is found.
+ */
+unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
+				   unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 1);
+}
+EXPORT_SYMBOL(xb_find_next_set_bit);
+
+/**
+ * xb_find_next_zero_bit - find the next zero bit in a range
+ * @xb: the xbitmap to search
+ * @start: the start of the range, inclusive
+ * @end: the end of the range, inclusive
+ *
+ * Returns: the index of the found bit, or @end + 1 if no such bit is found.
+ */
+unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start,
+				    unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 0);
+}
+EXPORT_SYMBOL(xb_find_next_zero_bit);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [Qemu-devel] [PATCH v16 1/5] lib/xbitmap: Introduce xbitmap
@ 2017-09-30  4:05   ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

From: Matthew Wilcox <mawilcox@microsoft.com>

The eXtensible Bitmap is a sparse bitmap representation which is
efficient for set bits which tend to cluster.  It supports up to
'unsigned long' worth of bits, and this commit adds the bare bones --
xb_set_bit(), xb_clear_bit() and xb_test_bit().

More possible optimizations to add in the future:
1) xb_set_bit_range: set a range of bits
2) when searching a bit, if the bit is not found in the slot, move on to
the next slot directly.
3) add Tags to help searching

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>

v15->v16 ChangeLog:
1) coding style - separate small functions for bit set/clear/test;
2) Clear a range of bits in a more efficient way:
   A) clear a range of bits from the same ida bitmap directly rather than
      search the bitmap again for each bit;
   B) when the range of bits to clear covers the whole ida bitmap,
      directly free the bitmap - no need to zero the bitmap first.
3) more efficient bit searching, like 2.A.
---
 include/linux/radix-tree.h |   2 +
 include/linux/xbitmap.h    |  66 ++++++++++++
 lib/Makefile               |   2 +-
 lib/radix-tree.c           |  42 +++++++-
 lib/xbitmap.c              | 264 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 373 insertions(+), 3 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 3e57350..1cffeb3 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -309,6 +309,8 @@ void radix_tree_iter_replace(struct radix_tree_root *,
 		const struct radix_tree_iter *, void __rcu **slot, void *entry);
 void radix_tree_replace_slot(struct radix_tree_root *,
 			     void __rcu **slot, void *entry);
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot);
 void __radix_tree_delete_node(struct radix_tree_root *,
 			      struct radix_tree_node *,
 			      radix_tree_update_node_t update_node,
diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
new file mode 100644
index 0000000..f634bd9
--- /dev/null
+++ b/include/linux/xbitmap.h
@@ -0,0 +1,66 @@
+/*
+ * eXtensible Bitmaps
+ * Copyright (c) 2017 Microsoft Corporation <mawilcox@microsoft.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * eXtensible Bitmaps provide an unlimited-size sparse bitmap facility.
+ * All bits are initially zero.
+ */
+
+#ifndef __XBITMAP_H__
+#define __XBITMAP_H__
+
+#include <linux/idr.h>
+
+struct xb {
+	struct radix_tree_root xbrt;
+};
+
+#define XB_INIT {							\
+	.xbrt = RADIX_TREE_INIT(IDR_RT_MARKER | GFP_NOWAIT),		\
+}
+#define DEFINE_XB(name)		struct xb name = XB_INIT
+
+static inline void xb_init(struct xb *xb)
+{
+	INIT_RADIX_TREE(&xb->xbrt, IDR_RT_MARKER | GFP_NOWAIT);
+}
+
+int xb_set_bit(struct xb *xb, unsigned long bit);
+bool xb_test_bit(struct xb *xb, unsigned long bit);
+void xb_clear_bit(struct xb *xb, unsigned long bit);
+unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
+				   unsigned long end);
+unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start,
+				    unsigned long end);
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end);
+
+/* Check if the xb tree is empty */
+static inline bool xb_is_empty(const struct xb *xb)
+{
+	return radix_tree_empty(&xb->xbrt);
+}
+
+void xb_preload(gfp_t gfp);
+
+/**
+ * xb_preload_end - end preload section started with xb_preload()
+ *
+ * Each xb_preload() should be matched with an invocation of this
+ * function. See xb_preload() for details.
+ */
+static inline void xb_preload_end(void)
+{
+	preempt_enable();
+}
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 40c1837..ea50496 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -18,7 +18,7 @@ KCOV_INSTRUMENT_dynamic_debug.o := n
 
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o dump_stack.o timerqueue.o\
-	 idr.o int_sqrt.o extable.o \
+	 idr.o xbitmap.o int_sqrt.o extable.o \
 	 sha1.o chacha20.o irq_regs.o argv_split.o \
 	 flex_proportions.o ratelimit.o show_mem.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 898e879..1e15e30 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -78,6 +78,19 @@ static struct kmem_cache *radix_tree_node_cachep;
 #define IDA_PRELOAD_SIZE	(IDA_MAX_PATH * 2 - 1)
 
 /*
+ * The xbitmap implementation supports up to ULONG_MAX bits, and it is
+ * implemented based on ida bitmaps. So, given an unsigned long index,
+ * the high order XB_INDEX_BITS bits of the index is used to find the
+ * corresponding item (i.e. ida bitmap) from the radix tree, and the low
+ * order (i.e. ilog2(IDA_BITMAP_BITS)) bits of the index are indexed into
+ * the ida bitmap to find the bit.
+ */
+#define XB_INDEX_BITS		(BITS_PER_LONG - ilog2(IDA_BITMAP_BITS))
+#define XB_MAX_PATH		(DIV_ROUND_UP(XB_INDEX_BITS, \
+					      RADIX_TREE_MAP_SHIFT))
+#define XB_PRELOAD_SIZE		(XB_MAX_PATH * 2 - 1)
+
+/*
  * Per-cpu pool of preloaded nodes
  */
 struct radix_tree_preload {
@@ -840,6 +853,8 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
 							offset, 0, 0);
 			if (!child)
 				return -ENOMEM;
+			if (is_idr(root))
+				all_tag_set(child, IDR_FREE);
 			rcu_assign_pointer(*slot, node_to_entry(child));
 			if (node)
 				node->count++;
@@ -1986,8 +2001,8 @@ void __radix_tree_delete_node(struct radix_tree_root *root,
 	delete_node(root, node, update_node, private);
 }
 
-static bool __radix_tree_delete(struct radix_tree_root *root,
-				struct radix_tree_node *node, void __rcu **slot)
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot)
 {
 	void *old = rcu_dereference_raw(*slot);
 	int exceptional = radix_tree_exceptional_entry(old) ? -1 : 0;
@@ -2005,6 +2020,29 @@ static bool __radix_tree_delete(struct radix_tree_root *root,
 }
 
 /**
+ *  xb_preload - preload for xb_set_bit()
+ *  @gfp_mask: allocation mask to use for preloading
+ *
+ * Preallocate memory to use for the next call to xb_set_bit(). This function
+ * returns with preemption disabled. It will be enabled by xb_preload_end().
+ */
+void xb_preload(gfp_t gfp)
+{
+	if (__radix_tree_preload(gfp, XB_PRELOAD_SIZE) < 0)
+		preempt_disable();
+
+	if (!this_cpu_read(ida_bitmap)) {
+		struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
+
+		if (!bitmap)
+			return;
+		bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
+		kfree(bitmap);
+	}
+}
+EXPORT_SYMBOL(xb_preload);
+
+/**
  * radix_tree_iter_delete - delete the entry at this iterator position
  * @root: radix tree root
  * @iter: iterator state
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
new file mode 100644
index 0000000..4ab9ac2
--- /dev/null
+++ b/lib/xbitmap.c
@@ -0,0 +1,264 @@
+#include <linux/slab.h>
+#include <linux/xbitmap.h>
+
+/**
+ *  xb_set_bit - set a bit in the xbitmap
+ *  @xb: the xbitmap tree used to record the bit
+ *  @bit: index of the bit to set
+ *
+ * This function is used to set a bit in the xbitmap. If the bitmap that @bit
+ * resides in is not there, it will be allocated.
+ *
+ * Returns: 0 on success. %-EAGAIN indicates that @bit was not set. The caller
+ * may want to call the function again.
+ */
+int xb_set_bit(struct xb *xb, unsigned long bit)
+{
+	int err;
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	err = __radix_tree_create(root, index, 0, &node, &slot);
+	if (err)
+		return err;
+	bitmap = rcu_dereference_raw(*slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit < BITS_PER_LONG) {
+			tmp |= 1UL << ebit;
+			rcu_assign_pointer(*slot, (void *)tmp);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+		rcu_assign_pointer(*slot, bitmap);
+	}
+
+	if (!bitmap) {
+		if (ebit < BITS_PER_LONG) {
+			bitmap = (void *)((1UL << ebit) |
+					RADIX_TREE_EXCEPTIONAL_ENTRY);
+			__radix_tree_replace(root, node, slot, bitmap, NULL,
+						NULL);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		__radix_tree_replace(root, node, slot, bitmap, NULL, NULL);
+	}
+
+	__set_bit(bit, bitmap->bitmap);
+	return 0;
+}
+EXPORT_SYMBOL(xb_set_bit);
+
+/**
+ * xb_clear_bit - clear a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to clear
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	bitmap = __radix_tree_lookup(root, index, &node, &slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit >= BITS_PER_LONG)
+			return;
+		tmp &= ~(1UL << ebit);
+		if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+			__radix_tree_delete(root, node, slot);
+		else
+			rcu_assign_pointer(*slot, (void *)tmp);
+		return;
+	}
+
+	if (!bitmap)
+		return;
+
+	__clear_bit(bit, bitmap->bitmap);
+	if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+		kfree(bitmap);
+		__radix_tree_delete(root, node, slot);
+	}
+}
+EXPORT_SYMBOL(xb_clear_bit);
+
+/**
+ * xb_clear_bit - clear a range of bits in the xbitmap
+ * @start: the start of the bit range, inclusive
+ * @end: the end of the bit range, inclusive
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned int nbits;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bitmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bitmap)) {
+			unsigned long ebit = bit + 2;
+			unsigned long tmp = (unsigned long)bitmap;
+
+			nbits = min(end - start + 1, BITS_PER_LONG - ebit);
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			bitmap_clear(&tmp, ebit, nbits);
+			if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+				__radix_tree_delete(root, node, slot);
+			else
+				rcu_assign_pointer(*slot, (void *)tmp);
+		} else if (bitmap) {
+			nbits = min(end - start + 1, IDA_BITMAP_BITS - bit);
+
+			if (nbits != IDA_BITMAP_BITS)
+				bitmap_clear(bitmap->bitmap, bit, nbits);
+
+			if (nbits == IDA_BITMAP_BITS ||
+				bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+				kfree(bitmap);
+				__radix_tree_delete(root, node, slot);
+			}
+		}
+	}
+}
+EXPORT_SYMBOL(xb_clear_bit_range);
+
+/**
+ * xb_test_bit - test a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to test
+ *
+ * This function is used to test a bit in the xbitmap.
+ * Returns: 1 if the bit is set, or 0 otherwise.
+ */
+bool xb_test_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	const struct radix_tree_root *root = &xb->xbrt;
+	struct ida_bitmap *bitmap = radix_tree_lookup(root, index);
+
+	bit %= IDA_BITMAP_BITS;
+
+	if (!bitmap)
+		return false;
+	if (radix_tree_exception(bitmap)) {
+		bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
+		if (bit > BITS_PER_LONG)
+			return false;
+		return (unsigned long)bitmap & (1UL << bit);
+	}
+
+	return test_bit(bit, bitmap->bitmap);
+}
+EXPORT_SYMBOL(xb_test_bit);
+
+static unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+				      unsigned long end, bool set)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bmap;
+	unsigned long ret = end + 1;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bmap)) {
+			unsigned long tmp = (unsigned long)bmap;
+			unsigned long ebit = bit + 2;
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			if (set)
+				ret = find_next_bit(&tmp, BITS_PER_LONG, ebit);
+			else
+				ret = find_next_zero_bit(&tmp, BITS_PER_LONG,
+							 ebit);
+			if (ret < BITS_PER_LONG)
+				return ret - 2 + IDA_BITMAP_BITS * index;
+		} else if (bmap) {
+			if (set)
+				ret = find_next_bit(bmap->bitmap,
+						    IDA_BITMAP_BITS, bit);
+			else
+				ret = find_next_zero_bit(bmap->bitmap,
+							 IDA_BITMAP_BITS, bit);
+			if (ret < IDA_BITMAP_BITS)
+				return ret + index * IDA_BITMAP_BITS;
+		} else if (!bmap && !set) {
+			return start;
+		}
+	}
+
+	return ret;
+}
+
+/**
+ * xb_find_next_set_bit - find the next set bit in a range
+ * @xb: the xbitmap to search
+ * @start: the start of the range, inclusive
+ * @end: the end of the range, inclusive
+ *
+ * Returns: the index of the found bit, or @end + 1 if no such bit is found.
+ */
+unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
+				   unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 1);
+}
+EXPORT_SYMBOL(xb_find_next_set_bit);
+
+/**
+ * xb_find_next_zero_bit - find the next zero bit in a range
+ * @xb: the xbitmap to search
+ * @start: the start of the range, inclusive
+ * @end: the end of the range, inclusive
+ *
+ * Returns: the index of the found bit, or @end + 1 if no such bit is found.
+ */
+unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start,
+				    unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 0);
+}
+EXPORT_SYMBOL(xb_find_next_zero_bit);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v16 1/5] lib/xbitmap: Introduce xbitmap
  2017-09-30  4:05 ` Wei Wang
                   ` (3 preceding siblings ...)
  (?)
@ 2017-09-30  4:05 ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: aarcange, yang.zhang.wz, liliang.opensource, willy, amit.shah,
	quan.xu, cornelia.huck, pbonzini, mgorman

From: Matthew Wilcox <mawilcox@microsoft.com>

The eXtensible Bitmap is a sparse bitmap representation which is
efficient for set bits which tend to cluster.  It supports up to
'unsigned long' worth of bits, and this commit adds the bare bones --
xb_set_bit(), xb_clear_bit() and xb_test_bit().

More possible optimizations to add in the future:
1) xb_set_bit_range: set a range of bits
2) when searching a bit, if the bit is not found in the slot, move on to
the next slot directly.
3) add Tags to help searching

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>

v15->v16 ChangeLog:
1) coding style - separate small functions for bit set/clear/test;
2) Clear a range of bits in a more efficient way:
   A) clear a range of bits from the same ida bitmap directly rather than
      search the bitmap again for each bit;
   B) when the range of bits to clear covers the whole ida bitmap,
      directly free the bitmap - no need to zero the bitmap first.
3) more efficient bit searching, like 2.A.
---
 include/linux/radix-tree.h |   2 +
 include/linux/xbitmap.h    |  66 ++++++++++++
 lib/Makefile               |   2 +-
 lib/radix-tree.c           |  42 +++++++-
 lib/xbitmap.c              | 264 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 373 insertions(+), 3 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 3e57350..1cffeb3 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -309,6 +309,8 @@ void radix_tree_iter_replace(struct radix_tree_root *,
 		const struct radix_tree_iter *, void __rcu **slot, void *entry);
 void radix_tree_replace_slot(struct radix_tree_root *,
 			     void __rcu **slot, void *entry);
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot);
 void __radix_tree_delete_node(struct radix_tree_root *,
 			      struct radix_tree_node *,
 			      radix_tree_update_node_t update_node,
diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
new file mode 100644
index 0000000..f634bd9
--- /dev/null
+++ b/include/linux/xbitmap.h
@@ -0,0 +1,66 @@
+/*
+ * eXtensible Bitmaps
+ * Copyright (c) 2017 Microsoft Corporation <mawilcox@microsoft.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * eXtensible Bitmaps provide an unlimited-size sparse bitmap facility.
+ * All bits are initially zero.
+ */
+
+#ifndef __XBITMAP_H__
+#define __XBITMAP_H__
+
+#include <linux/idr.h>
+
+struct xb {
+	struct radix_tree_root xbrt;
+};
+
+#define XB_INIT {							\
+	.xbrt = RADIX_TREE_INIT(IDR_RT_MARKER | GFP_NOWAIT),		\
+}
+#define DEFINE_XB(name)		struct xb name = XB_INIT
+
+static inline void xb_init(struct xb *xb)
+{
+	INIT_RADIX_TREE(&xb->xbrt, IDR_RT_MARKER | GFP_NOWAIT);
+}
+
+int xb_set_bit(struct xb *xb, unsigned long bit);
+bool xb_test_bit(struct xb *xb, unsigned long bit);
+void xb_clear_bit(struct xb *xb, unsigned long bit);
+unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
+				   unsigned long end);
+unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start,
+				    unsigned long end);
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end);
+
+/* Check if the xb tree is empty */
+static inline bool xb_is_empty(const struct xb *xb)
+{
+	return radix_tree_empty(&xb->xbrt);
+}
+
+void xb_preload(gfp_t gfp);
+
+/**
+ * xb_preload_end - end preload section started with xb_preload()
+ *
+ * Each xb_preload() should be matched with an invocation of this
+ * function. See xb_preload() for details.
+ */
+static inline void xb_preload_end(void)
+{
+	preempt_enable();
+}
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 40c1837..ea50496 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -18,7 +18,7 @@ KCOV_INSTRUMENT_dynamic_debug.o := n
 
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o dump_stack.o timerqueue.o\
-	 idr.o int_sqrt.o extable.o \
+	 idr.o xbitmap.o int_sqrt.o extable.o \
 	 sha1.o chacha20.o irq_regs.o argv_split.o \
 	 flex_proportions.o ratelimit.o show_mem.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 898e879..1e15e30 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -78,6 +78,19 @@ static struct kmem_cache *radix_tree_node_cachep;
 #define IDA_PRELOAD_SIZE	(IDA_MAX_PATH * 2 - 1)
 
 /*
+ * The xbitmap implementation supports up to ULONG_MAX bits, and it is
+ * implemented based on ida bitmaps. So, given an unsigned long index,
+ * the high order XB_INDEX_BITS bits of the index is used to find the
+ * corresponding item (i.e. ida bitmap) from the radix tree, and the low
+ * order (i.e. ilog2(IDA_BITMAP_BITS)) bits of the index are indexed into
+ * the ida bitmap to find the bit.
+ */
+#define XB_INDEX_BITS		(BITS_PER_LONG - ilog2(IDA_BITMAP_BITS))
+#define XB_MAX_PATH		(DIV_ROUND_UP(XB_INDEX_BITS, \
+					      RADIX_TREE_MAP_SHIFT))
+#define XB_PRELOAD_SIZE		(XB_MAX_PATH * 2 - 1)
+
+/*
  * Per-cpu pool of preloaded nodes
  */
 struct radix_tree_preload {
@@ -840,6 +853,8 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
 							offset, 0, 0);
 			if (!child)
 				return -ENOMEM;
+			if (is_idr(root))
+				all_tag_set(child, IDR_FREE);
 			rcu_assign_pointer(*slot, node_to_entry(child));
 			if (node)
 				node->count++;
@@ -1986,8 +2001,8 @@ void __radix_tree_delete_node(struct radix_tree_root *root,
 	delete_node(root, node, update_node, private);
 }
 
-static bool __radix_tree_delete(struct radix_tree_root *root,
-				struct radix_tree_node *node, void __rcu **slot)
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot)
 {
 	void *old = rcu_dereference_raw(*slot);
 	int exceptional = radix_tree_exceptional_entry(old) ? -1 : 0;
@@ -2005,6 +2020,29 @@ static bool __radix_tree_delete(struct radix_tree_root *root,
 }
 
 /**
+ *  xb_preload - preload for xb_set_bit()
+ *  @gfp_mask: allocation mask to use for preloading
+ *
+ * Preallocate memory to use for the next call to xb_set_bit(). This function
+ * returns with preemption disabled. It will be enabled by xb_preload_end().
+ */
+void xb_preload(gfp_t gfp)
+{
+	if (__radix_tree_preload(gfp, XB_PRELOAD_SIZE) < 0)
+		preempt_disable();
+
+	if (!this_cpu_read(ida_bitmap)) {
+		struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
+
+		if (!bitmap)
+			return;
+		bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
+		kfree(bitmap);
+	}
+}
+EXPORT_SYMBOL(xb_preload);
+
+/**
  * radix_tree_iter_delete - delete the entry at this iterator position
  * @root: radix tree root
  * @iter: iterator state
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
new file mode 100644
index 0000000..4ab9ac2
--- /dev/null
+++ b/lib/xbitmap.c
@@ -0,0 +1,264 @@
+#include <linux/slab.h>
+#include <linux/xbitmap.h>
+
+/**
+ *  xb_set_bit - set a bit in the xbitmap
+ *  @xb: the xbitmap tree used to record the bit
+ *  @bit: index of the bit to set
+ *
+ * This function is used to set a bit in the xbitmap. If the bitmap that @bit
+ * resides in is not there, it will be allocated.
+ *
+ * Returns: 0 on success. %-EAGAIN indicates that @bit was not set. The caller
+ * may want to call the function again.
+ */
+int xb_set_bit(struct xb *xb, unsigned long bit)
+{
+	int err;
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	err = __radix_tree_create(root, index, 0, &node, &slot);
+	if (err)
+		return err;
+	bitmap = rcu_dereference_raw(*slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit < BITS_PER_LONG) {
+			tmp |= 1UL << ebit;
+			rcu_assign_pointer(*slot, (void *)tmp);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+		rcu_assign_pointer(*slot, bitmap);
+	}
+
+	if (!bitmap) {
+		if (ebit < BITS_PER_LONG) {
+			bitmap = (void *)((1UL << ebit) |
+					RADIX_TREE_EXCEPTIONAL_ENTRY);
+			__radix_tree_replace(root, node, slot, bitmap, NULL,
+						NULL);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		__radix_tree_replace(root, node, slot, bitmap, NULL, NULL);
+	}
+
+	__set_bit(bit, bitmap->bitmap);
+	return 0;
+}
+EXPORT_SYMBOL(xb_set_bit);
+
+/**
+ * xb_clear_bit - clear a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to clear
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	bitmap = __radix_tree_lookup(root, index, &node, &slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit >= BITS_PER_LONG)
+			return;
+		tmp &= ~(1UL << ebit);
+		if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+			__radix_tree_delete(root, node, slot);
+		else
+			rcu_assign_pointer(*slot, (void *)tmp);
+		return;
+	}
+
+	if (!bitmap)
+		return;
+
+	__clear_bit(bit, bitmap->bitmap);
+	if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+		kfree(bitmap);
+		__radix_tree_delete(root, node, slot);
+	}
+}
+EXPORT_SYMBOL(xb_clear_bit);
+
+/**
+ * xb_clear_bit - clear a range of bits in the xbitmap
+ * @start: the start of the bit range, inclusive
+ * @end: the end of the bit range, inclusive
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned int nbits;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bitmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bitmap)) {
+			unsigned long ebit = bit + 2;
+			unsigned long tmp = (unsigned long)bitmap;
+
+			nbits = min(end - start + 1, BITS_PER_LONG - ebit);
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			bitmap_clear(&tmp, ebit, nbits);
+			if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+				__radix_tree_delete(root, node, slot);
+			else
+				rcu_assign_pointer(*slot, (void *)tmp);
+		} else if (bitmap) {
+			nbits = min(end - start + 1, IDA_BITMAP_BITS - bit);
+
+			if (nbits != IDA_BITMAP_BITS)
+				bitmap_clear(bitmap->bitmap, bit, nbits);
+
+			if (nbits == IDA_BITMAP_BITS ||
+				bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+				kfree(bitmap);
+				__radix_tree_delete(root, node, slot);
+			}
+		}
+	}
+}
+EXPORT_SYMBOL(xb_clear_bit_range);
+
+/**
+ * xb_test_bit - test a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to test
+ *
+ * This function is used to test a bit in the xbitmap.
+ * Returns: 1 if the bit is set, or 0 otherwise.
+ */
+bool xb_test_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	const struct radix_tree_root *root = &xb->xbrt;
+	struct ida_bitmap *bitmap = radix_tree_lookup(root, index);
+
+	bit %= IDA_BITMAP_BITS;
+
+	if (!bitmap)
+		return false;
+	if (radix_tree_exception(bitmap)) {
+		bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
+		if (bit > BITS_PER_LONG)
+			return false;
+		return (unsigned long)bitmap & (1UL << bit);
+	}
+
+	return test_bit(bit, bitmap->bitmap);
+}
+EXPORT_SYMBOL(xb_test_bit);
+
+static unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+				      unsigned long end, bool set)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bmap;
+	unsigned long ret = end + 1;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bmap)) {
+			unsigned long tmp = (unsigned long)bmap;
+			unsigned long ebit = bit + 2;
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			if (set)
+				ret = find_next_bit(&tmp, BITS_PER_LONG, ebit);
+			else
+				ret = find_next_zero_bit(&tmp, BITS_PER_LONG,
+							 ebit);
+			if (ret < BITS_PER_LONG)
+				return ret - 2 + IDA_BITMAP_BITS * index;
+		} else if (bmap) {
+			if (set)
+				ret = find_next_bit(bmap->bitmap,
+						    IDA_BITMAP_BITS, bit);
+			else
+				ret = find_next_zero_bit(bmap->bitmap,
+							 IDA_BITMAP_BITS, bit);
+			if (ret < IDA_BITMAP_BITS)
+				return ret + index * IDA_BITMAP_BITS;
+		} else if (!bmap && !set) {
+			return start;
+		}
+	}
+
+	return ret;
+}
+
+/**
+ * xb_find_next_set_bit - find the next set bit in a range
+ * @xb: the xbitmap to search
+ * @start: the start of the range, inclusive
+ * @end: the end of the range, inclusive
+ *
+ * Returns: the index of the found bit, or @end + 1 if no such bit is found.
+ */
+unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
+				   unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 1);
+}
+EXPORT_SYMBOL(xb_find_next_set_bit);
+
+/**
+ * xb_find_next_zero_bit - find the next zero bit in a range
+ * @xb: the xbitmap to search
+ * @start: the start of the range, inclusive
+ * @end: the end of the range, inclusive
+ *
+ * Returns: the index of the found bit, or @end + 1 if no such bit is found.
+ */
+unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start,
+				    unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 0);
+}
+EXPORT_SYMBOL(xb_find_next_zero_bit);
-- 
2.7.4

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [virtio-dev] [PATCH v16 1/5] lib/xbitmap: Introduce xbitmap
@ 2017-09-30  4:05   ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

From: Matthew Wilcox <mawilcox@microsoft.com>

The eXtensible Bitmap is a sparse bitmap representation which is
efficient for set bits which tend to cluster.  It supports up to
'unsigned long' worth of bits, and this commit adds the bare bones --
xb_set_bit(), xb_clear_bit() and xb_test_bit().

More possible optimizations to add in the future:
1) xb_set_bit_range: set a range of bits
2) when searching a bit, if the bit is not found in the slot, move on to
the next slot directly.
3) add Tags to help searching

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>

v15->v16 ChangeLog:
1) coding style - separate small functions for bit set/clear/test;
2) Clear a range of bits in a more efficient way:
   A) clear a range of bits from the same ida bitmap directly rather than
      search the bitmap again for each bit;
   B) when the range of bits to clear covers the whole ida bitmap,
      directly free the bitmap - no need to zero the bitmap first.
3) more efficient bit searching, like 2.A.
---
 include/linux/radix-tree.h |   2 +
 include/linux/xbitmap.h    |  66 ++++++++++++
 lib/Makefile               |   2 +-
 lib/radix-tree.c           |  42 +++++++-
 lib/xbitmap.c              | 264 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 373 insertions(+), 3 deletions(-)
 create mode 100644 include/linux/xbitmap.h
 create mode 100644 lib/xbitmap.c

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 3e57350..1cffeb3 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -309,6 +309,8 @@ void radix_tree_iter_replace(struct radix_tree_root *,
 		const struct radix_tree_iter *, void __rcu **slot, void *entry);
 void radix_tree_replace_slot(struct radix_tree_root *,
 			     void __rcu **slot, void *entry);
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot);
 void __radix_tree_delete_node(struct radix_tree_root *,
 			      struct radix_tree_node *,
 			      radix_tree_update_node_t update_node,
diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
new file mode 100644
index 0000000..f634bd9
--- /dev/null
+++ b/include/linux/xbitmap.h
@@ -0,0 +1,66 @@
+/*
+ * eXtensible Bitmaps
+ * Copyright (c) 2017 Microsoft Corporation <mawilcox@microsoft.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * eXtensible Bitmaps provide an unlimited-size sparse bitmap facility.
+ * All bits are initially zero.
+ */
+
+#ifndef __XBITMAP_H__
+#define __XBITMAP_H__
+
+#include <linux/idr.h>
+
+struct xb {
+	struct radix_tree_root xbrt;
+};
+
+#define XB_INIT {							\
+	.xbrt = RADIX_TREE_INIT(IDR_RT_MARKER | GFP_NOWAIT),		\
+}
+#define DEFINE_XB(name)		struct xb name = XB_INIT
+
+static inline void xb_init(struct xb *xb)
+{
+	INIT_RADIX_TREE(&xb->xbrt, IDR_RT_MARKER | GFP_NOWAIT);
+}
+
+int xb_set_bit(struct xb *xb, unsigned long bit);
+bool xb_test_bit(struct xb *xb, unsigned long bit);
+void xb_clear_bit(struct xb *xb, unsigned long bit);
+unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
+				   unsigned long end);
+unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start,
+				    unsigned long end);
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end);
+
+/* Check if the xb tree is empty */
+static inline bool xb_is_empty(const struct xb *xb)
+{
+	return radix_tree_empty(&xb->xbrt);
+}
+
+void xb_preload(gfp_t gfp);
+
+/**
+ * xb_preload_end - end preload section started with xb_preload()
+ *
+ * Each xb_preload() should be matched with an invocation of this
+ * function. See xb_preload() for details.
+ */
+static inline void xb_preload_end(void)
+{
+	preempt_enable();
+}
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 40c1837..ea50496 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -18,7 +18,7 @@ KCOV_INSTRUMENT_dynamic_debug.o := n
 
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o dump_stack.o timerqueue.o\
-	 idr.o int_sqrt.o extable.o \
+	 idr.o xbitmap.o int_sqrt.o extable.o \
 	 sha1.o chacha20.o irq_regs.o argv_split.o \
 	 flex_proportions.o ratelimit.o show_mem.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 898e879..1e15e30 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -78,6 +78,19 @@ static struct kmem_cache *radix_tree_node_cachep;
 #define IDA_PRELOAD_SIZE	(IDA_MAX_PATH * 2 - 1)
 
 /*
+ * The xbitmap implementation supports up to ULONG_MAX bits, and it is
+ * implemented based on ida bitmaps. So, given an unsigned long index,
+ * the high order XB_INDEX_BITS bits of the index is used to find the
+ * corresponding item (i.e. ida bitmap) from the radix tree, and the low
+ * order (i.e. ilog2(IDA_BITMAP_BITS)) bits of the index are indexed into
+ * the ida bitmap to find the bit.
+ */
+#define XB_INDEX_BITS		(BITS_PER_LONG - ilog2(IDA_BITMAP_BITS))
+#define XB_MAX_PATH		(DIV_ROUND_UP(XB_INDEX_BITS, \
+					      RADIX_TREE_MAP_SHIFT))
+#define XB_PRELOAD_SIZE		(XB_MAX_PATH * 2 - 1)
+
+/*
  * Per-cpu pool of preloaded nodes
  */
 struct radix_tree_preload {
@@ -840,6 +853,8 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
 							offset, 0, 0);
 			if (!child)
 				return -ENOMEM;
+			if (is_idr(root))
+				all_tag_set(child, IDR_FREE);
 			rcu_assign_pointer(*slot, node_to_entry(child));
 			if (node)
 				node->count++;
@@ -1986,8 +2001,8 @@ void __radix_tree_delete_node(struct radix_tree_root *root,
 	delete_node(root, node, update_node, private);
 }
 
-static bool __radix_tree_delete(struct radix_tree_root *root,
-				struct radix_tree_node *node, void __rcu **slot)
+bool __radix_tree_delete(struct radix_tree_root *root,
+			 struct radix_tree_node *node, void __rcu **slot)
 {
 	void *old = rcu_dereference_raw(*slot);
 	int exceptional = radix_tree_exceptional_entry(old) ? -1 : 0;
@@ -2005,6 +2020,29 @@ static bool __radix_tree_delete(struct radix_tree_root *root,
 }
 
 /**
+ *  xb_preload - preload for xb_set_bit()
+ *  @gfp_mask: allocation mask to use for preloading
+ *
+ * Preallocate memory to use for the next call to xb_set_bit(). This function
+ * returns with preemption disabled. It will be enabled by xb_preload_end().
+ */
+void xb_preload(gfp_t gfp)
+{
+	if (__radix_tree_preload(gfp, XB_PRELOAD_SIZE) < 0)
+		preempt_disable();
+
+	if (!this_cpu_read(ida_bitmap)) {
+		struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
+
+		if (!bitmap)
+			return;
+		bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
+		kfree(bitmap);
+	}
+}
+EXPORT_SYMBOL(xb_preload);
+
+/**
  * radix_tree_iter_delete - delete the entry at this iterator position
  * @root: radix tree root
  * @iter: iterator state
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
new file mode 100644
index 0000000..4ab9ac2
--- /dev/null
+++ b/lib/xbitmap.c
@@ -0,0 +1,264 @@
+#include <linux/slab.h>
+#include <linux/xbitmap.h>
+
+/**
+ *  xb_set_bit - set a bit in the xbitmap
+ *  @xb: the xbitmap tree used to record the bit
+ *  @bit: index of the bit to set
+ *
+ * This function is used to set a bit in the xbitmap. If the bitmap that @bit
+ * resides in is not there, it will be allocated.
+ *
+ * Returns: 0 on success. %-EAGAIN indicates that @bit was not set. The caller
+ * may want to call the function again.
+ */
+int xb_set_bit(struct xb *xb, unsigned long bit)
+{
+	int err;
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	err = __radix_tree_create(root, index, 0, &node, &slot);
+	if (err)
+		return err;
+	bitmap = rcu_dereference_raw(*slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit < BITS_PER_LONG) {
+			tmp |= 1UL << ebit;
+			rcu_assign_pointer(*slot, (void *)tmp);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+		rcu_assign_pointer(*slot, bitmap);
+	}
+
+	if (!bitmap) {
+		if (ebit < BITS_PER_LONG) {
+			bitmap = (void *)((1UL << ebit) |
+					RADIX_TREE_EXCEPTIONAL_ENTRY);
+			__radix_tree_replace(root, node, slot, bitmap, NULL,
+						NULL);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		__radix_tree_replace(root, node, slot, bitmap, NULL, NULL);
+	}
+
+	__set_bit(bit, bitmap->bitmap);
+	return 0;
+}
+EXPORT_SYMBOL(xb_set_bit);
+
+/**
+ * xb_clear_bit - clear a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to clear
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	bitmap = __radix_tree_lookup(root, index, &node, &slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit >= BITS_PER_LONG)
+			return;
+		tmp &= ~(1UL << ebit);
+		if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+			__radix_tree_delete(root, node, slot);
+		else
+			rcu_assign_pointer(*slot, (void *)tmp);
+		return;
+	}
+
+	if (!bitmap)
+		return;
+
+	__clear_bit(bit, bitmap->bitmap);
+	if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+		kfree(bitmap);
+		__radix_tree_delete(root, node, slot);
+	}
+}
+EXPORT_SYMBOL(xb_clear_bit);
+
+/**
+ * xb_clear_bit - clear a range of bits in the xbitmap
+ * @start: the start of the bit range, inclusive
+ * @end: the end of the bit range, inclusive
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned int nbits;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bitmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bitmap)) {
+			unsigned long ebit = bit + 2;
+			unsigned long tmp = (unsigned long)bitmap;
+
+			nbits = min(end - start + 1, BITS_PER_LONG - ebit);
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			bitmap_clear(&tmp, ebit, nbits);
+			if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+				__radix_tree_delete(root, node, slot);
+			else
+				rcu_assign_pointer(*slot, (void *)tmp);
+		} else if (bitmap) {
+			nbits = min(end - start + 1, IDA_BITMAP_BITS - bit);
+
+			if (nbits != IDA_BITMAP_BITS)
+				bitmap_clear(bitmap->bitmap, bit, nbits);
+
+			if (nbits == IDA_BITMAP_BITS ||
+				bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+				kfree(bitmap);
+				__radix_tree_delete(root, node, slot);
+			}
+		}
+	}
+}
+EXPORT_SYMBOL(xb_clear_bit_range);
+
+/**
+ * xb_test_bit - test a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to test
+ *
+ * This function is used to test a bit in the xbitmap.
+ * Returns: 1 if the bit is set, or 0 otherwise.
+ */
+bool xb_test_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	const struct radix_tree_root *root = &xb->xbrt;
+	struct ida_bitmap *bitmap = radix_tree_lookup(root, index);
+
+	bit %= IDA_BITMAP_BITS;
+
+	if (!bitmap)
+		return false;
+	if (radix_tree_exception(bitmap)) {
+		bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
+		if (bit > BITS_PER_LONG)
+			return false;
+		return (unsigned long)bitmap & (1UL << bit);
+	}
+
+	return test_bit(bit, bitmap->bitmap);
+}
+EXPORT_SYMBOL(xb_test_bit);
+
+static unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+				      unsigned long end, bool set)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bmap;
+	unsigned long ret = end + 1;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bmap)) {
+			unsigned long tmp = (unsigned long)bmap;
+			unsigned long ebit = bit + 2;
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			if (set)
+				ret = find_next_bit(&tmp, BITS_PER_LONG, ebit);
+			else
+				ret = find_next_zero_bit(&tmp, BITS_PER_LONG,
+							 ebit);
+			if (ret < BITS_PER_LONG)
+				return ret - 2 + IDA_BITMAP_BITS * index;
+		} else if (bmap) {
+			if (set)
+				ret = find_next_bit(bmap->bitmap,
+						    IDA_BITMAP_BITS, bit);
+			else
+				ret = find_next_zero_bit(bmap->bitmap,
+							 IDA_BITMAP_BITS, bit);
+			if (ret < IDA_BITMAP_BITS)
+				return ret + index * IDA_BITMAP_BITS;
+		} else if (!bmap && !set) {
+			return start;
+		}
+	}
+
+	return ret;
+}
+
+/**
+ * xb_find_next_set_bit - find the next set bit in a range
+ * @xb: the xbitmap to search
+ * @start: the start of the range, inclusive
+ * @end: the end of the range, inclusive
+ *
+ * Returns: the index of the found bit, or @end + 1 if no such bit is found.
+ */
+unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
+				   unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 1);
+}
+EXPORT_SYMBOL(xb_find_next_set_bit);
+
+/**
+ * xb_find_next_zero_bit - find the next zero bit in a range
+ * @xb: the xbitmap to search
+ * @start: the start of the range, inclusive
+ * @end: the end of the range, inclusive
+ *
+ * Returns: the index of the found bit, or @end + 1 if no such bit is found.
+ */
+unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start,
+				    unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 0);
+}
+EXPORT_SYMBOL(xb_find_next_zero_bit);
-- 
2.7.4


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v16 2/5] radix tree test suite: add tests for xbitmap
  2017-09-30  4:05 ` Wei Wang
  (?)
  (?)
@ 2017-09-30  4:05   ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

From: Matthew Wilcox <mawilcox@microsoft.com>

Add the following tests for xbitmap:
1) single bit test: single bit set/clear/find;
2) bit range test: set/clear a range of bits and find a 0 or 1 bit in
the range.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 tools/include/linux/bitmap.h            |  34 ++++
 tools/include/linux/kernel.h            |   2 +
 tools/testing/radix-tree/Makefile       |   7 +-
 tools/testing/radix-tree/linux/kernel.h |   2 -
 tools/testing/radix-tree/main.c         |   5 +
 tools/testing/radix-tree/test.h         |   1 +
 tools/testing/radix-tree/xbitmap.c      | 269 ++++++++++++++++++++++++++++++++
 7 files changed, 317 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/radix-tree/xbitmap.c

diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h
index e8b9f51..890dab2 100644
--- a/tools/include/linux/bitmap.h
+++ b/tools/include/linux/bitmap.h
@@ -36,6 +36,40 @@ static inline void bitmap_zero(unsigned long *dst, int nbits)
 	}
 }
 
+static inline void __bitmap_clear(unsigned long *map, unsigned int start,
+				  int len)
+{
+	unsigned long *p = map + BIT_WORD(start);
+	const unsigned int size = start + len;
+	int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+	unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+
+	while (len - bits_to_clear >= 0) {
+		*p &= ~mask_to_clear;
+		len -= bits_to_clear;
+		bits_to_clear = BITS_PER_LONG;
+		mask_to_clear = ~0UL;
+		p++;
+	}
+	if (len) {
+		mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+		*p &= ~mask_to_clear;
+	}
+}
+
+static inline __always_inline void bitmap_clear(unsigned long *map,
+						unsigned int start,
+						unsigned int nbits)
+{
+	if (__builtin_constant_p(nbits) && nbits == 1)
+		__clear_bit(start, map);
+	else if (__builtin_constant_p(start & 7) && IS_ALIGNED(start, 8) &&
+		 __builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))
+		memset((char *)map + start / 8, 0, nbits / 8);
+	else
+		__bitmap_clear(map, start, nbits);
+}
+
 static inline void bitmap_fill(unsigned long *dst, unsigned int nbits)
 {
 	unsigned int nlongs = BITS_TO_LONGS(nbits);
diff --git a/tools/include/linux/kernel.h b/tools/include/linux/kernel.h
index 77d2e94..21e90ee 100644
--- a/tools/include/linux/kernel.h
+++ b/tools/include/linux/kernel.h
@@ -12,6 +12,8 @@
 #define UINT_MAX	(~0U)
 #endif
 
+#define IS_ALIGNED(x, a)	(((x) & ((typeof(x))(a) - 1)) == 0)
+
 #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
 
 #define PERF_ALIGN(x, a)	__PERF_ALIGN_MASK(x, (typeof(x))(a)-1)
diff --git a/tools/testing/radix-tree/Makefile b/tools/testing/radix-tree/Makefile
index 6a9480c..fc7cb422 100644
--- a/tools/testing/radix-tree/Makefile
+++ b/tools/testing/radix-tree/Makefile
@@ -5,7 +5,8 @@ LDLIBS+= -lpthread -lurcu
 TARGETS = main idr-test multiorder
 CORE_OFILES := radix-tree.o idr.o linux.o test.o find_bit.o
 OFILES = main.o $(CORE_OFILES) regression1.o regression2.o regression3.o \
-	 tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o
+	 tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o \
+	 xbitmap.o
 
 ifndef SHIFT
 	SHIFT=3
@@ -24,6 +25,9 @@ idr-test: idr-test.o $(CORE_OFILES)
 
 multiorder: multiorder.o $(CORE_OFILES)
 
+xbitmap: xbitmap.o $(CORE_OFILES)
+	$(CC) $(CFLAGS) $(LDFLAGS) $^ -o xbitmap
+
 clean:
 	$(RM) $(TARGETS) *.o radix-tree.c idr.c generated/map-shift.h
 
@@ -33,6 +37,7 @@ $(OFILES): Makefile *.h */*.h generated/map-shift.h \
 	../../include/linux/*.h \
 	../../include/asm/*.h \
 	../../../include/linux/radix-tree.h \
+	../../../include/linux/xbitmap.h \
 	../../../include/linux/idr.h
 
 radix-tree.c: ../../../lib/radix-tree.c
diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-tree/linux/kernel.h
index b21a77f..c1e6088 100644
--- a/tools/testing/radix-tree/linux/kernel.h
+++ b/tools/testing/radix-tree/linux/kernel.h
@@ -16,6 +16,4 @@
 #define pr_debug printk
 #define pr_cont printk
 
-#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
-
 #endif /* _KERNEL_H */
diff --git a/tools/testing/radix-tree/main.c b/tools/testing/radix-tree/main.c
index bc9a784..6f4774e 100644
--- a/tools/testing/radix-tree/main.c
+++ b/tools/testing/radix-tree/main.c
@@ -337,6 +337,11 @@ static void single_thread_tests(bool long_run)
 	rcu_barrier();
 	printv(2, "after copy_tag_check: %d allocated, preempt %d\n",
 		nr_allocated, preempt_count);
+
+	xbitmap_checks();
+	rcu_barrier();
+	printv(2, "after xbitmap_checks: %d allocated, preempt %d\n",
+			nr_allocated, preempt_count);
 }
 
 int main(int argc, char **argv)
diff --git a/tools/testing/radix-tree/test.h b/tools/testing/radix-tree/test.h
index 0f8220c..f8dcdaa 100644
--- a/tools/testing/radix-tree/test.h
+++ b/tools/testing/radix-tree/test.h
@@ -36,6 +36,7 @@ void iteration_test(unsigned order, unsigned duration);
 void benchmark(void);
 void idr_checks(void);
 void ida_checks(void);
+void xbitmap_checks(void);
 void ida_thread_tests(void);
 
 struct item *
diff --git a/tools/testing/radix-tree/xbitmap.c b/tools/testing/radix-tree/xbitmap.c
new file mode 100644
index 0000000..2787cb2
--- /dev/null
+++ b/tools/testing/radix-tree/xbitmap.c
@@ -0,0 +1,269 @@
+#include <linux/bitmap.h>
+#include <linux/slab.h>
+#include <linux/kernel.h>
+#include "../../../include/linux/xbitmap.h"
+
+static DEFINE_XB(xb1);
+
+int xb_set_bit(struct xb *xb, unsigned long bit)
+{
+	int err;
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	err = __radix_tree_create(root, index, 0, &node, &slot);
+	if (err)
+		return err;
+	bitmap = rcu_dereference_raw(*slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit < BITS_PER_LONG) {
+			tmp |= 1UL << ebit;
+			rcu_assign_pointer(*slot, (void *)tmp);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+		rcu_assign_pointer(*slot, bitmap);
+	}
+
+	if (!bitmap) {
+		if (ebit < BITS_PER_LONG) {
+			bitmap = (void *)((1UL << ebit) |
+					RADIX_TREE_EXCEPTIONAL_ENTRY);
+			__radix_tree_replace(root, node, slot, bitmap, NULL,
+						NULL);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		__radix_tree_replace(root, node, slot, bitmap, NULL, NULL);
+	}
+
+	__set_bit(bit, bitmap->bitmap);
+	return 0;
+}
+
+bool xb_test_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	const struct radix_tree_root *root = &xb->xbrt;
+	struct ida_bitmap *bitmap = radix_tree_lookup(root, index);
+
+	bit %= IDA_BITMAP_BITS;
+
+	if (!bitmap)
+		return false;
+	if (radix_tree_exception(bitmap)) {
+		bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
+		if (bit > BITS_PER_LONG)
+			return false;
+		return (unsigned long)bitmap & (1UL << bit);
+	}
+
+	return test_bit(bit, bitmap->bitmap);
+}
+
+void xb_clear_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	bitmap = __radix_tree_lookup(root, index, &node, &slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit >= BITS_PER_LONG)
+			return;
+		tmp &= ~(1UL << ebit);
+		if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+			__radix_tree_delete(root, node, slot);
+		else
+			rcu_assign_pointer(*slot, (void *)tmp);
+		return;
+	}
+
+	if (!bitmap)
+		return;
+
+	__clear_bit(bit, bitmap->bitmap);
+	if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+		kfree(bitmap);
+		__radix_tree_delete(root, node, slot);
+	}
+}
+
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned int nbits;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bitmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bitmap)) {
+			unsigned long ebit = bit + 2;
+			unsigned long tmp = (unsigned long)bitmap;
+
+			nbits = min(end - start + 1, BITS_PER_LONG - ebit);
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			bitmap_clear(&tmp, ebit, nbits);
+			if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+				__radix_tree_delete(root, node, slot);
+			else
+				rcu_assign_pointer(*slot, (void *)tmp);
+		} else if (bitmap) {
+			nbits = min(end - start + 1, IDA_BITMAP_BITS - bit);
+
+			if (nbits != IDA_BITMAP_BITS)
+				bitmap_clear(bitmap->bitmap, bit, nbits);
+
+			if (nbits == IDA_BITMAP_BITS ||
+			    bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+				kfree(bitmap);
+				__radix_tree_delete(root, node, slot);
+			}
+		}
+	}
+}
+
+static unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+				      unsigned long end, bool set)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bmap;
+	unsigned long ret = end + 1;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bmap)) {
+			unsigned long tmp = (unsigned long)bmap;
+			unsigned long ebit = bit + 2;
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			if (set)
+				ret = find_next_bit(&tmp, BITS_PER_LONG, ebit);
+			else
+				ret = find_next_zero_bit(&tmp, BITS_PER_LONG,
+							 ebit);
+			if (ret < BITS_PER_LONG)
+				return ret - 2 + IDA_BITMAP_BITS * index;
+		} else if (bmap) {
+			if (set)
+				ret = find_next_bit(bmap->bitmap,
+						    IDA_BITMAP_BITS, bit);
+			else
+				ret = find_next_zero_bit(bmap->bitmap,
+							 IDA_BITMAP_BITS, bit);
+			if (ret < IDA_BITMAP_BITS)
+				return ret + index * IDA_BITMAP_BITS;
+		} else if (!bmap && !set) {
+			return start;
+		}
+	}
+
+	return ret;
+}
+
+unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
+				   unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 1);
+}
+
+unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start,
+				    unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 0);
+}
+
+static void xbitmap_check_bit(unsigned long bit)
+{
+	xb_preload(GFP_KERNEL);
+
+	assert(!xb_test_bit(&xb1, bit));
+	assert(!xb_set_bit(&xb1, bit));
+	assert(xb_test_bit(&xb1, bit));
+	xb_clear_bit(&xb1, bit);
+	assert(xb_is_empty(&xb1));
+
+	xb_preload_end();
+}
+
+static void xbitmap_check_bit_range(void)
+{
+	xb_preload(GFP_KERNEL);
+
+	/* Set a range of bits */
+	assert(!xb_set_bit(&xb1, 1060));
+	assert(!xb_set_bit(&xb1, 1061));
+	assert(!xb_set_bit(&xb1, 1064));
+	assert(!xb_set_bit(&xb1, 1065));
+	assert(!xb_set_bit(&xb1, 8180));
+	assert(!xb_set_bit(&xb1, 8181));
+	assert(!xb_set_bit(&xb1, 8190));
+	assert(!xb_set_bit(&xb1, 8191));
+
+	/* Test a range of bits */
+	assert(xb_find_next_set_bit(&xb1, 0, 10000) == 1060);
+	assert(xb_find_next_zero_bit(&xb1, 1061, 10000) == 1062);
+	assert(xb_find_next_set_bit(&xb1, 1062, 10000) == 1064);
+	assert(xb_find_next_zero_bit(&xb1, 1065, 10000) == 1066);
+	assert(xb_find_next_set_bit(&xb1, 1066, 10000) == 8180);
+	assert(xb_find_next_zero_bit(&xb1, 8180, 10000) == 8182);
+	xb_clear_bit_range(&xb1, 0, 1000000);
+	assert(xb_find_next_set_bit(&xb1, 0, 10000) == 10001);
+
+	assert(xb_find_next_zero_bit(&xb1, 20000, 30000) == 20000);
+
+	xb_preload_end();
+}
+
+void xbitmap_checks(void)
+{
+	xb_init(&xb1);
+
+	xbitmap_check_bit(0);
+	xbitmap_check_bit(30);
+	xbitmap_check_bit(31);
+	xbitmap_check_bit(1023);
+	xbitmap_check_bit(1024);
+	xbitmap_check_bit(1025);
+	xbitmap_check_bit((1UL << 63) | (1UL << 24));
+	xbitmap_check_bit((1UL << 63) | (1UL << 24) | 70);
+
+	xbitmap_check_bit_range();
+}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v16 2/5] radix tree test suite: add tests for xbitmap
@ 2017-09-30  4:05   ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

From: Matthew Wilcox <mawilcox@microsoft.com>

Add the following tests for xbitmap:
1) single bit test: single bit set/clear/find;
2) bit range test: set/clear a range of bits and find a 0 or 1 bit in
the range.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 tools/include/linux/bitmap.h            |  34 ++++
 tools/include/linux/kernel.h            |   2 +
 tools/testing/radix-tree/Makefile       |   7 +-
 tools/testing/radix-tree/linux/kernel.h |   2 -
 tools/testing/radix-tree/main.c         |   5 +
 tools/testing/radix-tree/test.h         |   1 +
 tools/testing/radix-tree/xbitmap.c      | 269 ++++++++++++++++++++++++++++++++
 7 files changed, 317 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/radix-tree/xbitmap.c

diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h
index e8b9f51..890dab2 100644
--- a/tools/include/linux/bitmap.h
+++ b/tools/include/linux/bitmap.h
@@ -36,6 +36,40 @@ static inline void bitmap_zero(unsigned long *dst, int nbits)
 	}
 }
 
+static inline void __bitmap_clear(unsigned long *map, unsigned int start,
+				  int len)
+{
+	unsigned long *p = map + BIT_WORD(start);
+	const unsigned int size = start + len;
+	int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+	unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+
+	while (len - bits_to_clear >= 0) {
+		*p &= ~mask_to_clear;
+		len -= bits_to_clear;
+		bits_to_clear = BITS_PER_LONG;
+		mask_to_clear = ~0UL;
+		p++;
+	}
+	if (len) {
+		mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+		*p &= ~mask_to_clear;
+	}
+}
+
+static inline __always_inline void bitmap_clear(unsigned long *map,
+						unsigned int start,
+						unsigned int nbits)
+{
+	if (__builtin_constant_p(nbits) && nbits == 1)
+		__clear_bit(start, map);
+	else if (__builtin_constant_p(start & 7) && IS_ALIGNED(start, 8) &&
+		 __builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))
+		memset((char *)map + start / 8, 0, nbits / 8);
+	else
+		__bitmap_clear(map, start, nbits);
+}
+
 static inline void bitmap_fill(unsigned long *dst, unsigned int nbits)
 {
 	unsigned int nlongs = BITS_TO_LONGS(nbits);
diff --git a/tools/include/linux/kernel.h b/tools/include/linux/kernel.h
index 77d2e94..21e90ee 100644
--- a/tools/include/linux/kernel.h
+++ b/tools/include/linux/kernel.h
@@ -12,6 +12,8 @@
 #define UINT_MAX	(~0U)
 #endif
 
+#define IS_ALIGNED(x, a)	(((x) & ((typeof(x))(a) - 1)) == 0)
+
 #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
 
 #define PERF_ALIGN(x, a)	__PERF_ALIGN_MASK(x, (typeof(x))(a)-1)
diff --git a/tools/testing/radix-tree/Makefile b/tools/testing/radix-tree/Makefile
index 6a9480c..fc7cb422 100644
--- a/tools/testing/radix-tree/Makefile
+++ b/tools/testing/radix-tree/Makefile
@@ -5,7 +5,8 @@ LDLIBS+= -lpthread -lurcu
 TARGETS = main idr-test multiorder
 CORE_OFILES := radix-tree.o idr.o linux.o test.o find_bit.o
 OFILES = main.o $(CORE_OFILES) regression1.o regression2.o regression3.o \
-	 tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o
+	 tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o \
+	 xbitmap.o
 
 ifndef SHIFT
 	SHIFT=3
@@ -24,6 +25,9 @@ idr-test: idr-test.o $(CORE_OFILES)
 
 multiorder: multiorder.o $(CORE_OFILES)
 
+xbitmap: xbitmap.o $(CORE_OFILES)
+	$(CC) $(CFLAGS) $(LDFLAGS) $^ -o xbitmap
+
 clean:
 	$(RM) $(TARGETS) *.o radix-tree.c idr.c generated/map-shift.h
 
@@ -33,6 +37,7 @@ $(OFILES): Makefile *.h */*.h generated/map-shift.h \
 	../../include/linux/*.h \
 	../../include/asm/*.h \
 	../../../include/linux/radix-tree.h \
+	../../../include/linux/xbitmap.h \
 	../../../include/linux/idr.h
 
 radix-tree.c: ../../../lib/radix-tree.c
diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-tree/linux/kernel.h
index b21a77f..c1e6088 100644
--- a/tools/testing/radix-tree/linux/kernel.h
+++ b/tools/testing/radix-tree/linux/kernel.h
@@ -16,6 +16,4 @@
 #define pr_debug printk
 #define pr_cont printk
 
-#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
-
 #endif /* _KERNEL_H */
diff --git a/tools/testing/radix-tree/main.c b/tools/testing/radix-tree/main.c
index bc9a784..6f4774e 100644
--- a/tools/testing/radix-tree/main.c
+++ b/tools/testing/radix-tree/main.c
@@ -337,6 +337,11 @@ static void single_thread_tests(bool long_run)
 	rcu_barrier();
 	printv(2, "after copy_tag_check: %d allocated, preempt %d\n",
 		nr_allocated, preempt_count);
+
+	xbitmap_checks();
+	rcu_barrier();
+	printv(2, "after xbitmap_checks: %d allocated, preempt %d\n",
+			nr_allocated, preempt_count);
 }
 
 int main(int argc, char **argv)
diff --git a/tools/testing/radix-tree/test.h b/tools/testing/radix-tree/test.h
index 0f8220c..f8dcdaa 100644
--- a/tools/testing/radix-tree/test.h
+++ b/tools/testing/radix-tree/test.h
@@ -36,6 +36,7 @@ void iteration_test(unsigned order, unsigned duration);
 void benchmark(void);
 void idr_checks(void);
 void ida_checks(void);
+void xbitmap_checks(void);
 void ida_thread_tests(void);
 
 struct item *
diff --git a/tools/testing/radix-tree/xbitmap.c b/tools/testing/radix-tree/xbitmap.c
new file mode 100644
index 0000000..2787cb2
--- /dev/null
+++ b/tools/testing/radix-tree/xbitmap.c
@@ -0,0 +1,269 @@
+#include <linux/bitmap.h>
+#include <linux/slab.h>
+#include <linux/kernel.h>
+#include "../../../include/linux/xbitmap.h"
+
+static DEFINE_XB(xb1);
+
+int xb_set_bit(struct xb *xb, unsigned long bit)
+{
+	int err;
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	err = __radix_tree_create(root, index, 0, &node, &slot);
+	if (err)
+		return err;
+	bitmap = rcu_dereference_raw(*slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit < BITS_PER_LONG) {
+			tmp |= 1UL << ebit;
+			rcu_assign_pointer(*slot, (void *)tmp);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+		rcu_assign_pointer(*slot, bitmap);
+	}
+
+	if (!bitmap) {
+		if (ebit < BITS_PER_LONG) {
+			bitmap = (void *)((1UL << ebit) |
+					RADIX_TREE_EXCEPTIONAL_ENTRY);
+			__radix_tree_replace(root, node, slot, bitmap, NULL,
+						NULL);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		__radix_tree_replace(root, node, slot, bitmap, NULL, NULL);
+	}
+
+	__set_bit(bit, bitmap->bitmap);
+	return 0;
+}
+
+bool xb_test_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	const struct radix_tree_root *root = &xb->xbrt;
+	struct ida_bitmap *bitmap = radix_tree_lookup(root, index);
+
+	bit %= IDA_BITMAP_BITS;
+
+	if (!bitmap)
+		return false;
+	if (radix_tree_exception(bitmap)) {
+		bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
+		if (bit > BITS_PER_LONG)
+			return false;
+		return (unsigned long)bitmap & (1UL << bit);
+	}
+
+	return test_bit(bit, bitmap->bitmap);
+}
+
+void xb_clear_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	bitmap = __radix_tree_lookup(root, index, &node, &slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit >= BITS_PER_LONG)
+			return;
+		tmp &= ~(1UL << ebit);
+		if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+			__radix_tree_delete(root, node, slot);
+		else
+			rcu_assign_pointer(*slot, (void *)tmp);
+		return;
+	}
+
+	if (!bitmap)
+		return;
+
+	__clear_bit(bit, bitmap->bitmap);
+	if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+		kfree(bitmap);
+		__radix_tree_delete(root, node, slot);
+	}
+}
+
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned int nbits;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bitmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bitmap)) {
+			unsigned long ebit = bit + 2;
+			unsigned long tmp = (unsigned long)bitmap;
+
+			nbits = min(end - start + 1, BITS_PER_LONG - ebit);
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			bitmap_clear(&tmp, ebit, nbits);
+			if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+				__radix_tree_delete(root, node, slot);
+			else
+				rcu_assign_pointer(*slot, (void *)tmp);
+		} else if (bitmap) {
+			nbits = min(end - start + 1, IDA_BITMAP_BITS - bit);
+
+			if (nbits != IDA_BITMAP_BITS)
+				bitmap_clear(bitmap->bitmap, bit, nbits);
+
+			if (nbits == IDA_BITMAP_BITS ||
+			    bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+				kfree(bitmap);
+				__radix_tree_delete(root, node, slot);
+			}
+		}
+	}
+}
+
+static unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+				      unsigned long end, bool set)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bmap;
+	unsigned long ret = end + 1;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bmap)) {
+			unsigned long tmp = (unsigned long)bmap;
+			unsigned long ebit = bit + 2;
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			if (set)
+				ret = find_next_bit(&tmp, BITS_PER_LONG, ebit);
+			else
+				ret = find_next_zero_bit(&tmp, BITS_PER_LONG,
+							 ebit);
+			if (ret < BITS_PER_LONG)
+				return ret - 2 + IDA_BITMAP_BITS * index;
+		} else if (bmap) {
+			if (set)
+				ret = find_next_bit(bmap->bitmap,
+						    IDA_BITMAP_BITS, bit);
+			else
+				ret = find_next_zero_bit(bmap->bitmap,
+							 IDA_BITMAP_BITS, bit);
+			if (ret < IDA_BITMAP_BITS)
+				return ret + index * IDA_BITMAP_BITS;
+		} else if (!bmap && !set) {
+			return start;
+		}
+	}
+
+	return ret;
+}
+
+unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
+				   unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 1);
+}
+
+unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start,
+				    unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 0);
+}
+
+static void xbitmap_check_bit(unsigned long bit)
+{
+	xb_preload(GFP_KERNEL);
+
+	assert(!xb_test_bit(&xb1, bit));
+	assert(!xb_set_bit(&xb1, bit));
+	assert(xb_test_bit(&xb1, bit));
+	xb_clear_bit(&xb1, bit);
+	assert(xb_is_empty(&xb1));
+
+	xb_preload_end();
+}
+
+static void xbitmap_check_bit_range(void)
+{
+	xb_preload(GFP_KERNEL);
+
+	/* Set a range of bits */
+	assert(!xb_set_bit(&xb1, 1060));
+	assert(!xb_set_bit(&xb1, 1061));
+	assert(!xb_set_bit(&xb1, 1064));
+	assert(!xb_set_bit(&xb1, 1065));
+	assert(!xb_set_bit(&xb1, 8180));
+	assert(!xb_set_bit(&xb1, 8181));
+	assert(!xb_set_bit(&xb1, 8190));
+	assert(!xb_set_bit(&xb1, 8191));
+
+	/* Test a range of bits */
+	assert(xb_find_next_set_bit(&xb1, 0, 10000) == 1060);
+	assert(xb_find_next_zero_bit(&xb1, 1061, 10000) == 1062);
+	assert(xb_find_next_set_bit(&xb1, 1062, 10000) == 1064);
+	assert(xb_find_next_zero_bit(&xb1, 1065, 10000) == 1066);
+	assert(xb_find_next_set_bit(&xb1, 1066, 10000) == 8180);
+	assert(xb_find_next_zero_bit(&xb1, 8180, 10000) == 8182);
+	xb_clear_bit_range(&xb1, 0, 1000000);
+	assert(xb_find_next_set_bit(&xb1, 0, 10000) == 10001);
+
+	assert(xb_find_next_zero_bit(&xb1, 20000, 30000) == 20000);
+
+	xb_preload_end();
+}
+
+void xbitmap_checks(void)
+{
+	xb_init(&xb1);
+
+	xbitmap_check_bit(0);
+	xbitmap_check_bit(30);
+	xbitmap_check_bit(31);
+	xbitmap_check_bit(1023);
+	xbitmap_check_bit(1024);
+	xbitmap_check_bit(1025);
+	xbitmap_check_bit((1UL << 63) | (1UL << 24));
+	xbitmap_check_bit((1UL << 63) | (1UL << 24) | 70);
+
+	xbitmap_check_bit_range();
+}
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [Qemu-devel] [PATCH v16 2/5] radix tree test suite: add tests for xbitmap
@ 2017-09-30  4:05   ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

From: Matthew Wilcox <mawilcox@microsoft.com>

Add the following tests for xbitmap:
1) single bit test: single bit set/clear/find;
2) bit range test: set/clear a range of bits and find a 0 or 1 bit in
the range.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 tools/include/linux/bitmap.h            |  34 ++++
 tools/include/linux/kernel.h            |   2 +
 tools/testing/radix-tree/Makefile       |   7 +-
 tools/testing/radix-tree/linux/kernel.h |   2 -
 tools/testing/radix-tree/main.c         |   5 +
 tools/testing/radix-tree/test.h         |   1 +
 tools/testing/radix-tree/xbitmap.c      | 269 ++++++++++++++++++++++++++++++++
 7 files changed, 317 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/radix-tree/xbitmap.c

diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h
index e8b9f51..890dab2 100644
--- a/tools/include/linux/bitmap.h
+++ b/tools/include/linux/bitmap.h
@@ -36,6 +36,40 @@ static inline void bitmap_zero(unsigned long *dst, int nbits)
 	}
 }
 
+static inline void __bitmap_clear(unsigned long *map, unsigned int start,
+				  int len)
+{
+	unsigned long *p = map + BIT_WORD(start);
+	const unsigned int size = start + len;
+	int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+	unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+
+	while (len - bits_to_clear >= 0) {
+		*p &= ~mask_to_clear;
+		len -= bits_to_clear;
+		bits_to_clear = BITS_PER_LONG;
+		mask_to_clear = ~0UL;
+		p++;
+	}
+	if (len) {
+		mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+		*p &= ~mask_to_clear;
+	}
+}
+
+static inline __always_inline void bitmap_clear(unsigned long *map,
+						unsigned int start,
+						unsigned int nbits)
+{
+	if (__builtin_constant_p(nbits) && nbits == 1)
+		__clear_bit(start, map);
+	else if (__builtin_constant_p(start & 7) && IS_ALIGNED(start, 8) &&
+		 __builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))
+		memset((char *)map + start / 8, 0, nbits / 8);
+	else
+		__bitmap_clear(map, start, nbits);
+}
+
 static inline void bitmap_fill(unsigned long *dst, unsigned int nbits)
 {
 	unsigned int nlongs = BITS_TO_LONGS(nbits);
diff --git a/tools/include/linux/kernel.h b/tools/include/linux/kernel.h
index 77d2e94..21e90ee 100644
--- a/tools/include/linux/kernel.h
+++ b/tools/include/linux/kernel.h
@@ -12,6 +12,8 @@
 #define UINT_MAX	(~0U)
 #endif
 
+#define IS_ALIGNED(x, a)	(((x) & ((typeof(x))(a) - 1)) == 0)
+
 #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
 
 #define PERF_ALIGN(x, a)	__PERF_ALIGN_MASK(x, (typeof(x))(a)-1)
diff --git a/tools/testing/radix-tree/Makefile b/tools/testing/radix-tree/Makefile
index 6a9480c..fc7cb422 100644
--- a/tools/testing/radix-tree/Makefile
+++ b/tools/testing/radix-tree/Makefile
@@ -5,7 +5,8 @@ LDLIBS+= -lpthread -lurcu
 TARGETS = main idr-test multiorder
 CORE_OFILES := radix-tree.o idr.o linux.o test.o find_bit.o
 OFILES = main.o $(CORE_OFILES) regression1.o regression2.o regression3.o \
-	 tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o
+	 tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o \
+	 xbitmap.o
 
 ifndef SHIFT
 	SHIFT=3
@@ -24,6 +25,9 @@ idr-test: idr-test.o $(CORE_OFILES)
 
 multiorder: multiorder.o $(CORE_OFILES)
 
+xbitmap: xbitmap.o $(CORE_OFILES)
+	$(CC) $(CFLAGS) $(LDFLAGS) $^ -o xbitmap
+
 clean:
 	$(RM) $(TARGETS) *.o radix-tree.c idr.c generated/map-shift.h
 
@@ -33,6 +37,7 @@ $(OFILES): Makefile *.h */*.h generated/map-shift.h \
 	../../include/linux/*.h \
 	../../include/asm/*.h \
 	../../../include/linux/radix-tree.h \
+	../../../include/linux/xbitmap.h \
 	../../../include/linux/idr.h
 
 radix-tree.c: ../../../lib/radix-tree.c
diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-tree/linux/kernel.h
index b21a77f..c1e6088 100644
--- a/tools/testing/radix-tree/linux/kernel.h
+++ b/tools/testing/radix-tree/linux/kernel.h
@@ -16,6 +16,4 @@
 #define pr_debug printk
 #define pr_cont printk
 
-#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
-
 #endif /* _KERNEL_H */
diff --git a/tools/testing/radix-tree/main.c b/tools/testing/radix-tree/main.c
index bc9a784..6f4774e 100644
--- a/tools/testing/radix-tree/main.c
+++ b/tools/testing/radix-tree/main.c
@@ -337,6 +337,11 @@ static void single_thread_tests(bool long_run)
 	rcu_barrier();
 	printv(2, "after copy_tag_check: %d allocated, preempt %d\n",
 		nr_allocated, preempt_count);
+
+	xbitmap_checks();
+	rcu_barrier();
+	printv(2, "after xbitmap_checks: %d allocated, preempt %d\n",
+			nr_allocated, preempt_count);
 }
 
 int main(int argc, char **argv)
diff --git a/tools/testing/radix-tree/test.h b/tools/testing/radix-tree/test.h
index 0f8220c..f8dcdaa 100644
--- a/tools/testing/radix-tree/test.h
+++ b/tools/testing/radix-tree/test.h
@@ -36,6 +36,7 @@ void iteration_test(unsigned order, unsigned duration);
 void benchmark(void);
 void idr_checks(void);
 void ida_checks(void);
+void xbitmap_checks(void);
 void ida_thread_tests(void);
 
 struct item *
diff --git a/tools/testing/radix-tree/xbitmap.c b/tools/testing/radix-tree/xbitmap.c
new file mode 100644
index 0000000..2787cb2
--- /dev/null
+++ b/tools/testing/radix-tree/xbitmap.c
@@ -0,0 +1,269 @@
+#include <linux/bitmap.h>
+#include <linux/slab.h>
+#include <linux/kernel.h>
+#include "../../../include/linux/xbitmap.h"
+
+static DEFINE_XB(xb1);
+
+int xb_set_bit(struct xb *xb, unsigned long bit)
+{
+	int err;
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	err = __radix_tree_create(root, index, 0, &node, &slot);
+	if (err)
+		return err;
+	bitmap = rcu_dereference_raw(*slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit < BITS_PER_LONG) {
+			tmp |= 1UL << ebit;
+			rcu_assign_pointer(*slot, (void *)tmp);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+		rcu_assign_pointer(*slot, bitmap);
+	}
+
+	if (!bitmap) {
+		if (ebit < BITS_PER_LONG) {
+			bitmap = (void *)((1UL << ebit) |
+					RADIX_TREE_EXCEPTIONAL_ENTRY);
+			__radix_tree_replace(root, node, slot, bitmap, NULL,
+						NULL);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		__radix_tree_replace(root, node, slot, bitmap, NULL, NULL);
+	}
+
+	__set_bit(bit, bitmap->bitmap);
+	return 0;
+}
+
+bool xb_test_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	const struct radix_tree_root *root = &xb->xbrt;
+	struct ida_bitmap *bitmap = radix_tree_lookup(root, index);
+
+	bit %= IDA_BITMAP_BITS;
+
+	if (!bitmap)
+		return false;
+	if (radix_tree_exception(bitmap)) {
+		bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
+		if (bit > BITS_PER_LONG)
+			return false;
+		return (unsigned long)bitmap & (1UL << bit);
+	}
+
+	return test_bit(bit, bitmap->bitmap);
+}
+
+void xb_clear_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	bitmap = __radix_tree_lookup(root, index, &node, &slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit >= BITS_PER_LONG)
+			return;
+		tmp &= ~(1UL << ebit);
+		if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+			__radix_tree_delete(root, node, slot);
+		else
+			rcu_assign_pointer(*slot, (void *)tmp);
+		return;
+	}
+
+	if (!bitmap)
+		return;
+
+	__clear_bit(bit, bitmap->bitmap);
+	if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+		kfree(bitmap);
+		__radix_tree_delete(root, node, slot);
+	}
+}
+
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned int nbits;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bitmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bitmap)) {
+			unsigned long ebit = bit + 2;
+			unsigned long tmp = (unsigned long)bitmap;
+
+			nbits = min(end - start + 1, BITS_PER_LONG - ebit);
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			bitmap_clear(&tmp, ebit, nbits);
+			if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+				__radix_tree_delete(root, node, slot);
+			else
+				rcu_assign_pointer(*slot, (void *)tmp);
+		} else if (bitmap) {
+			nbits = min(end - start + 1, IDA_BITMAP_BITS - bit);
+
+			if (nbits != IDA_BITMAP_BITS)
+				bitmap_clear(bitmap->bitmap, bit, nbits);
+
+			if (nbits == IDA_BITMAP_BITS ||
+			    bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+				kfree(bitmap);
+				__radix_tree_delete(root, node, slot);
+			}
+		}
+	}
+}
+
+static unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+				      unsigned long end, bool set)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bmap;
+	unsigned long ret = end + 1;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bmap)) {
+			unsigned long tmp = (unsigned long)bmap;
+			unsigned long ebit = bit + 2;
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			if (set)
+				ret = find_next_bit(&tmp, BITS_PER_LONG, ebit);
+			else
+				ret = find_next_zero_bit(&tmp, BITS_PER_LONG,
+							 ebit);
+			if (ret < BITS_PER_LONG)
+				return ret - 2 + IDA_BITMAP_BITS * index;
+		} else if (bmap) {
+			if (set)
+				ret = find_next_bit(bmap->bitmap,
+						    IDA_BITMAP_BITS, bit);
+			else
+				ret = find_next_zero_bit(bmap->bitmap,
+							 IDA_BITMAP_BITS, bit);
+			if (ret < IDA_BITMAP_BITS)
+				return ret + index * IDA_BITMAP_BITS;
+		} else if (!bmap && !set) {
+			return start;
+		}
+	}
+
+	return ret;
+}
+
+unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
+				   unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 1);
+}
+
+unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start,
+				    unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 0);
+}
+
+static void xbitmap_check_bit(unsigned long bit)
+{
+	xb_preload(GFP_KERNEL);
+
+	assert(!xb_test_bit(&xb1, bit));
+	assert(!xb_set_bit(&xb1, bit));
+	assert(xb_test_bit(&xb1, bit));
+	xb_clear_bit(&xb1, bit);
+	assert(xb_is_empty(&xb1));
+
+	xb_preload_end();
+}
+
+static void xbitmap_check_bit_range(void)
+{
+	xb_preload(GFP_KERNEL);
+
+	/* Set a range of bits */
+	assert(!xb_set_bit(&xb1, 1060));
+	assert(!xb_set_bit(&xb1, 1061));
+	assert(!xb_set_bit(&xb1, 1064));
+	assert(!xb_set_bit(&xb1, 1065));
+	assert(!xb_set_bit(&xb1, 8180));
+	assert(!xb_set_bit(&xb1, 8181));
+	assert(!xb_set_bit(&xb1, 8190));
+	assert(!xb_set_bit(&xb1, 8191));
+
+	/* Test a range of bits */
+	assert(xb_find_next_set_bit(&xb1, 0, 10000) == 1060);
+	assert(xb_find_next_zero_bit(&xb1, 1061, 10000) == 1062);
+	assert(xb_find_next_set_bit(&xb1, 1062, 10000) == 1064);
+	assert(xb_find_next_zero_bit(&xb1, 1065, 10000) == 1066);
+	assert(xb_find_next_set_bit(&xb1, 1066, 10000) == 8180);
+	assert(xb_find_next_zero_bit(&xb1, 8180, 10000) == 8182);
+	xb_clear_bit_range(&xb1, 0, 1000000);
+	assert(xb_find_next_set_bit(&xb1, 0, 10000) == 10001);
+
+	assert(xb_find_next_zero_bit(&xb1, 20000, 30000) == 20000);
+
+	xb_preload_end();
+}
+
+void xbitmap_checks(void)
+{
+	xb_init(&xb1);
+
+	xbitmap_check_bit(0);
+	xbitmap_check_bit(30);
+	xbitmap_check_bit(31);
+	xbitmap_check_bit(1023);
+	xbitmap_check_bit(1024);
+	xbitmap_check_bit(1025);
+	xbitmap_check_bit((1UL << 63) | (1UL << 24));
+	xbitmap_check_bit((1UL << 63) | (1UL << 24) | 70);
+
+	xbitmap_check_bit_range();
+}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v16 2/5] radix tree test suite: add tests for xbitmap
  2017-09-30  4:05 ` Wei Wang
                   ` (4 preceding siblings ...)
  (?)
@ 2017-09-30  4:05 ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: aarcange, yang.zhang.wz, liliang.opensource, willy, amit.shah,
	quan.xu, cornelia.huck, pbonzini, mgorman

From: Matthew Wilcox <mawilcox@microsoft.com>

Add the following tests for xbitmap:
1) single bit test: single bit set/clear/find;
2) bit range test: set/clear a range of bits and find a 0 or 1 bit in
the range.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 tools/include/linux/bitmap.h            |  34 ++++
 tools/include/linux/kernel.h            |   2 +
 tools/testing/radix-tree/Makefile       |   7 +-
 tools/testing/radix-tree/linux/kernel.h |   2 -
 tools/testing/radix-tree/main.c         |   5 +
 tools/testing/radix-tree/test.h         |   1 +
 tools/testing/radix-tree/xbitmap.c      | 269 ++++++++++++++++++++++++++++++++
 7 files changed, 317 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/radix-tree/xbitmap.c

diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h
index e8b9f51..890dab2 100644
--- a/tools/include/linux/bitmap.h
+++ b/tools/include/linux/bitmap.h
@@ -36,6 +36,40 @@ static inline void bitmap_zero(unsigned long *dst, int nbits)
 	}
 }
 
+static inline void __bitmap_clear(unsigned long *map, unsigned int start,
+				  int len)
+{
+	unsigned long *p = map + BIT_WORD(start);
+	const unsigned int size = start + len;
+	int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+	unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+
+	while (len - bits_to_clear >= 0) {
+		*p &= ~mask_to_clear;
+		len -= bits_to_clear;
+		bits_to_clear = BITS_PER_LONG;
+		mask_to_clear = ~0UL;
+		p++;
+	}
+	if (len) {
+		mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+		*p &= ~mask_to_clear;
+	}
+}
+
+static inline __always_inline void bitmap_clear(unsigned long *map,
+						unsigned int start,
+						unsigned int nbits)
+{
+	if (__builtin_constant_p(nbits) && nbits == 1)
+		__clear_bit(start, map);
+	else if (__builtin_constant_p(start & 7) && IS_ALIGNED(start, 8) &&
+		 __builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))
+		memset((char *)map + start / 8, 0, nbits / 8);
+	else
+		__bitmap_clear(map, start, nbits);
+}
+
 static inline void bitmap_fill(unsigned long *dst, unsigned int nbits)
 {
 	unsigned int nlongs = BITS_TO_LONGS(nbits);
diff --git a/tools/include/linux/kernel.h b/tools/include/linux/kernel.h
index 77d2e94..21e90ee 100644
--- a/tools/include/linux/kernel.h
+++ b/tools/include/linux/kernel.h
@@ -12,6 +12,8 @@
 #define UINT_MAX	(~0U)
 #endif
 
+#define IS_ALIGNED(x, a)	(((x) & ((typeof(x))(a) - 1)) == 0)
+
 #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
 
 #define PERF_ALIGN(x, a)	__PERF_ALIGN_MASK(x, (typeof(x))(a)-1)
diff --git a/tools/testing/radix-tree/Makefile b/tools/testing/radix-tree/Makefile
index 6a9480c..fc7cb422 100644
--- a/tools/testing/radix-tree/Makefile
+++ b/tools/testing/radix-tree/Makefile
@@ -5,7 +5,8 @@ LDLIBS+= -lpthread -lurcu
 TARGETS = main idr-test multiorder
 CORE_OFILES := radix-tree.o idr.o linux.o test.o find_bit.o
 OFILES = main.o $(CORE_OFILES) regression1.o regression2.o regression3.o \
-	 tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o
+	 tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o \
+	 xbitmap.o
 
 ifndef SHIFT
 	SHIFT=3
@@ -24,6 +25,9 @@ idr-test: idr-test.o $(CORE_OFILES)
 
 multiorder: multiorder.o $(CORE_OFILES)
 
+xbitmap: xbitmap.o $(CORE_OFILES)
+	$(CC) $(CFLAGS) $(LDFLAGS) $^ -o xbitmap
+
 clean:
 	$(RM) $(TARGETS) *.o radix-tree.c idr.c generated/map-shift.h
 
@@ -33,6 +37,7 @@ $(OFILES): Makefile *.h */*.h generated/map-shift.h \
 	../../include/linux/*.h \
 	../../include/asm/*.h \
 	../../../include/linux/radix-tree.h \
+	../../../include/linux/xbitmap.h \
 	../../../include/linux/idr.h
 
 radix-tree.c: ../../../lib/radix-tree.c
diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-tree/linux/kernel.h
index b21a77f..c1e6088 100644
--- a/tools/testing/radix-tree/linux/kernel.h
+++ b/tools/testing/radix-tree/linux/kernel.h
@@ -16,6 +16,4 @@
 #define pr_debug printk
 #define pr_cont printk
 
-#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
-
 #endif /* _KERNEL_H */
diff --git a/tools/testing/radix-tree/main.c b/tools/testing/radix-tree/main.c
index bc9a784..6f4774e 100644
--- a/tools/testing/radix-tree/main.c
+++ b/tools/testing/radix-tree/main.c
@@ -337,6 +337,11 @@ static void single_thread_tests(bool long_run)
 	rcu_barrier();
 	printv(2, "after copy_tag_check: %d allocated, preempt %d\n",
 		nr_allocated, preempt_count);
+
+	xbitmap_checks();
+	rcu_barrier();
+	printv(2, "after xbitmap_checks: %d allocated, preempt %d\n",
+			nr_allocated, preempt_count);
 }
 
 int main(int argc, char **argv)
diff --git a/tools/testing/radix-tree/test.h b/tools/testing/radix-tree/test.h
index 0f8220c..f8dcdaa 100644
--- a/tools/testing/radix-tree/test.h
+++ b/tools/testing/radix-tree/test.h
@@ -36,6 +36,7 @@ void iteration_test(unsigned order, unsigned duration);
 void benchmark(void);
 void idr_checks(void);
 void ida_checks(void);
+void xbitmap_checks(void);
 void ida_thread_tests(void);
 
 struct item *
diff --git a/tools/testing/radix-tree/xbitmap.c b/tools/testing/radix-tree/xbitmap.c
new file mode 100644
index 0000000..2787cb2
--- /dev/null
+++ b/tools/testing/radix-tree/xbitmap.c
@@ -0,0 +1,269 @@
+#include <linux/bitmap.h>
+#include <linux/slab.h>
+#include <linux/kernel.h>
+#include "../../../include/linux/xbitmap.h"
+
+static DEFINE_XB(xb1);
+
+int xb_set_bit(struct xb *xb, unsigned long bit)
+{
+	int err;
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	err = __radix_tree_create(root, index, 0, &node, &slot);
+	if (err)
+		return err;
+	bitmap = rcu_dereference_raw(*slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit < BITS_PER_LONG) {
+			tmp |= 1UL << ebit;
+			rcu_assign_pointer(*slot, (void *)tmp);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+		rcu_assign_pointer(*slot, bitmap);
+	}
+
+	if (!bitmap) {
+		if (ebit < BITS_PER_LONG) {
+			bitmap = (void *)((1UL << ebit) |
+					RADIX_TREE_EXCEPTIONAL_ENTRY);
+			__radix_tree_replace(root, node, slot, bitmap, NULL,
+						NULL);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		__radix_tree_replace(root, node, slot, bitmap, NULL, NULL);
+	}
+
+	__set_bit(bit, bitmap->bitmap);
+	return 0;
+}
+
+bool xb_test_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	const struct radix_tree_root *root = &xb->xbrt;
+	struct ida_bitmap *bitmap = radix_tree_lookup(root, index);
+
+	bit %= IDA_BITMAP_BITS;
+
+	if (!bitmap)
+		return false;
+	if (radix_tree_exception(bitmap)) {
+		bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
+		if (bit > BITS_PER_LONG)
+			return false;
+		return (unsigned long)bitmap & (1UL << bit);
+	}
+
+	return test_bit(bit, bitmap->bitmap);
+}
+
+void xb_clear_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	bitmap = __radix_tree_lookup(root, index, &node, &slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit >= BITS_PER_LONG)
+			return;
+		tmp &= ~(1UL << ebit);
+		if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+			__radix_tree_delete(root, node, slot);
+		else
+			rcu_assign_pointer(*slot, (void *)tmp);
+		return;
+	}
+
+	if (!bitmap)
+		return;
+
+	__clear_bit(bit, bitmap->bitmap);
+	if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+		kfree(bitmap);
+		__radix_tree_delete(root, node, slot);
+	}
+}
+
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned int nbits;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bitmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bitmap)) {
+			unsigned long ebit = bit + 2;
+			unsigned long tmp = (unsigned long)bitmap;
+
+			nbits = min(end - start + 1, BITS_PER_LONG - ebit);
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			bitmap_clear(&tmp, ebit, nbits);
+			if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+				__radix_tree_delete(root, node, slot);
+			else
+				rcu_assign_pointer(*slot, (void *)tmp);
+		} else if (bitmap) {
+			nbits = min(end - start + 1, IDA_BITMAP_BITS - bit);
+
+			if (nbits != IDA_BITMAP_BITS)
+				bitmap_clear(bitmap->bitmap, bit, nbits);
+
+			if (nbits == IDA_BITMAP_BITS ||
+			    bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+				kfree(bitmap);
+				__radix_tree_delete(root, node, slot);
+			}
+		}
+	}
+}
+
+static unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+				      unsigned long end, bool set)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bmap;
+	unsigned long ret = end + 1;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bmap)) {
+			unsigned long tmp = (unsigned long)bmap;
+			unsigned long ebit = bit + 2;
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			if (set)
+				ret = find_next_bit(&tmp, BITS_PER_LONG, ebit);
+			else
+				ret = find_next_zero_bit(&tmp, BITS_PER_LONG,
+							 ebit);
+			if (ret < BITS_PER_LONG)
+				return ret - 2 + IDA_BITMAP_BITS * index;
+		} else if (bmap) {
+			if (set)
+				ret = find_next_bit(bmap->bitmap,
+						    IDA_BITMAP_BITS, bit);
+			else
+				ret = find_next_zero_bit(bmap->bitmap,
+							 IDA_BITMAP_BITS, bit);
+			if (ret < IDA_BITMAP_BITS)
+				return ret + index * IDA_BITMAP_BITS;
+		} else if (!bmap && !set) {
+			return start;
+		}
+	}
+
+	return ret;
+}
+
+unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
+				   unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 1);
+}
+
+unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start,
+				    unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 0);
+}
+
+static void xbitmap_check_bit(unsigned long bit)
+{
+	xb_preload(GFP_KERNEL);
+
+	assert(!xb_test_bit(&xb1, bit));
+	assert(!xb_set_bit(&xb1, bit));
+	assert(xb_test_bit(&xb1, bit));
+	xb_clear_bit(&xb1, bit);
+	assert(xb_is_empty(&xb1));
+
+	xb_preload_end();
+}
+
+static void xbitmap_check_bit_range(void)
+{
+	xb_preload(GFP_KERNEL);
+
+	/* Set a range of bits */
+	assert(!xb_set_bit(&xb1, 1060));
+	assert(!xb_set_bit(&xb1, 1061));
+	assert(!xb_set_bit(&xb1, 1064));
+	assert(!xb_set_bit(&xb1, 1065));
+	assert(!xb_set_bit(&xb1, 8180));
+	assert(!xb_set_bit(&xb1, 8181));
+	assert(!xb_set_bit(&xb1, 8190));
+	assert(!xb_set_bit(&xb1, 8191));
+
+	/* Test a range of bits */
+	assert(xb_find_next_set_bit(&xb1, 0, 10000) == 1060);
+	assert(xb_find_next_zero_bit(&xb1, 1061, 10000) == 1062);
+	assert(xb_find_next_set_bit(&xb1, 1062, 10000) == 1064);
+	assert(xb_find_next_zero_bit(&xb1, 1065, 10000) == 1066);
+	assert(xb_find_next_set_bit(&xb1, 1066, 10000) == 8180);
+	assert(xb_find_next_zero_bit(&xb1, 8180, 10000) == 8182);
+	xb_clear_bit_range(&xb1, 0, 1000000);
+	assert(xb_find_next_set_bit(&xb1, 0, 10000) == 10001);
+
+	assert(xb_find_next_zero_bit(&xb1, 20000, 30000) == 20000);
+
+	xb_preload_end();
+}
+
+void xbitmap_checks(void)
+{
+	xb_init(&xb1);
+
+	xbitmap_check_bit(0);
+	xbitmap_check_bit(30);
+	xbitmap_check_bit(31);
+	xbitmap_check_bit(1023);
+	xbitmap_check_bit(1024);
+	xbitmap_check_bit(1025);
+	xbitmap_check_bit((1UL << 63) | (1UL << 24));
+	xbitmap_check_bit((1UL << 63) | (1UL << 24) | 70);
+
+	xbitmap_check_bit_range();
+}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [virtio-dev] [PATCH v16 2/5] radix tree test suite: add tests for xbitmap
@ 2017-09-30  4:05   ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

From: Matthew Wilcox <mawilcox@microsoft.com>

Add the following tests for xbitmap:
1) single bit test: single bit set/clear/find;
2) bit range test: set/clear a range of bits and find a 0 or 1 bit in
the range.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 tools/include/linux/bitmap.h            |  34 ++++
 tools/include/linux/kernel.h            |   2 +
 tools/testing/radix-tree/Makefile       |   7 +-
 tools/testing/radix-tree/linux/kernel.h |   2 -
 tools/testing/radix-tree/main.c         |   5 +
 tools/testing/radix-tree/test.h         |   1 +
 tools/testing/radix-tree/xbitmap.c      | 269 ++++++++++++++++++++++++++++++++
 7 files changed, 317 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/radix-tree/xbitmap.c

diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h
index e8b9f51..890dab2 100644
--- a/tools/include/linux/bitmap.h
+++ b/tools/include/linux/bitmap.h
@@ -36,6 +36,40 @@ static inline void bitmap_zero(unsigned long *dst, int nbits)
 	}
 }
 
+static inline void __bitmap_clear(unsigned long *map, unsigned int start,
+				  int len)
+{
+	unsigned long *p = map + BIT_WORD(start);
+	const unsigned int size = start + len;
+	int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+	unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+
+	while (len - bits_to_clear >= 0) {
+		*p &= ~mask_to_clear;
+		len -= bits_to_clear;
+		bits_to_clear = BITS_PER_LONG;
+		mask_to_clear = ~0UL;
+		p++;
+	}
+	if (len) {
+		mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+		*p &= ~mask_to_clear;
+	}
+}
+
+static inline __always_inline void bitmap_clear(unsigned long *map,
+						unsigned int start,
+						unsigned int nbits)
+{
+	if (__builtin_constant_p(nbits) && nbits == 1)
+		__clear_bit(start, map);
+	else if (__builtin_constant_p(start & 7) && IS_ALIGNED(start, 8) &&
+		 __builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))
+		memset((char *)map + start / 8, 0, nbits / 8);
+	else
+		__bitmap_clear(map, start, nbits);
+}
+
 static inline void bitmap_fill(unsigned long *dst, unsigned int nbits)
 {
 	unsigned int nlongs = BITS_TO_LONGS(nbits);
diff --git a/tools/include/linux/kernel.h b/tools/include/linux/kernel.h
index 77d2e94..21e90ee 100644
--- a/tools/include/linux/kernel.h
+++ b/tools/include/linux/kernel.h
@@ -12,6 +12,8 @@
 #define UINT_MAX	(~0U)
 #endif
 
+#define IS_ALIGNED(x, a)	(((x) & ((typeof(x))(a) - 1)) == 0)
+
 #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
 
 #define PERF_ALIGN(x, a)	__PERF_ALIGN_MASK(x, (typeof(x))(a)-1)
diff --git a/tools/testing/radix-tree/Makefile b/tools/testing/radix-tree/Makefile
index 6a9480c..fc7cb422 100644
--- a/tools/testing/radix-tree/Makefile
+++ b/tools/testing/radix-tree/Makefile
@@ -5,7 +5,8 @@ LDLIBS+= -lpthread -lurcu
 TARGETS = main idr-test multiorder
 CORE_OFILES := radix-tree.o idr.o linux.o test.o find_bit.o
 OFILES = main.o $(CORE_OFILES) regression1.o regression2.o regression3.o \
-	 tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o
+	 tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o \
+	 xbitmap.o
 
 ifndef SHIFT
 	SHIFT=3
@@ -24,6 +25,9 @@ idr-test: idr-test.o $(CORE_OFILES)
 
 multiorder: multiorder.o $(CORE_OFILES)
 
+xbitmap: xbitmap.o $(CORE_OFILES)
+	$(CC) $(CFLAGS) $(LDFLAGS) $^ -o xbitmap
+
 clean:
 	$(RM) $(TARGETS) *.o radix-tree.c idr.c generated/map-shift.h
 
@@ -33,6 +37,7 @@ $(OFILES): Makefile *.h */*.h generated/map-shift.h \
 	../../include/linux/*.h \
 	../../include/asm/*.h \
 	../../../include/linux/radix-tree.h \
+	../../../include/linux/xbitmap.h \
 	../../../include/linux/idr.h
 
 radix-tree.c: ../../../lib/radix-tree.c
diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-tree/linux/kernel.h
index b21a77f..c1e6088 100644
--- a/tools/testing/radix-tree/linux/kernel.h
+++ b/tools/testing/radix-tree/linux/kernel.h
@@ -16,6 +16,4 @@
 #define pr_debug printk
 #define pr_cont printk
 
-#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
-
 #endif /* _KERNEL_H */
diff --git a/tools/testing/radix-tree/main.c b/tools/testing/radix-tree/main.c
index bc9a784..6f4774e 100644
--- a/tools/testing/radix-tree/main.c
+++ b/tools/testing/radix-tree/main.c
@@ -337,6 +337,11 @@ static void single_thread_tests(bool long_run)
 	rcu_barrier();
 	printv(2, "after copy_tag_check: %d allocated, preempt %d\n",
 		nr_allocated, preempt_count);
+
+	xbitmap_checks();
+	rcu_barrier();
+	printv(2, "after xbitmap_checks: %d allocated, preempt %d\n",
+			nr_allocated, preempt_count);
 }
 
 int main(int argc, char **argv)
diff --git a/tools/testing/radix-tree/test.h b/tools/testing/radix-tree/test.h
index 0f8220c..f8dcdaa 100644
--- a/tools/testing/radix-tree/test.h
+++ b/tools/testing/radix-tree/test.h
@@ -36,6 +36,7 @@ void iteration_test(unsigned order, unsigned duration);
 void benchmark(void);
 void idr_checks(void);
 void ida_checks(void);
+void xbitmap_checks(void);
 void ida_thread_tests(void);
 
 struct item *
diff --git a/tools/testing/radix-tree/xbitmap.c b/tools/testing/radix-tree/xbitmap.c
new file mode 100644
index 0000000..2787cb2
--- /dev/null
+++ b/tools/testing/radix-tree/xbitmap.c
@@ -0,0 +1,269 @@
+#include <linux/bitmap.h>
+#include <linux/slab.h>
+#include <linux/kernel.h>
+#include "../../../include/linux/xbitmap.h"
+
+static DEFINE_XB(xb1);
+
+int xb_set_bit(struct xb *xb, unsigned long bit)
+{
+	int err;
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	err = __radix_tree_create(root, index, 0, &node, &slot);
+	if (err)
+		return err;
+	bitmap = rcu_dereference_raw(*slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit < BITS_PER_LONG) {
+			tmp |= 1UL << ebit;
+			rcu_assign_pointer(*slot, (void *)tmp);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+		rcu_assign_pointer(*slot, bitmap);
+	}
+
+	if (!bitmap) {
+		if (ebit < BITS_PER_LONG) {
+			bitmap = (void *)((1UL << ebit) |
+					RADIX_TREE_EXCEPTIONAL_ENTRY);
+			__radix_tree_replace(root, node, slot, bitmap, NULL,
+						NULL);
+			return 0;
+		}
+		bitmap = this_cpu_xchg(ida_bitmap, NULL);
+		if (!bitmap)
+			return -EAGAIN;
+		memset(bitmap, 0, sizeof(*bitmap));
+		__radix_tree_replace(root, node, slot, bitmap, NULL, NULL);
+	}
+
+	__set_bit(bit, bitmap->bitmap);
+	return 0;
+}
+
+bool xb_test_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	const struct radix_tree_root *root = &xb->xbrt;
+	struct ida_bitmap *bitmap = radix_tree_lookup(root, index);
+
+	bit %= IDA_BITMAP_BITS;
+
+	if (!bitmap)
+		return false;
+	if (radix_tree_exception(bitmap)) {
+		bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
+		if (bit > BITS_PER_LONG)
+			return false;
+		return (unsigned long)bitmap & (1UL << bit);
+	}
+
+	return test_bit(bit, bitmap->bitmap);
+}
+
+void xb_clear_bit(struct xb *xb, unsigned long bit)
+{
+	unsigned long index = bit / IDA_BITMAP_BITS;
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned long ebit;
+
+	bit %= IDA_BITMAP_BITS;
+	ebit = bit + 2;
+
+	bitmap = __radix_tree_lookup(root, index, &node, &slot);
+	if (radix_tree_exception(bitmap)) {
+		unsigned long tmp = (unsigned long)bitmap;
+
+		if (ebit >= BITS_PER_LONG)
+			return;
+		tmp &= ~(1UL << ebit);
+		if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+			__radix_tree_delete(root, node, slot);
+		else
+			rcu_assign_pointer(*slot, (void *)tmp);
+		return;
+	}
+
+	if (!bitmap)
+		return;
+
+	__clear_bit(bit, bitmap->bitmap);
+	if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+		kfree(bitmap);
+		__radix_tree_delete(root, node, slot);
+	}
+}
+
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bitmap;
+	unsigned int nbits;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bitmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bitmap)) {
+			unsigned long ebit = bit + 2;
+			unsigned long tmp = (unsigned long)bitmap;
+
+			nbits = min(end - start + 1, BITS_PER_LONG - ebit);
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			bitmap_clear(&tmp, ebit, nbits);
+			if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+				__radix_tree_delete(root, node, slot);
+			else
+				rcu_assign_pointer(*slot, (void *)tmp);
+		} else if (bitmap) {
+			nbits = min(end - start + 1, IDA_BITMAP_BITS - bit);
+
+			if (nbits != IDA_BITMAP_BITS)
+				bitmap_clear(bitmap->bitmap, bit, nbits);
+
+			if (nbits == IDA_BITMAP_BITS ||
+			    bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+				kfree(bitmap);
+				__radix_tree_delete(root, node, slot);
+			}
+		}
+	}
+}
+
+static unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+				      unsigned long end, bool set)
+{
+	struct radix_tree_root *root = &xb->xbrt;
+	struct radix_tree_node *node;
+	void **slot;
+	struct ida_bitmap *bmap;
+	unsigned long ret = end + 1;
+
+	for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+		unsigned long index = start / IDA_BITMAP_BITS;
+		unsigned long bit = start % IDA_BITMAP_BITS;
+
+		bmap = __radix_tree_lookup(root, index, &node, &slot);
+		if (radix_tree_exception(bmap)) {
+			unsigned long tmp = (unsigned long)bmap;
+			unsigned long ebit = bit + 2;
+
+			if (ebit >= BITS_PER_LONG)
+				continue;
+			if (set)
+				ret = find_next_bit(&tmp, BITS_PER_LONG, ebit);
+			else
+				ret = find_next_zero_bit(&tmp, BITS_PER_LONG,
+							 ebit);
+			if (ret < BITS_PER_LONG)
+				return ret - 2 + IDA_BITMAP_BITS * index;
+		} else if (bmap) {
+			if (set)
+				ret = find_next_bit(bmap->bitmap,
+						    IDA_BITMAP_BITS, bit);
+			else
+				ret = find_next_zero_bit(bmap->bitmap,
+							 IDA_BITMAP_BITS, bit);
+			if (ret < IDA_BITMAP_BITS)
+				return ret + index * IDA_BITMAP_BITS;
+		} else if (!bmap && !set) {
+			return start;
+		}
+	}
+
+	return ret;
+}
+
+unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
+				   unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 1);
+}
+
+unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start,
+				    unsigned long end)
+{
+	return xb_find_next_bit(xb, start, end, 0);
+}
+
+static void xbitmap_check_bit(unsigned long bit)
+{
+	xb_preload(GFP_KERNEL);
+
+	assert(!xb_test_bit(&xb1, bit));
+	assert(!xb_set_bit(&xb1, bit));
+	assert(xb_test_bit(&xb1, bit));
+	xb_clear_bit(&xb1, bit);
+	assert(xb_is_empty(&xb1));
+
+	xb_preload_end();
+}
+
+static void xbitmap_check_bit_range(void)
+{
+	xb_preload(GFP_KERNEL);
+
+	/* Set a range of bits */
+	assert(!xb_set_bit(&xb1, 1060));
+	assert(!xb_set_bit(&xb1, 1061));
+	assert(!xb_set_bit(&xb1, 1064));
+	assert(!xb_set_bit(&xb1, 1065));
+	assert(!xb_set_bit(&xb1, 8180));
+	assert(!xb_set_bit(&xb1, 8181));
+	assert(!xb_set_bit(&xb1, 8190));
+	assert(!xb_set_bit(&xb1, 8191));
+
+	/* Test a range of bits */
+	assert(xb_find_next_set_bit(&xb1, 0, 10000) == 1060);
+	assert(xb_find_next_zero_bit(&xb1, 1061, 10000) == 1062);
+	assert(xb_find_next_set_bit(&xb1, 1062, 10000) == 1064);
+	assert(xb_find_next_zero_bit(&xb1, 1065, 10000) == 1066);
+	assert(xb_find_next_set_bit(&xb1, 1066, 10000) == 8180);
+	assert(xb_find_next_zero_bit(&xb1, 8180, 10000) == 8182);
+	xb_clear_bit_range(&xb1, 0, 1000000);
+	assert(xb_find_next_set_bit(&xb1, 0, 10000) == 10001);
+
+	assert(xb_find_next_zero_bit(&xb1, 20000, 30000) == 20000);
+
+	xb_preload_end();
+}
+
+void xbitmap_checks(void)
+{
+	xb_init(&xb1);
+
+	xbitmap_check_bit(0);
+	xbitmap_check_bit(30);
+	xbitmap_check_bit(31);
+	xbitmap_check_bit(1023);
+	xbitmap_check_bit(1024);
+	xbitmap_check_bit(1025);
+	xbitmap_check_bit((1UL << 63) | (1UL << 24));
+	xbitmap_check_bit((1UL << 63) | (1UL << 24) | 70);
+
+	xbitmap_check_bit_range();
+}
-- 
2.7.4


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-09-30  4:05 ` Wei Wang
  (?)
  (?)
@ 2017-09-30  4:05   ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer
of balloon (i.e. inflated/deflated) pages using scatter-gather lists
to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the balloon pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
sgs. An sg describes a chunk of guest physically continuous pages.
With this mechanism, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~492ms
resulting in an improvement of ~88%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_balloon.c     | 188 ++++++++++++++++++++++++++++++++----
 include/uapi/linux/virtio_balloon.h |   1 +
 2 files changed, 172 insertions(+), 17 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f0b3a0b..6952e19 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -32,6 +32,8 @@
 #include <linux/mm.h>
 #include <linux/mount.h>
 #include <linux/magic.h>
+#include <linux/xbitmap.h>
+#include <asm/page.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -79,6 +81,9 @@ struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	/* The xbitmap used to record balloon pages */
+	struct xb page_xb;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -141,13 +146,128 @@ static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
+
+static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
+{
+	unsigned int len;
+
+	virtqueue_kick(vq);
+	wait_event(wq_head, virtqueue_get_buf(vq, &len));
+}
+
+static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
+{
+	struct scatterlist sg;
+	unsigned int len;
+
+	sg_init_one(&sg, addr, size);
+
+	/* Detach all the used buffers from the vq */
+	while (virtqueue_get_buf(vq, &len))
+		;
+
+	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
+}
+
+static int send_balloon_page_sg(struct virtio_balloon *vb,
+				 struct virtqueue *vq,
+				 void *addr,
+				 uint32_t size,
+				 bool batch)
+{
+	int err;
+
+	err = add_one_sg(vq, addr, size);
+
+	/* If batchng is requested, we batch till the vq is full */
+	if (!batch || !vq->num_free)
+		kick_and_wait(vq, vb->acked);
+
+	return err;
+}
+
+/*
+ * Send balloon pages in sgs to host. The balloon pages are recorded in the
+ * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
+ * The page xbitmap is searched for continuous "1" bits, which correspond
+ * to continuous pages, to chunk into sgs.
+ *
+ * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
+ * need to be searched.
+ */
+static void tell_host_sgs(struct virtio_balloon *vb,
+			  struct virtqueue *vq,
+			  unsigned long page_xb_start,
+			  unsigned long page_xb_end)
+{
+	unsigned long sg_pfn_start, sg_pfn_end;
+	void *sg_addr;
+	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
+	int err = 0;
+
+	sg_pfn_start = page_xb_start;
+	while (sg_pfn_start < page_xb_end) {
+		sg_pfn_start = xb_find_next_set_bit(&vb->page_xb, sg_pfn_start,
+						    page_xb_end);
+		if (sg_pfn_start == page_xb_end + 1)
+			break;
+		sg_pfn_end = xb_find_next_zero_bit(&vb->page_xb,
+						   sg_pfn_start + 1,
+						   page_xb_end);
+		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
+		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
+		while (sg_len > sg_max_len) {
+			err = send_balloon_page_sg(vb, vq, sg_addr, sg_max_len,
+						   true);
+			if (unlikely(err < 0))
+				goto err_out;
+			sg_addr += sg_max_len;
+			sg_len -= sg_max_len;
+		}
+		err = send_balloon_page_sg(vb, vq, sg_addr, sg_len, true);
+		if (unlikely(err < 0))
+			goto err_out;
+		sg_pfn_start = sg_pfn_end + 1;
+	}
+
+	/*
+	 * The last few sgs may not reach the batch size, but need a kick to
+	 * notify the device to handle them.
+	 */
+	if (vq->num_free != virtqueue_get_vring_size(vq))
+		kick_and_wait(vq, vb->acked);
+
+	xb_clear_bit_range(&vb->page_xb, page_xb_start, page_xb_end);
+	return;
+
+err_out:
+	dev_warn(&vb->vdev->dev, "%s failure: %d\n", __func__, err);
+}
+
+static inline void xb_set_page(struct virtio_balloon *vb,
+			       struct page *page,
+			       unsigned long *pfn_min,
+			       unsigned long *pfn_max)
+{
+	unsigned long pfn = page_to_pfn(page);
+
+	*pfn_min = min(pfn, *pfn_min);
+	*pfn_max = max(pfn, *pfn_max);
+	xb_preload(GFP_KERNEL);
+	xb_set_bit(&vb->page_xb, pfn);
+	xb_preload_end();
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	unsigned num_allocated_pages;
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -162,7 +282,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
+		if (use_sg)
+			xb_set_page(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -171,8 +296,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	/* Traditionally, we can only do one array worth at a time. */
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	/* We can't release more pages than taken */
@@ -210,7 +342,11 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (use_sg)
+			xb_set_page(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -221,8 +357,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
@@ -441,6 +581,7 @@ static int init_vqs(struct virtio_balloon *vb)
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
+
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
  *			     a compation thread.     (called under page lock)
@@ -464,6 +605,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
 	unsigned long flags;
 
 	/*
@@ -485,16 +627,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (use_sg) {
+		send_balloon_page_sg(vb, vb->inflate_vq, page_address(newpage),
+				     PAGE_SIZE, false);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (use_sg) {
+		send_balloon_page_sg(vb, vb->deflate_vq, page_address(page),
+				     PAGE_SIZE, false);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -553,6 +703,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (err)
 		goto out_free_vb;
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
+		xb_init(&vb->page_xb);
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -669,6 +822,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_SG,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..37780a7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-09-30  4:05   ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer
of balloon (i.e. inflated/deflated) pages using scatter-gather lists
to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the balloon pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
sgs. An sg describes a chunk of guest physically continuous pages.
With this mechanism, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~492ms
resulting in an improvement of ~88%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_balloon.c     | 188 ++++++++++++++++++++++++++++++++----
 include/uapi/linux/virtio_balloon.h |   1 +
 2 files changed, 172 insertions(+), 17 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f0b3a0b..6952e19 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -32,6 +32,8 @@
 #include <linux/mm.h>
 #include <linux/mount.h>
 #include <linux/magic.h>
+#include <linux/xbitmap.h>
+#include <asm/page.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -79,6 +81,9 @@ struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	/* The xbitmap used to record balloon pages */
+	struct xb page_xb;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -141,13 +146,128 @@ static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
+
+static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
+{
+	unsigned int len;
+
+	virtqueue_kick(vq);
+	wait_event(wq_head, virtqueue_get_buf(vq, &len));
+}
+
+static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
+{
+	struct scatterlist sg;
+	unsigned int len;
+
+	sg_init_one(&sg, addr, size);
+
+	/* Detach all the used buffers from the vq */
+	while (virtqueue_get_buf(vq, &len))
+		;
+
+	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
+}
+
+static int send_balloon_page_sg(struct virtio_balloon *vb,
+				 struct virtqueue *vq,
+				 void *addr,
+				 uint32_t size,
+				 bool batch)
+{
+	int err;
+
+	err = add_one_sg(vq, addr, size);
+
+	/* If batchng is requested, we batch till the vq is full */
+	if (!batch || !vq->num_free)
+		kick_and_wait(vq, vb->acked);
+
+	return err;
+}
+
+/*
+ * Send balloon pages in sgs to host. The balloon pages are recorded in the
+ * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
+ * The page xbitmap is searched for continuous "1" bits, which correspond
+ * to continuous pages, to chunk into sgs.
+ *
+ * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
+ * need to be searched.
+ */
+static void tell_host_sgs(struct virtio_balloon *vb,
+			  struct virtqueue *vq,
+			  unsigned long page_xb_start,
+			  unsigned long page_xb_end)
+{
+	unsigned long sg_pfn_start, sg_pfn_end;
+	void *sg_addr;
+	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
+	int err = 0;
+
+	sg_pfn_start = page_xb_start;
+	while (sg_pfn_start < page_xb_end) {
+		sg_pfn_start = xb_find_next_set_bit(&vb->page_xb, sg_pfn_start,
+						    page_xb_end);
+		if (sg_pfn_start == page_xb_end + 1)
+			break;
+		sg_pfn_end = xb_find_next_zero_bit(&vb->page_xb,
+						   sg_pfn_start + 1,
+						   page_xb_end);
+		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
+		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
+		while (sg_len > sg_max_len) {
+			err = send_balloon_page_sg(vb, vq, sg_addr, sg_max_len,
+						   true);
+			if (unlikely(err < 0))
+				goto err_out;
+			sg_addr += sg_max_len;
+			sg_len -= sg_max_len;
+		}
+		err = send_balloon_page_sg(vb, vq, sg_addr, sg_len, true);
+		if (unlikely(err < 0))
+			goto err_out;
+		sg_pfn_start = sg_pfn_end + 1;
+	}
+
+	/*
+	 * The last few sgs may not reach the batch size, but need a kick to
+	 * notify the device to handle them.
+	 */
+	if (vq->num_free != virtqueue_get_vring_size(vq))
+		kick_and_wait(vq, vb->acked);
+
+	xb_clear_bit_range(&vb->page_xb, page_xb_start, page_xb_end);
+	return;
+
+err_out:
+	dev_warn(&vb->vdev->dev, "%s failure: %d\n", __func__, err);
+}
+
+static inline void xb_set_page(struct virtio_balloon *vb,
+			       struct page *page,
+			       unsigned long *pfn_min,
+			       unsigned long *pfn_max)
+{
+	unsigned long pfn = page_to_pfn(page);
+
+	*pfn_min = min(pfn, *pfn_min);
+	*pfn_max = max(pfn, *pfn_max);
+	xb_preload(GFP_KERNEL);
+	xb_set_bit(&vb->page_xb, pfn);
+	xb_preload_end();
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	unsigned num_allocated_pages;
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -162,7 +282,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
+		if (use_sg)
+			xb_set_page(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -171,8 +296,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	/* Traditionally, we can only do one array worth at a time. */
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	/* We can't release more pages than taken */
@@ -210,7 +342,11 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (use_sg)
+			xb_set_page(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -221,8 +357,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
@@ -441,6 +581,7 @@ static int init_vqs(struct virtio_balloon *vb)
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
+
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
  *			     a compation thread.     (called under page lock)
@@ -464,6 +605,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
 	unsigned long flags;
 
 	/*
@@ -485,16 +627,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (use_sg) {
+		send_balloon_page_sg(vb, vb->inflate_vq, page_address(newpage),
+				     PAGE_SIZE, false);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (use_sg) {
+		send_balloon_page_sg(vb, vb->deflate_vq, page_address(page),
+				     PAGE_SIZE, false);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -553,6 +703,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (err)
 		goto out_free_vb;
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
+		xb_init(&vb->page_xb);
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -669,6 +822,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_SG,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..37780a7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [Qemu-devel] [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-09-30  4:05   ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer
of balloon (i.e. inflated/deflated) pages using scatter-gather lists
to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the balloon pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
sgs. An sg describes a chunk of guest physically continuous pages.
With this mechanism, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~492ms
resulting in an improvement of ~88%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_balloon.c     | 188 ++++++++++++++++++++++++++++++++----
 include/uapi/linux/virtio_balloon.h |   1 +
 2 files changed, 172 insertions(+), 17 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f0b3a0b..6952e19 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -32,6 +32,8 @@
 #include <linux/mm.h>
 #include <linux/mount.h>
 #include <linux/magic.h>
+#include <linux/xbitmap.h>
+#include <asm/page.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -79,6 +81,9 @@ struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	/* The xbitmap used to record balloon pages */
+	struct xb page_xb;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -141,13 +146,128 @@ static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
+
+static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
+{
+	unsigned int len;
+
+	virtqueue_kick(vq);
+	wait_event(wq_head, virtqueue_get_buf(vq, &len));
+}
+
+static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
+{
+	struct scatterlist sg;
+	unsigned int len;
+
+	sg_init_one(&sg, addr, size);
+
+	/* Detach all the used buffers from the vq */
+	while (virtqueue_get_buf(vq, &len))
+		;
+
+	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
+}
+
+static int send_balloon_page_sg(struct virtio_balloon *vb,
+				 struct virtqueue *vq,
+				 void *addr,
+				 uint32_t size,
+				 bool batch)
+{
+	int err;
+
+	err = add_one_sg(vq, addr, size);
+
+	/* If batchng is requested, we batch till the vq is full */
+	if (!batch || !vq->num_free)
+		kick_and_wait(vq, vb->acked);
+
+	return err;
+}
+
+/*
+ * Send balloon pages in sgs to host. The balloon pages are recorded in the
+ * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
+ * The page xbitmap is searched for continuous "1" bits, which correspond
+ * to continuous pages, to chunk into sgs.
+ *
+ * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
+ * need to be searched.
+ */
+static void tell_host_sgs(struct virtio_balloon *vb,
+			  struct virtqueue *vq,
+			  unsigned long page_xb_start,
+			  unsigned long page_xb_end)
+{
+	unsigned long sg_pfn_start, sg_pfn_end;
+	void *sg_addr;
+	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
+	int err = 0;
+
+	sg_pfn_start = page_xb_start;
+	while (sg_pfn_start < page_xb_end) {
+		sg_pfn_start = xb_find_next_set_bit(&vb->page_xb, sg_pfn_start,
+						    page_xb_end);
+		if (sg_pfn_start == page_xb_end + 1)
+			break;
+		sg_pfn_end = xb_find_next_zero_bit(&vb->page_xb,
+						   sg_pfn_start + 1,
+						   page_xb_end);
+		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
+		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
+		while (sg_len > sg_max_len) {
+			err = send_balloon_page_sg(vb, vq, sg_addr, sg_max_len,
+						   true);
+			if (unlikely(err < 0))
+				goto err_out;
+			sg_addr += sg_max_len;
+			sg_len -= sg_max_len;
+		}
+		err = send_balloon_page_sg(vb, vq, sg_addr, sg_len, true);
+		if (unlikely(err < 0))
+			goto err_out;
+		sg_pfn_start = sg_pfn_end + 1;
+	}
+
+	/*
+	 * The last few sgs may not reach the batch size, but need a kick to
+	 * notify the device to handle them.
+	 */
+	if (vq->num_free != virtqueue_get_vring_size(vq))
+		kick_and_wait(vq, vb->acked);
+
+	xb_clear_bit_range(&vb->page_xb, page_xb_start, page_xb_end);
+	return;
+
+err_out:
+	dev_warn(&vb->vdev->dev, "%s failure: %d\n", __func__, err);
+}
+
+static inline void xb_set_page(struct virtio_balloon *vb,
+			       struct page *page,
+			       unsigned long *pfn_min,
+			       unsigned long *pfn_max)
+{
+	unsigned long pfn = page_to_pfn(page);
+
+	*pfn_min = min(pfn, *pfn_min);
+	*pfn_max = max(pfn, *pfn_max);
+	xb_preload(GFP_KERNEL);
+	xb_set_bit(&vb->page_xb, pfn);
+	xb_preload_end();
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	unsigned num_allocated_pages;
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -162,7 +282,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
+		if (use_sg)
+			xb_set_page(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -171,8 +296,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	/* Traditionally, we can only do one array worth at a time. */
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	/* We can't release more pages than taken */
@@ -210,7 +342,11 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (use_sg)
+			xb_set_page(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -221,8 +357,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
@@ -441,6 +581,7 @@ static int init_vqs(struct virtio_balloon *vb)
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
+
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
  *			     a compation thread.     (called under page lock)
@@ -464,6 +605,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
 	unsigned long flags;
 
 	/*
@@ -485,16 +627,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (use_sg) {
+		send_balloon_page_sg(vb, vb->inflate_vq, page_address(newpage),
+				     PAGE_SIZE, false);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (use_sg) {
+		send_balloon_page_sg(vb, vb->deflate_vq, page_address(page),
+				     PAGE_SIZE, false);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -553,6 +703,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (err)
 		goto out_free_vb;
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
+		xb_init(&vb->page_xb);
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -669,6 +822,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_SG,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..37780a7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-09-30  4:05 ` Wei Wang
                   ` (6 preceding siblings ...)
  (?)
@ 2017-09-30  4:05 ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: aarcange, yang.zhang.wz, liliang.opensource, willy, amit.shah,
	quan.xu, cornelia.huck, pbonzini, mgorman

Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer
of balloon (i.e. inflated/deflated) pages using scatter-gather lists
to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the balloon pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
sgs. An sg describes a chunk of guest physically continuous pages.
With this mechanism, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~492ms
resulting in an improvement of ~88%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_balloon.c     | 188 ++++++++++++++++++++++++++++++++----
 include/uapi/linux/virtio_balloon.h |   1 +
 2 files changed, 172 insertions(+), 17 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f0b3a0b..6952e19 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -32,6 +32,8 @@
 #include <linux/mm.h>
 #include <linux/mount.h>
 #include <linux/magic.h>
+#include <linux/xbitmap.h>
+#include <asm/page.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -79,6 +81,9 @@ struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	/* The xbitmap used to record balloon pages */
+	struct xb page_xb;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -141,13 +146,128 @@ static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
+
+static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
+{
+	unsigned int len;
+
+	virtqueue_kick(vq);
+	wait_event(wq_head, virtqueue_get_buf(vq, &len));
+}
+
+static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
+{
+	struct scatterlist sg;
+	unsigned int len;
+
+	sg_init_one(&sg, addr, size);
+
+	/* Detach all the used buffers from the vq */
+	while (virtqueue_get_buf(vq, &len))
+		;
+
+	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
+}
+
+static int send_balloon_page_sg(struct virtio_balloon *vb,
+				 struct virtqueue *vq,
+				 void *addr,
+				 uint32_t size,
+				 bool batch)
+{
+	int err;
+
+	err = add_one_sg(vq, addr, size);
+
+	/* If batchng is requested, we batch till the vq is full */
+	if (!batch || !vq->num_free)
+		kick_and_wait(vq, vb->acked);
+
+	return err;
+}
+
+/*
+ * Send balloon pages in sgs to host. The balloon pages are recorded in the
+ * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
+ * The page xbitmap is searched for continuous "1" bits, which correspond
+ * to continuous pages, to chunk into sgs.
+ *
+ * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
+ * need to be searched.
+ */
+static void tell_host_sgs(struct virtio_balloon *vb,
+			  struct virtqueue *vq,
+			  unsigned long page_xb_start,
+			  unsigned long page_xb_end)
+{
+	unsigned long sg_pfn_start, sg_pfn_end;
+	void *sg_addr;
+	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
+	int err = 0;
+
+	sg_pfn_start = page_xb_start;
+	while (sg_pfn_start < page_xb_end) {
+		sg_pfn_start = xb_find_next_set_bit(&vb->page_xb, sg_pfn_start,
+						    page_xb_end);
+		if (sg_pfn_start == page_xb_end + 1)
+			break;
+		sg_pfn_end = xb_find_next_zero_bit(&vb->page_xb,
+						   sg_pfn_start + 1,
+						   page_xb_end);
+		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
+		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
+		while (sg_len > sg_max_len) {
+			err = send_balloon_page_sg(vb, vq, sg_addr, sg_max_len,
+						   true);
+			if (unlikely(err < 0))
+				goto err_out;
+			sg_addr += sg_max_len;
+			sg_len -= sg_max_len;
+		}
+		err = send_balloon_page_sg(vb, vq, sg_addr, sg_len, true);
+		if (unlikely(err < 0))
+			goto err_out;
+		sg_pfn_start = sg_pfn_end + 1;
+	}
+
+	/*
+	 * The last few sgs may not reach the batch size, but need a kick to
+	 * notify the device to handle them.
+	 */
+	if (vq->num_free != virtqueue_get_vring_size(vq))
+		kick_and_wait(vq, vb->acked);
+
+	xb_clear_bit_range(&vb->page_xb, page_xb_start, page_xb_end);
+	return;
+
+err_out:
+	dev_warn(&vb->vdev->dev, "%s failure: %d\n", __func__, err);
+}
+
+static inline void xb_set_page(struct virtio_balloon *vb,
+			       struct page *page,
+			       unsigned long *pfn_min,
+			       unsigned long *pfn_max)
+{
+	unsigned long pfn = page_to_pfn(page);
+
+	*pfn_min = min(pfn, *pfn_min);
+	*pfn_max = max(pfn, *pfn_max);
+	xb_preload(GFP_KERNEL);
+	xb_set_bit(&vb->page_xb, pfn);
+	xb_preload_end();
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	unsigned num_allocated_pages;
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -162,7 +282,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
+		if (use_sg)
+			xb_set_page(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -171,8 +296,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	/* Traditionally, we can only do one array worth at a time. */
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	/* We can't release more pages than taken */
@@ -210,7 +342,11 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (use_sg)
+			xb_set_page(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -221,8 +357,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
@@ -441,6 +581,7 @@ static int init_vqs(struct virtio_balloon *vb)
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
+
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
  *			     a compation thread.     (called under page lock)
@@ -464,6 +605,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
 	unsigned long flags;
 
 	/*
@@ -485,16 +627,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (use_sg) {
+		send_balloon_page_sg(vb, vb->inflate_vq, page_address(newpage),
+				     PAGE_SIZE, false);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (use_sg) {
+		send_balloon_page_sg(vb, vb->deflate_vq, page_address(page),
+				     PAGE_SIZE, false);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -553,6 +703,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (err)
 		goto out_free_vb;
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
+		xb_init(&vb->page_xb);
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -669,6 +822,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_SG,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..37780a7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [virtio-dev] [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-09-30  4:05   ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer
of balloon (i.e. inflated/deflated) pages using scatter-gather lists
to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the balloon pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
sgs. An sg describes a chunk of guest physically continuous pages.
With this mechanism, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~492ms
resulting in an improvement of ~88%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_balloon.c     | 188 ++++++++++++++++++++++++++++++++----
 include/uapi/linux/virtio_balloon.h |   1 +
 2 files changed, 172 insertions(+), 17 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f0b3a0b..6952e19 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -32,6 +32,8 @@
 #include <linux/mm.h>
 #include <linux/mount.h>
 #include <linux/magic.h>
+#include <linux/xbitmap.h>
+#include <asm/page.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -79,6 +81,9 @@ struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	/* The xbitmap used to record balloon pages */
+	struct xb page_xb;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -141,13 +146,128 @@ static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
+
+static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
+{
+	unsigned int len;
+
+	virtqueue_kick(vq);
+	wait_event(wq_head, virtqueue_get_buf(vq, &len));
+}
+
+static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
+{
+	struct scatterlist sg;
+	unsigned int len;
+
+	sg_init_one(&sg, addr, size);
+
+	/* Detach all the used buffers from the vq */
+	while (virtqueue_get_buf(vq, &len))
+		;
+
+	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
+}
+
+static int send_balloon_page_sg(struct virtio_balloon *vb,
+				 struct virtqueue *vq,
+				 void *addr,
+				 uint32_t size,
+				 bool batch)
+{
+	int err;
+
+	err = add_one_sg(vq, addr, size);
+
+	/* If batchng is requested, we batch till the vq is full */
+	if (!batch || !vq->num_free)
+		kick_and_wait(vq, vb->acked);
+
+	return err;
+}
+
+/*
+ * Send balloon pages in sgs to host. The balloon pages are recorded in the
+ * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
+ * The page xbitmap is searched for continuous "1" bits, which correspond
+ * to continuous pages, to chunk into sgs.
+ *
+ * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
+ * need to be searched.
+ */
+static void tell_host_sgs(struct virtio_balloon *vb,
+			  struct virtqueue *vq,
+			  unsigned long page_xb_start,
+			  unsigned long page_xb_end)
+{
+	unsigned long sg_pfn_start, sg_pfn_end;
+	void *sg_addr;
+	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
+	int err = 0;
+
+	sg_pfn_start = page_xb_start;
+	while (sg_pfn_start < page_xb_end) {
+		sg_pfn_start = xb_find_next_set_bit(&vb->page_xb, sg_pfn_start,
+						    page_xb_end);
+		if (sg_pfn_start == page_xb_end + 1)
+			break;
+		sg_pfn_end = xb_find_next_zero_bit(&vb->page_xb,
+						   sg_pfn_start + 1,
+						   page_xb_end);
+		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
+		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
+		while (sg_len > sg_max_len) {
+			err = send_balloon_page_sg(vb, vq, sg_addr, sg_max_len,
+						   true);
+			if (unlikely(err < 0))
+				goto err_out;
+			sg_addr += sg_max_len;
+			sg_len -= sg_max_len;
+		}
+		err = send_balloon_page_sg(vb, vq, sg_addr, sg_len, true);
+		if (unlikely(err < 0))
+			goto err_out;
+		sg_pfn_start = sg_pfn_end + 1;
+	}
+
+	/*
+	 * The last few sgs may not reach the batch size, but need a kick to
+	 * notify the device to handle them.
+	 */
+	if (vq->num_free != virtqueue_get_vring_size(vq))
+		kick_and_wait(vq, vb->acked);
+
+	xb_clear_bit_range(&vb->page_xb, page_xb_start, page_xb_end);
+	return;
+
+err_out:
+	dev_warn(&vb->vdev->dev, "%s failure: %d\n", __func__, err);
+}
+
+static inline void xb_set_page(struct virtio_balloon *vb,
+			       struct page *page,
+			       unsigned long *pfn_min,
+			       unsigned long *pfn_max)
+{
+	unsigned long pfn = page_to_pfn(page);
+
+	*pfn_min = min(pfn, *pfn_min);
+	*pfn_max = max(pfn, *pfn_max);
+	xb_preload(GFP_KERNEL);
+	xb_set_bit(&vb->page_xb, pfn);
+	xb_preload_end();
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	unsigned num_allocated_pages;
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -162,7 +282,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
+		if (use_sg)
+			xb_set_page(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -171,8 +296,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	/* Traditionally, we can only do one array worth at a time. */
+	if (!use_sg)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	/* We can't release more pages than taken */
@@ -210,7 +342,11 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (use_sg)
+			xb_set_page(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -221,8 +357,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns) {
+		if (use_sg)
+			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
@@ -441,6 +581,7 @@ static int init_vqs(struct virtio_balloon *vb)
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
+
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
  *			     a compation thread.     (called under page lock)
@@ -464,6 +605,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
 	unsigned long flags;
 
 	/*
@@ -485,16 +627,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (use_sg) {
+		send_balloon_page_sg(vb, vb->inflate_vq, page_address(newpage),
+				     PAGE_SIZE, false);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (use_sg) {
+		send_balloon_page_sg(vb, vb->deflate_vq, page_address(page),
+				     PAGE_SIZE, false);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -553,6 +703,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (err)
 		goto out_free_vb;
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
+		xb_init(&vb->page_xb);
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -669,6 +822,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_SG,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..37780a7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
-- 
2.7.4


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v16 4/5] mm: support reporting free page blocks
  2017-09-30  4:05 ` Wei Wang
  (?)
  (?)
@ 2017-09-30  4:05   ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

This patch adds support to walk through the free page blocks in the
system and report them via a callback function. Some page blocks may
leave the free list after zone->lock is released, so it is the caller's
responsibility to either detect or prevent the use of such pages.

One use example of this patch is to accelerate live migration by skipping
the transfer of free pages reported from the guest. A popular method used
by the hypervisor to track which part of memory is written during live
migration is to write-protect all the guest memory. So, those pages that
are reported as free pages but are written after the report function
returns will be captured by the hypervisor, and they will be added to the
next round of memory transfer.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/mm.h |  6 ++++
 mm/page_alloc.c    | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 97 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5..d9652c2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 
+extern void walk_free_mem_block(void *opaque,
+				int min_order,
+				bool (*report_pfn_range)(void *opaque,
+							 unsigned long pfn,
+							 unsigned long num));
+
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
  * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d00f74..c6bb874 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4762,6 +4762,97 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 	show_swap_cache_info();
 }
 
+/*
+ * Walk through a free page list and report the found pfn range via the
+ * callback.
+ *
+ * Return false if the callback requests to stop reporting. Otherwise,
+ * return true.
+ */
+static bool walk_free_page_list(void *opaque,
+				struct zone *zone,
+				int order,
+				enum migratetype mt,
+				bool (*report_pfn_range)(void *,
+							 unsigned long,
+							 unsigned long))
+{
+	struct page *page;
+	struct list_head *list;
+	unsigned long pfn, flags;
+	bool ret;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	list = &zone->free_area[order].free_list[mt];
+	list_for_each_entry(page, list, lru) {
+		pfn = page_to_pfn(page);
+		ret = report_pfn_range(opaque, pfn, 1 << order);
+		if (!ret)
+			break;
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	return ret;
+}
+
+/**
+ * walk_free_mem_block - Walk through the free page blocks in the system
+ * @opaque: the context passed from the caller
+ * @min_order: the minimum order of free lists to check
+ * @report_pfn_range: the callback to report the pfn range of the free pages
+ *
+ * If the callback returns false, stop iterating the list of free page blocks.
+ * Otherwise, continue to report.
+ *
+ * Please note that there are no locking guarantees for the callback and
+ * that the reported pfn range might be freed or disappear after the
+ * callback returns so the caller has to be very careful how it is used.
+ *
+ * The callback itself must not sleep or perform any operations which would
+ * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
+ * or via any lock dependency. It is generally advisable to implement
+ * the callback as simple as possible and defer any heavy lifting to a
+ * different context.
+ *
+ * There is no guarantee that each free range will be reported only once
+ * during one walk_free_mem_block invocation.
+ *
+ * pfn_to_page on the given range is strongly discouraged and if there is
+ * an absolute need for that make sure to contact MM people to discuss
+ * potential problems.
+ *
+ * The function itself might sleep so it cannot be called from atomic
+ * contexts.
+ *
+ * In general low orders tend to be very volatile and so it makes more
+ * sense to query larger ones first for various optimizations which like
+ * ballooning etc... This will reduce the overhead as well.
+ */
+void walk_free_mem_block(void *opaque,
+			 int min_order,
+			 bool (*report_pfn_range)(void *opaque,
+						  unsigned long pfn,
+						  unsigned long num))
+{
+	struct zone *zone;
+	int order;
+	enum migratetype mt;
+	bool ret;
+
+	for_each_populated_zone(zone) {
+		for (order = MAX_ORDER - 1; order >= min_order; order--) {
+			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+				ret = walk_free_page_list(opaque, zone,
+							  order, mt,
+							  report_pfn_range);
+				if (!ret)
+					return;
+			}
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(walk_free_mem_block);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v16 4/5] mm: support reporting free page blocks
@ 2017-09-30  4:05   ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

This patch adds support to walk through the free page blocks in the
system and report them via a callback function. Some page blocks may
leave the free list after zone->lock is released, so it is the caller's
responsibility to either detect or prevent the use of such pages.

One use example of this patch is to accelerate live migration by skipping
the transfer of free pages reported from the guest. A popular method used
by the hypervisor to track which part of memory is written during live
migration is to write-protect all the guest memory. So, those pages that
are reported as free pages but are written after the report function
returns will be captured by the hypervisor, and they will be added to the
next round of memory transfer.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/mm.h |  6 ++++
 mm/page_alloc.c    | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 97 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5..d9652c2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 
+extern void walk_free_mem_block(void *opaque,
+				int min_order,
+				bool (*report_pfn_range)(void *opaque,
+							 unsigned long pfn,
+							 unsigned long num));
+
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
  * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d00f74..c6bb874 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4762,6 +4762,97 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 	show_swap_cache_info();
 }
 
+/*
+ * Walk through a free page list and report the found pfn range via the
+ * callback.
+ *
+ * Return false if the callback requests to stop reporting. Otherwise,
+ * return true.
+ */
+static bool walk_free_page_list(void *opaque,
+				struct zone *zone,
+				int order,
+				enum migratetype mt,
+				bool (*report_pfn_range)(void *,
+							 unsigned long,
+							 unsigned long))
+{
+	struct page *page;
+	struct list_head *list;
+	unsigned long pfn, flags;
+	bool ret;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	list = &zone->free_area[order].free_list[mt];
+	list_for_each_entry(page, list, lru) {
+		pfn = page_to_pfn(page);
+		ret = report_pfn_range(opaque, pfn, 1 << order);
+		if (!ret)
+			break;
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	return ret;
+}
+
+/**
+ * walk_free_mem_block - Walk through the free page blocks in the system
+ * @opaque: the context passed from the caller
+ * @min_order: the minimum order of free lists to check
+ * @report_pfn_range: the callback to report the pfn range of the free pages
+ *
+ * If the callback returns false, stop iterating the list of free page blocks.
+ * Otherwise, continue to report.
+ *
+ * Please note that there are no locking guarantees for the callback and
+ * that the reported pfn range might be freed or disappear after the
+ * callback returns so the caller has to be very careful how it is used.
+ *
+ * The callback itself must not sleep or perform any operations which would
+ * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
+ * or via any lock dependency. It is generally advisable to implement
+ * the callback as simple as possible and defer any heavy lifting to a
+ * different context.
+ *
+ * There is no guarantee that each free range will be reported only once
+ * during one walk_free_mem_block invocation.
+ *
+ * pfn_to_page on the given range is strongly discouraged and if there is
+ * an absolute need for that make sure to contact MM people to discuss
+ * potential problems.
+ *
+ * The function itself might sleep so it cannot be called from atomic
+ * contexts.
+ *
+ * In general low orders tend to be very volatile and so it makes more
+ * sense to query larger ones first for various optimizations which like
+ * ballooning etc... This will reduce the overhead as well.
+ */
+void walk_free_mem_block(void *opaque,
+			 int min_order,
+			 bool (*report_pfn_range)(void *opaque,
+						  unsigned long pfn,
+						  unsigned long num))
+{
+	struct zone *zone;
+	int order;
+	enum migratetype mt;
+	bool ret;
+
+	for_each_populated_zone(zone) {
+		for (order = MAX_ORDER - 1; order >= min_order; order--) {
+			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+				ret = walk_free_page_list(opaque, zone,
+							  order, mt,
+							  report_pfn_range);
+				if (!ret)
+					return;
+			}
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(walk_free_mem_block);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [Qemu-devel] [PATCH v16 4/5] mm: support reporting free page blocks
@ 2017-09-30  4:05   ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

This patch adds support to walk through the free page blocks in the
system and report them via a callback function. Some page blocks may
leave the free list after zone->lock is released, so it is the caller's
responsibility to either detect or prevent the use of such pages.

One use example of this patch is to accelerate live migration by skipping
the transfer of free pages reported from the guest. A popular method used
by the hypervisor to track which part of memory is written during live
migration is to write-protect all the guest memory. So, those pages that
are reported as free pages but are written after the report function
returns will be captured by the hypervisor, and they will be added to the
next round of memory transfer.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/mm.h |  6 ++++
 mm/page_alloc.c    | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 97 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5..d9652c2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 
+extern void walk_free_mem_block(void *opaque,
+				int min_order,
+				bool (*report_pfn_range)(void *opaque,
+							 unsigned long pfn,
+							 unsigned long num));
+
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
  * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d00f74..c6bb874 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4762,6 +4762,97 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 	show_swap_cache_info();
 }
 
+/*
+ * Walk through a free page list and report the found pfn range via the
+ * callback.
+ *
+ * Return false if the callback requests to stop reporting. Otherwise,
+ * return true.
+ */
+static bool walk_free_page_list(void *opaque,
+				struct zone *zone,
+				int order,
+				enum migratetype mt,
+				bool (*report_pfn_range)(void *,
+							 unsigned long,
+							 unsigned long))
+{
+	struct page *page;
+	struct list_head *list;
+	unsigned long pfn, flags;
+	bool ret;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	list = &zone->free_area[order].free_list[mt];
+	list_for_each_entry(page, list, lru) {
+		pfn = page_to_pfn(page);
+		ret = report_pfn_range(opaque, pfn, 1 << order);
+		if (!ret)
+			break;
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	return ret;
+}
+
+/**
+ * walk_free_mem_block - Walk through the free page blocks in the system
+ * @opaque: the context passed from the caller
+ * @min_order: the minimum order of free lists to check
+ * @report_pfn_range: the callback to report the pfn range of the free pages
+ *
+ * If the callback returns false, stop iterating the list of free page blocks.
+ * Otherwise, continue to report.
+ *
+ * Please note that there are no locking guarantees for the callback and
+ * that the reported pfn range might be freed or disappear after the
+ * callback returns so the caller has to be very careful how it is used.
+ *
+ * The callback itself must not sleep or perform any operations which would
+ * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
+ * or via any lock dependency. It is generally advisable to implement
+ * the callback as simple as possible and defer any heavy lifting to a
+ * different context.
+ *
+ * There is no guarantee that each free range will be reported only once
+ * during one walk_free_mem_block invocation.
+ *
+ * pfn_to_page on the given range is strongly discouraged and if there is
+ * an absolute need for that make sure to contact MM people to discuss
+ * potential problems.
+ *
+ * The function itself might sleep so it cannot be called from atomic
+ * contexts.
+ *
+ * In general low orders tend to be very volatile and so it makes more
+ * sense to query larger ones first for various optimizations which like
+ * ballooning etc... This will reduce the overhead as well.
+ */
+void walk_free_mem_block(void *opaque,
+			 int min_order,
+			 bool (*report_pfn_range)(void *opaque,
+						  unsigned long pfn,
+						  unsigned long num))
+{
+	struct zone *zone;
+	int order;
+	enum migratetype mt;
+	bool ret;
+
+	for_each_populated_zone(zone) {
+		for (order = MAX_ORDER - 1; order >= min_order; order--) {
+			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+				ret = walk_free_page_list(opaque, zone,
+							  order, mt,
+							  report_pfn_range);
+				if (!ret)
+					return;
+			}
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(walk_free_mem_block);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v16 4/5] mm: support reporting free page blocks
  2017-09-30  4:05 ` Wei Wang
                   ` (8 preceding siblings ...)
  (?)
@ 2017-09-30  4:05 ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: aarcange, yang.zhang.wz, liliang.opensource, willy, amit.shah,
	quan.xu, cornelia.huck, pbonzini, mgorman

This patch adds support to walk through the free page blocks in the
system and report them via a callback function. Some page blocks may
leave the free list after zone->lock is released, so it is the caller's
responsibility to either detect or prevent the use of such pages.

One use example of this patch is to accelerate live migration by skipping
the transfer of free pages reported from the guest. A popular method used
by the hypervisor to track which part of memory is written during live
migration is to write-protect all the guest memory. So, those pages that
are reported as free pages but are written after the report function
returns will be captured by the hypervisor, and they will be added to the
next round of memory transfer.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/mm.h |  6 ++++
 mm/page_alloc.c    | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 97 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5..d9652c2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 
+extern void walk_free_mem_block(void *opaque,
+				int min_order,
+				bool (*report_pfn_range)(void *opaque,
+							 unsigned long pfn,
+							 unsigned long num));
+
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
  * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d00f74..c6bb874 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4762,6 +4762,97 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 	show_swap_cache_info();
 }
 
+/*
+ * Walk through a free page list and report the found pfn range via the
+ * callback.
+ *
+ * Return false if the callback requests to stop reporting. Otherwise,
+ * return true.
+ */
+static bool walk_free_page_list(void *opaque,
+				struct zone *zone,
+				int order,
+				enum migratetype mt,
+				bool (*report_pfn_range)(void *,
+							 unsigned long,
+							 unsigned long))
+{
+	struct page *page;
+	struct list_head *list;
+	unsigned long pfn, flags;
+	bool ret;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	list = &zone->free_area[order].free_list[mt];
+	list_for_each_entry(page, list, lru) {
+		pfn = page_to_pfn(page);
+		ret = report_pfn_range(opaque, pfn, 1 << order);
+		if (!ret)
+			break;
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	return ret;
+}
+
+/**
+ * walk_free_mem_block - Walk through the free page blocks in the system
+ * @opaque: the context passed from the caller
+ * @min_order: the minimum order of free lists to check
+ * @report_pfn_range: the callback to report the pfn range of the free pages
+ *
+ * If the callback returns false, stop iterating the list of free page blocks.
+ * Otherwise, continue to report.
+ *
+ * Please note that there are no locking guarantees for the callback and
+ * that the reported pfn range might be freed or disappear after the
+ * callback returns so the caller has to be very careful how it is used.
+ *
+ * The callback itself must not sleep or perform any operations which would
+ * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
+ * or via any lock dependency. It is generally advisable to implement
+ * the callback as simple as possible and defer any heavy lifting to a
+ * different context.
+ *
+ * There is no guarantee that each free range will be reported only once
+ * during one walk_free_mem_block invocation.
+ *
+ * pfn_to_page on the given range is strongly discouraged and if there is
+ * an absolute need for that make sure to contact MM people to discuss
+ * potential problems.
+ *
+ * The function itself might sleep so it cannot be called from atomic
+ * contexts.
+ *
+ * In general low orders tend to be very volatile and so it makes more
+ * sense to query larger ones first for various optimizations which like
+ * ballooning etc... This will reduce the overhead as well.
+ */
+void walk_free_mem_block(void *opaque,
+			 int min_order,
+			 bool (*report_pfn_range)(void *opaque,
+						  unsigned long pfn,
+						  unsigned long num))
+{
+	struct zone *zone;
+	int order;
+	enum migratetype mt;
+	bool ret;
+
+	for_each_populated_zone(zone) {
+		for (order = MAX_ORDER - 1; order >= min_order; order--) {
+			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+				ret = walk_free_page_list(opaque, zone,
+							  order, mt,
+							  report_pfn_range);
+				if (!ret)
+					return;
+			}
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(walk_free_mem_block);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [virtio-dev] [PATCH v16 4/5] mm: support reporting free page blocks
@ 2017-09-30  4:05   ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

This patch adds support to walk through the free page blocks in the
system and report them via a callback function. Some page blocks may
leave the free list after zone->lock is released, so it is the caller's
responsibility to either detect or prevent the use of such pages.

One use example of this patch is to accelerate live migration by skipping
the transfer of free pages reported from the guest. A popular method used
by the hypervisor to track which part of memory is written during live
migration is to write-protect all the guest memory. So, those pages that
are reported as free pages but are written after the report function
returns will be captured by the hypervisor, and they will be added to the
next round of memory transfer.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/mm.h |  6 ++++
 mm/page_alloc.c    | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 97 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5..d9652c2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 
+extern void walk_free_mem_block(void *opaque,
+				int min_order,
+				bool (*report_pfn_range)(void *opaque,
+							 unsigned long pfn,
+							 unsigned long num));
+
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
  * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d00f74..c6bb874 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4762,6 +4762,97 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 	show_swap_cache_info();
 }
 
+/*
+ * Walk through a free page list and report the found pfn range via the
+ * callback.
+ *
+ * Return false if the callback requests to stop reporting. Otherwise,
+ * return true.
+ */
+static bool walk_free_page_list(void *opaque,
+				struct zone *zone,
+				int order,
+				enum migratetype mt,
+				bool (*report_pfn_range)(void *,
+							 unsigned long,
+							 unsigned long))
+{
+	struct page *page;
+	struct list_head *list;
+	unsigned long pfn, flags;
+	bool ret;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	list = &zone->free_area[order].free_list[mt];
+	list_for_each_entry(page, list, lru) {
+		pfn = page_to_pfn(page);
+		ret = report_pfn_range(opaque, pfn, 1 << order);
+		if (!ret)
+			break;
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	return ret;
+}
+
+/**
+ * walk_free_mem_block - Walk through the free page blocks in the system
+ * @opaque: the context passed from the caller
+ * @min_order: the minimum order of free lists to check
+ * @report_pfn_range: the callback to report the pfn range of the free pages
+ *
+ * If the callback returns false, stop iterating the list of free page blocks.
+ * Otherwise, continue to report.
+ *
+ * Please note that there are no locking guarantees for the callback and
+ * that the reported pfn range might be freed or disappear after the
+ * callback returns so the caller has to be very careful how it is used.
+ *
+ * The callback itself must not sleep or perform any operations which would
+ * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
+ * or via any lock dependency. It is generally advisable to implement
+ * the callback as simple as possible and defer any heavy lifting to a
+ * different context.
+ *
+ * There is no guarantee that each free range will be reported only once
+ * during one walk_free_mem_block invocation.
+ *
+ * pfn_to_page on the given range is strongly discouraged and if there is
+ * an absolute need for that make sure to contact MM people to discuss
+ * potential problems.
+ *
+ * The function itself might sleep so it cannot be called from atomic
+ * contexts.
+ *
+ * In general low orders tend to be very volatile and so it makes more
+ * sense to query larger ones first for various optimizations which like
+ * ballooning etc... This will reduce the overhead as well.
+ */
+void walk_free_mem_block(void *opaque,
+			 int min_order,
+			 bool (*report_pfn_range)(void *opaque,
+						  unsigned long pfn,
+						  unsigned long num))
+{
+	struct zone *zone;
+	int order;
+	enum migratetype mt;
+	bool ret;
+
+	for_each_populated_zone(zone) {
+		for (order = MAX_ORDER - 1; order >= min_order; order--) {
+			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+				ret = walk_free_page_list(opaque, zone,
+							  order, mt,
+							  report_pfn_range);
+				if (!ret)
+					return;
+			}
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(walk_free_mem_block);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
  2017-09-30  4:05 ` Wei Wang
  (?)
  (?)
@ 2017-09-30  4:05   ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

Add a new vq, ctrl_vq, to handle commands between the host and guest.
With this feature, we will be able to have the control plane and data
plane separated. In other words, the control related commands of each
feature will be sent via the ctrl_vq, meanwhile each feature may have
its own vq used as a data plane.

Free page report is the the first new feature controlled via ctrl_vq,
and a new cmd class, VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE, is added.
Currently, this feature has two cmds:
VIRTIO_BALLOON_FREE_PAGE_F_START: This cmd is sent from host to guest
to start the free page report work.
VIRTIO_BALLOON_FREE_PAGE_F_STOP: This cmd is bidirectional. The guest
would send the cmd to the host to indicate the reporting work is done.
The host would send the cmd to the guest to actively request the stop
of the reporting work.

The free_page_vq is used to transmit the guest free page blocks to the
host.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
---
 drivers/virtio/virtio_balloon.c     | 249 +++++++++++++++++++++++++++++++++---
 include/uapi/linux/virtio_balloon.h |  15 +++
 2 files changed, 244 insertions(+), 20 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 6952e19..70dc4ae 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -55,7 +55,13 @@ static struct vfsmount *balloon_mnt;
 
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *ctrl_vq,
+			 *free_page_vq;
+
+	/* Balloon's own wq for cpu-intensive work items */
+	struct workqueue_struct *balloon_wq;
+	/* The work items submitted to the balloon wq are listed here */
+	struct work_struct report_free_page_work;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
@@ -65,6 +71,9 @@ struct virtio_balloon {
 	spinlock_t stop_update_lock;
 	bool stop_update;
 
+	/* Stop reporting free pages */
+	bool report_free_page_stop;
+
 	/* Waiting for host to ack the pages we released. */
 	wait_queue_head_t acked;
 
@@ -93,6 +102,11 @@ struct virtio_balloon {
 
 	/* To register callback in oom notifier call chain */
 	struct notifier_block nb;
+
+	/* Host to guest ctrlq cmd buf for free page report */
+	struct virtio_balloon_ctrlq_cmd free_page_cmd_in;
+	/* Guest to Host ctrlq cmd buf for free page report */
+	struct virtio_balloon_ctrlq_cmd free_page_cmd_out;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -186,6 +200,24 @@ static int send_balloon_page_sg(struct virtio_balloon *vb,
 	return err;
 }
 
+static int send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
+{
+	int ret = 0;
+
+	/*
+	 * Since this is an optimization feature, losing a couplle of free
+	 * pages to report isn't important. We simply resturn without adding
+	 * the page if the vq is full.
+	 */
+	if (vq->num_free) {
+		ret = add_one_sg(vq, addr, size);
+		if (!ret)
+			virtqueue_kick(vq);
+	}
+
+	return ret;
+}
+
 /*
  * Send balloon pages in sgs to host. The balloon pages are recorded in the
  * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
@@ -542,42 +574,210 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
-static int init_vqs(struct virtio_balloon *vb)
+static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
+					   unsigned long nr_pages)
+{
+	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
+	void *addr = (void *)pfn_to_kaddr(pfn);
+	uint32_t len = nr_pages << PAGE_SHIFT;
+
+	if (vb->report_free_page_stop)
+		return false;
+
+	/* If the vq is broken, stop reporting the free pages. */
+	if (send_free_page_sg(vb->free_page_vq, addr, len) < 0)
+		return false;
+
+	return true;
+}
+
+static void ctrlq_add_cmd(struct virtqueue *vq,
+			  struct virtio_balloon_ctrlq_cmd *cmd,
+			  bool inbuf)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
-	int err, nvqs;
+	struct scatterlist sg;
+	int err;
+
+	sg_init_one(&sg, cmd, sizeof(struct virtio_balloon_ctrlq_cmd));
+	if (inbuf)
+		err = virtqueue_add_inbuf(vq, &sg, 1, cmd, GFP_KERNEL);
+	else
+		err = virtqueue_add_outbuf(vq, &sg, 1, cmd, GFP_KERNEL);
+
+	/* Sanity check: this can't really happen */
+	WARN_ON(err);
+}
+
+static void ctrlq_send_cmd(struct virtio_balloon *vb,
+			  struct virtio_balloon_ctrlq_cmd *cmd,
+			  bool inbuf)
+{
+	struct virtqueue *vq = vb->ctrl_vq;
+
+	ctrlq_add_cmd(vq, cmd, inbuf);
+	if (!inbuf) {
+		/*
+		 * All the input cmd buffers are replenished here.
+		 * This is necessary because the input cmd buffers are lost
+		 * after live migration. The device needs to rewind all of
+		 * them from the ctrl_vq.
+		 */
+		ctrlq_add_cmd(vq, &vb->free_page_cmd_in, true);
+	}
+	virtqueue_kick(vq);
+}
 
+static void report_free_page_end(struct virtio_balloon *vb)
+{
 	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
+	 * The host may have already requested to stop the reporting before we
+	 * finish, so no need to notify the host in this case.
 	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
+	if (vb->report_free_page_stop)
+		return;
+
+	vb->free_page_cmd_out.class = VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
+	vb->free_page_cmd_out.cmd = VIRTIO_BALLOON_FREE_PAGE_F_STOP;
+	ctrlq_send_cmd(vb, &vb->free_page_cmd_out, false);
+	vb->report_free_page_stop = true;
+}
+
+static void report_free_page(struct work_struct *work)
+{
+	struct virtio_balloon *vb;
+
+	vb = container_of(work, struct virtio_balloon, report_free_page_work);
+	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
+	report_free_page_end(vb);
+}
+
+static void ctrlq_handle(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+	struct virtio_balloon_ctrlq_cmd *msg;
+	unsigned int class, cmd, len;
+
+	msg = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len);
+	if (unlikely(!msg))
+		return;
+
+	/* The outbuf is sent by the host for recycling, so just return. */
+	if (msg == &vb->free_page_cmd_out)
+		return;
+
+	class = virtio32_to_cpu(vb->vdev, msg->class);
+	cmd =  virtio32_to_cpu(vb->vdev, msg->cmd);
+
+	switch (class) {
+	case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
+		if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
+			vb->report_free_page_stop = true;
+		} else if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
+			vb->report_free_page_stop = false;
+			queue_work(vb->balloon_wq, &vb->report_free_page_work);
+		}
+		vb->free_page_cmd_in.class =
+					VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
+		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
+	break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n",
+			 __func__);
+	}
+}
+
+static int init_vqs(struct virtio_balloon *vb)
+{
+	struct virtqueue **vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	struct scatterlist sg;
+	int i, nvqs, err = -ENOMEM;
+
+	/* Inflateq and deflateq are used unconditionally */
+	nvqs = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs++;
+	/* If ctrlq is enabled, the free page vq will also be created */
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ))
+		nvqs += 2;
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vq;
+	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
+	if (!callbacks)
+		goto err_callback;
+	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
+	if (!names)
+		goto err_names;
+
+	callbacks[0] = balloon_ack;
+	names[0] = "inflate";
+	callbacks[1] = balloon_ack;
+	names[1] = "deflate";
+
+	i = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		callbacks[i] = stats_request;
+		names[i] = "stats";
+		i++;
+	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
+		callbacks[i] = ctrlq_handle;
+		names[i++] = "ctrlq";
+		callbacks[i] = NULL;
+		names[i] = "free_page_vq";
+	}
+
+	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
+					 NULL, NULL);
 	if (err)
-		return err;
+		goto err_find;
 
 	vb->inflate_vq = vqs[0];
 	vb->deflate_vq = vqs[1];
+	i = 2;
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
-		struct scatterlist sg;
-		unsigned int num_stats;
-		vb->stats_vq = vqs[2];
-
+		vb->stats_vq = vqs[i++];
 		/*
 		 * Prime this virtqueue with one buffer so the hypervisor can
 		 * use it to signal us later (it can't be broken yet!).
 		 */
-		num_stats = update_balloon_stats(vb);
-
-		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
+		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
 		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
-		    < 0)
-			BUG();
+		    < 0) {
+			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
+				 __func__);
+			goto err_find;
+		}
 		virtqueue_kick(vb->stats_vq);
 	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
+		vb->ctrl_vq = vqs[i++];
+		vb->free_page_vq = vqs[i];
+		/* Prime the ctrlq with an inbuf for the host to send a cmd */
+		vb->free_page_cmd_in.class =
+					VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
+		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
+	}
+
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
 	return 0;
+
+err_find:
+	kfree(names);
+err_names:
+	kfree(callbacks);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return err;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -706,6 +906,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
 		xb_init(&vb->page_xb);
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
+		vb->balloon_wq = alloc_workqueue("balloon-wq",
+					WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
+		INIT_WORK(&vb->report_free_page_work, report_free_page);
+		vb->report_free_page_stop = true;
+	}
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -770,6 +977,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	spin_unlock_irq(&vb->stop_update_lock);
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
+	cancel_work_sync(&vb->report_free_page_work);
 
 	remove_common(vb);
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -823,6 +1031,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_SG,
+	VIRTIO_BALLOON_F_CTRL_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 37780a7..dbf0616 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
+#define VIRTIO_BALLOON_F_CTRL_VQ	4 /* Control Virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -83,4 +84,18 @@ struct virtio_balloon_stat {
 	__virtio64 val;
 } __attribute__((packed));
 
+enum {
+	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE = 0,
+	VIRTIO_BALLOON_CTRLQ_CLASS_MAX,
+};
+
+struct virtio_balloon_ctrlq_cmd {
+	__virtio32 class;
+	__virtio32 cmd;
+};
+
+/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE */
+#define VIRTIO_BALLOON_FREE_PAGE_F_STOP		0
+#define VIRTIO_BALLOON_FREE_PAGE_F_START	1
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-09-30  4:05   ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

Add a new vq, ctrl_vq, to handle commands between the host and guest.
With this feature, we will be able to have the control plane and data
plane separated. In other words, the control related commands of each
feature will be sent via the ctrl_vq, meanwhile each feature may have
its own vq used as a data plane.

Free page report is the the first new feature controlled via ctrl_vq,
and a new cmd class, VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE, is added.
Currently, this feature has two cmds:
VIRTIO_BALLOON_FREE_PAGE_F_START: This cmd is sent from host to guest
to start the free page report work.
VIRTIO_BALLOON_FREE_PAGE_F_STOP: This cmd is bidirectional. The guest
would send the cmd to the host to indicate the reporting work is done.
The host would send the cmd to the guest to actively request the stop
of the reporting work.

The free_page_vq is used to transmit the guest free page blocks to the
host.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
---
 drivers/virtio/virtio_balloon.c     | 249 +++++++++++++++++++++++++++++++++---
 include/uapi/linux/virtio_balloon.h |  15 +++
 2 files changed, 244 insertions(+), 20 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 6952e19..70dc4ae 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -55,7 +55,13 @@ static struct vfsmount *balloon_mnt;
 
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *ctrl_vq,
+			 *free_page_vq;
+
+	/* Balloon's own wq for cpu-intensive work items */
+	struct workqueue_struct *balloon_wq;
+	/* The work items submitted to the balloon wq are listed here */
+	struct work_struct report_free_page_work;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
@@ -65,6 +71,9 @@ struct virtio_balloon {
 	spinlock_t stop_update_lock;
 	bool stop_update;
 
+	/* Stop reporting free pages */
+	bool report_free_page_stop;
+
 	/* Waiting for host to ack the pages we released. */
 	wait_queue_head_t acked;
 
@@ -93,6 +102,11 @@ struct virtio_balloon {
 
 	/* To register callback in oom notifier call chain */
 	struct notifier_block nb;
+
+	/* Host to guest ctrlq cmd buf for free page report */
+	struct virtio_balloon_ctrlq_cmd free_page_cmd_in;
+	/* Guest to Host ctrlq cmd buf for free page report */
+	struct virtio_balloon_ctrlq_cmd free_page_cmd_out;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -186,6 +200,24 @@ static int send_balloon_page_sg(struct virtio_balloon *vb,
 	return err;
 }
 
+static int send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
+{
+	int ret = 0;
+
+	/*
+	 * Since this is an optimization feature, losing a couplle of free
+	 * pages to report isn't important. We simply resturn without adding
+	 * the page if the vq is full.
+	 */
+	if (vq->num_free) {
+		ret = add_one_sg(vq, addr, size);
+		if (!ret)
+			virtqueue_kick(vq);
+	}
+
+	return ret;
+}
+
 /*
  * Send balloon pages in sgs to host. The balloon pages are recorded in the
  * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
@@ -542,42 +574,210 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
-static int init_vqs(struct virtio_balloon *vb)
+static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
+					   unsigned long nr_pages)
+{
+	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
+	void *addr = (void *)pfn_to_kaddr(pfn);
+	uint32_t len = nr_pages << PAGE_SHIFT;
+
+	if (vb->report_free_page_stop)
+		return false;
+
+	/* If the vq is broken, stop reporting the free pages. */
+	if (send_free_page_sg(vb->free_page_vq, addr, len) < 0)
+		return false;
+
+	return true;
+}
+
+static void ctrlq_add_cmd(struct virtqueue *vq,
+			  struct virtio_balloon_ctrlq_cmd *cmd,
+			  bool inbuf)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
-	int err, nvqs;
+	struct scatterlist sg;
+	int err;
+
+	sg_init_one(&sg, cmd, sizeof(struct virtio_balloon_ctrlq_cmd));
+	if (inbuf)
+		err = virtqueue_add_inbuf(vq, &sg, 1, cmd, GFP_KERNEL);
+	else
+		err = virtqueue_add_outbuf(vq, &sg, 1, cmd, GFP_KERNEL);
+
+	/* Sanity check: this can't really happen */
+	WARN_ON(err);
+}
+
+static void ctrlq_send_cmd(struct virtio_balloon *vb,
+			  struct virtio_balloon_ctrlq_cmd *cmd,
+			  bool inbuf)
+{
+	struct virtqueue *vq = vb->ctrl_vq;
+
+	ctrlq_add_cmd(vq, cmd, inbuf);
+	if (!inbuf) {
+		/*
+		 * All the input cmd buffers are replenished here.
+		 * This is necessary because the input cmd buffers are lost
+		 * after live migration. The device needs to rewind all of
+		 * them from the ctrl_vq.
+		 */
+		ctrlq_add_cmd(vq, &vb->free_page_cmd_in, true);
+	}
+	virtqueue_kick(vq);
+}
 
+static void report_free_page_end(struct virtio_balloon *vb)
+{
 	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
+	 * The host may have already requested to stop the reporting before we
+	 * finish, so no need to notify the host in this case.
 	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
+	if (vb->report_free_page_stop)
+		return;
+
+	vb->free_page_cmd_out.class = VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
+	vb->free_page_cmd_out.cmd = VIRTIO_BALLOON_FREE_PAGE_F_STOP;
+	ctrlq_send_cmd(vb, &vb->free_page_cmd_out, false);
+	vb->report_free_page_stop = true;
+}
+
+static void report_free_page(struct work_struct *work)
+{
+	struct virtio_balloon *vb;
+
+	vb = container_of(work, struct virtio_balloon, report_free_page_work);
+	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
+	report_free_page_end(vb);
+}
+
+static void ctrlq_handle(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+	struct virtio_balloon_ctrlq_cmd *msg;
+	unsigned int class, cmd, len;
+
+	msg = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len);
+	if (unlikely(!msg))
+		return;
+
+	/* The outbuf is sent by the host for recycling, so just return. */
+	if (msg == &vb->free_page_cmd_out)
+		return;
+
+	class = virtio32_to_cpu(vb->vdev, msg->class);
+	cmd =  virtio32_to_cpu(vb->vdev, msg->cmd);
+
+	switch (class) {
+	case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
+		if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
+			vb->report_free_page_stop = true;
+		} else if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
+			vb->report_free_page_stop = false;
+			queue_work(vb->balloon_wq, &vb->report_free_page_work);
+		}
+		vb->free_page_cmd_in.class =
+					VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
+		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
+	break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n",
+			 __func__);
+	}
+}
+
+static int init_vqs(struct virtio_balloon *vb)
+{
+	struct virtqueue **vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	struct scatterlist sg;
+	int i, nvqs, err = -ENOMEM;
+
+	/* Inflateq and deflateq are used unconditionally */
+	nvqs = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs++;
+	/* If ctrlq is enabled, the free page vq will also be created */
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ))
+		nvqs += 2;
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vq;
+	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
+	if (!callbacks)
+		goto err_callback;
+	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
+	if (!names)
+		goto err_names;
+
+	callbacks[0] = balloon_ack;
+	names[0] = "inflate";
+	callbacks[1] = balloon_ack;
+	names[1] = "deflate";
+
+	i = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		callbacks[i] = stats_request;
+		names[i] = "stats";
+		i++;
+	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
+		callbacks[i] = ctrlq_handle;
+		names[i++] = "ctrlq";
+		callbacks[i] = NULL;
+		names[i] = "free_page_vq";
+	}
+
+	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
+					 NULL, NULL);
 	if (err)
-		return err;
+		goto err_find;
 
 	vb->inflate_vq = vqs[0];
 	vb->deflate_vq = vqs[1];
+	i = 2;
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
-		struct scatterlist sg;
-		unsigned int num_stats;
-		vb->stats_vq = vqs[2];
-
+		vb->stats_vq = vqs[i++];
 		/*
 		 * Prime this virtqueue with one buffer so the hypervisor can
 		 * use it to signal us later (it can't be broken yet!).
 		 */
-		num_stats = update_balloon_stats(vb);
-
-		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
+		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
 		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
-		    < 0)
-			BUG();
+		    < 0) {
+			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
+				 __func__);
+			goto err_find;
+		}
 		virtqueue_kick(vb->stats_vq);
 	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
+		vb->ctrl_vq = vqs[i++];
+		vb->free_page_vq = vqs[i];
+		/* Prime the ctrlq with an inbuf for the host to send a cmd */
+		vb->free_page_cmd_in.class =
+					VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
+		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
+	}
+
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
 	return 0;
+
+err_find:
+	kfree(names);
+err_names:
+	kfree(callbacks);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return err;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -706,6 +906,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
 		xb_init(&vb->page_xb);
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
+		vb->balloon_wq = alloc_workqueue("balloon-wq",
+					WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
+		INIT_WORK(&vb->report_free_page_work, report_free_page);
+		vb->report_free_page_stop = true;
+	}
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -770,6 +977,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	spin_unlock_irq(&vb->stop_update_lock);
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
+	cancel_work_sync(&vb->report_free_page_work);
 
 	remove_common(vb);
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -823,6 +1031,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_SG,
+	VIRTIO_BALLOON_F_CTRL_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 37780a7..dbf0616 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
+#define VIRTIO_BALLOON_F_CTRL_VQ	4 /* Control Virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -83,4 +84,18 @@ struct virtio_balloon_stat {
 	__virtio64 val;
 } __attribute__((packed));
 
+enum {
+	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE = 0,
+	VIRTIO_BALLOON_CTRLQ_CLASS_MAX,
+};
+
+struct virtio_balloon_ctrlq_cmd {
+	__virtio32 class;
+	__virtio32 cmd;
+};
+
+/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE */
+#define VIRTIO_BALLOON_FREE_PAGE_F_STOP		0
+#define VIRTIO_BALLOON_FREE_PAGE_F_START	1
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [Qemu-devel] [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-09-30  4:05   ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

Add a new vq, ctrl_vq, to handle commands between the host and guest.
With this feature, we will be able to have the control plane and data
plane separated. In other words, the control related commands of each
feature will be sent via the ctrl_vq, meanwhile each feature may have
its own vq used as a data plane.

Free page report is the the first new feature controlled via ctrl_vq,
and a new cmd class, VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE, is added.
Currently, this feature has two cmds:
VIRTIO_BALLOON_FREE_PAGE_F_START: This cmd is sent from host to guest
to start the free page report work.
VIRTIO_BALLOON_FREE_PAGE_F_STOP: This cmd is bidirectional. The guest
would send the cmd to the host to indicate the reporting work is done.
The host would send the cmd to the guest to actively request the stop
of the reporting work.

The free_page_vq is used to transmit the guest free page blocks to the
host.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
---
 drivers/virtio/virtio_balloon.c     | 249 +++++++++++++++++++++++++++++++++---
 include/uapi/linux/virtio_balloon.h |  15 +++
 2 files changed, 244 insertions(+), 20 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 6952e19..70dc4ae 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -55,7 +55,13 @@ static struct vfsmount *balloon_mnt;
 
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *ctrl_vq,
+			 *free_page_vq;
+
+	/* Balloon's own wq for cpu-intensive work items */
+	struct workqueue_struct *balloon_wq;
+	/* The work items submitted to the balloon wq are listed here */
+	struct work_struct report_free_page_work;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
@@ -65,6 +71,9 @@ struct virtio_balloon {
 	spinlock_t stop_update_lock;
 	bool stop_update;
 
+	/* Stop reporting free pages */
+	bool report_free_page_stop;
+
 	/* Waiting for host to ack the pages we released. */
 	wait_queue_head_t acked;
 
@@ -93,6 +102,11 @@ struct virtio_balloon {
 
 	/* To register callback in oom notifier call chain */
 	struct notifier_block nb;
+
+	/* Host to guest ctrlq cmd buf for free page report */
+	struct virtio_balloon_ctrlq_cmd free_page_cmd_in;
+	/* Guest to Host ctrlq cmd buf for free page report */
+	struct virtio_balloon_ctrlq_cmd free_page_cmd_out;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -186,6 +200,24 @@ static int send_balloon_page_sg(struct virtio_balloon *vb,
 	return err;
 }
 
+static int send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
+{
+	int ret = 0;
+
+	/*
+	 * Since this is an optimization feature, losing a couplle of free
+	 * pages to report isn't important. We simply resturn without adding
+	 * the page if the vq is full.
+	 */
+	if (vq->num_free) {
+		ret = add_one_sg(vq, addr, size);
+		if (!ret)
+			virtqueue_kick(vq);
+	}
+
+	return ret;
+}
+
 /*
  * Send balloon pages in sgs to host. The balloon pages are recorded in the
  * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
@@ -542,42 +574,210 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
-static int init_vqs(struct virtio_balloon *vb)
+static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
+					   unsigned long nr_pages)
+{
+	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
+	void *addr = (void *)pfn_to_kaddr(pfn);
+	uint32_t len = nr_pages << PAGE_SHIFT;
+
+	if (vb->report_free_page_stop)
+		return false;
+
+	/* If the vq is broken, stop reporting the free pages. */
+	if (send_free_page_sg(vb->free_page_vq, addr, len) < 0)
+		return false;
+
+	return true;
+}
+
+static void ctrlq_add_cmd(struct virtqueue *vq,
+			  struct virtio_balloon_ctrlq_cmd *cmd,
+			  bool inbuf)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
-	int err, nvqs;
+	struct scatterlist sg;
+	int err;
+
+	sg_init_one(&sg, cmd, sizeof(struct virtio_balloon_ctrlq_cmd));
+	if (inbuf)
+		err = virtqueue_add_inbuf(vq, &sg, 1, cmd, GFP_KERNEL);
+	else
+		err = virtqueue_add_outbuf(vq, &sg, 1, cmd, GFP_KERNEL);
+
+	/* Sanity check: this can't really happen */
+	WARN_ON(err);
+}
+
+static void ctrlq_send_cmd(struct virtio_balloon *vb,
+			  struct virtio_balloon_ctrlq_cmd *cmd,
+			  bool inbuf)
+{
+	struct virtqueue *vq = vb->ctrl_vq;
+
+	ctrlq_add_cmd(vq, cmd, inbuf);
+	if (!inbuf) {
+		/*
+		 * All the input cmd buffers are replenished here.
+		 * This is necessary because the input cmd buffers are lost
+		 * after live migration. The device needs to rewind all of
+		 * them from the ctrl_vq.
+		 */
+		ctrlq_add_cmd(vq, &vb->free_page_cmd_in, true);
+	}
+	virtqueue_kick(vq);
+}
 
+static void report_free_page_end(struct virtio_balloon *vb)
+{
 	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
+	 * The host may have already requested to stop the reporting before we
+	 * finish, so no need to notify the host in this case.
 	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
+	if (vb->report_free_page_stop)
+		return;
+
+	vb->free_page_cmd_out.class = VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
+	vb->free_page_cmd_out.cmd = VIRTIO_BALLOON_FREE_PAGE_F_STOP;
+	ctrlq_send_cmd(vb, &vb->free_page_cmd_out, false);
+	vb->report_free_page_stop = true;
+}
+
+static void report_free_page(struct work_struct *work)
+{
+	struct virtio_balloon *vb;
+
+	vb = container_of(work, struct virtio_balloon, report_free_page_work);
+	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
+	report_free_page_end(vb);
+}
+
+static void ctrlq_handle(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+	struct virtio_balloon_ctrlq_cmd *msg;
+	unsigned int class, cmd, len;
+
+	msg = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len);
+	if (unlikely(!msg))
+		return;
+
+	/* The outbuf is sent by the host for recycling, so just return. */
+	if (msg == &vb->free_page_cmd_out)
+		return;
+
+	class = virtio32_to_cpu(vb->vdev, msg->class);
+	cmd =  virtio32_to_cpu(vb->vdev, msg->cmd);
+
+	switch (class) {
+	case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
+		if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
+			vb->report_free_page_stop = true;
+		} else if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
+			vb->report_free_page_stop = false;
+			queue_work(vb->balloon_wq, &vb->report_free_page_work);
+		}
+		vb->free_page_cmd_in.class =
+					VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
+		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
+	break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n",
+			 __func__);
+	}
+}
+
+static int init_vqs(struct virtio_balloon *vb)
+{
+	struct virtqueue **vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	struct scatterlist sg;
+	int i, nvqs, err = -ENOMEM;
+
+	/* Inflateq and deflateq are used unconditionally */
+	nvqs = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs++;
+	/* If ctrlq is enabled, the free page vq will also be created */
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ))
+		nvqs += 2;
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vq;
+	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
+	if (!callbacks)
+		goto err_callback;
+	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
+	if (!names)
+		goto err_names;
+
+	callbacks[0] = balloon_ack;
+	names[0] = "inflate";
+	callbacks[1] = balloon_ack;
+	names[1] = "deflate";
+
+	i = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		callbacks[i] = stats_request;
+		names[i] = "stats";
+		i++;
+	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
+		callbacks[i] = ctrlq_handle;
+		names[i++] = "ctrlq";
+		callbacks[i] = NULL;
+		names[i] = "free_page_vq";
+	}
+
+	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
+					 NULL, NULL);
 	if (err)
-		return err;
+		goto err_find;
 
 	vb->inflate_vq = vqs[0];
 	vb->deflate_vq = vqs[1];
+	i = 2;
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
-		struct scatterlist sg;
-		unsigned int num_stats;
-		vb->stats_vq = vqs[2];
-
+		vb->stats_vq = vqs[i++];
 		/*
 		 * Prime this virtqueue with one buffer so the hypervisor can
 		 * use it to signal us later (it can't be broken yet!).
 		 */
-		num_stats = update_balloon_stats(vb);
-
-		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
+		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
 		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
-		    < 0)
-			BUG();
+		    < 0) {
+			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
+				 __func__);
+			goto err_find;
+		}
 		virtqueue_kick(vb->stats_vq);
 	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
+		vb->ctrl_vq = vqs[i++];
+		vb->free_page_vq = vqs[i];
+		/* Prime the ctrlq with an inbuf for the host to send a cmd */
+		vb->free_page_cmd_in.class =
+					VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
+		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
+	}
+
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
 	return 0;
+
+err_find:
+	kfree(names);
+err_names:
+	kfree(callbacks);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return err;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -706,6 +906,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
 		xb_init(&vb->page_xb);
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
+		vb->balloon_wq = alloc_workqueue("balloon-wq",
+					WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
+		INIT_WORK(&vb->report_free_page_work, report_free_page);
+		vb->report_free_page_stop = true;
+	}
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -770,6 +977,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	spin_unlock_irq(&vb->stop_update_lock);
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
+	cancel_work_sync(&vb->report_free_page_work);
 
 	remove_common(vb);
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -823,6 +1031,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_SG,
+	VIRTIO_BALLOON_F_CTRL_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 37780a7..dbf0616 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
+#define VIRTIO_BALLOON_F_CTRL_VQ	4 /* Control Virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -83,4 +84,18 @@ struct virtio_balloon_stat {
 	__virtio64 val;
 } __attribute__((packed));
 
+enum {
+	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE = 0,
+	VIRTIO_BALLOON_CTRLQ_CLASS_MAX,
+};
+
+struct virtio_balloon_ctrlq_cmd {
+	__virtio32 class;
+	__virtio32 cmd;
+};
+
+/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE */
+#define VIRTIO_BALLOON_FREE_PAGE_F_STOP		0
+#define VIRTIO_BALLOON_FREE_PAGE_F_START	1
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
  2017-09-30  4:05 ` Wei Wang
                   ` (11 preceding siblings ...)
  (?)
@ 2017-09-30  4:05 ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: aarcange, yang.zhang.wz, liliang.opensource, willy, amit.shah,
	quan.xu, cornelia.huck, pbonzini, mgorman

Add a new vq, ctrl_vq, to handle commands between the host and guest.
With this feature, we will be able to have the control plane and data
plane separated. In other words, the control related commands of each
feature will be sent via the ctrl_vq, meanwhile each feature may have
its own vq used as a data plane.

Free page report is the the first new feature controlled via ctrl_vq,
and a new cmd class, VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE, is added.
Currently, this feature has two cmds:
VIRTIO_BALLOON_FREE_PAGE_F_START: This cmd is sent from host to guest
to start the free page report work.
VIRTIO_BALLOON_FREE_PAGE_F_STOP: This cmd is bidirectional. The guest
would send the cmd to the host to indicate the reporting work is done.
The host would send the cmd to the guest to actively request the stop
of the reporting work.

The free_page_vq is used to transmit the guest free page blocks to the
host.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
---
 drivers/virtio/virtio_balloon.c     | 249 +++++++++++++++++++++++++++++++++---
 include/uapi/linux/virtio_balloon.h |  15 +++
 2 files changed, 244 insertions(+), 20 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 6952e19..70dc4ae 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -55,7 +55,13 @@ static struct vfsmount *balloon_mnt;
 
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *ctrl_vq,
+			 *free_page_vq;
+
+	/* Balloon's own wq for cpu-intensive work items */
+	struct workqueue_struct *balloon_wq;
+	/* The work items submitted to the balloon wq are listed here */
+	struct work_struct report_free_page_work;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
@@ -65,6 +71,9 @@ struct virtio_balloon {
 	spinlock_t stop_update_lock;
 	bool stop_update;
 
+	/* Stop reporting free pages */
+	bool report_free_page_stop;
+
 	/* Waiting for host to ack the pages we released. */
 	wait_queue_head_t acked;
 
@@ -93,6 +102,11 @@ struct virtio_balloon {
 
 	/* To register callback in oom notifier call chain */
 	struct notifier_block nb;
+
+	/* Host to guest ctrlq cmd buf for free page report */
+	struct virtio_balloon_ctrlq_cmd free_page_cmd_in;
+	/* Guest to Host ctrlq cmd buf for free page report */
+	struct virtio_balloon_ctrlq_cmd free_page_cmd_out;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -186,6 +200,24 @@ static int send_balloon_page_sg(struct virtio_balloon *vb,
 	return err;
 }
 
+static int send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
+{
+	int ret = 0;
+
+	/*
+	 * Since this is an optimization feature, losing a couplle of free
+	 * pages to report isn't important. We simply resturn without adding
+	 * the page if the vq is full.
+	 */
+	if (vq->num_free) {
+		ret = add_one_sg(vq, addr, size);
+		if (!ret)
+			virtqueue_kick(vq);
+	}
+
+	return ret;
+}
+
 /*
  * Send balloon pages in sgs to host. The balloon pages are recorded in the
  * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
@@ -542,42 +574,210 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
-static int init_vqs(struct virtio_balloon *vb)
+static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
+					   unsigned long nr_pages)
+{
+	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
+	void *addr = (void *)pfn_to_kaddr(pfn);
+	uint32_t len = nr_pages << PAGE_SHIFT;
+
+	if (vb->report_free_page_stop)
+		return false;
+
+	/* If the vq is broken, stop reporting the free pages. */
+	if (send_free_page_sg(vb->free_page_vq, addr, len) < 0)
+		return false;
+
+	return true;
+}
+
+static void ctrlq_add_cmd(struct virtqueue *vq,
+			  struct virtio_balloon_ctrlq_cmd *cmd,
+			  bool inbuf)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
-	int err, nvqs;
+	struct scatterlist sg;
+	int err;
+
+	sg_init_one(&sg, cmd, sizeof(struct virtio_balloon_ctrlq_cmd));
+	if (inbuf)
+		err = virtqueue_add_inbuf(vq, &sg, 1, cmd, GFP_KERNEL);
+	else
+		err = virtqueue_add_outbuf(vq, &sg, 1, cmd, GFP_KERNEL);
+
+	/* Sanity check: this can't really happen */
+	WARN_ON(err);
+}
+
+static void ctrlq_send_cmd(struct virtio_balloon *vb,
+			  struct virtio_balloon_ctrlq_cmd *cmd,
+			  bool inbuf)
+{
+	struct virtqueue *vq = vb->ctrl_vq;
+
+	ctrlq_add_cmd(vq, cmd, inbuf);
+	if (!inbuf) {
+		/*
+		 * All the input cmd buffers are replenished here.
+		 * This is necessary because the input cmd buffers are lost
+		 * after live migration. The device needs to rewind all of
+		 * them from the ctrl_vq.
+		 */
+		ctrlq_add_cmd(vq, &vb->free_page_cmd_in, true);
+	}
+	virtqueue_kick(vq);
+}
 
+static void report_free_page_end(struct virtio_balloon *vb)
+{
 	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
+	 * The host may have already requested to stop the reporting before we
+	 * finish, so no need to notify the host in this case.
 	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
+	if (vb->report_free_page_stop)
+		return;
+
+	vb->free_page_cmd_out.class = VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
+	vb->free_page_cmd_out.cmd = VIRTIO_BALLOON_FREE_PAGE_F_STOP;
+	ctrlq_send_cmd(vb, &vb->free_page_cmd_out, false);
+	vb->report_free_page_stop = true;
+}
+
+static void report_free_page(struct work_struct *work)
+{
+	struct virtio_balloon *vb;
+
+	vb = container_of(work, struct virtio_balloon, report_free_page_work);
+	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
+	report_free_page_end(vb);
+}
+
+static void ctrlq_handle(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+	struct virtio_balloon_ctrlq_cmd *msg;
+	unsigned int class, cmd, len;
+
+	msg = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len);
+	if (unlikely(!msg))
+		return;
+
+	/* The outbuf is sent by the host for recycling, so just return. */
+	if (msg == &vb->free_page_cmd_out)
+		return;
+
+	class = virtio32_to_cpu(vb->vdev, msg->class);
+	cmd =  virtio32_to_cpu(vb->vdev, msg->cmd);
+
+	switch (class) {
+	case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
+		if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
+			vb->report_free_page_stop = true;
+		} else if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
+			vb->report_free_page_stop = false;
+			queue_work(vb->balloon_wq, &vb->report_free_page_work);
+		}
+		vb->free_page_cmd_in.class =
+					VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
+		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
+	break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n",
+			 __func__);
+	}
+}
+
+static int init_vqs(struct virtio_balloon *vb)
+{
+	struct virtqueue **vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	struct scatterlist sg;
+	int i, nvqs, err = -ENOMEM;
+
+	/* Inflateq and deflateq are used unconditionally */
+	nvqs = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs++;
+	/* If ctrlq is enabled, the free page vq will also be created */
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ))
+		nvqs += 2;
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vq;
+	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
+	if (!callbacks)
+		goto err_callback;
+	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
+	if (!names)
+		goto err_names;
+
+	callbacks[0] = balloon_ack;
+	names[0] = "inflate";
+	callbacks[1] = balloon_ack;
+	names[1] = "deflate";
+
+	i = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		callbacks[i] = stats_request;
+		names[i] = "stats";
+		i++;
+	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
+		callbacks[i] = ctrlq_handle;
+		names[i++] = "ctrlq";
+		callbacks[i] = NULL;
+		names[i] = "free_page_vq";
+	}
+
+	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
+					 NULL, NULL);
 	if (err)
-		return err;
+		goto err_find;
 
 	vb->inflate_vq = vqs[0];
 	vb->deflate_vq = vqs[1];
+	i = 2;
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
-		struct scatterlist sg;
-		unsigned int num_stats;
-		vb->stats_vq = vqs[2];
-
+		vb->stats_vq = vqs[i++];
 		/*
 		 * Prime this virtqueue with one buffer so the hypervisor can
 		 * use it to signal us later (it can't be broken yet!).
 		 */
-		num_stats = update_balloon_stats(vb);
-
-		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
+		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
 		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
-		    < 0)
-			BUG();
+		    < 0) {
+			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
+				 __func__);
+			goto err_find;
+		}
 		virtqueue_kick(vb->stats_vq);
 	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
+		vb->ctrl_vq = vqs[i++];
+		vb->free_page_vq = vqs[i];
+		/* Prime the ctrlq with an inbuf for the host to send a cmd */
+		vb->free_page_cmd_in.class =
+					VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
+		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
+	}
+
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
 	return 0;
+
+err_find:
+	kfree(names);
+err_names:
+	kfree(callbacks);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return err;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -706,6 +906,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
 		xb_init(&vb->page_xb);
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
+		vb->balloon_wq = alloc_workqueue("balloon-wq",
+					WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
+		INIT_WORK(&vb->report_free_page_work, report_free_page);
+		vb->report_free_page_stop = true;
+	}
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -770,6 +977,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	spin_unlock_irq(&vb->stop_update_lock);
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
+	cancel_work_sync(&vb->report_free_page_work);
 
 	remove_common(vb);
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -823,6 +1031,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_SG,
+	VIRTIO_BALLOON_F_CTRL_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 37780a7..dbf0616 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
+#define VIRTIO_BALLOON_F_CTRL_VQ	4 /* Control Virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -83,4 +84,18 @@ struct virtio_balloon_stat {
 	__virtio64 val;
 } __attribute__((packed));
 
+enum {
+	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE = 0,
+	VIRTIO_BALLOON_CTRLQ_CLASS_MAX,
+};
+
+struct virtio_balloon_ctrlq_cmd {
+	__virtio32 class;
+	__virtio32 cmd;
+};
+
+/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE */
+#define VIRTIO_BALLOON_FREE_PAGE_F_STOP		0
+#define VIRTIO_BALLOON_FREE_PAGE_F_START	1
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [virtio-dev] [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-09-30  4:05   ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-09-30  4:05 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, wei.w.wang, liliang.opensource, yang.zhang.wz, quan.xu

Add a new vq, ctrl_vq, to handle commands between the host and guest.
With this feature, we will be able to have the control plane and data
plane separated. In other words, the control related commands of each
feature will be sent via the ctrl_vq, meanwhile each feature may have
its own vq used as a data plane.

Free page report is the the first new feature controlled via ctrl_vq,
and a new cmd class, VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE, is added.
Currently, this feature has two cmds:
VIRTIO_BALLOON_FREE_PAGE_F_START: This cmd is sent from host to guest
to start the free page report work.
VIRTIO_BALLOON_FREE_PAGE_F_STOP: This cmd is bidirectional. The guest
would send the cmd to the host to indicate the reporting work is done.
The host would send the cmd to the guest to actively request the stop
of the reporting work.

The free_page_vq is used to transmit the guest free page blocks to the
host.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
---
 drivers/virtio/virtio_balloon.c     | 249 +++++++++++++++++++++++++++++++++---
 include/uapi/linux/virtio_balloon.h |  15 +++
 2 files changed, 244 insertions(+), 20 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 6952e19..70dc4ae 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -55,7 +55,13 @@ static struct vfsmount *balloon_mnt;
 
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *ctrl_vq,
+			 *free_page_vq;
+
+	/* Balloon's own wq for cpu-intensive work items */
+	struct workqueue_struct *balloon_wq;
+	/* The work items submitted to the balloon wq are listed here */
+	struct work_struct report_free_page_work;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
@@ -65,6 +71,9 @@ struct virtio_balloon {
 	spinlock_t stop_update_lock;
 	bool stop_update;
 
+	/* Stop reporting free pages */
+	bool report_free_page_stop;
+
 	/* Waiting for host to ack the pages we released. */
 	wait_queue_head_t acked;
 
@@ -93,6 +102,11 @@ struct virtio_balloon {
 
 	/* To register callback in oom notifier call chain */
 	struct notifier_block nb;
+
+	/* Host to guest ctrlq cmd buf for free page report */
+	struct virtio_balloon_ctrlq_cmd free_page_cmd_in;
+	/* Guest to Host ctrlq cmd buf for free page report */
+	struct virtio_balloon_ctrlq_cmd free_page_cmd_out;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -186,6 +200,24 @@ static int send_balloon_page_sg(struct virtio_balloon *vb,
 	return err;
 }
 
+static int send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
+{
+	int ret = 0;
+
+	/*
+	 * Since this is an optimization feature, losing a couplle of free
+	 * pages to report isn't important. We simply resturn without adding
+	 * the page if the vq is full.
+	 */
+	if (vq->num_free) {
+		ret = add_one_sg(vq, addr, size);
+		if (!ret)
+			virtqueue_kick(vq);
+	}
+
+	return ret;
+}
+
 /*
  * Send balloon pages in sgs to host. The balloon pages are recorded in the
  * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
@@ -542,42 +574,210 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
-static int init_vqs(struct virtio_balloon *vb)
+static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
+					   unsigned long nr_pages)
+{
+	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
+	void *addr = (void *)pfn_to_kaddr(pfn);
+	uint32_t len = nr_pages << PAGE_SHIFT;
+
+	if (vb->report_free_page_stop)
+		return false;
+
+	/* If the vq is broken, stop reporting the free pages. */
+	if (send_free_page_sg(vb->free_page_vq, addr, len) < 0)
+		return false;
+
+	return true;
+}
+
+static void ctrlq_add_cmd(struct virtqueue *vq,
+			  struct virtio_balloon_ctrlq_cmd *cmd,
+			  bool inbuf)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
-	int err, nvqs;
+	struct scatterlist sg;
+	int err;
+
+	sg_init_one(&sg, cmd, sizeof(struct virtio_balloon_ctrlq_cmd));
+	if (inbuf)
+		err = virtqueue_add_inbuf(vq, &sg, 1, cmd, GFP_KERNEL);
+	else
+		err = virtqueue_add_outbuf(vq, &sg, 1, cmd, GFP_KERNEL);
+
+	/* Sanity check: this can't really happen */
+	WARN_ON(err);
+}
+
+static void ctrlq_send_cmd(struct virtio_balloon *vb,
+			  struct virtio_balloon_ctrlq_cmd *cmd,
+			  bool inbuf)
+{
+	struct virtqueue *vq = vb->ctrl_vq;
+
+	ctrlq_add_cmd(vq, cmd, inbuf);
+	if (!inbuf) {
+		/*
+		 * All the input cmd buffers are replenished here.
+		 * This is necessary because the input cmd buffers are lost
+		 * after live migration. The device needs to rewind all of
+		 * them from the ctrl_vq.
+		 */
+		ctrlq_add_cmd(vq, &vb->free_page_cmd_in, true);
+	}
+	virtqueue_kick(vq);
+}
 
+static void report_free_page_end(struct virtio_balloon *vb)
+{
 	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
+	 * The host may have already requested to stop the reporting before we
+	 * finish, so no need to notify the host in this case.
 	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
+	if (vb->report_free_page_stop)
+		return;
+
+	vb->free_page_cmd_out.class = VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
+	vb->free_page_cmd_out.cmd = VIRTIO_BALLOON_FREE_PAGE_F_STOP;
+	ctrlq_send_cmd(vb, &vb->free_page_cmd_out, false);
+	vb->report_free_page_stop = true;
+}
+
+static void report_free_page(struct work_struct *work)
+{
+	struct virtio_balloon *vb;
+
+	vb = container_of(work, struct virtio_balloon, report_free_page_work);
+	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
+	report_free_page_end(vb);
+}
+
+static void ctrlq_handle(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+	struct virtio_balloon_ctrlq_cmd *msg;
+	unsigned int class, cmd, len;
+
+	msg = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len);
+	if (unlikely(!msg))
+		return;
+
+	/* The outbuf is sent by the host for recycling, so just return. */
+	if (msg == &vb->free_page_cmd_out)
+		return;
+
+	class = virtio32_to_cpu(vb->vdev, msg->class);
+	cmd =  virtio32_to_cpu(vb->vdev, msg->cmd);
+
+	switch (class) {
+	case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
+		if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
+			vb->report_free_page_stop = true;
+		} else if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
+			vb->report_free_page_stop = false;
+			queue_work(vb->balloon_wq, &vb->report_free_page_work);
+		}
+		vb->free_page_cmd_in.class =
+					VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
+		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
+	break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n",
+			 __func__);
+	}
+}
+
+static int init_vqs(struct virtio_balloon *vb)
+{
+	struct virtqueue **vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	struct scatterlist sg;
+	int i, nvqs, err = -ENOMEM;
+
+	/* Inflateq and deflateq are used unconditionally */
+	nvqs = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs++;
+	/* If ctrlq is enabled, the free page vq will also be created */
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ))
+		nvqs += 2;
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vq;
+	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
+	if (!callbacks)
+		goto err_callback;
+	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
+	if (!names)
+		goto err_names;
+
+	callbacks[0] = balloon_ack;
+	names[0] = "inflate";
+	callbacks[1] = balloon_ack;
+	names[1] = "deflate";
+
+	i = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		callbacks[i] = stats_request;
+		names[i] = "stats";
+		i++;
+	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
+		callbacks[i] = ctrlq_handle;
+		names[i++] = "ctrlq";
+		callbacks[i] = NULL;
+		names[i] = "free_page_vq";
+	}
+
+	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
+					 NULL, NULL);
 	if (err)
-		return err;
+		goto err_find;
 
 	vb->inflate_vq = vqs[0];
 	vb->deflate_vq = vqs[1];
+	i = 2;
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
-		struct scatterlist sg;
-		unsigned int num_stats;
-		vb->stats_vq = vqs[2];
-
+		vb->stats_vq = vqs[i++];
 		/*
 		 * Prime this virtqueue with one buffer so the hypervisor can
 		 * use it to signal us later (it can't be broken yet!).
 		 */
-		num_stats = update_balloon_stats(vb);
-
-		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
+		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
 		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
-		    < 0)
-			BUG();
+		    < 0) {
+			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
+				 __func__);
+			goto err_find;
+		}
 		virtqueue_kick(vb->stats_vq);
 	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
+		vb->ctrl_vq = vqs[i++];
+		vb->free_page_vq = vqs[i];
+		/* Prime the ctrlq with an inbuf for the host to send a cmd */
+		vb->free_page_cmd_in.class =
+					VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
+		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
+	}
+
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
 	return 0;
+
+err_find:
+	kfree(names);
+err_names:
+	kfree(callbacks);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return err;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -706,6 +906,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
 		xb_init(&vb->page_xb);
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
+		vb->balloon_wq = alloc_workqueue("balloon-wq",
+					WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
+		INIT_WORK(&vb->report_free_page_work, report_free_page);
+		vb->report_free_page_stop = true;
+	}
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
@@ -770,6 +977,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	spin_unlock_irq(&vb->stop_update_lock);
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
+	cancel_work_sync(&vb->report_free_page_work);
 
 	remove_common(vb);
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -823,6 +1031,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_SG,
+	VIRTIO_BALLOON_F_CTRL_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 37780a7..dbf0616 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
+#define VIRTIO_BALLOON_F_CTRL_VQ	4 /* Control Virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -83,4 +84,18 @@ struct virtio_balloon_stat {
 	__virtio64 val;
 } __attribute__((packed));
 
+enum {
+	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE = 0,
+	VIRTIO_BALLOON_CTRLQ_CLASS_MAX,
+};
+
+struct virtio_balloon_ctrlq_cmd {
+	__virtio32 class;
+	__virtio32 cmd;
+};
+
+/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE */
+#define VIRTIO_BALLOON_FREE_PAGE_F_STOP		0
+#define VIRTIO_BALLOON_FREE_PAGE_F_START	1
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
2.7.4


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
  2017-09-30  4:05   ` Wei Wang
  (?)
  (?)
@ 2017-10-01  3:18     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-01  3:18 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> Add a new vq, ctrl_vq, to handle commands between the host and guest.
> With this feature, we will be able to have the control plane and data
> plane separated. In other words, the control related commands of each
> feature will be sent via the ctrl_vq, meanwhile each feature may have
> its own vq used as a data plane.
> 
> Free page report is the the first new feature controlled via ctrl_vq,
> and a new cmd class, VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE, is added.
> Currently, this feature has two cmds:
> VIRTIO_BALLOON_FREE_PAGE_F_START: This cmd is sent from host to guest
> to start the free page report work.
> VIRTIO_BALLOON_FREE_PAGE_F_STOP: This cmd is bidirectional. The guest
> would send the cmd to the host to indicate the reporting work is done.
> The host would send the cmd to the guest to actively request the stop
> of the reporting work.
> 
> The free_page_vq is used to transmit the guest free page blocks to the
> host.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> ---
>  drivers/virtio/virtio_balloon.c     | 249 +++++++++++++++++++++++++++++++++---
>  include/uapi/linux/virtio_balloon.h |  15 +++
>  2 files changed, 244 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 6952e19..70dc4ae 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -55,7 +55,13 @@ static struct vfsmount *balloon_mnt;
>  
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *ctrl_vq,
> +			 *free_page_vq;
> +
> +	/* Balloon's own wq for cpu-intensive work items */
> +	struct workqueue_struct *balloon_wq;
> +	/* The work items submitted to the balloon wq are listed here */
> +	struct work_struct report_free_page_work;
>  
>  	/* The balloon servicing is delegated to a freezable workqueue. */
>  	struct work_struct update_balloon_stats_work;
> @@ -65,6 +71,9 @@ struct virtio_balloon {
>  	spinlock_t stop_update_lock;
>  	bool stop_update;
>  
> +	/* Stop reporting free pages */
> +	bool report_free_page_stop;
> +
>  	/* Waiting for host to ack the pages we released. */
>  	wait_queue_head_t acked;
>  
> @@ -93,6 +102,11 @@ struct virtio_balloon {
>  
>  	/* To register callback in oom notifier call chain */
>  	struct notifier_block nb;
> +
> +	/* Host to guest ctrlq cmd buf for free page report */
> +	struct virtio_balloon_ctrlq_cmd free_page_cmd_in;
> +	/* Guest to Host ctrlq cmd buf for free page report */
> +	struct virtio_balloon_ctrlq_cmd free_page_cmd_out;
>  };
>  
>  static struct virtio_device_id id_table[] = {
> @@ -186,6 +200,24 @@ static int send_balloon_page_sg(struct virtio_balloon *vb,
>  	return err;
>  }
>  
> +static int send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	int ret = 0;
> +
> +	/*
> +	 * Since this is an optimization feature, losing a couplle of free

typo

> +	 * pages to report isn't important. We simply resturn without adding
> +	 * the page if the vq is full.
> +	 */
> +	if (vq->num_free) {
> +		ret = add_one_sg(vq, addr, size);
> +		if (!ret)
> +			virtqueue_kick(vq);
> +	}
> +
> +	return ret;
> +}
> +
>  /*
>   * Send balloon pages in sgs to host. The balloon pages are recorded in the
>   * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> @@ -542,42 +574,210 @@ static void update_balloon_size_func(struct work_struct *work)
>  		queue_work(system_freezable_wq, work);
>  }
>  
> -static int init_vqs(struct virtio_balloon *vb)
> +static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> +					   unsigned long nr_pages)
> +{
> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> +	void *addr = (void *)pfn_to_kaddr(pfn);
> +	uint32_t len = nr_pages << PAGE_SHIFT;
> +
> +	if (vb->report_free_page_stop)
> +		return false;
> +
> +	/* If the vq is broken, stop reporting the free pages. */
> +	if (send_free_page_sg(vb->free_page_vq, addr, len) < 0)
> +		return false;
> +
> +	return true;
> +}
> +
> +static void ctrlq_add_cmd(struct virtqueue *vq,
> +			  struct virtio_balloon_ctrlq_cmd *cmd,
> +			  bool inbuf)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct scatterlist sg;
> +	int err;
> +
> +	sg_init_one(&sg, cmd, sizeof(struct virtio_balloon_ctrlq_cmd));
> +	if (inbuf)
> +		err = virtqueue_add_inbuf(vq, &sg, 1, cmd, GFP_KERNEL);
> +	else
> +		err = virtqueue_add_outbuf(vq, &sg, 1, cmd, GFP_KERNEL);
> +
> +	/* Sanity check: this can't really happen */
> +	WARN_ON(err);
> +}
> +
> +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> +			  struct virtio_balloon_ctrlq_cmd *cmd,
> +			  bool inbuf)
> +{
> +	struct virtqueue *vq = vb->ctrl_vq;
> +
> +	ctrlq_add_cmd(vq, cmd, inbuf);
> +	if (!inbuf) {
> +		/*
> +		 * All the input cmd buffers are replenished here.
> +		 * This is necessary because the input cmd buffers are lost
> +		 * after live migration. The device needs to rewind all of
> +		 * them from the ctrl_vq.

Confused. Live migration somehow loses state? Why is that and why
is it a good idea? And how do you know this is migration even?
Looks like all you know is you got free page end. Could be any
reason for this.


> +		 */
> +		ctrlq_add_cmd(vq, &vb->free_page_cmd_in, true);
> +	}
> +	virtqueue_kick(vq);
> +}
>  
> +static void report_free_page_end(struct virtio_balloon *vb)
> +{
>  	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> +	 * The host may have already requested to stop the reporting before we
> +	 * finish, so no need to notify the host in this case.
>  	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> -	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> +	if (vb->report_free_page_stop)
> +		return;
> +
> +	vb->free_page_cmd_out.class = VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> +	vb->free_page_cmd_out.cmd = VIRTIO_BALLOON_FREE_PAGE_F_STOP;
> +	ctrlq_send_cmd(vb, &vb->free_page_cmd_out, false);
> +	vb->report_free_page_stop = true;
> +}
> +
> +static void report_free_page(struct work_struct *work)
> +{
> +	struct virtio_balloon *vb;
> +
> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> +	report_free_page_end(vb);
> +}
> +
> +static void ctrlq_handle(struct virtqueue *vq)
> +{
> +	struct virtio_balloon *vb = vq->vdev->priv;
> +	struct virtio_balloon_ctrlq_cmd *msg;
> +	unsigned int class, cmd, len;
> +
> +	msg = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len);
> +	if (unlikely(!msg))
> +		return;
> +
> +	/* The outbuf is sent by the host for recycling, so just return. */
> +	if (msg == &vb->free_page_cmd_out)
> +		return;
> +
> +	class = virtio32_to_cpu(vb->vdev, msg->class);
> +	cmd =  virtio32_to_cpu(vb->vdev, msg->cmd);
> +
> +	switch (class) {
> +	case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
> +		if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
> +			vb->report_free_page_stop = true;
> +		} else if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
> +			vb->report_free_page_stop = false;
> +			queue_work(vb->balloon_wq, &vb->report_free_page_work);
> +		}
> +		vb->free_page_cmd_in.class =
> +					VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> +		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
> +	break;
> +	default:
> +		dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n",
> +			 __func__);
> +	}

Manipulating report_free_page_stop without any locks looks
very suspicious.
Also, what if we get two start commands? we should restart
from beginning, should we not?

> +}
> +
> +static int init_vqs(struct virtio_balloon *vb)
> +{
> +	struct virtqueue **vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	struct scatterlist sg;
> +	int i, nvqs, err = -ENOMEM;
> +
> +	/* Inflateq and deflateq are used unconditionally */
> +	nvqs = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> +		nvqs++;
> +	/* If ctrlq is enabled, the free page vq will also be created */
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ))
> +		nvqs += 2;

Since you made it generic, free page should
have its own flag not rely on ctrl vq.


> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs)
> +		goto err_vq;
> +	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
> +	if (!callbacks)
> +		goto err_callback;
> +	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
> +	if (!names)
> +		goto err_names;
> +
> +	callbacks[0] = balloon_ack;
> +	names[0] = "inflate";
> +	callbacks[1] = balloon_ack;
> +	names[1] = "deflate";
> +
> +	i = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		callbacks[i] = stats_request;
> +		names[i] = "stats";
> +		i++;
> +	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
> +		callbacks[i] = ctrlq_handle;
> +		names[i++] = "ctrlq";
> +		callbacks[i] = NULL;
> +		names[i] = "free_page_vq";
> +	}
> +
> +	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
> +					 NULL, NULL);
>  	if (err)
> -		return err;
> +		goto err_find;
>  
>  	vb->inflate_vq = vqs[0];
>  	vb->deflate_vq = vqs[1];
> +	i = 2;
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> -		struct scatterlist sg;
> -		unsigned int num_stats;
> -		vb->stats_vq = vqs[2];
> -
> +		vb->stats_vq = vqs[i++];
>  		/*
>  		 * Prime this virtqueue with one buffer so the hypervisor can
>  		 * use it to signal us later (it can't be broken yet!).
>  		 */
> -		num_stats = update_balloon_stats(vb);
> -
> -		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
> +		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
>  		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
> -		    < 0)
> -			BUG();
> +		    < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
>  		virtqueue_kick(vb->stats_vq);
>  	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
> +		vb->ctrl_vq = vqs[i++];
> +		vb->free_page_vq = vqs[i];
> +		/* Prime the ctrlq with an inbuf for the host to send a cmd */
> +		vb->free_page_cmd_in.class =
> +					VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> +		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
> +	}
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
>  	return 0;
> +
> +err_find:
> +	kfree(names);
> +err_names:
> +	kfree(callbacks);
> +err_callback:
> +	kfree(vqs);
> +err_vq:
> +	return err;
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -706,6 +906,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
>  		xb_init(&vb->page_xb);
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
> +		vb->balloon_wq = alloc_workqueue("balloon-wq",
> +					WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
> +		INIT_WORK(&vb->report_free_page_work, report_free_page);
> +		vb->report_free_page_stop = true;
> +	}
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -770,6 +977,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	spin_unlock_irq(&vb->stop_update_lock);
>  	cancel_work_sync(&vb->update_balloon_size_work);
>  	cancel_work_sync(&vb->update_balloon_stats_work);
> +	cancel_work_sync(&vb->report_free_page_work);
>  
>  	remove_common(vb);
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -823,6 +1031,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>  	VIRTIO_BALLOON_F_SG,
> +	VIRTIO_BALLOON_F_CTRL_VQ,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 37780a7..dbf0616 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -35,6 +35,7 @@
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
> +#define VIRTIO_BALLOON_F_CTRL_VQ	4 /* Control Virtqueue */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> @@ -83,4 +84,18 @@ struct virtio_balloon_stat {
>  	__virtio64 val;
>  } __attribute__((packed));
>  
> +enum {
> +	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE = 0,
> +	VIRTIO_BALLOON_CTRLQ_CLASS_MAX,
> +};
> +
> +struct virtio_balloon_ctrlq_cmd {
> +	__virtio32 class;
> +	__virtio32 cmd;
> +};
> +
> +/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE */
> +#define VIRTIO_BALLOON_FREE_PAGE_F_STOP		0
> +#define VIRTIO_BALLOON_FREE_PAGE_F_START	1
> +
>  #endif /* _LINUX_VIRTIO_BALLOON_H */

The stop command does not appear to be thought through.

Let's assume e.g. you started migration. You ask guest for free pages.
Then you cancel it.  There are a bunch of pages in free vq and you are
getting more.  You now want to start migration again. What to do?

A bunch of vq flushing and waiting will maybe do the trick, but waiting
on guest is never a great idea.

I previously suggested pushing the stop/start commands from guest to
host on the free page vq, and including an ID in host to guest and
guest to host commands. This way ctrl vq is just for host to guest
commands, and host matches commands and knows which command
is a free page in response to.

I still think it's a good idea but go ahead and propose something
else that works.



> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-01  3:18     ` Michael S. Tsirkin
  0 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-01  3:18 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> Add a new vq, ctrl_vq, to handle commands between the host and guest.
> With this feature, we will be able to have the control plane and data
> plane separated. In other words, the control related commands of each
> feature will be sent via the ctrl_vq, meanwhile each feature may have
> its own vq used as a data plane.
> 
> Free page report is the the first new feature controlled via ctrl_vq,
> and a new cmd class, VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE, is added.
> Currently, this feature has two cmds:
> VIRTIO_BALLOON_FREE_PAGE_F_START: This cmd is sent from host to guest
> to start the free page report work.
> VIRTIO_BALLOON_FREE_PAGE_F_STOP: This cmd is bidirectional. The guest
> would send the cmd to the host to indicate the reporting work is done.
> The host would send the cmd to the guest to actively request the stop
> of the reporting work.
> 
> The free_page_vq is used to transmit the guest free page blocks to the
> host.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> ---
>  drivers/virtio/virtio_balloon.c     | 249 +++++++++++++++++++++++++++++++++---
>  include/uapi/linux/virtio_balloon.h |  15 +++
>  2 files changed, 244 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 6952e19..70dc4ae 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -55,7 +55,13 @@ static struct vfsmount *balloon_mnt;
>  
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *ctrl_vq,
> +			 *free_page_vq;
> +
> +	/* Balloon's own wq for cpu-intensive work items */
> +	struct workqueue_struct *balloon_wq;
> +	/* The work items submitted to the balloon wq are listed here */
> +	struct work_struct report_free_page_work;
>  
>  	/* The balloon servicing is delegated to a freezable workqueue. */
>  	struct work_struct update_balloon_stats_work;
> @@ -65,6 +71,9 @@ struct virtio_balloon {
>  	spinlock_t stop_update_lock;
>  	bool stop_update;
>  
> +	/* Stop reporting free pages */
> +	bool report_free_page_stop;
> +
>  	/* Waiting for host to ack the pages we released. */
>  	wait_queue_head_t acked;
>  
> @@ -93,6 +102,11 @@ struct virtio_balloon {
>  
>  	/* To register callback in oom notifier call chain */
>  	struct notifier_block nb;
> +
> +	/* Host to guest ctrlq cmd buf for free page report */
> +	struct virtio_balloon_ctrlq_cmd free_page_cmd_in;
> +	/* Guest to Host ctrlq cmd buf for free page report */
> +	struct virtio_balloon_ctrlq_cmd free_page_cmd_out;
>  };
>  
>  static struct virtio_device_id id_table[] = {
> @@ -186,6 +200,24 @@ static int send_balloon_page_sg(struct virtio_balloon *vb,
>  	return err;
>  }
>  
> +static int send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	int ret = 0;
> +
> +	/*
> +	 * Since this is an optimization feature, losing a couplle of free

typo

> +	 * pages to report isn't important. We simply resturn without adding
> +	 * the page if the vq is full.
> +	 */
> +	if (vq->num_free) {
> +		ret = add_one_sg(vq, addr, size);
> +		if (!ret)
> +			virtqueue_kick(vq);
> +	}
> +
> +	return ret;
> +}
> +
>  /*
>   * Send balloon pages in sgs to host. The balloon pages are recorded in the
>   * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> @@ -542,42 +574,210 @@ static void update_balloon_size_func(struct work_struct *work)
>  		queue_work(system_freezable_wq, work);
>  }
>  
> -static int init_vqs(struct virtio_balloon *vb)
> +static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> +					   unsigned long nr_pages)
> +{
> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> +	void *addr = (void *)pfn_to_kaddr(pfn);
> +	uint32_t len = nr_pages << PAGE_SHIFT;
> +
> +	if (vb->report_free_page_stop)
> +		return false;
> +
> +	/* If the vq is broken, stop reporting the free pages. */
> +	if (send_free_page_sg(vb->free_page_vq, addr, len) < 0)
> +		return false;
> +
> +	return true;
> +}
> +
> +static void ctrlq_add_cmd(struct virtqueue *vq,
> +			  struct virtio_balloon_ctrlq_cmd *cmd,
> +			  bool inbuf)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct scatterlist sg;
> +	int err;
> +
> +	sg_init_one(&sg, cmd, sizeof(struct virtio_balloon_ctrlq_cmd));
> +	if (inbuf)
> +		err = virtqueue_add_inbuf(vq, &sg, 1, cmd, GFP_KERNEL);
> +	else
> +		err = virtqueue_add_outbuf(vq, &sg, 1, cmd, GFP_KERNEL);
> +
> +	/* Sanity check: this can't really happen */
> +	WARN_ON(err);
> +}
> +
> +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> +			  struct virtio_balloon_ctrlq_cmd *cmd,
> +			  bool inbuf)
> +{
> +	struct virtqueue *vq = vb->ctrl_vq;
> +
> +	ctrlq_add_cmd(vq, cmd, inbuf);
> +	if (!inbuf) {
> +		/*
> +		 * All the input cmd buffers are replenished here.
> +		 * This is necessary because the input cmd buffers are lost
> +		 * after live migration. The device needs to rewind all of
> +		 * them from the ctrl_vq.

Confused. Live migration somehow loses state? Why is that and why
is it a good idea? And how do you know this is migration even?
Looks like all you know is you got free page end. Could be any
reason for this.


> +		 */
> +		ctrlq_add_cmd(vq, &vb->free_page_cmd_in, true);
> +	}
> +	virtqueue_kick(vq);
> +}
>  
> +static void report_free_page_end(struct virtio_balloon *vb)
> +{
>  	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> +	 * The host may have already requested to stop the reporting before we
> +	 * finish, so no need to notify the host in this case.
>  	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> -	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> +	if (vb->report_free_page_stop)
> +		return;
> +
> +	vb->free_page_cmd_out.class = VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> +	vb->free_page_cmd_out.cmd = VIRTIO_BALLOON_FREE_PAGE_F_STOP;
> +	ctrlq_send_cmd(vb, &vb->free_page_cmd_out, false);
> +	vb->report_free_page_stop = true;
> +}
> +
> +static void report_free_page(struct work_struct *work)
> +{
> +	struct virtio_balloon *vb;
> +
> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> +	report_free_page_end(vb);
> +}
> +
> +static void ctrlq_handle(struct virtqueue *vq)
> +{
> +	struct virtio_balloon *vb = vq->vdev->priv;
> +	struct virtio_balloon_ctrlq_cmd *msg;
> +	unsigned int class, cmd, len;
> +
> +	msg = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len);
> +	if (unlikely(!msg))
> +		return;
> +
> +	/* The outbuf is sent by the host for recycling, so just return. */
> +	if (msg == &vb->free_page_cmd_out)
> +		return;
> +
> +	class = virtio32_to_cpu(vb->vdev, msg->class);
> +	cmd =  virtio32_to_cpu(vb->vdev, msg->cmd);
> +
> +	switch (class) {
> +	case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
> +		if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
> +			vb->report_free_page_stop = true;
> +		} else if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
> +			vb->report_free_page_stop = false;
> +			queue_work(vb->balloon_wq, &vb->report_free_page_work);
> +		}
> +		vb->free_page_cmd_in.class =
> +					VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> +		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
> +	break;
> +	default:
> +		dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n",
> +			 __func__);
> +	}

Manipulating report_free_page_stop without any locks looks
very suspicious.
Also, what if we get two start commands? we should restart
from beginning, should we not?

> +}
> +
> +static int init_vqs(struct virtio_balloon *vb)
> +{
> +	struct virtqueue **vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	struct scatterlist sg;
> +	int i, nvqs, err = -ENOMEM;
> +
> +	/* Inflateq and deflateq are used unconditionally */
> +	nvqs = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> +		nvqs++;
> +	/* If ctrlq is enabled, the free page vq will also be created */
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ))
> +		nvqs += 2;

Since you made it generic, free page should
have its own flag not rely on ctrl vq.


> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs)
> +		goto err_vq;
> +	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
> +	if (!callbacks)
> +		goto err_callback;
> +	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
> +	if (!names)
> +		goto err_names;
> +
> +	callbacks[0] = balloon_ack;
> +	names[0] = "inflate";
> +	callbacks[1] = balloon_ack;
> +	names[1] = "deflate";
> +
> +	i = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		callbacks[i] = stats_request;
> +		names[i] = "stats";
> +		i++;
> +	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
> +		callbacks[i] = ctrlq_handle;
> +		names[i++] = "ctrlq";
> +		callbacks[i] = NULL;
> +		names[i] = "free_page_vq";
> +	}
> +
> +	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
> +					 NULL, NULL);
>  	if (err)
> -		return err;
> +		goto err_find;
>  
>  	vb->inflate_vq = vqs[0];
>  	vb->deflate_vq = vqs[1];
> +	i = 2;
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> -		struct scatterlist sg;
> -		unsigned int num_stats;
> -		vb->stats_vq = vqs[2];
> -
> +		vb->stats_vq = vqs[i++];
>  		/*
>  		 * Prime this virtqueue with one buffer so the hypervisor can
>  		 * use it to signal us later (it can't be broken yet!).
>  		 */
> -		num_stats = update_balloon_stats(vb);
> -
> -		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
> +		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
>  		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
> -		    < 0)
> -			BUG();
> +		    < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
>  		virtqueue_kick(vb->stats_vq);
>  	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
> +		vb->ctrl_vq = vqs[i++];
> +		vb->free_page_vq = vqs[i];
> +		/* Prime the ctrlq with an inbuf for the host to send a cmd */
> +		vb->free_page_cmd_in.class =
> +					VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> +		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
> +	}
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
>  	return 0;
> +
> +err_find:
> +	kfree(names);
> +err_names:
> +	kfree(callbacks);
> +err_callback:
> +	kfree(vqs);
> +err_vq:
> +	return err;
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -706,6 +906,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
>  		xb_init(&vb->page_xb);
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
> +		vb->balloon_wq = alloc_workqueue("balloon-wq",
> +					WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
> +		INIT_WORK(&vb->report_free_page_work, report_free_page);
> +		vb->report_free_page_stop = true;
> +	}
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -770,6 +977,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	spin_unlock_irq(&vb->stop_update_lock);
>  	cancel_work_sync(&vb->update_balloon_size_work);
>  	cancel_work_sync(&vb->update_balloon_stats_work);
> +	cancel_work_sync(&vb->report_free_page_work);
>  
>  	remove_common(vb);
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -823,6 +1031,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>  	VIRTIO_BALLOON_F_SG,
> +	VIRTIO_BALLOON_F_CTRL_VQ,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 37780a7..dbf0616 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -35,6 +35,7 @@
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
> +#define VIRTIO_BALLOON_F_CTRL_VQ	4 /* Control Virtqueue */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> @@ -83,4 +84,18 @@ struct virtio_balloon_stat {
>  	__virtio64 val;
>  } __attribute__((packed));
>  
> +enum {
> +	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE = 0,
> +	VIRTIO_BALLOON_CTRLQ_CLASS_MAX,
> +};
> +
> +struct virtio_balloon_ctrlq_cmd {
> +	__virtio32 class;
> +	__virtio32 cmd;
> +};
> +
> +/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE */
> +#define VIRTIO_BALLOON_FREE_PAGE_F_STOP		0
> +#define VIRTIO_BALLOON_FREE_PAGE_F_START	1
> +
>  #endif /* _LINUX_VIRTIO_BALLOON_H */

The stop command does not appear to be thought through.

Let's assume e.g. you started migration. You ask guest for free pages.
Then you cancel it.  There are a bunch of pages in free vq and you are
getting more.  You now want to start migration again. What to do?

A bunch of vq flushing and waiting will maybe do the trick, but waiting
on guest is never a great idea.

I previously suggested pushing the stop/start commands from guest to
host on the free page vq, and including an ID in host to guest and
guest to host commands. This way ctrl vq is just for host to guest
commands, and host matches commands and knows which command
is a free page in response to.

I still think it's a good idea but go ahead and propose something
else that works.



> -- 
> 2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [Qemu-devel] [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-01  3:18     ` Michael S. Tsirkin
  0 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-01  3:18 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> Add a new vq, ctrl_vq, to handle commands between the host and guest.
> With this feature, we will be able to have the control plane and data
> plane separated. In other words, the control related commands of each
> feature will be sent via the ctrl_vq, meanwhile each feature may have
> its own vq used as a data plane.
> 
> Free page report is the the first new feature controlled via ctrl_vq,
> and a new cmd class, VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE, is added.
> Currently, this feature has two cmds:
> VIRTIO_BALLOON_FREE_PAGE_F_START: This cmd is sent from host to guest
> to start the free page report work.
> VIRTIO_BALLOON_FREE_PAGE_F_STOP: This cmd is bidirectional. The guest
> would send the cmd to the host to indicate the reporting work is done.
> The host would send the cmd to the guest to actively request the stop
> of the reporting work.
> 
> The free_page_vq is used to transmit the guest free page blocks to the
> host.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> ---
>  drivers/virtio/virtio_balloon.c     | 249 +++++++++++++++++++++++++++++++++---
>  include/uapi/linux/virtio_balloon.h |  15 +++
>  2 files changed, 244 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 6952e19..70dc4ae 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -55,7 +55,13 @@ static struct vfsmount *balloon_mnt;
>  
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *ctrl_vq,
> +			 *free_page_vq;
> +
> +	/* Balloon's own wq for cpu-intensive work items */
> +	struct workqueue_struct *balloon_wq;
> +	/* The work items submitted to the balloon wq are listed here */
> +	struct work_struct report_free_page_work;
>  
>  	/* The balloon servicing is delegated to a freezable workqueue. */
>  	struct work_struct update_balloon_stats_work;
> @@ -65,6 +71,9 @@ struct virtio_balloon {
>  	spinlock_t stop_update_lock;
>  	bool stop_update;
>  
> +	/* Stop reporting free pages */
> +	bool report_free_page_stop;
> +
>  	/* Waiting for host to ack the pages we released. */
>  	wait_queue_head_t acked;
>  
> @@ -93,6 +102,11 @@ struct virtio_balloon {
>  
>  	/* To register callback in oom notifier call chain */
>  	struct notifier_block nb;
> +
> +	/* Host to guest ctrlq cmd buf for free page report */
> +	struct virtio_balloon_ctrlq_cmd free_page_cmd_in;
> +	/* Guest to Host ctrlq cmd buf for free page report */
> +	struct virtio_balloon_ctrlq_cmd free_page_cmd_out;
>  };
>  
>  static struct virtio_device_id id_table[] = {
> @@ -186,6 +200,24 @@ static int send_balloon_page_sg(struct virtio_balloon *vb,
>  	return err;
>  }
>  
> +static int send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	int ret = 0;
> +
> +	/*
> +	 * Since this is an optimization feature, losing a couplle of free

typo

> +	 * pages to report isn't important. We simply resturn without adding
> +	 * the page if the vq is full.
> +	 */
> +	if (vq->num_free) {
> +		ret = add_one_sg(vq, addr, size);
> +		if (!ret)
> +			virtqueue_kick(vq);
> +	}
> +
> +	return ret;
> +}
> +
>  /*
>   * Send balloon pages in sgs to host. The balloon pages are recorded in the
>   * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> @@ -542,42 +574,210 @@ static void update_balloon_size_func(struct work_struct *work)
>  		queue_work(system_freezable_wq, work);
>  }
>  
> -static int init_vqs(struct virtio_balloon *vb)
> +static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> +					   unsigned long nr_pages)
> +{
> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> +	void *addr = (void *)pfn_to_kaddr(pfn);
> +	uint32_t len = nr_pages << PAGE_SHIFT;
> +
> +	if (vb->report_free_page_stop)
> +		return false;
> +
> +	/* If the vq is broken, stop reporting the free pages. */
> +	if (send_free_page_sg(vb->free_page_vq, addr, len) < 0)
> +		return false;
> +
> +	return true;
> +}
> +
> +static void ctrlq_add_cmd(struct virtqueue *vq,
> +			  struct virtio_balloon_ctrlq_cmd *cmd,
> +			  bool inbuf)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct scatterlist sg;
> +	int err;
> +
> +	sg_init_one(&sg, cmd, sizeof(struct virtio_balloon_ctrlq_cmd));
> +	if (inbuf)
> +		err = virtqueue_add_inbuf(vq, &sg, 1, cmd, GFP_KERNEL);
> +	else
> +		err = virtqueue_add_outbuf(vq, &sg, 1, cmd, GFP_KERNEL);
> +
> +	/* Sanity check: this can't really happen */
> +	WARN_ON(err);
> +}
> +
> +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> +			  struct virtio_balloon_ctrlq_cmd *cmd,
> +			  bool inbuf)
> +{
> +	struct virtqueue *vq = vb->ctrl_vq;
> +
> +	ctrlq_add_cmd(vq, cmd, inbuf);
> +	if (!inbuf) {
> +		/*
> +		 * All the input cmd buffers are replenished here.
> +		 * This is necessary because the input cmd buffers are lost
> +		 * after live migration. The device needs to rewind all of
> +		 * them from the ctrl_vq.

Confused. Live migration somehow loses state? Why is that and why
is it a good idea? And how do you know this is migration even?
Looks like all you know is you got free page end. Could be any
reason for this.


> +		 */
> +		ctrlq_add_cmd(vq, &vb->free_page_cmd_in, true);
> +	}
> +	virtqueue_kick(vq);
> +}
>  
> +static void report_free_page_end(struct virtio_balloon *vb)
> +{
>  	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> +	 * The host may have already requested to stop the reporting before we
> +	 * finish, so no need to notify the host in this case.
>  	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> -	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> +	if (vb->report_free_page_stop)
> +		return;
> +
> +	vb->free_page_cmd_out.class = VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> +	vb->free_page_cmd_out.cmd = VIRTIO_BALLOON_FREE_PAGE_F_STOP;
> +	ctrlq_send_cmd(vb, &vb->free_page_cmd_out, false);
> +	vb->report_free_page_stop = true;
> +}
> +
> +static void report_free_page(struct work_struct *work)
> +{
> +	struct virtio_balloon *vb;
> +
> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> +	report_free_page_end(vb);
> +}
> +
> +static void ctrlq_handle(struct virtqueue *vq)
> +{
> +	struct virtio_balloon *vb = vq->vdev->priv;
> +	struct virtio_balloon_ctrlq_cmd *msg;
> +	unsigned int class, cmd, len;
> +
> +	msg = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len);
> +	if (unlikely(!msg))
> +		return;
> +
> +	/* The outbuf is sent by the host for recycling, so just return. */
> +	if (msg == &vb->free_page_cmd_out)
> +		return;
> +
> +	class = virtio32_to_cpu(vb->vdev, msg->class);
> +	cmd =  virtio32_to_cpu(vb->vdev, msg->cmd);
> +
> +	switch (class) {
> +	case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
> +		if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
> +			vb->report_free_page_stop = true;
> +		} else if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
> +			vb->report_free_page_stop = false;
> +			queue_work(vb->balloon_wq, &vb->report_free_page_work);
> +		}
> +		vb->free_page_cmd_in.class =
> +					VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> +		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
> +	break;
> +	default:
> +		dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n",
> +			 __func__);
> +	}

Manipulating report_free_page_stop without any locks looks
very suspicious.
Also, what if we get two start commands? we should restart
from beginning, should we not?

> +}
> +
> +static int init_vqs(struct virtio_balloon *vb)
> +{
> +	struct virtqueue **vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	struct scatterlist sg;
> +	int i, nvqs, err = -ENOMEM;
> +
> +	/* Inflateq and deflateq are used unconditionally */
> +	nvqs = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> +		nvqs++;
> +	/* If ctrlq is enabled, the free page vq will also be created */
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ))
> +		nvqs += 2;

Since you made it generic, free page should
have its own flag not rely on ctrl vq.


> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs)
> +		goto err_vq;
> +	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
> +	if (!callbacks)
> +		goto err_callback;
> +	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
> +	if (!names)
> +		goto err_names;
> +
> +	callbacks[0] = balloon_ack;
> +	names[0] = "inflate";
> +	callbacks[1] = balloon_ack;
> +	names[1] = "deflate";
> +
> +	i = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		callbacks[i] = stats_request;
> +		names[i] = "stats";
> +		i++;
> +	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
> +		callbacks[i] = ctrlq_handle;
> +		names[i++] = "ctrlq";
> +		callbacks[i] = NULL;
> +		names[i] = "free_page_vq";
> +	}
> +
> +	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
> +					 NULL, NULL);
>  	if (err)
> -		return err;
> +		goto err_find;
>  
>  	vb->inflate_vq = vqs[0];
>  	vb->deflate_vq = vqs[1];
> +	i = 2;
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> -		struct scatterlist sg;
> -		unsigned int num_stats;
> -		vb->stats_vq = vqs[2];
> -
> +		vb->stats_vq = vqs[i++];
>  		/*
>  		 * Prime this virtqueue with one buffer so the hypervisor can
>  		 * use it to signal us later (it can't be broken yet!).
>  		 */
> -		num_stats = update_balloon_stats(vb);
> -
> -		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
> +		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
>  		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
> -		    < 0)
> -			BUG();
> +		    < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
>  		virtqueue_kick(vb->stats_vq);
>  	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
> +		vb->ctrl_vq = vqs[i++];
> +		vb->free_page_vq = vqs[i];
> +		/* Prime the ctrlq with an inbuf for the host to send a cmd */
> +		vb->free_page_cmd_in.class =
> +					VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> +		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
> +	}
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
>  	return 0;
> +
> +err_find:
> +	kfree(names);
> +err_names:
> +	kfree(callbacks);
> +err_callback:
> +	kfree(vqs);
> +err_vq:
> +	return err;
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -706,6 +906,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
>  		xb_init(&vb->page_xb);
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
> +		vb->balloon_wq = alloc_workqueue("balloon-wq",
> +					WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
> +		INIT_WORK(&vb->report_free_page_work, report_free_page);
> +		vb->report_free_page_stop = true;
> +	}
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -770,6 +977,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	spin_unlock_irq(&vb->stop_update_lock);
>  	cancel_work_sync(&vb->update_balloon_size_work);
>  	cancel_work_sync(&vb->update_balloon_stats_work);
> +	cancel_work_sync(&vb->report_free_page_work);
>  
>  	remove_common(vb);
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -823,6 +1031,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>  	VIRTIO_BALLOON_F_SG,
> +	VIRTIO_BALLOON_F_CTRL_VQ,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 37780a7..dbf0616 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -35,6 +35,7 @@
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
> +#define VIRTIO_BALLOON_F_CTRL_VQ	4 /* Control Virtqueue */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> @@ -83,4 +84,18 @@ struct virtio_balloon_stat {
>  	__virtio64 val;
>  } __attribute__((packed));
>  
> +enum {
> +	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE = 0,
> +	VIRTIO_BALLOON_CTRLQ_CLASS_MAX,
> +};
> +
> +struct virtio_balloon_ctrlq_cmd {
> +	__virtio32 class;
> +	__virtio32 cmd;
> +};
> +
> +/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE */
> +#define VIRTIO_BALLOON_FREE_PAGE_F_STOP		0
> +#define VIRTIO_BALLOON_FREE_PAGE_F_START	1
> +
>  #endif /* _LINUX_VIRTIO_BALLOON_H */

The stop command does not appear to be thought through.

Let's assume e.g. you started migration. You ask guest for free pages.
Then you cancel it.  There are a bunch of pages in free vq and you are
getting more.  You now want to start migration again. What to do?

A bunch of vq flushing and waiting will maybe do the trick, but waiting
on guest is never a great idea.

I previously suggested pushing the stop/start commands from guest to
host on the free page vq, and including an ID in host to guest and
guest to host commands. This way ctrl vq is just for host to guest
commands, and host matches commands and knows which command
is a free page in response to.

I still think it's a good idea but go ahead and propose something
else that works.



> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
  2017-09-30  4:05   ` Wei Wang
                     ` (3 preceding siblings ...)
  (?)
@ 2017-10-01  3:18   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-01  3:18 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko, mgorman

On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> Add a new vq, ctrl_vq, to handle commands between the host and guest.
> With this feature, we will be able to have the control plane and data
> plane separated. In other words, the control related commands of each
> feature will be sent via the ctrl_vq, meanwhile each feature may have
> its own vq used as a data plane.
> 
> Free page report is the the first new feature controlled via ctrl_vq,
> and a new cmd class, VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE, is added.
> Currently, this feature has two cmds:
> VIRTIO_BALLOON_FREE_PAGE_F_START: This cmd is sent from host to guest
> to start the free page report work.
> VIRTIO_BALLOON_FREE_PAGE_F_STOP: This cmd is bidirectional. The guest
> would send the cmd to the host to indicate the reporting work is done.
> The host would send the cmd to the guest to actively request the stop
> of the reporting work.
> 
> The free_page_vq is used to transmit the guest free page blocks to the
> host.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> ---
>  drivers/virtio/virtio_balloon.c     | 249 +++++++++++++++++++++++++++++++++---
>  include/uapi/linux/virtio_balloon.h |  15 +++
>  2 files changed, 244 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 6952e19..70dc4ae 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -55,7 +55,13 @@ static struct vfsmount *balloon_mnt;
>  
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *ctrl_vq,
> +			 *free_page_vq;
> +
> +	/* Balloon's own wq for cpu-intensive work items */
> +	struct workqueue_struct *balloon_wq;
> +	/* The work items submitted to the balloon wq are listed here */
> +	struct work_struct report_free_page_work;
>  
>  	/* The balloon servicing is delegated to a freezable workqueue. */
>  	struct work_struct update_balloon_stats_work;
> @@ -65,6 +71,9 @@ struct virtio_balloon {
>  	spinlock_t stop_update_lock;
>  	bool stop_update;
>  
> +	/* Stop reporting free pages */
> +	bool report_free_page_stop;
> +
>  	/* Waiting for host to ack the pages we released. */
>  	wait_queue_head_t acked;
>  
> @@ -93,6 +102,11 @@ struct virtio_balloon {
>  
>  	/* To register callback in oom notifier call chain */
>  	struct notifier_block nb;
> +
> +	/* Host to guest ctrlq cmd buf for free page report */
> +	struct virtio_balloon_ctrlq_cmd free_page_cmd_in;
> +	/* Guest to Host ctrlq cmd buf for free page report */
> +	struct virtio_balloon_ctrlq_cmd free_page_cmd_out;
>  };
>  
>  static struct virtio_device_id id_table[] = {
> @@ -186,6 +200,24 @@ static int send_balloon_page_sg(struct virtio_balloon *vb,
>  	return err;
>  }
>  
> +static int send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	int ret = 0;
> +
> +	/*
> +	 * Since this is an optimization feature, losing a couplle of free

typo

> +	 * pages to report isn't important. We simply resturn without adding
> +	 * the page if the vq is full.
> +	 */
> +	if (vq->num_free) {
> +		ret = add_one_sg(vq, addr, size);
> +		if (!ret)
> +			virtqueue_kick(vq);
> +	}
> +
> +	return ret;
> +}
> +
>  /*
>   * Send balloon pages in sgs to host. The balloon pages are recorded in the
>   * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> @@ -542,42 +574,210 @@ static void update_balloon_size_func(struct work_struct *work)
>  		queue_work(system_freezable_wq, work);
>  }
>  
> -static int init_vqs(struct virtio_balloon *vb)
> +static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> +					   unsigned long nr_pages)
> +{
> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> +	void *addr = (void *)pfn_to_kaddr(pfn);
> +	uint32_t len = nr_pages << PAGE_SHIFT;
> +
> +	if (vb->report_free_page_stop)
> +		return false;
> +
> +	/* If the vq is broken, stop reporting the free pages. */
> +	if (send_free_page_sg(vb->free_page_vq, addr, len) < 0)
> +		return false;
> +
> +	return true;
> +}
> +
> +static void ctrlq_add_cmd(struct virtqueue *vq,
> +			  struct virtio_balloon_ctrlq_cmd *cmd,
> +			  bool inbuf)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct scatterlist sg;
> +	int err;
> +
> +	sg_init_one(&sg, cmd, sizeof(struct virtio_balloon_ctrlq_cmd));
> +	if (inbuf)
> +		err = virtqueue_add_inbuf(vq, &sg, 1, cmd, GFP_KERNEL);
> +	else
> +		err = virtqueue_add_outbuf(vq, &sg, 1, cmd, GFP_KERNEL);
> +
> +	/* Sanity check: this can't really happen */
> +	WARN_ON(err);
> +}
> +
> +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> +			  struct virtio_balloon_ctrlq_cmd *cmd,
> +			  bool inbuf)
> +{
> +	struct virtqueue *vq = vb->ctrl_vq;
> +
> +	ctrlq_add_cmd(vq, cmd, inbuf);
> +	if (!inbuf) {
> +		/*
> +		 * All the input cmd buffers are replenished here.
> +		 * This is necessary because the input cmd buffers are lost
> +		 * after live migration. The device needs to rewind all of
> +		 * them from the ctrl_vq.

Confused. Live migration somehow loses state? Why is that and why
is it a good idea? And how do you know this is migration even?
Looks like all you know is you got free page end. Could be any
reason for this.


> +		 */
> +		ctrlq_add_cmd(vq, &vb->free_page_cmd_in, true);
> +	}
> +	virtqueue_kick(vq);
> +}
>  
> +static void report_free_page_end(struct virtio_balloon *vb)
> +{
>  	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> +	 * The host may have already requested to stop the reporting before we
> +	 * finish, so no need to notify the host in this case.
>  	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> -	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> +	if (vb->report_free_page_stop)
> +		return;
> +
> +	vb->free_page_cmd_out.class = VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> +	vb->free_page_cmd_out.cmd = VIRTIO_BALLOON_FREE_PAGE_F_STOP;
> +	ctrlq_send_cmd(vb, &vb->free_page_cmd_out, false);
> +	vb->report_free_page_stop = true;
> +}
> +
> +static void report_free_page(struct work_struct *work)
> +{
> +	struct virtio_balloon *vb;
> +
> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> +	report_free_page_end(vb);
> +}
> +
> +static void ctrlq_handle(struct virtqueue *vq)
> +{
> +	struct virtio_balloon *vb = vq->vdev->priv;
> +	struct virtio_balloon_ctrlq_cmd *msg;
> +	unsigned int class, cmd, len;
> +
> +	msg = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len);
> +	if (unlikely(!msg))
> +		return;
> +
> +	/* The outbuf is sent by the host for recycling, so just return. */
> +	if (msg == &vb->free_page_cmd_out)
> +		return;
> +
> +	class = virtio32_to_cpu(vb->vdev, msg->class);
> +	cmd =  virtio32_to_cpu(vb->vdev, msg->cmd);
> +
> +	switch (class) {
> +	case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
> +		if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
> +			vb->report_free_page_stop = true;
> +		} else if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
> +			vb->report_free_page_stop = false;
> +			queue_work(vb->balloon_wq, &vb->report_free_page_work);
> +		}
> +		vb->free_page_cmd_in.class =
> +					VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> +		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
> +	break;
> +	default:
> +		dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n",
> +			 __func__);
> +	}

Manipulating report_free_page_stop without any locks looks
very suspicious.
Also, what if we get two start commands? we should restart
from beginning, should we not?

> +}
> +
> +static int init_vqs(struct virtio_balloon *vb)
> +{
> +	struct virtqueue **vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	struct scatterlist sg;
> +	int i, nvqs, err = -ENOMEM;
> +
> +	/* Inflateq and deflateq are used unconditionally */
> +	nvqs = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> +		nvqs++;
> +	/* If ctrlq is enabled, the free page vq will also be created */
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ))
> +		nvqs += 2;

Since you made it generic, free page should
have its own flag not rely on ctrl vq.


> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs)
> +		goto err_vq;
> +	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
> +	if (!callbacks)
> +		goto err_callback;
> +	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
> +	if (!names)
> +		goto err_names;
> +
> +	callbacks[0] = balloon_ack;
> +	names[0] = "inflate";
> +	callbacks[1] = balloon_ack;
> +	names[1] = "deflate";
> +
> +	i = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		callbacks[i] = stats_request;
> +		names[i] = "stats";
> +		i++;
> +	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
> +		callbacks[i] = ctrlq_handle;
> +		names[i++] = "ctrlq";
> +		callbacks[i] = NULL;
> +		names[i] = "free_page_vq";
> +	}
> +
> +	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
> +					 NULL, NULL);
>  	if (err)
> -		return err;
> +		goto err_find;
>  
>  	vb->inflate_vq = vqs[0];
>  	vb->deflate_vq = vqs[1];
> +	i = 2;
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> -		struct scatterlist sg;
> -		unsigned int num_stats;
> -		vb->stats_vq = vqs[2];
> -
> +		vb->stats_vq = vqs[i++];
>  		/*
>  		 * Prime this virtqueue with one buffer so the hypervisor can
>  		 * use it to signal us later (it can't be broken yet!).
>  		 */
> -		num_stats = update_balloon_stats(vb);
> -
> -		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
> +		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
>  		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
> -		    < 0)
> -			BUG();
> +		    < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
>  		virtqueue_kick(vb->stats_vq);
>  	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
> +		vb->ctrl_vq = vqs[i++];
> +		vb->free_page_vq = vqs[i];
> +		/* Prime the ctrlq with an inbuf for the host to send a cmd */
> +		vb->free_page_cmd_in.class =
> +					VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> +		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
> +	}
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
>  	return 0;
> +
> +err_find:
> +	kfree(names);
> +err_names:
> +	kfree(callbacks);
> +err_callback:
> +	kfree(vqs);
> +err_vq:
> +	return err;
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -706,6 +906,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
>  		xb_init(&vb->page_xb);
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
> +		vb->balloon_wq = alloc_workqueue("balloon-wq",
> +					WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
> +		INIT_WORK(&vb->report_free_page_work, report_free_page);
> +		vb->report_free_page_stop = true;
> +	}
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -770,6 +977,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	spin_unlock_irq(&vb->stop_update_lock);
>  	cancel_work_sync(&vb->update_balloon_size_work);
>  	cancel_work_sync(&vb->update_balloon_stats_work);
> +	cancel_work_sync(&vb->report_free_page_work);
>  
>  	remove_common(vb);
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -823,6 +1031,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>  	VIRTIO_BALLOON_F_SG,
> +	VIRTIO_BALLOON_F_CTRL_VQ,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 37780a7..dbf0616 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -35,6 +35,7 @@
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
> +#define VIRTIO_BALLOON_F_CTRL_VQ	4 /* Control Virtqueue */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> @@ -83,4 +84,18 @@ struct virtio_balloon_stat {
>  	__virtio64 val;
>  } __attribute__((packed));
>  
> +enum {
> +	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE = 0,
> +	VIRTIO_BALLOON_CTRLQ_CLASS_MAX,
> +};
> +
> +struct virtio_balloon_ctrlq_cmd {
> +	__virtio32 class;
> +	__virtio32 cmd;
> +};
> +
> +/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE */
> +#define VIRTIO_BALLOON_FREE_PAGE_F_STOP		0
> +#define VIRTIO_BALLOON_FREE_PAGE_F_START	1
> +
>  #endif /* _LINUX_VIRTIO_BALLOON_H */

The stop command does not appear to be thought through.

Let's assume e.g. you started migration. You ask guest for free pages.
Then you cancel it.  There are a bunch of pages in free vq and you are
getting more.  You now want to start migration again. What to do?

A bunch of vq flushing and waiting will maybe do the trick, but waiting
on guest is never a great idea.

I previously suggested pushing the stop/start commands from guest to
host on the free page vq, and including an ID in host to guest and
guest to host commands. This way ctrl vq is just for host to guest
commands, and host matches commands and knows which command
is a free page in response to.

I still think it's a good idea but go ahead and propose something
else that works.



> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [virtio-dev] Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-01  3:18     ` Michael S. Tsirkin
  0 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-01  3:18 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> Add a new vq, ctrl_vq, to handle commands between the host and guest.
> With this feature, we will be able to have the control plane and data
> plane separated. In other words, the control related commands of each
> feature will be sent via the ctrl_vq, meanwhile each feature may have
> its own vq used as a data plane.
> 
> Free page report is the the first new feature controlled via ctrl_vq,
> and a new cmd class, VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE, is added.
> Currently, this feature has two cmds:
> VIRTIO_BALLOON_FREE_PAGE_F_START: This cmd is sent from host to guest
> to start the free page report work.
> VIRTIO_BALLOON_FREE_PAGE_F_STOP: This cmd is bidirectional. The guest
> would send the cmd to the host to indicate the reporting work is done.
> The host would send the cmd to the guest to actively request the stop
> of the reporting work.
> 
> The free_page_vq is used to transmit the guest free page blocks to the
> host.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> ---
>  drivers/virtio/virtio_balloon.c     | 249 +++++++++++++++++++++++++++++++++---
>  include/uapi/linux/virtio_balloon.h |  15 +++
>  2 files changed, 244 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 6952e19..70dc4ae 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -55,7 +55,13 @@ static struct vfsmount *balloon_mnt;
>  
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *ctrl_vq,
> +			 *free_page_vq;
> +
> +	/* Balloon's own wq for cpu-intensive work items */
> +	struct workqueue_struct *balloon_wq;
> +	/* The work items submitted to the balloon wq are listed here */
> +	struct work_struct report_free_page_work;
>  
>  	/* The balloon servicing is delegated to a freezable workqueue. */
>  	struct work_struct update_balloon_stats_work;
> @@ -65,6 +71,9 @@ struct virtio_balloon {
>  	spinlock_t stop_update_lock;
>  	bool stop_update;
>  
> +	/* Stop reporting free pages */
> +	bool report_free_page_stop;
> +
>  	/* Waiting for host to ack the pages we released. */
>  	wait_queue_head_t acked;
>  
> @@ -93,6 +102,11 @@ struct virtio_balloon {
>  
>  	/* To register callback in oom notifier call chain */
>  	struct notifier_block nb;
> +
> +	/* Host to guest ctrlq cmd buf for free page report */
> +	struct virtio_balloon_ctrlq_cmd free_page_cmd_in;
> +	/* Guest to Host ctrlq cmd buf for free page report */
> +	struct virtio_balloon_ctrlq_cmd free_page_cmd_out;
>  };
>  
>  static struct virtio_device_id id_table[] = {
> @@ -186,6 +200,24 @@ static int send_balloon_page_sg(struct virtio_balloon *vb,
>  	return err;
>  }
>  
> +static int send_free_page_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	int ret = 0;
> +
> +	/*
> +	 * Since this is an optimization feature, losing a couplle of free

typo

> +	 * pages to report isn't important. We simply resturn without adding
> +	 * the page if the vq is full.
> +	 */
> +	if (vq->num_free) {
> +		ret = add_one_sg(vq, addr, size);
> +		if (!ret)
> +			virtqueue_kick(vq);
> +	}
> +
> +	return ret;
> +}
> +
>  /*
>   * Send balloon pages in sgs to host. The balloon pages are recorded in the
>   * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> @@ -542,42 +574,210 @@ static void update_balloon_size_func(struct work_struct *work)
>  		queue_work(system_freezable_wq, work);
>  }
>  
> -static int init_vqs(struct virtio_balloon *vb)
> +static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> +					   unsigned long nr_pages)
> +{
> +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> +	void *addr = (void *)pfn_to_kaddr(pfn);
> +	uint32_t len = nr_pages << PAGE_SHIFT;
> +
> +	if (vb->report_free_page_stop)
> +		return false;
> +
> +	/* If the vq is broken, stop reporting the free pages. */
> +	if (send_free_page_sg(vb->free_page_vq, addr, len) < 0)
> +		return false;
> +
> +	return true;
> +}
> +
> +static void ctrlq_add_cmd(struct virtqueue *vq,
> +			  struct virtio_balloon_ctrlq_cmd *cmd,
> +			  bool inbuf)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct scatterlist sg;
> +	int err;
> +
> +	sg_init_one(&sg, cmd, sizeof(struct virtio_balloon_ctrlq_cmd));
> +	if (inbuf)
> +		err = virtqueue_add_inbuf(vq, &sg, 1, cmd, GFP_KERNEL);
> +	else
> +		err = virtqueue_add_outbuf(vq, &sg, 1, cmd, GFP_KERNEL);
> +
> +	/* Sanity check: this can't really happen */
> +	WARN_ON(err);
> +}
> +
> +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> +			  struct virtio_balloon_ctrlq_cmd *cmd,
> +			  bool inbuf)
> +{
> +	struct virtqueue *vq = vb->ctrl_vq;
> +
> +	ctrlq_add_cmd(vq, cmd, inbuf);
> +	if (!inbuf) {
> +		/*
> +		 * All the input cmd buffers are replenished here.
> +		 * This is necessary because the input cmd buffers are lost
> +		 * after live migration. The device needs to rewind all of
> +		 * them from the ctrl_vq.

Confused. Live migration somehow loses state? Why is that and why
is it a good idea? And how do you know this is migration even?
Looks like all you know is you got free page end. Could be any
reason for this.


> +		 */
> +		ctrlq_add_cmd(vq, &vb->free_page_cmd_in, true);
> +	}
> +	virtqueue_kick(vq);
> +}
>  
> +static void report_free_page_end(struct virtio_balloon *vb)
> +{
>  	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> +	 * The host may have already requested to stop the reporting before we
> +	 * finish, so no need to notify the host in this case.
>  	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> -	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> +	if (vb->report_free_page_stop)
> +		return;
> +
> +	vb->free_page_cmd_out.class = VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> +	vb->free_page_cmd_out.cmd = VIRTIO_BALLOON_FREE_PAGE_F_STOP;
> +	ctrlq_send_cmd(vb, &vb->free_page_cmd_out, false);
> +	vb->report_free_page_stop = true;
> +}
> +
> +static void report_free_page(struct work_struct *work)
> +{
> +	struct virtio_balloon *vb;
> +
> +	vb = container_of(work, struct virtio_balloon, report_free_page_work);
> +	walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> +	report_free_page_end(vb);
> +}
> +
> +static void ctrlq_handle(struct virtqueue *vq)
> +{
> +	struct virtio_balloon *vb = vq->vdev->priv;
> +	struct virtio_balloon_ctrlq_cmd *msg;
> +	unsigned int class, cmd, len;
> +
> +	msg = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len);
> +	if (unlikely(!msg))
> +		return;
> +
> +	/* The outbuf is sent by the host for recycling, so just return. */
> +	if (msg == &vb->free_page_cmd_out)
> +		return;
> +
> +	class = virtio32_to_cpu(vb->vdev, msg->class);
> +	cmd =  virtio32_to_cpu(vb->vdev, msg->cmd);
> +
> +	switch (class) {
> +	case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
> +		if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
> +			vb->report_free_page_stop = true;
> +		} else if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
> +			vb->report_free_page_stop = false;
> +			queue_work(vb->balloon_wq, &vb->report_free_page_work);
> +		}
> +		vb->free_page_cmd_in.class =
> +					VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> +		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
> +	break;
> +	default:
> +		dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n",
> +			 __func__);
> +	}

Manipulating report_free_page_stop without any locks looks
very suspicious.
Also, what if we get two start commands? we should restart
from beginning, should we not?

> +}
> +
> +static int init_vqs(struct virtio_balloon *vb)
> +{
> +	struct virtqueue **vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	struct scatterlist sg;
> +	int i, nvqs, err = -ENOMEM;
> +
> +	/* Inflateq and deflateq are used unconditionally */
> +	nvqs = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> +		nvqs++;
> +	/* If ctrlq is enabled, the free page vq will also be created */
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ))
> +		nvqs += 2;

Since you made it generic, free page should
have its own flag not rely on ctrl vq.


> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs)
> +		goto err_vq;
> +	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
> +	if (!callbacks)
> +		goto err_callback;
> +	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
> +	if (!names)
> +		goto err_names;
> +
> +	callbacks[0] = balloon_ack;
> +	names[0] = "inflate";
> +	callbacks[1] = balloon_ack;
> +	names[1] = "deflate";
> +
> +	i = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		callbacks[i] = stats_request;
> +		names[i] = "stats";
> +		i++;
> +	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
> +		callbacks[i] = ctrlq_handle;
> +		names[i++] = "ctrlq";
> +		callbacks[i] = NULL;
> +		names[i] = "free_page_vq";
> +	}
> +
> +	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
> +					 NULL, NULL);
>  	if (err)
> -		return err;
> +		goto err_find;
>  
>  	vb->inflate_vq = vqs[0];
>  	vb->deflate_vq = vqs[1];
> +	i = 2;
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> -		struct scatterlist sg;
> -		unsigned int num_stats;
> -		vb->stats_vq = vqs[2];
> -
> +		vb->stats_vq = vqs[i++];
>  		/*
>  		 * Prime this virtqueue with one buffer so the hypervisor can
>  		 * use it to signal us later (it can't be broken yet!).
>  		 */
> -		num_stats = update_balloon_stats(vb);
> -
> -		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
> +		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
>  		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
> -		    < 0)
> -			BUG();
> +		    < 0) {
> +			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
> +				 __func__);
> +			goto err_find;
> +		}
>  		virtqueue_kick(vb->stats_vq);
>  	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
> +		vb->ctrl_vq = vqs[i++];
> +		vb->free_page_vq = vqs[i];
> +		/* Prime the ctrlq with an inbuf for the host to send a cmd */
> +		vb->free_page_cmd_in.class =
> +					VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> +		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
> +	}
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
>  	return 0;
> +
> +err_find:
> +	kfree(names);
> +err_names:
> +	kfree(callbacks);
> +err_callback:
> +	kfree(vqs);
> +err_vq:
> +	return err;
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -706,6 +906,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
>  		xb_init(&vb->page_xb);
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_CTRL_VQ)) {
> +		vb->balloon_wq = alloc_workqueue("balloon-wq",
> +					WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
> +		INIT_WORK(&vb->report_free_page_work, report_free_page);
> +		vb->report_free_page_stop = true;
> +	}
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -770,6 +977,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	spin_unlock_irq(&vb->stop_update_lock);
>  	cancel_work_sync(&vb->update_balloon_size_work);
>  	cancel_work_sync(&vb->update_balloon_stats_work);
> +	cancel_work_sync(&vb->report_free_page_work);
>  
>  	remove_common(vb);
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -823,6 +1031,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>  	VIRTIO_BALLOON_F_SG,
> +	VIRTIO_BALLOON_F_CTRL_VQ,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 37780a7..dbf0616 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -35,6 +35,7 @@
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
> +#define VIRTIO_BALLOON_F_CTRL_VQ	4 /* Control Virtqueue */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> @@ -83,4 +84,18 @@ struct virtio_balloon_stat {
>  	__virtio64 val;
>  } __attribute__((packed));
>  
> +enum {
> +	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE = 0,
> +	VIRTIO_BALLOON_CTRLQ_CLASS_MAX,
> +};
> +
> +struct virtio_balloon_ctrlq_cmd {
> +	__virtio32 class;
> +	__virtio32 cmd;
> +};
> +
> +/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE */
> +#define VIRTIO_BALLOON_FREE_PAGE_F_STOP		0
> +#define VIRTIO_BALLOON_FREE_PAGE_F_START	1
> +
>  #endif /* _LINUX_VIRTIO_BALLOON_H */

The stop command does not appear to be thought through.

Let's assume e.g. you started migration. You ask guest for free pages.
Then you cancel it.  There are a bunch of pages in free vq and you are
getting more.  You now want to start migration again. What to do?

A bunch of vq flushing and waiting will maybe do the trick, but waiting
on guest is never a great idea.

I previously suggested pushing the stop/start commands from guest to
host on the free page vq, and including an ID in host to guest and
guest to host commands. This way ctrl vq is just for host to guest
commands, and host matches commands and knows which command
is a free page in response to.

I still think it's a good idea but go ahead and propose something
else that works.



> -- 
> 2.7.4

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 0/5] Virtio-balloon Enhancement
  2017-09-30  4:05 ` Wei Wang
  (?)
  (?)
@ 2017-10-01 13:16   ` Damian Tometzki
  -1 siblings, 0 replies; 146+ messages in thread
From: Damian Tometzki @ 2017-10-01 13:16 UTC (permalink / raw)
  To: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, liliang.opensource, yang.zhang.wz, quan.xu

Hello,

where i can found the patch in git.kernel.org ?


Am Samstag, den 30.09.2017, 12:05 +0800 schrieb Wei Wang:
> This patch series enhances the existing virtio-balloon with the
> following
> new features:
> 1) fast ballooning: transfer ballooned pages between the guest and
> host in
> chunks using sgs, instead of one array each time; and
> 2) free page block reporting: a new virtqueue to report guest free
> pages
> to the host.
> 
> The second feature can be used to accelerate live migration of VMs.
> Here
> are some details:
> 
> Live migration needs to transfer the VM's memory from the source
> machine
> to the destination round by round. For the 1st round, all the VM's
> memory
> is transferred. From the 2nd round, only the pieces of memory that
> were
> written by the guest (after the 1st round) are transferred. One
> method
> that is popularly used by the hypervisor to track which part of
> memory is
> written is to write-protect all the guest memory.
> 
> The second feature enables the optimization of the 1st round memory
> transfer - the hypervisor can skip the transfer of guest free pages
> in the
> 1st round. It is not concerned that the memory pages are used after
> they
> are given to the hypervisor as a hint of the free pages, because they
> will
> be tracked by the hypervisor and transferred in the next round if
> they are
> used and written.
> 
> Change Log:
> v15->v16:
> 1) mm: stop reporting the free pfn range if the callback returns
> false;
> 2) mm: move some implementaion of walk_free_mem_block into a function
> to
> make the code layout looks better;
> 3) xbitmap: added some optimizations suggested by Matthew, please
> refer to
> the ChangLog in the xbitmap patch for details.
> 4) xbitmap: added a test suite
> 5) virtio-balloon: bail out with a warning when virtqueue_add_inbuf
> returns
> an error
> 6) virtio-balloon: some small code re-arrangement, e.g. detachinf
> used buf
> from the vq before adding a new buf
> 
> v14->v15:
> 1) mm: make the report callback return a bool value - returning 1 to
> stop
> walking through the free page list.
> 2) virtio-balloon: batching sgs of balloon pages till the vq is full
> 3) virtio-balloon: create a new workqueue, rather than using the
> default
> system_wq, to queue the free page reporting work item.
> 4) virtio-balloon: add a ctrl_vq to be a central control plane which
> will
> handle all the future control related commands between the host and
> guest.
> Add free page report as the first feature controlled under ctrl_vq,
> and
> the free_page_vq is a data plane vq dedicated to the transmission of
> free
> page blocks.
> 
> v13->v14:
> 1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c.
> 2) xbitmap: consolidate the implementation of xb_bit_set/clear/test
> into
> one xb_bit_ops.
> 3) xbitmap: add documents for the exported APIs.
> 4) mm: rewrite the function to walk through free page blocks.
> 5) virtio-balloon: when reporting a free page blcok to the device, if
> the
> vq is full (less likey to happen in practice), just skip reporting
> this
> block, instead of busywaiting till an entry gets released.
> 6) virtio-balloon: fail the probe function if adding the signal buf
> in
> init_vqs fails.
> 
> v12->v13:
> 1) mm: use a callback function to handle the the free page blocks
> from the
> report function. This avoids exposing the zone internal to a kernel
> module.
> 2) virtio-balloon: send balloon pages or a free page block using a
> single
> sg each time. This has the benefits of simpler implementation with no
> new
> APIs.
> 3) virtio-balloon: the free_page_vq is used to report free pages only
> (no
> multiple usages interleaving)
> 4) virtio-balloon: Balloon pages and free page blocks are sent via
> input
> sgs, and the completion signal to the host is sent via an output sg.
> 
> v11->v12:
> 1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned
> pages.
> 2) virtio-ring: enable the driver to build up a desc chain using
> vring
> desc.
> 3) virtio-ring: Add locking to the existing START_USE() and END_USE()
> macro to lock/unlock the vq when a vq operation starts/ends.
> 4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
> 5) virtio-balloon: describe chunks of ballooned pages and free pages
> blocks directly using one or more chains of desc from the vq.
> 
> v10->v11:
> 1) virtio_balloon: use vring_desc to describe a chunk;
> 2) virtio_ring: support to add an indirect desc table to virtqueue;
> 3)  virtio_balloon: use cmdq to report guest memory statistics.
> 
> v9->v10:
> 1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
> 2) virtio-balloon: add virtballoon_validate();
> 3) virtio-balloon: msg format change;
> 4) virtio-balloon: move miscq handling to a task on
> system_freezable_wq;
> 5) virtio-balloon: code cleanup.
> 
> v8->v9:
> 1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
> VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
> implementation;
> 2) Simpler function to get the free page block.
> 
> v7->v8:
> 1) Use only one chunk format, instead of two.
> 2) re-write the virtio-balloon implementation patch.
> 3) commit changes
> 4) patch re-org
> 
> Matthew Wilcox (2):
>   lib/xbitmap: Introduce xbitmap
>   radix tree test suite: add tests for xbitmap
> 
> Wei Wang (3):
>   virtio-balloon: VIRTIO_BALLOON_F_SG
>   mm: support reporting free page blocks
>   virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
> 
>  drivers/virtio/virtio_balloon.c         | 437
> +++++++++++++++++++++++++++++---
>  include/linux/mm.h                      |   6 +
>  include/linux/radix-tree.h              |   2 +
>  include/linux/xbitmap.h                 |  66 +++++
>  include/uapi/linux/virtio_balloon.h     |  16 ++
>  lib/Makefile                            |   2 +-
>  lib/radix-tree.c                        |  42 ++-
>  lib/xbitmap.c                           | 264 +++++++++++++++++++
>  mm/page_alloc.c                         |  91 +++++++
>  tools/include/linux/bitmap.h            |  34 +++
>  tools/include/linux/kernel.h            |   2 +
>  tools/testing/radix-tree/Makefile       |   7 +-
>  tools/testing/radix-tree/linux/kernel.h |   2 -
>  tools/testing/radix-tree/main.c         |   5 +
>  tools/testing/radix-tree/test.h         |   1 +
>  tools/testing/radix-tree/xbitmap.c      | 269 ++++++++++++++++++++
>  16 files changed, 1203 insertions(+), 43 deletions(-)
>  create mode 100644 include/linux/xbitmap.h
>  create mode 100644 lib/xbitmap.c
>  create mode 100644 tools/testing/radix-tree/xbitmap.c
> 

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 0/5] Virtio-balloon Enhancement
@ 2017-10-01 13:16   ` Damian Tometzki
  0 siblings, 0 replies; 146+ messages in thread
From: Damian Tometzki @ 2017-10-01 13:16 UTC (permalink / raw)
  To: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, liliang.opensource, yang.zhang.wz, quan.xu

Hello,

where i can found the patch in git.kernel.org ?


Am Samstag, den 30.09.2017, 12:05 +0800 schrieb Wei Wang:
> This patch series enhances the existing virtio-balloon with the
> following
> new features:
> 1) fast ballooning: transfer ballooned pages between the guest and
> host in
> chunks using sgs, instead of one array each time; and
> 2) free page block reporting: a new virtqueue to report guest free
> pages
> to the host.
> 
> The second feature can be used to accelerate live migration of VMs.
> Here
> are some details:
> 
> Live migration needs to transfer the VM's memory from the source
> machine
> to the destination round by round. For the 1st round, all the VM's
> memory
> is transferred. From the 2nd round, only the pieces of memory that
> were
> written by the guest (after the 1st round) are transferred. One
> method
> that is popularly used by the hypervisor to track which part of
> memory is
> written is to write-protect all the guest memory.
> 
> The second feature enables the optimization of the 1st round memory
> transfer - the hypervisor can skip the transfer of guest free pages
> in the
> 1st round. It is not concerned that the memory pages are used after
> they
> are given to the hypervisor as a hint of the free pages, because they
> will
> be tracked by the hypervisor and transferred in the next round if
> they are
> used and written.
> 
> Change Log:
> v15->v16:
> 1) mm: stop reporting the free pfn range if the callback returns
> false;
> 2) mm: move some implementaion of walk_free_mem_block into a function
> to
> make the code layout looks better;
> 3) xbitmap: added some optimizations suggested by Matthew, please
> refer to
> the ChangLog in the xbitmap patch for details.
> 4) xbitmap: added a test suite
> 5) virtio-balloon: bail out with a warning when virtqueue_add_inbuf
> returns
> an error
> 6) virtio-balloon: some small code re-arrangement, e.g. detachinf
> used buf
> from the vq before adding a new buf
> 
> v14->v15:
> 1) mm: make the report callback return a bool value - returning 1 to
> stop
> walking through the free page list.
> 2) virtio-balloon: batching sgs of balloon pages till the vq is full
> 3) virtio-balloon: create a new workqueue, rather than using the
> default
> system_wq, to queue the free page reporting work item.
> 4) virtio-balloon: add a ctrl_vq to be a central control plane which
> will
> handle all the future control related commands between the host and
> guest.
> Add free page report as the first feature controlled under ctrl_vq,
> and
> the free_page_vq is a data plane vq dedicated to the transmission of
> free
> page blocks.
> 
> v13->v14:
> 1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c.
> 2) xbitmap: consolidate the implementation of xb_bit_set/clear/test
> into
> one xb_bit_ops.
> 3) xbitmap: add documents for the exported APIs.
> 4) mm: rewrite the function to walk through free page blocks.
> 5) virtio-balloon: when reporting a free page blcok to the device, if
> the
> vq is full (less likey to happen in practice), just skip reporting
> this
> block, instead of busywaiting till an entry gets released.
> 6) virtio-balloon: fail the probe function if adding the signal buf
> in
> init_vqs fails.
> 
> v12->v13:
> 1) mm: use a callback function to handle the the free page blocks
> from the
> report function. This avoids exposing the zone internal to a kernel
> module.
> 2) virtio-balloon: send balloon pages or a free page block using a
> single
> sg each time. This has the benefits of simpler implementation with no
> new
> APIs.
> 3) virtio-balloon: the free_page_vq is used to report free pages only
> (no
> multiple usages interleaving)
> 4) virtio-balloon: Balloon pages and free page blocks are sent via
> input
> sgs, and the completion signal to the host is sent via an output sg.
> 
> v11->v12:
> 1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned
> pages.
> 2) virtio-ring: enable the driver to build up a desc chain using
> vring
> desc.
> 3) virtio-ring: Add locking to the existing START_USE() and END_USE()
> macro to lock/unlock the vq when a vq operation starts/ends.
> 4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
> 5) virtio-balloon: describe chunks of ballooned pages and free pages
> blocks directly using one or more chains of desc from the vq.
> 
> v10->v11:
> 1) virtio_balloon: use vring_desc to describe a chunk;
> 2) virtio_ring: support to add an indirect desc table to virtqueue;
> 3)  virtio_balloon: use cmdq to report guest memory statistics.
> 
> v9->v10:
> 1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
> 2) virtio-balloon: add virtballoon_validate();
> 3) virtio-balloon: msg format change;
> 4) virtio-balloon: move miscq handling to a task on
> system_freezable_wq;
> 5) virtio-balloon: code cleanup.
> 
> v8->v9:
> 1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
> VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
> implementation;
> 2) Simpler function to get the free page block.
> 
> v7->v8:
> 1) Use only one chunk format, instead of two.
> 2) re-write the virtio-balloon implementation patch.
> 3) commit changes
> 4) patch re-org
> 
> Matthew Wilcox (2):
>   lib/xbitmap: Introduce xbitmap
>   radix tree test suite: add tests for xbitmap
> 
> Wei Wang (3):
>   virtio-balloon: VIRTIO_BALLOON_F_SG
>   mm: support reporting free page blocks
>   virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
> 
>  drivers/virtio/virtio_balloon.c         | 437
> +++++++++++++++++++++++++++++---
>  include/linux/mm.h                      |   6 +
>  include/linux/radix-tree.h              |   2 +
>  include/linux/xbitmap.h                 |  66 +++++
>  include/uapi/linux/virtio_balloon.h     |  16 ++
>  lib/Makefile                            |   2 +-
>  lib/radix-tree.c                        |  42 ++-
>  lib/xbitmap.c                           | 264 +++++++++++++++++++
>  mm/page_alloc.c                         |  91 +++++++
>  tools/include/linux/bitmap.h            |  34 +++
>  tools/include/linux/kernel.h            |   2 +
>  tools/testing/radix-tree/Makefile       |   7 +-
>  tools/testing/radix-tree/linux/kernel.h |   2 -
>  tools/testing/radix-tree/main.c         |   5 +
>  tools/testing/radix-tree/test.h         |   1 +
>  tools/testing/radix-tree/xbitmap.c      | 269 ++++++++++++++++++++
>  16 files changed, 1203 insertions(+), 43 deletions(-)
>  create mode 100644 include/linux/xbitmap.h
>  create mode 100644 lib/xbitmap.c
>  create mode 100644 tools/testing/radix-tree/xbitmap.c
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 0/5] Virtio-balloon Enhancement
@ 2017-10-01 13:16   ` Damian Tometzki
  0 siblings, 0 replies; 146+ messages in thread
From: Damian Tometzki @ 2017-10-01 13:16 UTC (permalink / raw)
  To: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, liliang.opensource, yang.zhang.wz, quan.xu

Hello,

where i can found the patch in git.kernel.org ?


Am Samstag, den 30.09.2017, 12:05 +0800 schrieb Wei Wang:
> This patch series enhances the existing virtio-balloon with the
> following
> new features:
> 1) fast ballooning: transfer ballooned pages between the guest and
> host in
> chunks using sgs, instead of one array each time; and
> 2) free page block reporting: a new virtqueue to report guest free
> pages
> to the host.
> 
> The second feature can be used to accelerate live migration of VMs.
> Here
> are some details:
> 
> Live migration needs to transfer the VM's memory from the source
> machine
> to the destination round by round. For the 1st round, all the VM's
> memory
> is transferred. From the 2nd round, only the pieces of memory that
> were
> written by the guest (after the 1st round) are transferred. One
> method
> that is popularly used by the hypervisor to track which part of
> memory is
> written is to write-protect all the guest memory.
> 
> The second feature enables the optimization of the 1st round memory
> transfer - the hypervisor can skip the transfer of guest free pages
> in the
> 1st round. It is not concerned that the memory pages are used after
> they
> are given to the hypervisor as a hint of the free pages, because they
> will
> be tracked by the hypervisor and transferred in the next round if
> they are
> used and written.
> 
> Change Log:
> v15->v16:
> 1) mm: stop reporting the free pfn range if the callback returns
> false;
> 2) mm: move some implementaion of walk_free_mem_block into a function
> to
> make the code layout looks better;
> 3) xbitmap: added some optimizations suggested by Matthew, please
> refer to
> the ChangLog in the xbitmap patch for details.
> 4) xbitmap: added a test suite
> 5) virtio-balloon: bail out with a warning when virtqueue_add_inbuf
> returns
> an error
> 6) virtio-balloon: some small code re-arrangement, e.g. detachinf
> used buf
> from the vq before adding a new buf
> 
> v14->v15:
> 1) mm: make the report callback return a bool value - returning 1 to
> stop
> walking through the free page list.
> 2) virtio-balloon: batching sgs of balloon pages till the vq is full
> 3) virtio-balloon: create a new workqueue, rather than using the
> default
> system_wq, to queue the free page reporting work item.
> 4) virtio-balloon: add a ctrl_vq to be a central control plane which
> will
> handle all the future control related commands between the host and
> guest.
> Add free page report as the first feature controlled under ctrl_vq,
> and
> the free_page_vq is a data plane vq dedicated to the transmission of
> free
> page blocks.
> 
> v13->v14:
> 1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c.
> 2) xbitmap: consolidate the implementation of xb_bit_set/clear/test
> into
> one xb_bit_ops.
> 3) xbitmap: add documents for the exported APIs.
> 4) mm: rewrite the function to walk through free page blocks.
> 5) virtio-balloon: when reporting a free page blcok to the device, if
> the
> vq is full (less likey to happen in practice), just skip reporting
> this
> block, instead of busywaiting till an entry gets released.
> 6) virtio-balloon: fail the probe function if adding the signal buf
> in
> init_vqs fails.
> 
> v12->v13:
> 1) mm: use a callback function to handle the the free page blocks
> from the
> report function. This avoids exposing the zone internal to a kernel
> module.
> 2) virtio-balloon: send balloon pages or a free page block using a
> single
> sg each time. This has the benefits of simpler implementation with no
> new
> APIs.
> 3) virtio-balloon: the free_page_vq is used to report free pages only
> (no
> multiple usages interleaving)
> 4) virtio-balloon: Balloon pages and free page blocks are sent via
> input
> sgs, and the completion signal to the host is sent via an output sg.
> 
> v11->v12:
> 1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned
> pages.
> 2) virtio-ring: enable the driver to build up a desc chain using
> vring
> desc.
> 3) virtio-ring: Add locking to the existing START_USE() and END_USE()
> macro to lock/unlock the vq when a vq operation starts/ends.
> 4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
> 5) virtio-balloon: describe chunks of ballooned pages and free pages
> blocks directly using one or more chains of desc from the vq.
> 
> v10->v11:
> 1) virtio_balloon: use vring_desc to describe a chunk;
> 2) virtio_ring: support to add an indirect desc table to virtqueue;
> 3)A A virtio_balloon: use cmdq to report guest memory statistics.
> 
> v9->v10:
> 1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
> 2) virtio-balloon: add virtballoon_validate();
> 3) virtio-balloon: msg format change;
> 4) virtio-balloon: move miscq handling to a task on
> system_freezable_wq;
> 5) virtio-balloon: code cleanup.
> 
> v8->v9:
> 1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
> VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
> implementation;
> 2) Simpler function to get the free page block.
> 
> v7->v8:
> 1) Use only one chunk format, instead of two.
> 2) re-write the virtio-balloon implementation patch.
> 3) commit changes
> 4) patch re-org
> 
> Matthew Wilcox (2):
> A  lib/xbitmap: Introduce xbitmap
> A  radix tree test suite: add tests for xbitmap
> 
> Wei Wang (3):
> A  virtio-balloon: VIRTIO_BALLOON_F_SG
> A  mm: support reporting free page blocks
> A  virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
> 
> A drivers/virtio/virtio_balloon.cA A A A A A A A A | 437
> +++++++++++++++++++++++++++++---
> A include/linux/mm.hA A A A A A A A A A A A A A A A A A A A A A |A A A 6 +
> A include/linux/radix-tree.hA A A A A A A A A A A A A A |A A A 2 +
> A include/linux/xbitmap.hA A A A A A A A A A A A A A A A A |A A 66 +++++
> A include/uapi/linux/virtio_balloon.hA A A A A |A A 16 ++
> A lib/MakefileA A A A A A A A A A A A A A A A A A A A A A A A A A A A |A A A 2 +-
> A lib/radix-tree.cA A A A A A A A A A A A A A A A A A A A A A A A |A A 42 ++-
> A lib/xbitmap.cA A A A A A A A A A A A A A A A A A A A A A A A A A A | 264 +++++++++++++++++++
> A mm/page_alloc.cA A A A A A A A A A A A A A A A A A A A A A A A A |A A 91 +++++++
> A tools/include/linux/bitmap.hA A A A A A A A A A A A |A A 34 +++
> A tools/include/linux/kernel.hA A A A A A A A A A A A |A A A 2 +
> A tools/testing/radix-tree/MakefileA A A A A A A |A A A 7 +-
> A tools/testing/radix-tree/linux/kernel.h |A A A 2 -
> A tools/testing/radix-tree/main.cA A A A A A A A A |A A A 5 +
> A tools/testing/radix-tree/test.hA A A A A A A A A |A A A 1 +
> A tools/testing/radix-tree/xbitmap.cA A A A A A | 269 ++++++++++++++++++++
> A 16 files changed, 1203 insertions(+), 43 deletions(-)
> A create mode 100644 include/linux/xbitmap.h
> A create mode 100644 lib/xbitmap.c
> A create mode 100644 tools/testing/radix-tree/xbitmap.c
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [Qemu-devel] [PATCH v16 0/5] Virtio-balloon Enhancement
@ 2017-10-01 13:16   ` Damian Tometzki
  0 siblings, 0 replies; 146+ messages in thread
From: Damian Tometzki @ 2017-10-01 13:16 UTC (permalink / raw)
  To: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, liliang.opensource, yang.zhang.wz, quan.xu

Hello,

where i can found the patch in git.kernel.org ?


Am Samstag, den 30.09.2017, 12:05 +0800 schrieb Wei Wang:
> This patch series enhances the existing virtio-balloon with the
> following
> new features:
> 1) fast ballooning: transfer ballooned pages between the guest and
> host in
> chunks using sgs, instead of one array each time; and
> 2) free page block reporting: a new virtqueue to report guest free
> pages
> to the host.
> 
> The second feature can be used to accelerate live migration of VMs.
> Here
> are some details:
> 
> Live migration needs to transfer the VM's memory from the source
> machine
> to the destination round by round. For the 1st round, all the VM's
> memory
> is transferred. From the 2nd round, only the pieces of memory that
> were
> written by the guest (after the 1st round) are transferred. One
> method
> that is popularly used by the hypervisor to track which part of
> memory is
> written is to write-protect all the guest memory.
> 
> The second feature enables the optimization of the 1st round memory
> transfer - the hypervisor can skip the transfer of guest free pages
> in the
> 1st round. It is not concerned that the memory pages are used after
> they
> are given to the hypervisor as a hint of the free pages, because they
> will
> be tracked by the hypervisor and transferred in the next round if
> they are
> used and written.
> 
> Change Log:
> v15->v16:
> 1) mm: stop reporting the free pfn range if the callback returns
> false;
> 2) mm: move some implementaion of walk_free_mem_block into a function
> to
> make the code layout looks better;
> 3) xbitmap: added some optimizations suggested by Matthew, please
> refer to
> the ChangLog in the xbitmap patch for details.
> 4) xbitmap: added a test suite
> 5) virtio-balloon: bail out with a warning when virtqueue_add_inbuf
> returns
> an error
> 6) virtio-balloon: some small code re-arrangement, e.g. detachinf
> used buf
> from the vq before adding a new buf
> 
> v14->v15:
> 1) mm: make the report callback return a bool value - returning 1 to
> stop
> walking through the free page list.
> 2) virtio-balloon: batching sgs of balloon pages till the vq is full
> 3) virtio-balloon: create a new workqueue, rather than using the
> default
> system_wq, to queue the free page reporting work item.
> 4) virtio-balloon: add a ctrl_vq to be a central control plane which
> will
> handle all the future control related commands between the host and
> guest.
> Add free page report as the first feature controlled under ctrl_vq,
> and
> the free_page_vq is a data plane vq dedicated to the transmission of
> free
> page blocks.
> 
> v13->v14:
> 1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c.
> 2) xbitmap: consolidate the implementation of xb_bit_set/clear/test
> into
> one xb_bit_ops.
> 3) xbitmap: add documents for the exported APIs.
> 4) mm: rewrite the function to walk through free page blocks.
> 5) virtio-balloon: when reporting a free page blcok to the device, if
> the
> vq is full (less likey to happen in practice), just skip reporting
> this
> block, instead of busywaiting till an entry gets released.
> 6) virtio-balloon: fail the probe function if adding the signal buf
> in
> init_vqs fails.
> 
> v12->v13:
> 1) mm: use a callback function to handle the the free page blocks
> from the
> report function. This avoids exposing the zone internal to a kernel
> module.
> 2) virtio-balloon: send balloon pages or a free page block using a
> single
> sg each time. This has the benefits of simpler implementation with no
> new
> APIs.
> 3) virtio-balloon: the free_page_vq is used to report free pages only
> (no
> multiple usages interleaving)
> 4) virtio-balloon: Balloon pages and free page blocks are sent via
> input
> sgs, and the completion signal to the host is sent via an output sg.
> 
> v11->v12:
> 1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned
> pages.
> 2) virtio-ring: enable the driver to build up a desc chain using
> vring
> desc.
> 3) virtio-ring: Add locking to the existing START_USE() and END_USE()
> macro to lock/unlock the vq when a vq operation starts/ends.
> 4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
> 5) virtio-balloon: describe chunks of ballooned pages and free pages
> blocks directly using one or more chains of desc from the vq.
> 
> v10->v11:
> 1) virtio_balloon: use vring_desc to describe a chunk;
> 2) virtio_ring: support to add an indirect desc table to virtqueue;
> 3)  virtio_balloon: use cmdq to report guest memory statistics.
> 
> v9->v10:
> 1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
> 2) virtio-balloon: add virtballoon_validate();
> 3) virtio-balloon: msg format change;
> 4) virtio-balloon: move miscq handling to a task on
> system_freezable_wq;
> 5) virtio-balloon: code cleanup.
> 
> v8->v9:
> 1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
> VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
> implementation;
> 2) Simpler function to get the free page block.
> 
> v7->v8:
> 1) Use only one chunk format, instead of two.
> 2) re-write the virtio-balloon implementation patch.
> 3) commit changes
> 4) patch re-org
> 
> Matthew Wilcox (2):
>   lib/xbitmap: Introduce xbitmap
>   radix tree test suite: add tests for xbitmap
> 
> Wei Wang (3):
>   virtio-balloon: VIRTIO_BALLOON_F_SG
>   mm: support reporting free page blocks
>   virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
> 
>  drivers/virtio/virtio_balloon.c         | 437
> +++++++++++++++++++++++++++++---
>  include/linux/mm.h                      |   6 +
>  include/linux/radix-tree.h              |   2 +
>  include/linux/xbitmap.h                 |  66 +++++
>  include/uapi/linux/virtio_balloon.h     |  16 ++
>  lib/Makefile                            |   2 +-
>  lib/radix-tree.c                        |  42 ++-
>  lib/xbitmap.c                           | 264 +++++++++++++++++++
>  mm/page_alloc.c                         |  91 +++++++
>  tools/include/linux/bitmap.h            |  34 +++
>  tools/include/linux/kernel.h            |   2 +
>  tools/testing/radix-tree/Makefile       |   7 +-
>  tools/testing/radix-tree/linux/kernel.h |   2 -
>  tools/testing/radix-tree/main.c         |   5 +
>  tools/testing/radix-tree/test.h         |   1 +
>  tools/testing/radix-tree/xbitmap.c      | 269 ++++++++++++++++++++
>  16 files changed, 1203 insertions(+), 43 deletions(-)
>  create mode 100644 include/linux/xbitmap.h
>  create mode 100644 lib/xbitmap.c
>  create mode 100644 tools/testing/radix-tree/xbitmap.c
> 

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 0/5] Virtio-balloon Enhancement
  2017-09-30  4:05 ` Wei Wang
                   ` (12 preceding siblings ...)
  (?)
@ 2017-10-01 13:16 ` Damian Tometzki
  -1 siblings, 0 replies; 146+ messages in thread
From: Damian Tometzki @ 2017-10-01 13:16 UTC (permalink / raw)
  To: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, mhocko, akpm, mawilcox
  Cc: aarcange, yang.zhang.wz, liliang.opensource, willy, amit.shah,
	quan.xu, cornelia.huck, pbonzini, mgorman

Hello,

where i can found the patch in git.kernel.org ?


Am Samstag, den 30.09.2017, 12:05 +0800 schrieb Wei Wang:
> This patch series enhances the existing virtio-balloon with the
> following
> new features:
> 1) fast ballooning: transfer ballooned pages between the guest and
> host in
> chunks using sgs, instead of one array each time; and
> 2) free page block reporting: a new virtqueue to report guest free
> pages
> to the host.
> 
> The second feature can be used to accelerate live migration of VMs.
> Here
> are some details:
> 
> Live migration needs to transfer the VM's memory from the source
> machine
> to the destination round by round. For the 1st round, all the VM's
> memory
> is transferred. From the 2nd round, only the pieces of memory that
> were
> written by the guest (after the 1st round) are transferred. One
> method
> that is popularly used by the hypervisor to track which part of
> memory is
> written is to write-protect all the guest memory.
> 
> The second feature enables the optimization of the 1st round memory
> transfer - the hypervisor can skip the transfer of guest free pages
> in the
> 1st round. It is not concerned that the memory pages are used after
> they
> are given to the hypervisor as a hint of the free pages, because they
> will
> be tracked by the hypervisor and transferred in the next round if
> they are
> used and written.
> 
> Change Log:
> v15->v16:
> 1) mm: stop reporting the free pfn range if the callback returns
> false;
> 2) mm: move some implementaion of walk_free_mem_block into a function
> to
> make the code layout looks better;
> 3) xbitmap: added some optimizations suggested by Matthew, please
> refer to
> the ChangLog in the xbitmap patch for details.
> 4) xbitmap: added a test suite
> 5) virtio-balloon: bail out with a warning when virtqueue_add_inbuf
> returns
> an error
> 6) virtio-balloon: some small code re-arrangement, e.g. detachinf
> used buf
> from the vq before adding a new buf
> 
> v14->v15:
> 1) mm: make the report callback return a bool value - returning 1 to
> stop
> walking through the free page list.
> 2) virtio-balloon: batching sgs of balloon pages till the vq is full
> 3) virtio-balloon: create a new workqueue, rather than using the
> default
> system_wq, to queue the free page reporting work item.
> 4) virtio-balloon: add a ctrl_vq to be a central control plane which
> will
> handle all the future control related commands between the host and
> guest.
> Add free page report as the first feature controlled under ctrl_vq,
> and
> the free_page_vq is a data plane vq dedicated to the transmission of
> free
> page blocks.
> 
> v13->v14:
> 1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c.
> 2) xbitmap: consolidate the implementation of xb_bit_set/clear/test
> into
> one xb_bit_ops.
> 3) xbitmap: add documents for the exported APIs.
> 4) mm: rewrite the function to walk through free page blocks.
> 5) virtio-balloon: when reporting a free page blcok to the device, if
> the
> vq is full (less likey to happen in practice), just skip reporting
> this
> block, instead of busywaiting till an entry gets released.
> 6) virtio-balloon: fail the probe function if adding the signal buf
> in
> init_vqs fails.
> 
> v12->v13:
> 1) mm: use a callback function to handle the the free page blocks
> from the
> report function. This avoids exposing the zone internal to a kernel
> module.
> 2) virtio-balloon: send balloon pages or a free page block using a
> single
> sg each time. This has the benefits of simpler implementation with no
> new
> APIs.
> 3) virtio-balloon: the free_page_vq is used to report free pages only
> (no
> multiple usages interleaving)
> 4) virtio-balloon: Balloon pages and free page blocks are sent via
> input
> sgs, and the completion signal to the host is sent via an output sg.
> 
> v11->v12:
> 1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned
> pages.
> 2) virtio-ring: enable the driver to build up a desc chain using
> vring
> desc.
> 3) virtio-ring: Add locking to the existing START_USE() and END_USE()
> macro to lock/unlock the vq when a vq operation starts/ends.
> 4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
> 5) virtio-balloon: describe chunks of ballooned pages and free pages
> blocks directly using one or more chains of desc from the vq.
> 
> v10->v11:
> 1) virtio_balloon: use vring_desc to describe a chunk;
> 2) virtio_ring: support to add an indirect desc table to virtqueue;
> 3)  virtio_balloon: use cmdq to report guest memory statistics.
> 
> v9->v10:
> 1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
> 2) virtio-balloon: add virtballoon_validate();
> 3) virtio-balloon: msg format change;
> 4) virtio-balloon: move miscq handling to a task on
> system_freezable_wq;
> 5) virtio-balloon: code cleanup.
> 
> v8->v9:
> 1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
> VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
> implementation;
> 2) Simpler function to get the free page block.
> 
> v7->v8:
> 1) Use only one chunk format, instead of two.
> 2) re-write the virtio-balloon implementation patch.
> 3) commit changes
> 4) patch re-org
> 
> Matthew Wilcox (2):
>   lib/xbitmap: Introduce xbitmap
>   radix tree test suite: add tests for xbitmap
> 
> Wei Wang (3):
>   virtio-balloon: VIRTIO_BALLOON_F_SG
>   mm: support reporting free page blocks
>   virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
> 
>  drivers/virtio/virtio_balloon.c         | 437
> +++++++++++++++++++++++++++++---
>  include/linux/mm.h                      |   6 +
>  include/linux/radix-tree.h              |   2 +
>  include/linux/xbitmap.h                 |  66 +++++
>  include/uapi/linux/virtio_balloon.h     |  16 ++
>  lib/Makefile                            |   2 +-
>  lib/radix-tree.c                        |  42 ++-
>  lib/xbitmap.c                           | 264 +++++++++++++++++++
>  mm/page_alloc.c                         |  91 +++++++
>  tools/include/linux/bitmap.h            |  34 +++
>  tools/include/linux/kernel.h            |   2 +
>  tools/testing/radix-tree/Makefile       |   7 +-
>  tools/testing/radix-tree/linux/kernel.h |   2 -
>  tools/testing/radix-tree/main.c         |   5 +
>  tools/testing/radix-tree/test.h         |   1 +
>  tools/testing/radix-tree/xbitmap.c      | 269 ++++++++++++++++++++
>  16 files changed, 1203 insertions(+), 43 deletions(-)
>  create mode 100644 include/linux/xbitmap.h
>  create mode 100644 lib/xbitmap.c
>  create mode 100644 tools/testing/radix-tree/xbitmap.c
> 
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 0/5] Virtio-balloon Enhancement
  2017-09-30  4:05 ` Wei Wang
  (?)
  (?)
@ 2017-10-01 13:25   ` Damian Tometzki
  -1 siblings, 0 replies; 146+ messages in thread
From: Damian Tometzki @ 2017-10-01 13:25 UTC (permalink / raw)
  To: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, liliang.opensource, yang.zhang.wz, quan.xu

Hello,

where i can found the patch in git.kernel.org ?

Best regards
Damian


Am Samstag, den 30.09.2017, 12:05 +0800 schrieb Wei Wang:
> This patch series enhances the existing virtio-balloon with the
> following
> new features:
> 1) fast ballooning: transfer ballooned pages between the guest and
> host in
> chunks using sgs, instead of one array each time; and
> 2) free page block reporting: a new virtqueue to report guest free
> pages
> to the host.
> 
> The second feature can be used to accelerate live migration of VMs.
> Here
> are some details:
> 
> Live migration needs to transfer the VM's memory from the source
> machine
> to the destination round by round. For the 1st round, all the VM's
> memory
> is transferred. From the 2nd round, only the pieces of memory that
> were
> written by the guest (after the 1st round) are transferred. One
> method
> that is popularly used by the hypervisor to track which part of
> memory is
> written is to write-protect all the guest memory.
> 
> The second feature enables the optimization of the 1st round memory
> transfer - the hypervisor can skip the transfer of guest free pages
> in the
> 1st round. It is not concerned that the memory pages are used after
> they
> are given to the hypervisor as a hint of the free pages, because they
> will
> be tracked by the hypervisor and transferred in the next round if
> they are
> used and written.
> 
> Change Log:
> v15->v16:
> 1) mm: stop reporting the free pfn range if the callback returns
> false;
> 2) mm: move some implementaion of walk_free_mem_block into a function
> to
> make the code layout looks better;
> 3) xbitmap: added some optimizations suggested by Matthew, please
> refer to
> the ChangLog in the xbitmap patch for details.
> 4) xbitmap: added a test suite
> 5) virtio-balloon: bail out with a warning when virtqueue_add_inbuf
> returns
> an error
> 6) virtio-balloon: some small code re-arrangement, e.g. detachinf
> used buf
> from the vq before adding a new buf
> 
> v14->v15:
> 1) mm: make the report callback return a bool value - returning 1 to
> stop
> walking through the free page list.
> 2) virtio-balloon: batching sgs of balloon pages till the vq is full
> 3) virtio-balloon: create a new workqueue, rather than using the
> default
> system_wq, to queue the free page reporting work item.
> 4) virtio-balloon: add a ctrl_vq to be a central control plane which
> will
> handle all the future control related commands between the host and
> guest.
> Add free page report as the first feature controlled under ctrl_vq,
> and
> the free_page_vq is a data plane vq dedicated to the transmission of
> free
> page blocks.
> 
> v13->v14:
> 1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c.
> 2) xbitmap: consolidate the implementation of xb_bit_set/clear/test
> into
> one xb_bit_ops.
> 3) xbitmap: add documents for the exported APIs.
> 4) mm: rewrite the function to walk through free page blocks.
> 5) virtio-balloon: when reporting a free page blcok to the device, if
> the
> vq is full (less likey to happen in practice), just skip reporting
> this
> block, instead of busywaiting till an entry gets released.
> 6) virtio-balloon: fail the probe function if adding the signal buf
> in
> init_vqs fails.
> 
> v12->v13:
> 1) mm: use a callback function to handle the the free page blocks
> from the
> report function. This avoids exposing the zone internal to a kernel
> module.
> 2) virtio-balloon: send balloon pages or a free page block using a
> single
> sg each time. This has the benefits of simpler implementation with no
> new
> APIs.
> 3) virtio-balloon: the free_page_vq is used to report free pages only
> (no
> multiple usages interleaving)
> 4) virtio-balloon: Balloon pages and free page blocks are sent via
> input
> sgs, and the completion signal to the host is sent via an output sg.
> 
> v11->v12:
> 1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned
> pages.
> 2) virtio-ring: enable the driver to build up a desc chain using
> vring
> desc.
> 3) virtio-ring: Add locking to the existing START_USE() and END_USE()
> macro to lock/unlock the vq when a vq operation starts/ends.
> 4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
> 5) virtio-balloon: describe chunks of ballooned pages and free pages
> blocks directly using one or more chains of desc from the vq.
> 
> v10->v11:
> 1) virtio_balloon: use vring_desc to describe a chunk;
> 2) virtio_ring: support to add an indirect desc table to virtqueue;
> 3)  virtio_balloon: use cmdq to report guest memory statistics.
> 
> v9->v10:
> 1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
> 2) virtio-balloon: add virtballoon_validate();
> 3) virtio-balloon: msg format change;
> 4) virtio-balloon: move miscq handling to a task on
> system_freezable_wq;
> 5) virtio-balloon: code cleanup.
> 
> v8->v9:
> 1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
> VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
> implementation;
> 2) Simpler function to get the free page block.
> 
> v7->v8:
> 1) Use only one chunk format, instead of two.
> 2) re-write the virtio-balloon implementation patch.
> 3) commit changes
> 4) patch re-org
> 
> Matthew Wilcox (2):
>   lib/xbitmap: Introduce xbitmap
>   radix tree test suite: add tests for xbitmap
> 
> Wei Wang (3):
>   virtio-balloon: VIRTIO_BALLOON_F_SG
>   mm: support reporting free page blocks
>   virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
> 
>  drivers/virtio/virtio_balloon.c         | 437
> +++++++++++++++++++++++++++++---
>  include/linux/mm.h                      |   6 +
>  include/linux/radix-tree.h              |   2 +
>  include/linux/xbitmap.h                 |  66 +++++
>  include/uapi/linux/virtio_balloon.h     |  16 ++
>  lib/Makefile                            |   2 +-
>  lib/radix-tree.c                        |  42 ++-
>  lib/xbitmap.c                           | 264 +++++++++++++++++++
>  mm/page_alloc.c                         |  91 +++++++
>  tools/include/linux/bitmap.h            |  34 +++
>  tools/include/linux/kernel.h            |   2 +
>  tools/testing/radix-tree/Makefile       |   7 +-
>  tools/testing/radix-tree/linux/kernel.h |   2 -
>  tools/testing/radix-tree/main.c         |   5 +
>  tools/testing/radix-tree/test.h         |   1 +
>  tools/testing/radix-tree/xbitmap.c      | 269 ++++++++++++++++++++
>  16 files changed, 1203 insertions(+), 43 deletions(-)
>  create mode 100644 include/linux/xbitmap.h
>  create mode 100644 lib/xbitmap.c
>  create mode 100644 tools/testing/radix-tree/xbitmap.c
> 

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 0/5] Virtio-balloon Enhancement
@ 2017-10-01 13:25   ` Damian Tometzki
  0 siblings, 0 replies; 146+ messages in thread
From: Damian Tometzki @ 2017-10-01 13:25 UTC (permalink / raw)
  To: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, liliang.opensource, yang.zhang.wz, quan.xu

Hello,

where i can found the patch in git.kernel.org ?

Best regards
Damian


Am Samstag, den 30.09.2017, 12:05 +0800 schrieb Wei Wang:
> This patch series enhances the existing virtio-balloon with the
> following
> new features:
> 1) fast ballooning: transfer ballooned pages between the guest and
> host in
> chunks using sgs, instead of one array each time; and
> 2) free page block reporting: a new virtqueue to report guest free
> pages
> to the host.
> 
> The second feature can be used to accelerate live migration of VMs.
> Here
> are some details:
> 
> Live migration needs to transfer the VM's memory from the source
> machine
> to the destination round by round. For the 1st round, all the VM's
> memory
> is transferred. From the 2nd round, only the pieces of memory that
> were
> written by the guest (after the 1st round) are transferred. One
> method
> that is popularly used by the hypervisor to track which part of
> memory is
> written is to write-protect all the guest memory.
> 
> The second feature enables the optimization of the 1st round memory
> transfer - the hypervisor can skip the transfer of guest free pages
> in the
> 1st round. It is not concerned that the memory pages are used after
> they
> are given to the hypervisor as a hint of the free pages, because they
> will
> be tracked by the hypervisor and transferred in the next round if
> they are
> used and written.
> 
> Change Log:
> v15->v16:
> 1) mm: stop reporting the free pfn range if the callback returns
> false;
> 2) mm: move some implementaion of walk_free_mem_block into a function
> to
> make the code layout looks better;
> 3) xbitmap: added some optimizations suggested by Matthew, please
> refer to
> the ChangLog in the xbitmap patch for details.
> 4) xbitmap: added a test suite
> 5) virtio-balloon: bail out with a warning when virtqueue_add_inbuf
> returns
> an error
> 6) virtio-balloon: some small code re-arrangement, e.g. detachinf
> used buf
> from the vq before adding a new buf
> 
> v14->v15:
> 1) mm: make the report callback return a bool value - returning 1 to
> stop
> walking through the free page list.
> 2) virtio-balloon: batching sgs of balloon pages till the vq is full
> 3) virtio-balloon: create a new workqueue, rather than using the
> default
> system_wq, to queue the free page reporting work item.
> 4) virtio-balloon: add a ctrl_vq to be a central control plane which
> will
> handle all the future control related commands between the host and
> guest.
> Add free page report as the first feature controlled under ctrl_vq,
> and
> the free_page_vq is a data plane vq dedicated to the transmission of
> free
> page blocks.
> 
> v13->v14:
> 1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c.
> 2) xbitmap: consolidate the implementation of xb_bit_set/clear/test
> into
> one xb_bit_ops.
> 3) xbitmap: add documents for the exported APIs.
> 4) mm: rewrite the function to walk through free page blocks.
> 5) virtio-balloon: when reporting a free page blcok to the device, if
> the
> vq is full (less likey to happen in practice), just skip reporting
> this
> block, instead of busywaiting till an entry gets released.
> 6) virtio-balloon: fail the probe function if adding the signal buf
> in
> init_vqs fails.
> 
> v12->v13:
> 1) mm: use a callback function to handle the the free page blocks
> from the
> report function. This avoids exposing the zone internal to a kernel
> module.
> 2) virtio-balloon: send balloon pages or a free page block using a
> single
> sg each time. This has the benefits of simpler implementation with no
> new
> APIs.
> 3) virtio-balloon: the free_page_vq is used to report free pages only
> (no
> multiple usages interleaving)
> 4) virtio-balloon: Balloon pages and free page blocks are sent via
> input
> sgs, and the completion signal to the host is sent via an output sg.
> 
> v11->v12:
> 1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned
> pages.
> 2) virtio-ring: enable the driver to build up a desc chain using
> vring
> desc.
> 3) virtio-ring: Add locking to the existing START_USE() and END_USE()
> macro to lock/unlock the vq when a vq operation starts/ends.
> 4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
> 5) virtio-balloon: describe chunks of ballooned pages and free pages
> blocks directly using one or more chains of desc from the vq.
> 
> v10->v11:
> 1) virtio_balloon: use vring_desc to describe a chunk;
> 2) virtio_ring: support to add an indirect desc table to virtqueue;
> 3)  virtio_balloon: use cmdq to report guest memory statistics.
> 
> v9->v10:
> 1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
> 2) virtio-balloon: add virtballoon_validate();
> 3) virtio-balloon: msg format change;
> 4) virtio-balloon: move miscq handling to a task on
> system_freezable_wq;
> 5) virtio-balloon: code cleanup.
> 
> v8->v9:
> 1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
> VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
> implementation;
> 2) Simpler function to get the free page block.
> 
> v7->v8:
> 1) Use only one chunk format, instead of two.
> 2) re-write the virtio-balloon implementation patch.
> 3) commit changes
> 4) patch re-org
> 
> Matthew Wilcox (2):
>   lib/xbitmap: Introduce xbitmap
>   radix tree test suite: add tests for xbitmap
> 
> Wei Wang (3):
>   virtio-balloon: VIRTIO_BALLOON_F_SG
>   mm: support reporting free page blocks
>   virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
> 
>  drivers/virtio/virtio_balloon.c         | 437
> +++++++++++++++++++++++++++++---
>  include/linux/mm.h                      |   6 +
>  include/linux/radix-tree.h              |   2 +
>  include/linux/xbitmap.h                 |  66 +++++
>  include/uapi/linux/virtio_balloon.h     |  16 ++
>  lib/Makefile                            |   2 +-
>  lib/radix-tree.c                        |  42 ++-
>  lib/xbitmap.c                           | 264 +++++++++++++++++++
>  mm/page_alloc.c                         |  91 +++++++
>  tools/include/linux/bitmap.h            |  34 +++
>  tools/include/linux/kernel.h            |   2 +
>  tools/testing/radix-tree/Makefile       |   7 +-
>  tools/testing/radix-tree/linux/kernel.h |   2 -
>  tools/testing/radix-tree/main.c         |   5 +
>  tools/testing/radix-tree/test.h         |   1 +
>  tools/testing/radix-tree/xbitmap.c      | 269 ++++++++++++++++++++
>  16 files changed, 1203 insertions(+), 43 deletions(-)
>  create mode 100644 include/linux/xbitmap.h
>  create mode 100644 lib/xbitmap.c
>  create mode 100644 tools/testing/radix-tree/xbitmap.c
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 0/5] Virtio-balloon Enhancement
@ 2017-10-01 13:25   ` Damian Tometzki
  0 siblings, 0 replies; 146+ messages in thread
From: Damian Tometzki @ 2017-10-01 13:25 UTC (permalink / raw)
  To: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, liliang.opensource, yang.zhang.wz, quan.xu

Hello,

where i can found the patch in git.kernel.org ?

Best regards
Damian


Am Samstag, den 30.09.2017, 12:05 +0800 schrieb Wei Wang:
> This patch series enhances the existing virtio-balloon with the
> following
> new features:
> 1) fast ballooning: transfer ballooned pages between the guest and
> host in
> chunks using sgs, instead of one array each time; and
> 2) free page block reporting: a new virtqueue to report guest free
> pages
> to the host.
> 
> The second feature can be used to accelerate live migration of VMs.
> Here
> are some details:
> 
> Live migration needs to transfer the VM's memory from the source
> machine
> to the destination round by round. For the 1st round, all the VM's
> memory
> is transferred. From the 2nd round, only the pieces of memory that
> were
> written by the guest (after the 1st round) are transferred. One
> method
> that is popularly used by the hypervisor to track which part of
> memory is
> written is to write-protect all the guest memory.
> 
> The second feature enables the optimization of the 1st round memory
> transfer - the hypervisor can skip the transfer of guest free pages
> in the
> 1st round. It is not concerned that the memory pages are used after
> they
> are given to the hypervisor as a hint of the free pages, because they
> will
> be tracked by the hypervisor and transferred in the next round if
> they are
> used and written.
> 
> Change Log:
> v15->v16:
> 1) mm: stop reporting the free pfn range if the callback returns
> false;
> 2) mm: move some implementaion of walk_free_mem_block into a function
> to
> make the code layout looks better;
> 3) xbitmap: added some optimizations suggested by Matthew, please
> refer to
> the ChangLog in the xbitmap patch for details.
> 4) xbitmap: added a test suite
> 5) virtio-balloon: bail out with a warning when virtqueue_add_inbuf
> returns
> an error
> 6) virtio-balloon: some small code re-arrangement, e.g. detachinf
> used buf
> from the vq before adding a new buf
> 
> v14->v15:
> 1) mm: make the report callback return a bool value - returning 1 to
> stop
> walking through the free page list.
> 2) virtio-balloon: batching sgs of balloon pages till the vq is full
> 3) virtio-balloon: create a new workqueue, rather than using the
> default
> system_wq, to queue the free page reporting work item.
> 4) virtio-balloon: add a ctrl_vq to be a central control plane which
> will
> handle all the future control related commands between the host and
> guest.
> Add free page report as the first feature controlled under ctrl_vq,
> and
> the free_page_vq is a data plane vq dedicated to the transmission of
> free
> page blocks.
> 
> v13->v14:
> 1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c.
> 2) xbitmap: consolidate the implementation of xb_bit_set/clear/test
> into
> one xb_bit_ops.
> 3) xbitmap: add documents for the exported APIs.
> 4) mm: rewrite the function to walk through free page blocks.
> 5) virtio-balloon: when reporting a free page blcok to the device, if
> the
> vq is full (less likey to happen in practice), just skip reporting
> this
> block, instead of busywaiting till an entry gets released.
> 6) virtio-balloon: fail the probe function if adding the signal buf
> in
> init_vqs fails.
> 
> v12->v13:
> 1) mm: use a callback function to handle the the free page blocks
> from the
> report function. This avoids exposing the zone internal to a kernel
> module.
> 2) virtio-balloon: send balloon pages or a free page block using a
> single
> sg each time. This has the benefits of simpler implementation with no
> new
> APIs.
> 3) virtio-balloon: the free_page_vq is used to report free pages only
> (no
> multiple usages interleaving)
> 4) virtio-balloon: Balloon pages and free page blocks are sent via
> input
> sgs, and the completion signal to the host is sent via an output sg.
> 
> v11->v12:
> 1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned
> pages.
> 2) virtio-ring: enable the driver to build up a desc chain using
> vring
> desc.
> 3) virtio-ring: Add locking to the existing START_USE() and END_USE()
> macro to lock/unlock the vq when a vq operation starts/ends.
> 4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
> 5) virtio-balloon: describe chunks of ballooned pages and free pages
> blocks directly using one or more chains of desc from the vq.
> 
> v10->v11:
> 1) virtio_balloon: use vring_desc to describe a chunk;
> 2) virtio_ring: support to add an indirect desc table to virtqueue;
> 3)A A virtio_balloon: use cmdq to report guest memory statistics.
> 
> v9->v10:
> 1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
> 2) virtio-balloon: add virtballoon_validate();
> 3) virtio-balloon: msg format change;
> 4) virtio-balloon: move miscq handling to a task on
> system_freezable_wq;
> 5) virtio-balloon: code cleanup.
> 
> v8->v9:
> 1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
> VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
> implementation;
> 2) Simpler function to get the free page block.
> 
> v7->v8:
> 1) Use only one chunk format, instead of two.
> 2) re-write the virtio-balloon implementation patch.
> 3) commit changes
> 4) patch re-org
> 
> Matthew Wilcox (2):
> A  lib/xbitmap: Introduce xbitmap
> A  radix tree test suite: add tests for xbitmap
> 
> Wei Wang (3):
> A  virtio-balloon: VIRTIO_BALLOON_F_SG
> A  mm: support reporting free page blocks
> A  virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
> 
> A drivers/virtio/virtio_balloon.cA A A A A A A A A | 437
> +++++++++++++++++++++++++++++---
> A include/linux/mm.hA A A A A A A A A A A A A A A A A A A A A A |A A A 6 +
> A include/linux/radix-tree.hA A A A A A A A A A A A A A |A A A 2 +
> A include/linux/xbitmap.hA A A A A A A A A A A A A A A A A |A A 66 +++++
> A include/uapi/linux/virtio_balloon.hA A A A A |A A 16 ++
> A lib/MakefileA A A A A A A A A A A A A A A A A A A A A A A A A A A A |A A A 2 +-
> A lib/radix-tree.cA A A A A A A A A A A A A A A A A A A A A A A A |A A 42 ++-
> A lib/xbitmap.cA A A A A A A A A A A A A A A A A A A A A A A A A A A | 264 +++++++++++++++++++
> A mm/page_alloc.cA A A A A A A A A A A A A A A A A A A A A A A A A |A A 91 +++++++
> A tools/include/linux/bitmap.hA A A A A A A A A A A A |A A 34 +++
> A tools/include/linux/kernel.hA A A A A A A A A A A A |A A A 2 +
> A tools/testing/radix-tree/MakefileA A A A A A A |A A A 7 +-
> A tools/testing/radix-tree/linux/kernel.h |A A A 2 -
> A tools/testing/radix-tree/main.cA A A A A A A A A |A A A 5 +
> A tools/testing/radix-tree/test.hA A A A A A A A A |A A A 1 +
> A tools/testing/radix-tree/xbitmap.cA A A A A A | 269 ++++++++++++++++++++
> A 16 files changed, 1203 insertions(+), 43 deletions(-)
> A create mode 100644 include/linux/xbitmap.h
> A create mode 100644 lib/xbitmap.c
> A create mode 100644 tools/testing/radix-tree/xbitmap.c
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [Qemu-devel] [PATCH v16 0/5] Virtio-balloon Enhancement
@ 2017-10-01 13:25   ` Damian Tometzki
  0 siblings, 0 replies; 146+ messages in thread
From: Damian Tometzki @ 2017-10-01 13:25 UTC (permalink / raw)
  To: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, liliang.opensource, yang.zhang.wz, quan.xu

Hello,

where i can found the patch in git.kernel.org ?

Best regards
Damian


Am Samstag, den 30.09.2017, 12:05 +0800 schrieb Wei Wang:
> This patch series enhances the existing virtio-balloon with the
> following
> new features:
> 1) fast ballooning: transfer ballooned pages between the guest and
> host in
> chunks using sgs, instead of one array each time; and
> 2) free page block reporting: a new virtqueue to report guest free
> pages
> to the host.
> 
> The second feature can be used to accelerate live migration of VMs.
> Here
> are some details:
> 
> Live migration needs to transfer the VM's memory from the source
> machine
> to the destination round by round. For the 1st round, all the VM's
> memory
> is transferred. From the 2nd round, only the pieces of memory that
> were
> written by the guest (after the 1st round) are transferred. One
> method
> that is popularly used by the hypervisor to track which part of
> memory is
> written is to write-protect all the guest memory.
> 
> The second feature enables the optimization of the 1st round memory
> transfer - the hypervisor can skip the transfer of guest free pages
> in the
> 1st round. It is not concerned that the memory pages are used after
> they
> are given to the hypervisor as a hint of the free pages, because they
> will
> be tracked by the hypervisor and transferred in the next round if
> they are
> used and written.
> 
> Change Log:
> v15->v16:
> 1) mm: stop reporting the free pfn range if the callback returns
> false;
> 2) mm: move some implementaion of walk_free_mem_block into a function
> to
> make the code layout looks better;
> 3) xbitmap: added some optimizations suggested by Matthew, please
> refer to
> the ChangLog in the xbitmap patch for details.
> 4) xbitmap: added a test suite
> 5) virtio-balloon: bail out with a warning when virtqueue_add_inbuf
> returns
> an error
> 6) virtio-balloon: some small code re-arrangement, e.g. detachinf
> used buf
> from the vq before adding a new buf
> 
> v14->v15:
> 1) mm: make the report callback return a bool value - returning 1 to
> stop
> walking through the free page list.
> 2) virtio-balloon: batching sgs of balloon pages till the vq is full
> 3) virtio-balloon: create a new workqueue, rather than using the
> default
> system_wq, to queue the free page reporting work item.
> 4) virtio-balloon: add a ctrl_vq to be a central control plane which
> will
> handle all the future control related commands between the host and
> guest.
> Add free page report as the first feature controlled under ctrl_vq,
> and
> the free_page_vq is a data plane vq dedicated to the transmission of
> free
> page blocks.
> 
> v13->v14:
> 1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c.
> 2) xbitmap: consolidate the implementation of xb_bit_set/clear/test
> into
> one xb_bit_ops.
> 3) xbitmap: add documents for the exported APIs.
> 4) mm: rewrite the function to walk through free page blocks.
> 5) virtio-balloon: when reporting a free page blcok to the device, if
> the
> vq is full (less likey to happen in practice), just skip reporting
> this
> block, instead of busywaiting till an entry gets released.
> 6) virtio-balloon: fail the probe function if adding the signal buf
> in
> init_vqs fails.
> 
> v12->v13:
> 1) mm: use a callback function to handle the the free page blocks
> from the
> report function. This avoids exposing the zone internal to a kernel
> module.
> 2) virtio-balloon: send balloon pages or a free page block using a
> single
> sg each time. This has the benefits of simpler implementation with no
> new
> APIs.
> 3) virtio-balloon: the free_page_vq is used to report free pages only
> (no
> multiple usages interleaving)
> 4) virtio-balloon: Balloon pages and free page blocks are sent via
> input
> sgs, and the completion signal to the host is sent via an output sg.
> 
> v11->v12:
> 1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned
> pages.
> 2) virtio-ring: enable the driver to build up a desc chain using
> vring
> desc.
> 3) virtio-ring: Add locking to the existing START_USE() and END_USE()
> macro to lock/unlock the vq when a vq operation starts/ends.
> 4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
> 5) virtio-balloon: describe chunks of ballooned pages and free pages
> blocks directly using one or more chains of desc from the vq.
> 
> v10->v11:
> 1) virtio_balloon: use vring_desc to describe a chunk;
> 2) virtio_ring: support to add an indirect desc table to virtqueue;
> 3)  virtio_balloon: use cmdq to report guest memory statistics.
> 
> v9->v10:
> 1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
> 2) virtio-balloon: add virtballoon_validate();
> 3) virtio-balloon: msg format change;
> 4) virtio-balloon: move miscq handling to a task on
> system_freezable_wq;
> 5) virtio-balloon: code cleanup.
> 
> v8->v9:
> 1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
> VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
> implementation;
> 2) Simpler function to get the free page block.
> 
> v7->v8:
> 1) Use only one chunk format, instead of two.
> 2) re-write the virtio-balloon implementation patch.
> 3) commit changes
> 4) patch re-org
> 
> Matthew Wilcox (2):
>   lib/xbitmap: Introduce xbitmap
>   radix tree test suite: add tests for xbitmap
> 
> Wei Wang (3):
>   virtio-balloon: VIRTIO_BALLOON_F_SG
>   mm: support reporting free page blocks
>   virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
> 
>  drivers/virtio/virtio_balloon.c         | 437
> +++++++++++++++++++++++++++++---
>  include/linux/mm.h                      |   6 +
>  include/linux/radix-tree.h              |   2 +
>  include/linux/xbitmap.h                 |  66 +++++
>  include/uapi/linux/virtio_balloon.h     |  16 ++
>  lib/Makefile                            |   2 +-
>  lib/radix-tree.c                        |  42 ++-
>  lib/xbitmap.c                           | 264 +++++++++++++++++++
>  mm/page_alloc.c                         |  91 +++++++
>  tools/include/linux/bitmap.h            |  34 +++
>  tools/include/linux/kernel.h            |   2 +
>  tools/testing/radix-tree/Makefile       |   7 +-
>  tools/testing/radix-tree/linux/kernel.h |   2 -
>  tools/testing/radix-tree/main.c         |   5 +
>  tools/testing/radix-tree/test.h         |   1 +
>  tools/testing/radix-tree/xbitmap.c      | 269 ++++++++++++++++++++
>  16 files changed, 1203 insertions(+), 43 deletions(-)
>  create mode 100644 include/linux/xbitmap.h
>  create mode 100644 lib/xbitmap.c
>  create mode 100644 tools/testing/radix-tree/xbitmap.c
> 

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 0/5] Virtio-balloon Enhancement
  2017-09-30  4:05 ` Wei Wang
                   ` (15 preceding siblings ...)
  (?)
@ 2017-10-01 13:25 ` Damian Tometzki
  -1 siblings, 0 replies; 146+ messages in thread
From: Damian Tometzki @ 2017-10-01 13:25 UTC (permalink / raw)
  To: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, mhocko, akpm, mawilcox
  Cc: aarcange, yang.zhang.wz, liliang.opensource, willy, amit.shah,
	quan.xu, cornelia.huck, pbonzini, mgorman

Hello,

where i can found the patch in git.kernel.org ?

Best regards
Damian


Am Samstag, den 30.09.2017, 12:05 +0800 schrieb Wei Wang:
> This patch series enhances the existing virtio-balloon with the
> following
> new features:
> 1) fast ballooning: transfer ballooned pages between the guest and
> host in
> chunks using sgs, instead of one array each time; and
> 2) free page block reporting: a new virtqueue to report guest free
> pages
> to the host.
> 
> The second feature can be used to accelerate live migration of VMs.
> Here
> are some details:
> 
> Live migration needs to transfer the VM's memory from the source
> machine
> to the destination round by round. For the 1st round, all the VM's
> memory
> is transferred. From the 2nd round, only the pieces of memory that
> were
> written by the guest (after the 1st round) are transferred. One
> method
> that is popularly used by the hypervisor to track which part of
> memory is
> written is to write-protect all the guest memory.
> 
> The second feature enables the optimization of the 1st round memory
> transfer - the hypervisor can skip the transfer of guest free pages
> in the
> 1st round. It is not concerned that the memory pages are used after
> they
> are given to the hypervisor as a hint of the free pages, because they
> will
> be tracked by the hypervisor and transferred in the next round if
> they are
> used and written.
> 
> Change Log:
> v15->v16:
> 1) mm: stop reporting the free pfn range if the callback returns
> false;
> 2) mm: move some implementaion of walk_free_mem_block into a function
> to
> make the code layout looks better;
> 3) xbitmap: added some optimizations suggested by Matthew, please
> refer to
> the ChangLog in the xbitmap patch for details.
> 4) xbitmap: added a test suite
> 5) virtio-balloon: bail out with a warning when virtqueue_add_inbuf
> returns
> an error
> 6) virtio-balloon: some small code re-arrangement, e.g. detachinf
> used buf
> from the vq before adding a new buf
> 
> v14->v15:
> 1) mm: make the report callback return a bool value - returning 1 to
> stop
> walking through the free page list.
> 2) virtio-balloon: batching sgs of balloon pages till the vq is full
> 3) virtio-balloon: create a new workqueue, rather than using the
> default
> system_wq, to queue the free page reporting work item.
> 4) virtio-balloon: add a ctrl_vq to be a central control plane which
> will
> handle all the future control related commands between the host and
> guest.
> Add free page report as the first feature controlled under ctrl_vq,
> and
> the free_page_vq is a data plane vq dedicated to the transmission of
> free
> page blocks.
> 
> v13->v14:
> 1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c.
> 2) xbitmap: consolidate the implementation of xb_bit_set/clear/test
> into
> one xb_bit_ops.
> 3) xbitmap: add documents for the exported APIs.
> 4) mm: rewrite the function to walk through free page blocks.
> 5) virtio-balloon: when reporting a free page blcok to the device, if
> the
> vq is full (less likey to happen in practice), just skip reporting
> this
> block, instead of busywaiting till an entry gets released.
> 6) virtio-balloon: fail the probe function if adding the signal buf
> in
> init_vqs fails.
> 
> v12->v13:
> 1) mm: use a callback function to handle the the free page blocks
> from the
> report function. This avoids exposing the zone internal to a kernel
> module.
> 2) virtio-balloon: send balloon pages or a free page block using a
> single
> sg each time. This has the benefits of simpler implementation with no
> new
> APIs.
> 3) virtio-balloon: the free_page_vq is used to report free pages only
> (no
> multiple usages interleaving)
> 4) virtio-balloon: Balloon pages and free page blocks are sent via
> input
> sgs, and the completion signal to the host is sent via an output sg.
> 
> v11->v12:
> 1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned
> pages.
> 2) virtio-ring: enable the driver to build up a desc chain using
> vring
> desc.
> 3) virtio-ring: Add locking to the existing START_USE() and END_USE()
> macro to lock/unlock the vq when a vq operation starts/ends.
> 4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
> 5) virtio-balloon: describe chunks of ballooned pages and free pages
> blocks directly using one or more chains of desc from the vq.
> 
> v10->v11:
> 1) virtio_balloon: use vring_desc to describe a chunk;
> 2) virtio_ring: support to add an indirect desc table to virtqueue;
> 3)  virtio_balloon: use cmdq to report guest memory statistics.
> 
> v9->v10:
> 1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
> 2) virtio-balloon: add virtballoon_validate();
> 3) virtio-balloon: msg format change;
> 4) virtio-balloon: move miscq handling to a task on
> system_freezable_wq;
> 5) virtio-balloon: code cleanup.
> 
> v8->v9:
> 1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
> VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
> implementation;
> 2) Simpler function to get the free page block.
> 
> v7->v8:
> 1) Use only one chunk format, instead of two.
> 2) re-write the virtio-balloon implementation patch.
> 3) commit changes
> 4) patch re-org
> 
> Matthew Wilcox (2):
>   lib/xbitmap: Introduce xbitmap
>   radix tree test suite: add tests for xbitmap
> 
> Wei Wang (3):
>   virtio-balloon: VIRTIO_BALLOON_F_SG
>   mm: support reporting free page blocks
>   virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
> 
>  drivers/virtio/virtio_balloon.c         | 437
> +++++++++++++++++++++++++++++---
>  include/linux/mm.h                      |   6 +
>  include/linux/radix-tree.h              |   2 +
>  include/linux/xbitmap.h                 |  66 +++++
>  include/uapi/linux/virtio_balloon.h     |  16 ++
>  lib/Makefile                            |   2 +-
>  lib/radix-tree.c                        |  42 ++-
>  lib/xbitmap.c                           | 264 +++++++++++++++++++
>  mm/page_alloc.c                         |  91 +++++++
>  tools/include/linux/bitmap.h            |  34 +++
>  tools/include/linux/kernel.h            |   2 +
>  tools/testing/radix-tree/Makefile       |   7 +-
>  tools/testing/radix-tree/linux/kernel.h |   2 -
>  tools/testing/radix-tree/main.c         |   5 +
>  tools/testing/radix-tree/test.h         |   1 +
>  tools/testing/radix-tree/xbitmap.c      | 269 ++++++++++++++++++++
>  16 files changed, 1203 insertions(+), 43 deletions(-)
>  create mode 100644 include/linux/xbitmap.h
>  create mode 100644 lib/xbitmap.c
>  create mode 100644 tools/testing/radix-tree/xbitmap.c
> 
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-09-30  4:05   ` Wei Wang
  (?)
@ 2017-10-02  4:30     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-02  4:30 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

Looks good to me. minor comments below.

On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> @@ -141,13 +146,128 @@ static void set_page_pfns(struct virtio_balloon *vb,
>  					  page_to_balloon_pfn(page) + i);
>  }
>  
> +
> +static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
> +{
> +	unsigned int len;
> +
> +	virtqueue_kick(vq);
> +	wait_event(wq_head, virtqueue_get_buf(vq, &len));
> +}
> +
> +static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	struct scatterlist sg;
> +	unsigned int len;
> +
> +	sg_init_one(&sg, addr, size);
> +
> +	/* Detach all the used buffers from the vq */
> +	while (virtqueue_get_buf(vq, &len))
> +		;
> +
> +	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
> +}
> +
> +static int send_balloon_page_sg(struct virtio_balloon *vb,
> +				 struct virtqueue *vq,
> +				 void *addr,
> +				 uint32_t size,
> +				 bool batch)
> +{
> +	int err;
> +
> +	err = add_one_sg(vq, addr, size);
> +
> +	/* If batchng is requested, we batch till the vq is full */

typo

> +	if (!batch || !vq->num_free)
> +		kick_and_wait(vq, vb->acked);
> +
> +	return err;
> +}

If add_one_sg fails, kick_and_wait will hang forever.

The reason this might work in because
1. with 1 sg there are no memory allocations
2. if adding fails on vq full, then something
   is in queue and will wake up kick_and_wait.

So in short this is expected to never fail.
How about a BUG_ON here then?
And make it void, and add a comment with above explanation.

> +
> +/*
> + * Send balloon pages in sgs to host. The balloon pages are recorded in the
> + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> + * The page xbitmap is searched for continuous "1" bits, which correspond
> + * to continuous pages, to chunk into sgs.
> + *
> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
> + * need to be searched.
> + */
> +static void tell_host_sgs(struct virtio_balloon *vb,
> +			  struct virtqueue *vq,
> +			  unsigned long page_xb_start,
> +			  unsigned long page_xb_end)
> +{
> +	unsigned long sg_pfn_start, sg_pfn_end;
> +	void *sg_addr;
> +	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
> +	int err = 0;
> +
> +	sg_pfn_start = page_xb_start;
> +	while (sg_pfn_start < page_xb_end) {
> +		sg_pfn_start = xb_find_next_set_bit(&vb->page_xb, sg_pfn_start,
> +						    page_xb_end);
> +		if (sg_pfn_start == page_xb_end + 1)
> +			break;
> +		sg_pfn_end = xb_find_next_zero_bit(&vb->page_xb,
> +						   sg_pfn_start + 1,
> +						   page_xb_end);
> +		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
> +		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
> +		while (sg_len > sg_max_len) {
> +			err = send_balloon_page_sg(vb, vq, sg_addr, sg_max_len,
> +						   true);
> +			if (unlikely(err < 0))
> +				goto err_out;
> +			sg_addr += sg_max_len;
> +			sg_len -= sg_max_len;
> +		}
> +		err = send_balloon_page_sg(vb, vq, sg_addr, sg_len, true);
> +		if (unlikely(err < 0))
> +			goto err_out;
> +		sg_pfn_start = sg_pfn_end + 1;
> +	}
> +
> +	/*
> +	 * The last few sgs may not reach the batch size, but need a kick to
> +	 * notify the device to handle them.
> +	 */
> +	if (vq->num_free != virtqueue_get_vring_size(vq))
> +		kick_and_wait(vq, vb->acked);
> +
> +	xb_clear_bit_range(&vb->page_xb, page_xb_start, page_xb_end);
> +	return;
> +
> +err_out:
> +	dev_warn(&vb->vdev->dev, "%s failure: %d\n", __func__, err);

so fundamentally just make send_balloon_page_sg void then.

> +}
> +
> +static inline void xb_set_page(struct virtio_balloon *vb,
> +			       struct page *page,
> +			       unsigned long *pfn_min,
> +			       unsigned long *pfn_max)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +	xb_preload(GFP_KERNEL);
> +	xb_set_bit(&vb->page_xb, pfn);
> +	xb_preload_end();
> +}
> +
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	unsigned num_allocated_pages;
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
>  	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
> @@ -162,7 +282,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  			msleep(200);
>  			break;
>  		}
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
> +		if (use_sg)
> +			xb_set_page(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		if (!virtio_has_feature(vb->vdev,
>  					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> @@ -171,8 +296,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  
>  	num_allocated_pages = vb->num_pfns;
>  	/* Did we get any? */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->inflate_vq);
> +	if (vb->num_pfns) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->inflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	return num_allocated_pages;
> @@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> -	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	/* Traditionally, we can only do one array worth at a time. */
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	/* We can't release more pages than taken */
> @@ -210,7 +342,11 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  		page = balloon_page_dequeue(vb_dev_info);
>  		if (!page)
>  			break;
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (use_sg)
> +			xb_set_page(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
>  		list_add(&page->lru, &pages);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}
> @@ -221,8 +357,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>  	 * is true, we *have* to do it in this order
>  	 */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->deflate_vq);
> +	if (vb->num_pfns) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->deflate_vq);
> +	}
>  	release_pages_balloon(vb, &pages);
>  	mutex_unlock(&vb->balloon_lock);
>  	return num_freed_pages;
> @@ -441,6 +581,7 @@ static int init_vqs(struct virtio_balloon *vb)
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> +
>  /*
>   * virtballoon_migratepage - perform the balloon page migration on behalf of
>   *			     a compation thread.     (called under page lock)
> @@ -464,6 +605,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  {
>  	struct virtio_balloon *vb = container_of(vb_dev_info,
>  			struct virtio_balloon, vb_dev_info);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
>  	unsigned long flags;
>  
>  	/*
> @@ -485,16 +627,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  	vb_dev_info->isolated_pages--;
>  	__count_vm_event(BALLOON_MIGRATE);
>  	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, newpage);
> -	tell_host(vb, vb->inflate_vq);
> -
> +	if (use_sg) {
> +		send_balloon_page_sg(vb, vb->inflate_vq, page_address(newpage),
> +				     PAGE_SIZE, false);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, newpage);
> +		tell_host(vb, vb->inflate_vq);
> +	}
>  	/* balloon's page migration 2nd step -- deflate "page" */
>  	balloon_page_delete(page);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, page);
> -	tell_host(vb, vb->deflate_vq);
> -
> +	if (use_sg) {
> +		send_balloon_page_sg(vb, vb->deflate_vq, page_address(page),
> +				     PAGE_SIZE, false);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, page);
> +		tell_host(vb, vb->deflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	put_page(page); /* balloon reference */
> @@ -553,6 +703,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (err)
>  		goto out_free_vb;
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
> +		xb_init(&vb->page_xb);
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -669,6 +822,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_SG,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 343d7dd..37780a7 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-02  4:30     ` Michael S. Tsirkin
  0 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-02  4:30 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

Looks good to me. minor comments below.

On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> @@ -141,13 +146,128 @@ static void set_page_pfns(struct virtio_balloon *vb,
>  					  page_to_balloon_pfn(page) + i);
>  }
>  
> +
> +static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
> +{
> +	unsigned int len;
> +
> +	virtqueue_kick(vq);
> +	wait_event(wq_head, virtqueue_get_buf(vq, &len));
> +}
> +
> +static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	struct scatterlist sg;
> +	unsigned int len;
> +
> +	sg_init_one(&sg, addr, size);
> +
> +	/* Detach all the used buffers from the vq */
> +	while (virtqueue_get_buf(vq, &len))
> +		;
> +
> +	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
> +}
> +
> +static int send_balloon_page_sg(struct virtio_balloon *vb,
> +				 struct virtqueue *vq,
> +				 void *addr,
> +				 uint32_t size,
> +				 bool batch)
> +{
> +	int err;
> +
> +	err = add_one_sg(vq, addr, size);
> +
> +	/* If batchng is requested, we batch till the vq is full */

typo

> +	if (!batch || !vq->num_free)
> +		kick_and_wait(vq, vb->acked);
> +
> +	return err;
> +}

If add_one_sg fails, kick_and_wait will hang forever.

The reason this might work in because
1. with 1 sg there are no memory allocations
2. if adding fails on vq full, then something
   is in queue and will wake up kick_and_wait.

So in short this is expected to never fail.
How about a BUG_ON here then?
And make it void, and add a comment with above explanation.

> +
> +/*
> + * Send balloon pages in sgs to host. The balloon pages are recorded in the
> + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> + * The page xbitmap is searched for continuous "1" bits, which correspond
> + * to continuous pages, to chunk into sgs.
> + *
> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
> + * need to be searched.
> + */
> +static void tell_host_sgs(struct virtio_balloon *vb,
> +			  struct virtqueue *vq,
> +			  unsigned long page_xb_start,
> +			  unsigned long page_xb_end)
> +{
> +	unsigned long sg_pfn_start, sg_pfn_end;
> +	void *sg_addr;
> +	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
> +	int err = 0;
> +
> +	sg_pfn_start = page_xb_start;
> +	while (sg_pfn_start < page_xb_end) {
> +		sg_pfn_start = xb_find_next_set_bit(&vb->page_xb, sg_pfn_start,
> +						    page_xb_end);
> +		if (sg_pfn_start == page_xb_end + 1)
> +			break;
> +		sg_pfn_end = xb_find_next_zero_bit(&vb->page_xb,
> +						   sg_pfn_start + 1,
> +						   page_xb_end);
> +		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
> +		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
> +		while (sg_len > sg_max_len) {
> +			err = send_balloon_page_sg(vb, vq, sg_addr, sg_max_len,
> +						   true);
> +			if (unlikely(err < 0))
> +				goto err_out;
> +			sg_addr += sg_max_len;
> +			sg_len -= sg_max_len;
> +		}
> +		err = send_balloon_page_sg(vb, vq, sg_addr, sg_len, true);
> +		if (unlikely(err < 0))
> +			goto err_out;
> +		sg_pfn_start = sg_pfn_end + 1;
> +	}
> +
> +	/*
> +	 * The last few sgs may not reach the batch size, but need a kick to
> +	 * notify the device to handle them.
> +	 */
> +	if (vq->num_free != virtqueue_get_vring_size(vq))
> +		kick_and_wait(vq, vb->acked);
> +
> +	xb_clear_bit_range(&vb->page_xb, page_xb_start, page_xb_end);
> +	return;
> +
> +err_out:
> +	dev_warn(&vb->vdev->dev, "%s failure: %d\n", __func__, err);

so fundamentally just make send_balloon_page_sg void then.

> +}
> +
> +static inline void xb_set_page(struct virtio_balloon *vb,
> +			       struct page *page,
> +			       unsigned long *pfn_min,
> +			       unsigned long *pfn_max)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +	xb_preload(GFP_KERNEL);
> +	xb_set_bit(&vb->page_xb, pfn);
> +	xb_preload_end();
> +}
> +
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	unsigned num_allocated_pages;
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
>  	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
> @@ -162,7 +282,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  			msleep(200);
>  			break;
>  		}
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
> +		if (use_sg)
> +			xb_set_page(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		if (!virtio_has_feature(vb->vdev,
>  					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> @@ -171,8 +296,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  
>  	num_allocated_pages = vb->num_pfns;
>  	/* Did we get any? */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->inflate_vq);
> +	if (vb->num_pfns) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->inflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	return num_allocated_pages;
> @@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> -	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	/* Traditionally, we can only do one array worth at a time. */
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	/* We can't release more pages than taken */
> @@ -210,7 +342,11 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  		page = balloon_page_dequeue(vb_dev_info);
>  		if (!page)
>  			break;
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (use_sg)
> +			xb_set_page(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
>  		list_add(&page->lru, &pages);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}
> @@ -221,8 +357,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>  	 * is true, we *have* to do it in this order
>  	 */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->deflate_vq);
> +	if (vb->num_pfns) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->deflate_vq);
> +	}
>  	release_pages_balloon(vb, &pages);
>  	mutex_unlock(&vb->balloon_lock);
>  	return num_freed_pages;
> @@ -441,6 +581,7 @@ static int init_vqs(struct virtio_balloon *vb)
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> +
>  /*
>   * virtballoon_migratepage - perform the balloon page migration on behalf of
>   *			     a compation thread.     (called under page lock)
> @@ -464,6 +605,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  {
>  	struct virtio_balloon *vb = container_of(vb_dev_info,
>  			struct virtio_balloon, vb_dev_info);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
>  	unsigned long flags;
>  
>  	/*
> @@ -485,16 +627,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  	vb_dev_info->isolated_pages--;
>  	__count_vm_event(BALLOON_MIGRATE);
>  	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, newpage);
> -	tell_host(vb, vb->inflate_vq);
> -
> +	if (use_sg) {
> +		send_balloon_page_sg(vb, vb->inflate_vq, page_address(newpage),
> +				     PAGE_SIZE, false);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, newpage);
> +		tell_host(vb, vb->inflate_vq);
> +	}
>  	/* balloon's page migration 2nd step -- deflate "page" */
>  	balloon_page_delete(page);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, page);
> -	tell_host(vb, vb->deflate_vq);
> -
> +	if (use_sg) {
> +		send_balloon_page_sg(vb, vb->deflate_vq, page_address(page),
> +				     PAGE_SIZE, false);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, page);
> +		tell_host(vb, vb->deflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	put_page(page); /* balloon reference */
> @@ -553,6 +703,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (err)
>  		goto out_free_vb;
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
> +		xb_init(&vb->page_xb);
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -669,6 +822,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_SG,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 343d7dd..37780a7 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> -- 
> 2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [Qemu-devel] [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-02  4:30     ` Michael S. Tsirkin
  0 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-02  4:30 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

Looks good to me. minor comments below.

On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> @@ -141,13 +146,128 @@ static void set_page_pfns(struct virtio_balloon *vb,
>  					  page_to_balloon_pfn(page) + i);
>  }
>  
> +
> +static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
> +{
> +	unsigned int len;
> +
> +	virtqueue_kick(vq);
> +	wait_event(wq_head, virtqueue_get_buf(vq, &len));
> +}
> +
> +static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	struct scatterlist sg;
> +	unsigned int len;
> +
> +	sg_init_one(&sg, addr, size);
> +
> +	/* Detach all the used buffers from the vq */
> +	while (virtqueue_get_buf(vq, &len))
> +		;
> +
> +	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
> +}
> +
> +static int send_balloon_page_sg(struct virtio_balloon *vb,
> +				 struct virtqueue *vq,
> +				 void *addr,
> +				 uint32_t size,
> +				 bool batch)
> +{
> +	int err;
> +
> +	err = add_one_sg(vq, addr, size);
> +
> +	/* If batchng is requested, we batch till the vq is full */

typo

> +	if (!batch || !vq->num_free)
> +		kick_and_wait(vq, vb->acked);
> +
> +	return err;
> +}

If add_one_sg fails, kick_and_wait will hang forever.

The reason this might work in because
1. with 1 sg there are no memory allocations
2. if adding fails on vq full, then something
   is in queue and will wake up kick_and_wait.

So in short this is expected to never fail.
How about a BUG_ON here then?
And make it void, and add a comment with above explanation.

> +
> +/*
> + * Send balloon pages in sgs to host. The balloon pages are recorded in the
> + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> + * The page xbitmap is searched for continuous "1" bits, which correspond
> + * to continuous pages, to chunk into sgs.
> + *
> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
> + * need to be searched.
> + */
> +static void tell_host_sgs(struct virtio_balloon *vb,
> +			  struct virtqueue *vq,
> +			  unsigned long page_xb_start,
> +			  unsigned long page_xb_end)
> +{
> +	unsigned long sg_pfn_start, sg_pfn_end;
> +	void *sg_addr;
> +	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
> +	int err = 0;
> +
> +	sg_pfn_start = page_xb_start;
> +	while (sg_pfn_start < page_xb_end) {
> +		sg_pfn_start = xb_find_next_set_bit(&vb->page_xb, sg_pfn_start,
> +						    page_xb_end);
> +		if (sg_pfn_start == page_xb_end + 1)
> +			break;
> +		sg_pfn_end = xb_find_next_zero_bit(&vb->page_xb,
> +						   sg_pfn_start + 1,
> +						   page_xb_end);
> +		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
> +		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
> +		while (sg_len > sg_max_len) {
> +			err = send_balloon_page_sg(vb, vq, sg_addr, sg_max_len,
> +						   true);
> +			if (unlikely(err < 0))
> +				goto err_out;
> +			sg_addr += sg_max_len;
> +			sg_len -= sg_max_len;
> +		}
> +		err = send_balloon_page_sg(vb, vq, sg_addr, sg_len, true);
> +		if (unlikely(err < 0))
> +			goto err_out;
> +		sg_pfn_start = sg_pfn_end + 1;
> +	}
> +
> +	/*
> +	 * The last few sgs may not reach the batch size, but need a kick to
> +	 * notify the device to handle them.
> +	 */
> +	if (vq->num_free != virtqueue_get_vring_size(vq))
> +		kick_and_wait(vq, vb->acked);
> +
> +	xb_clear_bit_range(&vb->page_xb, page_xb_start, page_xb_end);
> +	return;
> +
> +err_out:
> +	dev_warn(&vb->vdev->dev, "%s failure: %d\n", __func__, err);

so fundamentally just make send_balloon_page_sg void then.

> +}
> +
> +static inline void xb_set_page(struct virtio_balloon *vb,
> +			       struct page *page,
> +			       unsigned long *pfn_min,
> +			       unsigned long *pfn_max)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +	xb_preload(GFP_KERNEL);
> +	xb_set_bit(&vb->page_xb, pfn);
> +	xb_preload_end();
> +}
> +
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	unsigned num_allocated_pages;
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
>  	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
> @@ -162,7 +282,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  			msleep(200);
>  			break;
>  		}
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
> +		if (use_sg)
> +			xb_set_page(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		if (!virtio_has_feature(vb->vdev,
>  					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> @@ -171,8 +296,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  
>  	num_allocated_pages = vb->num_pfns;
>  	/* Did we get any? */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->inflate_vq);
> +	if (vb->num_pfns) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->inflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	return num_allocated_pages;
> @@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> -	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	/* Traditionally, we can only do one array worth at a time. */
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	/* We can't release more pages than taken */
> @@ -210,7 +342,11 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  		page = balloon_page_dequeue(vb_dev_info);
>  		if (!page)
>  			break;
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (use_sg)
> +			xb_set_page(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
>  		list_add(&page->lru, &pages);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}
> @@ -221,8 +357,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>  	 * is true, we *have* to do it in this order
>  	 */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->deflate_vq);
> +	if (vb->num_pfns) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->deflate_vq);
> +	}
>  	release_pages_balloon(vb, &pages);
>  	mutex_unlock(&vb->balloon_lock);
>  	return num_freed_pages;
> @@ -441,6 +581,7 @@ static int init_vqs(struct virtio_balloon *vb)
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> +
>  /*
>   * virtballoon_migratepage - perform the balloon page migration on behalf of
>   *			     a compation thread.     (called under page lock)
> @@ -464,6 +605,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  {
>  	struct virtio_balloon *vb = container_of(vb_dev_info,
>  			struct virtio_balloon, vb_dev_info);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
>  	unsigned long flags;
>  
>  	/*
> @@ -485,16 +627,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  	vb_dev_info->isolated_pages--;
>  	__count_vm_event(BALLOON_MIGRATE);
>  	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, newpage);
> -	tell_host(vb, vb->inflate_vq);
> -
> +	if (use_sg) {
> +		send_balloon_page_sg(vb, vb->inflate_vq, page_address(newpage),
> +				     PAGE_SIZE, false);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, newpage);
> +		tell_host(vb, vb->inflate_vq);
> +	}
>  	/* balloon's page migration 2nd step -- deflate "page" */
>  	balloon_page_delete(page);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, page);
> -	tell_host(vb, vb->deflate_vq);
> -
> +	if (use_sg) {
> +		send_balloon_page_sg(vb, vb->deflate_vq, page_address(page),
> +				     PAGE_SIZE, false);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, page);
> +		tell_host(vb, vb->deflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	put_page(page); /* balloon reference */
> @@ -553,6 +703,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (err)
>  		goto out_free_vb;
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
> +		xb_init(&vb->page_xb);
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -669,6 +822,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_SG,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 343d7dd..37780a7 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-09-30  4:05   ` Wei Wang
                     ` (2 preceding siblings ...)
  (?)
@ 2017-10-02  4:30   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-02  4:30 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko, mgorman

Looks good to me. minor comments below.

On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> @@ -141,13 +146,128 @@ static void set_page_pfns(struct virtio_balloon *vb,
>  					  page_to_balloon_pfn(page) + i);
>  }
>  
> +
> +static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
> +{
> +	unsigned int len;
> +
> +	virtqueue_kick(vq);
> +	wait_event(wq_head, virtqueue_get_buf(vq, &len));
> +}
> +
> +static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> +	struct scatterlist sg;
> +	unsigned int len;
> +
> +	sg_init_one(&sg, addr, size);
> +
> +	/* Detach all the used buffers from the vq */
> +	while (virtqueue_get_buf(vq, &len))
> +		;
> +
> +	return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
> +}
> +
> +static int send_balloon_page_sg(struct virtio_balloon *vb,
> +				 struct virtqueue *vq,
> +				 void *addr,
> +				 uint32_t size,
> +				 bool batch)
> +{
> +	int err;
> +
> +	err = add_one_sg(vq, addr, size);
> +
> +	/* If batchng is requested, we batch till the vq is full */

typo

> +	if (!batch || !vq->num_free)
> +		kick_and_wait(vq, vb->acked);
> +
> +	return err;
> +}

If add_one_sg fails, kick_and_wait will hang forever.

The reason this might work in because
1. with 1 sg there are no memory allocations
2. if adding fails on vq full, then something
   is in queue and will wake up kick_and_wait.

So in short this is expected to never fail.
How about a BUG_ON here then?
And make it void, and add a comment with above explanation.

> +
> +/*
> + * Send balloon pages in sgs to host. The balloon pages are recorded in the
> + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> + * The page xbitmap is searched for continuous "1" bits, which correspond
> + * to continuous pages, to chunk into sgs.
> + *
> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
> + * need to be searched.
> + */
> +static void tell_host_sgs(struct virtio_balloon *vb,
> +			  struct virtqueue *vq,
> +			  unsigned long page_xb_start,
> +			  unsigned long page_xb_end)
> +{
> +	unsigned long sg_pfn_start, sg_pfn_end;
> +	void *sg_addr;
> +	uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
> +	int err = 0;
> +
> +	sg_pfn_start = page_xb_start;
> +	while (sg_pfn_start < page_xb_end) {
> +		sg_pfn_start = xb_find_next_set_bit(&vb->page_xb, sg_pfn_start,
> +						    page_xb_end);
> +		if (sg_pfn_start == page_xb_end + 1)
> +			break;
> +		sg_pfn_end = xb_find_next_zero_bit(&vb->page_xb,
> +						   sg_pfn_start + 1,
> +						   page_xb_end);
> +		sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
> +		sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
> +		while (sg_len > sg_max_len) {
> +			err = send_balloon_page_sg(vb, vq, sg_addr, sg_max_len,
> +						   true);
> +			if (unlikely(err < 0))
> +				goto err_out;
> +			sg_addr += sg_max_len;
> +			sg_len -= sg_max_len;
> +		}
> +		err = send_balloon_page_sg(vb, vq, sg_addr, sg_len, true);
> +		if (unlikely(err < 0))
> +			goto err_out;
> +		sg_pfn_start = sg_pfn_end + 1;
> +	}
> +
> +	/*
> +	 * The last few sgs may not reach the batch size, but need a kick to
> +	 * notify the device to handle them.
> +	 */
> +	if (vq->num_free != virtqueue_get_vring_size(vq))
> +		kick_and_wait(vq, vb->acked);
> +
> +	xb_clear_bit_range(&vb->page_xb, page_xb_start, page_xb_end);
> +	return;
> +
> +err_out:
> +	dev_warn(&vb->vdev->dev, "%s failure: %d\n", __func__, err);

so fundamentally just make send_balloon_page_sg void then.

> +}
> +
> +static inline void xb_set_page(struct virtio_balloon *vb,
> +			       struct page *page,
> +			       unsigned long *pfn_min,
> +			       unsigned long *pfn_max)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +	xb_preload(GFP_KERNEL);
> +	xb_set_bit(&vb->page_xb, pfn);
> +	xb_preload_end();
> +}
> +
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	unsigned num_allocated_pages;
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
>  	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
> @@ -162,7 +282,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  			msleep(200);
>  			break;
>  		}
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
> +		if (use_sg)
> +			xb_set_page(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		if (!virtio_has_feature(vb->vdev,
>  					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> @@ -171,8 +296,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  
>  	num_allocated_pages = vb->num_pfns;
>  	/* Did we get any? */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->inflate_vq);
> +	if (vb->num_pfns) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->inflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	return num_allocated_pages;
> @@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> -	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	/* Traditionally, we can only do one array worth at a time. */
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	/* We can't release more pages than taken */
> @@ -210,7 +342,11 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  		page = balloon_page_dequeue(vb_dev_info);
>  		if (!page)
>  			break;
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (use_sg)
> +			xb_set_page(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
>  		list_add(&page->lru, &pages);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}
> @@ -221,8 +357,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>  	 * is true, we *have* to do it in this order
>  	 */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->deflate_vq);
> +	if (vb->num_pfns) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->deflate_vq);
> +	}
>  	release_pages_balloon(vb, &pages);
>  	mutex_unlock(&vb->balloon_lock);
>  	return num_freed_pages;
> @@ -441,6 +581,7 @@ static int init_vqs(struct virtio_balloon *vb)
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> +
>  /*
>   * virtballoon_migratepage - perform the balloon page migration on behalf of
>   *			     a compation thread.     (called under page lock)
> @@ -464,6 +605,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  {
>  	struct virtio_balloon *vb = container_of(vb_dev_info,
>  			struct virtio_balloon, vb_dev_info);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
>  	unsigned long flags;
>  
>  	/*
> @@ -485,16 +627,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  	vb_dev_info->isolated_pages--;
>  	__count_vm_event(BALLOON_MIGRATE);
>  	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, newpage);
> -	tell_host(vb, vb->inflate_vq);
> -
> +	if (use_sg) {
> +		send_balloon_page_sg(vb, vb->inflate_vq, page_address(newpage),
> +				     PAGE_SIZE, false);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, newpage);
> +		tell_host(vb, vb->inflate_vq);
> +	}
>  	/* balloon's page migration 2nd step -- deflate "page" */
>  	balloon_page_delete(page);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, page);
> -	tell_host(vb, vb->deflate_vq);
> -
> +	if (use_sg) {
> +		send_balloon_page_sg(vb, vb->deflate_vq, page_address(page),
> +				     PAGE_SIZE, false);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, page);
> +		tell_host(vb, vb->deflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	put_page(page); /* balloon reference */
> @@ -553,6 +703,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (err)
>  		goto out_free_vb;
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
> +		xb_init(&vb->page_xb);
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -669,6 +822,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_SG,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 343d7dd..37780a7 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 146+ messages in thread

* RE: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-10-02  4:30     ` Michael S. Tsirkin
  (?)
  (?)
@ 2017-10-02 12:39       ` Wang, Wei W
  -1 siblings, 0 replies; 146+ messages in thread
From: Wang, Wei W @ 2017-10-02 12:39 UTC (permalink / raw)
  To: 'Michael S. Tsirkin'
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Monday, October 2, 2017 12:30 PM, Michael S. Tsirkin wrote:
> On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> > +static int send_balloon_page_sg(struct virtio_balloon *vb,
> > +				 struct virtqueue *vq,
> > +				 void *addr,
> > +				 uint32_t size,
> > +				 bool batch)
> > +{
> > +	int err;
> > +
> > +	err = add_one_sg(vq, addr, size);
> > +
> > +	/* If batchng is requested, we batch till the vq is full */
> 
> typo
> 
> > +	if (!batch || !vq->num_free)
> > +		kick_and_wait(vq, vb->acked);
> > +
> > +	return err;
> > +}
> 
> If add_one_sg fails, kick_and_wait will hang forever.
> 
> The reason this might work in because
> 1. with 1 sg there are no memory allocations 2. if adding fails on vq full, then
> something
>    is in queue and will wake up kick_and_wait.
> 
> So in short this is expected to never fail.
> How about a BUG_ON here then?
> And make it void, and add a comment with above explanation.
> 


Yes, agree that this wouldn't fail - the worker thread performing the ballooning operations has been put into sleep when the vq is full, so I think there shouldn't be anyone else to put more sgs onto the vq then.
Btw, not sure if we need to mention memory allocation in the comment, I found virtqueue_add() doesn't return any error when allocation (for indirect desc-s) fails - it simply avoids the use of indirect desc.

What do you think of the following? 

err = add_one_sg(vq, addr, size);
/* 
  * This is expected to never fail: there is always at least 1 entry available on the vq,
  * because when the vq is full the worker thread that adds the sg will be put into
  * sleep until at least 1 entry is available to use.
  */
BUG_ON(err);

Best,
Wei



 

^ permalink raw reply	[flat|nested] 146+ messages in thread

* RE: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-02 12:39       ` Wang, Wei W
  0 siblings, 0 replies; 146+ messages in thread
From: Wang, Wei W @ 2017-10-02 12:39 UTC (permalink / raw)
  To: 'Michael S. Tsirkin'
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy,
	liliang.opensource@gmail.com

On Monday, October 2, 2017 12:30 PM, Michael S. Tsirkin wrote:
> On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> > +static int send_balloon_page_sg(struct virtio_balloon *vb,
> > +				 struct virtqueue *vq,
> > +				 void *addr,
> > +				 uint32_t size,
> > +				 bool batch)
> > +{
> > +	int err;
> > +
> > +	err = add_one_sg(vq, addr, size);
> > +
> > +	/* If batchng is requested, we batch till the vq is full */
> 
> typo
> 
> > +	if (!batch || !vq->num_free)
> > +		kick_and_wait(vq, vb->acked);
> > +
> > +	return err;
> > +}
> 
> If add_one_sg fails, kick_and_wait will hang forever.
> 
> The reason this might work in because
> 1. with 1 sg there are no memory allocations 2. if adding fails on vq full, then
> something
>    is in queue and will wake up kick_and_wait.
> 
> So in short this is expected to never fail.
> How about a BUG_ON here then?
> And make it void, and add a comment with above explanation.
> 


Yes, agree that this wouldn't fail - the worker thread performing the ballooning operations has been put into sleep when the vq is full, so I think there shouldn't be anyone else to put more sgs onto the vq then.
Btw, not sure if we need to mention memory allocation in the comment, I found virtqueue_add() doesn't return any error when allocation (for indirect desc-s) fails - it simply avoids the use of indirect desc.

What do you think of the following? 

err = add_one_sg(vq, addr, size);
/* 
  * This is expected to never fail: there is always at least 1 entry available on the vq,
  * because when the vq is full the worker thread that adds the sg will be put into
  * sleep until at least 1 entry is available to use.
  */
BUG_ON(err);

Best,
Wei



 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* RE: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-02 12:39       ` Wang, Wei W
  0 siblings, 0 replies; 146+ messages in thread
From: Wang, Wei W @ 2017-10-02 12:39 UTC (permalink / raw)
  To: 'Michael S. Tsirkin'
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Monday, October 2, 2017 12:30 PM, Michael S. Tsirkin wrote:
> On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> > +static int send_balloon_page_sg(struct virtio_balloon *vb,
> > +				 struct virtqueue *vq,
> > +				 void *addr,
> > +				 uint32_t size,
> > +				 bool batch)
> > +{
> > +	int err;
> > +
> > +	err = add_one_sg(vq, addr, size);
> > +
> > +	/* If batchng is requested, we batch till the vq is full */
> 
> typo
> 
> > +	if (!batch || !vq->num_free)
> > +		kick_and_wait(vq, vb->acked);
> > +
> > +	return err;
> > +}
> 
> If add_one_sg fails, kick_and_wait will hang forever.
> 
> The reason this might work in because
> 1. with 1 sg there are no memory allocations 2. if adding fails on vq full, then
> something
>    is in queue and will wake up kick_and_wait.
> 
> So in short this is expected to never fail.
> How about a BUG_ON here then?
> And make it void, and add a comment with above explanation.
> 


Yes, agree that this wouldn't fail - the worker thread performing the ballooning operations has been put into sleep when the vq is full, so I think there shouldn't be anyone else to put more sgs onto the vq then.
Btw, not sure if we need to mention memory allocation in the comment, I found virtqueue_add() doesn't return any error when allocation (for indirect desc-s) fails - it simply avoids the use of indirect desc.

What do you think of the following? 

err = add_one_sg(vq, addr, size);
/* 
  * This is expected to never fail: there is always at least 1 entry available on the vq,
  * because when the vq is full the worker thread that adds the sg will be put into
  * sleep until at least 1 entry is available to use.
  */
BUG_ON(err);

Best,
Wei



 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [Qemu-devel] [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-02 12:39       ` Wang, Wei W
  0 siblings, 0 replies; 146+ messages in thread
From: Wang, Wei W @ 2017-10-02 12:39 UTC (permalink / raw)
  To: 'Michael S. Tsirkin'
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Monday, October 2, 2017 12:30 PM, Michael S. Tsirkin wrote:
> On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> > +static int send_balloon_page_sg(struct virtio_balloon *vb,
> > +				 struct virtqueue *vq,
> > +				 void *addr,
> > +				 uint32_t size,
> > +				 bool batch)
> > +{
> > +	int err;
> > +
> > +	err = add_one_sg(vq, addr, size);
> > +
> > +	/* If batchng is requested, we batch till the vq is full */
> 
> typo
> 
> > +	if (!batch || !vq->num_free)
> > +		kick_and_wait(vq, vb->acked);
> > +
> > +	return err;
> > +}
> 
> If add_one_sg fails, kick_and_wait will hang forever.
> 
> The reason this might work in because
> 1. with 1 sg there are no memory allocations 2. if adding fails on vq full, then
> something
>    is in queue and will wake up kick_and_wait.
> 
> So in short this is expected to never fail.
> How about a BUG_ON here then?
> And make it void, and add a comment with above explanation.
> 


Yes, agree that this wouldn't fail - the worker thread performing the ballooning operations has been put into sleep when the vq is full, so I think there shouldn't be anyone else to put more sgs onto the vq then.
Btw, not sure if we need to mention memory allocation in the comment, I found virtqueue_add() doesn't return any error when allocation (for indirect desc-s) fails - it simply avoids the use of indirect desc.

What do you think of the following? 

err = add_one_sg(vq, addr, size);
/* 
  * This is expected to never fail: there is always at least 1 entry available on the vq,
  * because when the vq is full the worker thread that adds the sg will be put into
  * sleep until at least 1 entry is available to use.
  */
BUG_ON(err);

Best,
Wei



 

^ permalink raw reply	[flat|nested] 146+ messages in thread

* RE: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-10-02  4:30     ` Michael S. Tsirkin
  (?)
  (?)
@ 2017-10-02 12:39     ` Wang, Wei W
  -1 siblings, 0 replies; 146+ messages in thread
From: Wang, Wei W @ 2017-10-02 12:39 UTC (permalink / raw)
  To: 'Michael S. Tsirkin'
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko

On Monday, October 2, 2017 12:30 PM, Michael S. Tsirkin wrote:
> On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> > +static int send_balloon_page_sg(struct virtio_balloon *vb,
> > +				 struct virtqueue *vq,
> > +				 void *addr,
> > +				 uint32_t size,
> > +				 bool batch)
> > +{
> > +	int err;
> > +
> > +	err = add_one_sg(vq, addr, size);
> > +
> > +	/* If batchng is requested, we batch till the vq is full */
> 
> typo
> 
> > +	if (!batch || !vq->num_free)
> > +		kick_and_wait(vq, vb->acked);
> > +
> > +	return err;
> > +}
> 
> If add_one_sg fails, kick_and_wait will hang forever.
> 
> The reason this might work in because
> 1. with 1 sg there are no memory allocations 2. if adding fails on vq full, then
> something
>    is in queue and will wake up kick_and_wait.
> 
> So in short this is expected to never fail.
> How about a BUG_ON here then?
> And make it void, and add a comment with above explanation.
> 


Yes, agree that this wouldn't fail - the worker thread performing the ballooning operations has been put into sleep when the vq is full, so I think there shouldn't be anyone else to put more sgs onto the vq then.
Btw, not sure if we need to mention memory allocation in the comment, I found virtqueue_add() doesn't return any error when allocation (for indirect desc-s) fails - it simply avoids the use of indirect desc.

What do you think of the following? 

err = add_one_sg(vq, addr, size);
/* 
  * This is expected to never fail: there is always at least 1 entry available on the vq,
  * because when the vq is full the worker thread that adds the sg will be put into
  * sleep until at least 1 entry is available to use.
  */
BUG_ON(err);

Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-10-02 12:39       ` Wang, Wei W
  (?)
  (?)
@ 2017-10-02 13:44         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-02 13:44 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Mon, Oct 02, 2017 at 12:39:30PM +0000, Wang, Wei W wrote:
> On Monday, October 2, 2017 12:30 PM, Michael S. Tsirkin wrote:
> > On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> > > +static int send_balloon_page_sg(struct virtio_balloon *vb,
> > > +				 struct virtqueue *vq,
> > > +				 void *addr,
> > > +				 uint32_t size,
> > > +				 bool batch)
> > > +{
> > > +	int err;
> > > +
> > > +	err = add_one_sg(vq, addr, size);
> > > +
> > > +	/* If batchng is requested, we batch till the vq is full */
> > 
> > typo
> > 
> > > +	if (!batch || !vq->num_free)
> > > +		kick_and_wait(vq, vb->acked);
> > > +
> > > +	return err;
> > > +}
> > 
> > If add_one_sg fails, kick_and_wait will hang forever.
> > 
> > The reason this might work in because
> > 1. with 1 sg there are no memory allocations 2. if adding fails on vq full, then
> > something
> >    is in queue and will wake up kick_and_wait.
> > 
> > So in short this is expected to never fail.
> > How about a BUG_ON here then?
> > And make it void, and add a comment with above explanation.
> > 
> 
> 
> Yes, agree that this wouldn't fail - the worker thread performing the ballooning operations has been put into sleep when the vq is full, so I think there shouldn't be anyone else to put more sgs onto the vq then.
> Btw, not sure if we need to mention memory allocation in the comment, I found virtqueue_add() doesn't return any error when allocation (for indirect desc-s) fails - it simply avoids the use of indirect desc.
> 
> What do you think of the following? 
> 
> err = add_one_sg(vq, addr, size);
> /* 
>   * This is expected to never fail: there is always at least 1 entry available on the vq,
>   * because when the vq is full the worker thread that adds the sg will be put into
>   * sleep until at least 1 entry is available to use.
>   */
> BUG_ON(err);
> 
> Best,
> Wei
> 
> 
> 
>  

Sounds good.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-02 13:44         ` Michael S. Tsirkin
  0 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-02 13:44 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy,
	liliang.opensource@gmail.com

On Mon, Oct 02, 2017 at 12:39:30PM +0000, Wang, Wei W wrote:
> On Monday, October 2, 2017 12:30 PM, Michael S. Tsirkin wrote:
> > On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> > > +static int send_balloon_page_sg(struct virtio_balloon *vb,
> > > +				 struct virtqueue *vq,
> > > +				 void *addr,
> > > +				 uint32_t size,
> > > +				 bool batch)
> > > +{
> > > +	int err;
> > > +
> > > +	err = add_one_sg(vq, addr, size);
> > > +
> > > +	/* If batchng is requested, we batch till the vq is full */
> > 
> > typo
> > 
> > > +	if (!batch || !vq->num_free)
> > > +		kick_and_wait(vq, vb->acked);
> > > +
> > > +	return err;
> > > +}
> > 
> > If add_one_sg fails, kick_and_wait will hang forever.
> > 
> > The reason this might work in because
> > 1. with 1 sg there are no memory allocations 2. if adding fails on vq full, then
> > something
> >    is in queue and will wake up kick_and_wait.
> > 
> > So in short this is expected to never fail.
> > How about a BUG_ON here then?
> > And make it void, and add a comment with above explanation.
> > 
> 
> 
> Yes, agree that this wouldn't fail - the worker thread performing the ballooning operations has been put into sleep when the vq is full, so I think there shouldn't be anyone else to put more sgs onto the vq then.
> Btw, not sure if we need to mention memory allocation in the comment, I found virtqueue_add() doesn't return any error when allocation (for indirect desc-s) fails - it simply avoids the use of indirect desc.
> 
> What do you think of the following? 
> 
> err = add_one_sg(vq, addr, size);
> /* 
>   * This is expected to never fail: there is always at least 1 entry available on the vq,
>   * because when the vq is full the worker thread that adds the sg will be put into
>   * sleep until at least 1 entry is available to use.
>   */
> BUG_ON(err);
> 
> Best,
> Wei
> 
> 
> 
>  

Sounds good.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-02 13:44         ` Michael S. Tsirkin
  0 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-02 13:44 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Mon, Oct 02, 2017 at 12:39:30PM +0000, Wang, Wei W wrote:
> On Monday, October 2, 2017 12:30 PM, Michael S. Tsirkin wrote:
> > On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> > > +static int send_balloon_page_sg(struct virtio_balloon *vb,
> > > +				 struct virtqueue *vq,
> > > +				 void *addr,
> > > +				 uint32_t size,
> > > +				 bool batch)
> > > +{
> > > +	int err;
> > > +
> > > +	err = add_one_sg(vq, addr, size);
> > > +
> > > +	/* If batchng is requested, we batch till the vq is full */
> > 
> > typo
> > 
> > > +	if (!batch || !vq->num_free)
> > > +		kick_and_wait(vq, vb->acked);
> > > +
> > > +	return err;
> > > +}
> > 
> > If add_one_sg fails, kick_and_wait will hang forever.
> > 
> > The reason this might work in because
> > 1. with 1 sg there are no memory allocations 2. if adding fails on vq full, then
> > something
> >    is in queue and will wake up kick_and_wait.
> > 
> > So in short this is expected to never fail.
> > How about a BUG_ON here then?
> > And make it void, and add a comment with above explanation.
> > 
> 
> 
> Yes, agree that this wouldn't fail - the worker thread performing the ballooning operations has been put into sleep when the vq is full, so I think there shouldn't be anyone else to put more sgs onto the vq then.
> Btw, not sure if we need to mention memory allocation in the comment, I found virtqueue_add() doesn't return any error when allocation (for indirect desc-s) fails - it simply avoids the use of indirect desc.
> 
> What do you think of the following? 
> 
> err = add_one_sg(vq, addr, size);
> /* 
>   * This is expected to never fail: there is always at least 1 entry available on the vq,
>   * because when the vq is full the worker thread that adds the sg will be put into
>   * sleep until at least 1 entry is available to use.
>   */
> BUG_ON(err);
> 
> Best,
> Wei
> 
> 
> 
>  

Sounds good.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [Qemu-devel] [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-02 13:44         ` Michael S. Tsirkin
  0 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-02 13:44 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Mon, Oct 02, 2017 at 12:39:30PM +0000, Wang, Wei W wrote:
> On Monday, October 2, 2017 12:30 PM, Michael S. Tsirkin wrote:
> > On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> > > +static int send_balloon_page_sg(struct virtio_balloon *vb,
> > > +				 struct virtqueue *vq,
> > > +				 void *addr,
> > > +				 uint32_t size,
> > > +				 bool batch)
> > > +{
> > > +	int err;
> > > +
> > > +	err = add_one_sg(vq, addr, size);
> > > +
> > > +	/* If batchng is requested, we batch till the vq is full */
> > 
> > typo
> > 
> > > +	if (!batch || !vq->num_free)
> > > +		kick_and_wait(vq, vb->acked);
> > > +
> > > +	return err;
> > > +}
> > 
> > If add_one_sg fails, kick_and_wait will hang forever.
> > 
> > The reason this might work in because
> > 1. with 1 sg there are no memory allocations 2. if adding fails on vq full, then
> > something
> >    is in queue and will wake up kick_and_wait.
> > 
> > So in short this is expected to never fail.
> > How about a BUG_ON here then?
> > And make it void, and add a comment with above explanation.
> > 
> 
> 
> Yes, agree that this wouldn't fail - the worker thread performing the ballooning operations has been put into sleep when the vq is full, so I think there shouldn't be anyone else to put more sgs onto the vq then.
> Btw, not sure if we need to mention memory allocation in the comment, I found virtqueue_add() doesn't return any error when allocation (for indirect desc-s) fails - it simply avoids the use of indirect desc.
> 
> What do you think of the following? 
> 
> err = add_one_sg(vq, addr, size);
> /* 
>   * This is expected to never fail: there is always at least 1 entry available on the vq,
>   * because when the vq is full the worker thread that adds the sg will be put into
>   * sleep until at least 1 entry is available to use.
>   */
> BUG_ON(err);
> 
> Best,
> Wei
> 
> 
> 
>  

Sounds good.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-10-02 12:39       ` Wang, Wei W
                         ` (2 preceding siblings ...)
  (?)
@ 2017-10-02 13:44       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-02 13:44 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko

On Mon, Oct 02, 2017 at 12:39:30PM +0000, Wang, Wei W wrote:
> On Monday, October 2, 2017 12:30 PM, Michael S. Tsirkin wrote:
> > On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> > > +static int send_balloon_page_sg(struct virtio_balloon *vb,
> > > +				 struct virtqueue *vq,
> > > +				 void *addr,
> > > +				 uint32_t size,
> > > +				 bool batch)
> > > +{
> > > +	int err;
> > > +
> > > +	err = add_one_sg(vq, addr, size);
> > > +
> > > +	/* If batchng is requested, we batch till the vq is full */
> > 
> > typo
> > 
> > > +	if (!batch || !vq->num_free)
> > > +		kick_and_wait(vq, vb->acked);
> > > +
> > > +	return err;
> > > +}
> > 
> > If add_one_sg fails, kick_and_wait will hang forever.
> > 
> > The reason this might work in because
> > 1. with 1 sg there are no memory allocations 2. if adding fails on vq full, then
> > something
> >    is in queue and will wake up kick_and_wait.
> > 
> > So in short this is expected to never fail.
> > How about a BUG_ON here then?
> > And make it void, and add a comment with above explanation.
> > 
> 
> 
> Yes, agree that this wouldn't fail - the worker thread performing the ballooning operations has been put into sleep when the vq is full, so I think there shouldn't be anyone else to put more sgs onto the vq then.
> Btw, not sure if we need to mention memory allocation in the comment, I found virtqueue_add() doesn't return any error when allocation (for indirect desc-s) fails - it simply avoids the use of indirect desc.
> 
> What do you think of the following? 
> 
> err = add_one_sg(vq, addr, size);
> /* 
>   * This is expected to never fail: there is always at least 1 entry available on the vq,
>   * because when the vq is full the worker thread that adds the sg will be put into
>   * sleep until at least 1 entry is available to use.
>   */
> BUG_ON(err);
> 
> Best,
> Wei
> 
> 
> 
>  

Sounds good.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* RE: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
  2017-10-01  3:18     ` Michael S. Tsirkin
                         ` (2 preceding siblings ...)
  (?)
@ 2017-10-02 16:38       ` Wang, Wei W
  -1 siblings, 0 replies; 146+ messages in thread
From: Wang, Wei W @ 2017-10-02 16:38 UTC (permalink / raw)
  To: 'Michael S. Tsirkin'
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
> On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> > +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> > +			  struct virtio_balloon_ctrlq_cmd *cmd,
> > +			  bool inbuf)
> > +{
> > +	struct virtqueue *vq = vb->ctrl_vq;
> > +
> > +	ctrlq_add_cmd(vq, cmd, inbuf);
> > +	if (!inbuf) {
> > +		/*
> > +		 * All the input cmd buffers are replenished here.
> > +		 * This is necessary because the input cmd buffers are lost
> > +		 * after live migration. The device needs to rewind all of
> > +		 * them from the ctrl_vq.
> 
> Confused. Live migration somehow loses state? Why is that and why is it a good
> idea? And how do you know this is migration even?
> Looks like all you know is you got free page end. Could be any reason for this.


I think this would be something that the current live migration lacks - what the
device read from the vq is not transferred during live migration, an example is the 
stat_vq_elem: 
Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c

For all the things that are added to the vq and need to be held by the device
to use later need to consider the situation that live migration might happen at any
time and they need to be re-taken from the vq by the device on the destination
machine.

So, even without this live migration optimization feature, I think all the things that are 
added to the vq for the device to hold, need a way for the device to rewind back from
the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
on the vq so that the device side rewinding can work. 

Please let me know if anything is missed or if you have other suggestions.


> > +static void ctrlq_handle(struct virtqueue *vq) {
> > +	struct virtio_balloon *vb = vq->vdev->priv;
> > +	struct virtio_balloon_ctrlq_cmd *msg;
> > +	unsigned int class, cmd, len;
> > +
> > +	msg = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len);
> > +	if (unlikely(!msg))
> > +		return;
> > +
> > +	/* The outbuf is sent by the host for recycling, so just return. */
> > +	if (msg == &vb->free_page_cmd_out)
> > +		return;
> > +
> > +	class = virtio32_to_cpu(vb->vdev, msg->class);
> > +	cmd =  virtio32_to_cpu(vb->vdev, msg->cmd);
> > +
> > +	switch (class) {
> > +	case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
> > +		if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
> > +			vb->report_free_page_stop = true;
> > +		} else if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
> > +			vb->report_free_page_stop = false;
> > +			queue_work(vb->balloon_wq, &vb-
> >report_free_page_work);
> > +		}
> > +		vb->free_page_cmd_in.class =
> > +
> 	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> > +		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
> > +	break;
> > +	default:
> > +		dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n",
> > +			 __func__);
> > +	}
> 
> Manipulating report_free_page_stop without any locks looks very suspicious.

> Also, what if we get two start commands? we should restart from beginning,
> should we not?
> 


Yes, it will start to report free pages from the beginning.
walk_free_mem_block() doesn't maintain any internal status, so the invoking of
it will always start from the beginning.


> > +/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE
> */
> > +#define VIRTIO_BALLOON_FREE_PAGE_F_STOP		0
> > +#define VIRTIO_BALLOON_FREE_PAGE_F_START	1
> > +
> >  #endif /* _LINUX_VIRTIO_BALLOON_H */
> 
> The stop command does not appear to be thought through.
> 
> Let's assume e.g. you started migration. You ask guest for free pages.
> Then you cancel it.  There are a bunch of pages in free vq and you are getting
> more.  You now want to start migration again. What to do?
> 
> A bunch of vq flushing and waiting will maybe do the trick, but waiting on guest
> is never a great idea.
> 


I think the device can flush (pop out what's left in the vq and push them back) the
vq right after the Stop command is sent to the guest, rather than doing the flush
when the 2nd initiation of live migration begins. The entries pushed back to the vq
will be in the used ring, what would the device need to wait for?


> I previously suggested pushing the stop/start commands from guest to host on
> the free page vq, and including an ID in host to guest and guest to host
> commands. This way ctrl vq is just for host to guest commands, and host
> matches commands and knows which command is a free page in response to.
> 
> I still think it's a good idea but go ahead and propose something else that works.
> 

Thanks for the suggestion. Probably I haven't fully understood it. Please see the example
below:

1) host-to-guest ctrl_vq:
StartCMD, ID=1

2) guest-to-host free_page_vq:
free_page, ID=1
free_page, ID=1
free_page, ID=1
free_page, ID=1

3) host-to-guest ctrl_vq:
StopCMD, ID=1

4) initiate the 2nd try of live migration via host-to-guest ctrl_vq:
StartCMD, ID=2

5) the guest-to-host free_page_vq might look like this:
free_page, ID=1
free_page, ID=1
free_page, ID=2
free_page, ID=2

The device will need to drop (pop out the two entries and push them back)
the first 2 obsolete free pages which are sent by ID=1.

I haven't found the benefits above yet. The device will perform the same operations
to get rid of the old free pages. If we drop the old free pages after the StopCMD (
ID may also not be needed in this case), the overhead won't be added to the live
migration time.

Would you have any thought about this?


Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* RE: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-02 16:38       ` Wang, Wei W
  0 siblings, 0 replies; 146+ messages in thread
From: Wang, Wei W @ 2017-10-02 16:38 UTC (permalink / raw)
  To: 'Michael S. Tsirkin'
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy,
	liliang.opensource@gmail.com

On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
> On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> > +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> > +			  struct virtio_balloon_ctrlq_cmd *cmd,
> > +			  bool inbuf)
> > +{
> > +	struct virtqueue *vq = vb->ctrl_vq;
> > +
> > +	ctrlq_add_cmd(vq, cmd, inbuf);
> > +	if (!inbuf) {
> > +		/*
> > +		 * All the input cmd buffers are replenished here.
> > +		 * This is necessary because the input cmd buffers are lost
> > +		 * after live migration. The device needs to rewind all of
> > +		 * them from the ctrl_vq.
> 
> Confused. Live migration somehow loses state? Why is that and why is it a good
> idea? And how do you know this is migration even?
> Looks like all you know is you got free page end. Could be any reason for this.


I think this would be something that the current live migration lacks - what the
device read from the vq is not transferred during live migration, an example is the 
stat_vq_elem: 
Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c

For all the things that are added to the vq and need to be held by the device
to use later need to consider the situation that live migration might happen at any
time and they need to be re-taken from the vq by the device on the destination
machine.

So, even without this live migration optimization feature, I think all the things that are 
added to the vq for the device to hold, need a way for the device to rewind back from
the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
on the vq so that the device side rewinding can work. 

Please let me know if anything is missed or if you have other suggestions.


> > +static void ctrlq_handle(struct virtqueue *vq) {
> > +	struct virtio_balloon *vb = vq->vdev->priv;
> > +	struct virtio_balloon_ctrlq_cmd *msg;
> > +	unsigned int class, cmd, len;
> > +
> > +	msg = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len);
> > +	if (unlikely(!msg))
> > +		return;
> > +
> > +	/* The outbuf is sent by the host for recycling, so just return. */
> > +	if (msg == &vb->free_page_cmd_out)
> > +		return;
> > +
> > +	class = virtio32_to_cpu(vb->vdev, msg->class);
> > +	cmd =  virtio32_to_cpu(vb->vdev, msg->cmd);
> > +
> > +	switch (class) {
> > +	case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
> > +		if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
> > +			vb->report_free_page_stop = true;
> > +		} else if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
> > +			vb->report_free_page_stop = false;
> > +			queue_work(vb->balloon_wq, &vb-
> >report_free_page_work);
> > +		}
> > +		vb->free_page_cmd_in.class =
> > +
> 	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> > +		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
> > +	break;
> > +	default:
> > +		dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n",
> > +			 __func__);
> > +	}
> 
> Manipulating report_free_page_stop without any locks looks very suspicious.

> Also, what if we get two start commands? we should restart from beginning,
> should we not?
> 


Yes, it will start to report free pages from the beginning.
walk_free_mem_block() doesn't maintain any internal status, so the invoking of
it will always start from the beginning.


> > +/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE
> */
> > +#define VIRTIO_BALLOON_FREE_PAGE_F_STOP		0
> > +#define VIRTIO_BALLOON_FREE_PAGE_F_START	1
> > +
> >  #endif /* _LINUX_VIRTIO_BALLOON_H */
> 
> The stop command does not appear to be thought through.
> 
> Let's assume e.g. you started migration. You ask guest for free pages.
> Then you cancel it.  There are a bunch of pages in free vq and you are getting
> more.  You now want to start migration again. What to do?
> 
> A bunch of vq flushing and waiting will maybe do the trick, but waiting on guest
> is never a great idea.
> 


I think the device can flush (pop out what's left in the vq and push them back) the
vq right after the Stop command is sent to the guest, rather than doing the flush
when the 2nd initiation of live migration begins. The entries pushed back to the vq
will be in the used ring, what would the device need to wait for?


> I previously suggested pushing the stop/start commands from guest to host on
> the free page vq, and including an ID in host to guest and guest to host
> commands. This way ctrl vq is just for host to guest commands, and host
> matches commands and knows which command is a free page in response to.
> 
> I still think it's a good idea but go ahead and propose something else that works.
> 

Thanks for the suggestion. Probably I haven't fully understood it. Please see the example
below:

1) host-to-guest ctrl_vq:
StartCMD, ID=1

2) guest-to-host free_page_vq:
free_page, ID=1
free_page, ID=1
free_page, ID=1
free_page, ID=1

3) host-to-guest ctrl_vq:
StopCMD, ID=1

4) initiate the 2nd try of live migration via host-to-guest ctrl_vq:
StartCMD, ID=2

5) the guest-to-host free_page_vq might look like this:
free_page, ID=1
free_page, ID=1
free_page, ID=2
free_page, ID=2

The device will need to drop (pop out the two entries and push them back)
the first 2 obsolete free pages which are sent by ID=1.

I haven't found the benefits above yet. The device will perform the same operations
to get rid of the old free pages. If we drop the old free pages after the StopCMD (
ID may also not be needed in this case), the overhead won't be added to the live
migration time.

Would you have any thought about this?


Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* RE: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-02 16:38       ` Wang, Wei W
  0 siblings, 0 replies; 146+ messages in thread
From: Wang, Wei W @ 2017-10-02 16:38 UTC (permalink / raw)
  To: 'Michael S. Tsirkin'
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
> On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> > +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> > +			  struct virtio_balloon_ctrlq_cmd *cmd,
> > +			  bool inbuf)
> > +{
> > +	struct virtqueue *vq = vb->ctrl_vq;
> > +
> > +	ctrlq_add_cmd(vq, cmd, inbuf);
> > +	if (!inbuf) {
> > +		/*
> > +		 * All the input cmd buffers are replenished here.
> > +		 * This is necessary because the input cmd buffers are lost
> > +		 * after live migration. The device needs to rewind all of
> > +		 * them from the ctrl_vq.
> 
> Confused. Live migration somehow loses state? Why is that and why is it a good
> idea? And how do you know this is migration even?
> Looks like all you know is you got free page end. Could be any reason for this.


I think this would be something that the current live migration lacks - what the
device read from the vq is not transferred during live migration, an example is the 
stat_vq_elem: 
Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c

For all the things that are added to the vq and need to be held by the device
to use later need to consider the situation that live migration might happen at any
time and they need to be re-taken from the vq by the device on the destination
machine.

So, even without this live migration optimization feature, I think all the things that are 
added to the vq for the device to hold, need a way for the device to rewind back from
the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
on the vq so that the device side rewinding can work. 

Please let me know if anything is missed or if you have other suggestions.


> > +static void ctrlq_handle(struct virtqueue *vq) {
> > +	struct virtio_balloon *vb = vq->vdev->priv;
> > +	struct virtio_balloon_ctrlq_cmd *msg;
> > +	unsigned int class, cmd, len;
> > +
> > +	msg = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len);
> > +	if (unlikely(!msg))
> > +		return;
> > +
> > +	/* The outbuf is sent by the host for recycling, so just return. */
> > +	if (msg == &vb->free_page_cmd_out)
> > +		return;
> > +
> > +	class = virtio32_to_cpu(vb->vdev, msg->class);
> > +	cmd =  virtio32_to_cpu(vb->vdev, msg->cmd);
> > +
> > +	switch (class) {
> > +	case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
> > +		if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
> > +			vb->report_free_page_stop = true;
> > +		} else if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
> > +			vb->report_free_page_stop = false;
> > +			queue_work(vb->balloon_wq, &vb-
> >report_free_page_work);
> > +		}
> > +		vb->free_page_cmd_in.class =
> > +
> 	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> > +		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
> > +	break;
> > +	default:
> > +		dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n",
> > +			 __func__);
> > +	}
> 
> Manipulating report_free_page_stop without any locks looks very suspicious.

> Also, what if we get two start commands? we should restart from beginning,
> should we not?
> 


Yes, it will start to report free pages from the beginning.
walk_free_mem_block() doesn't maintain any internal status, so the invoking of
it will always start from the beginning.


> > +/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE
> */
> > +#define VIRTIO_BALLOON_FREE_PAGE_F_STOP		0
> > +#define VIRTIO_BALLOON_FREE_PAGE_F_START	1
> > +
> >  #endif /* _LINUX_VIRTIO_BALLOON_H */
> 
> The stop command does not appear to be thought through.
> 
> Let's assume e.g. you started migration. You ask guest for free pages.
> Then you cancel it.  There are a bunch of pages in free vq and you are getting
> more.  You now want to start migration again. What to do?
> 
> A bunch of vq flushing and waiting will maybe do the trick, but waiting on guest
> is never a great idea.
> 


I think the device can flush (pop out what's left in the vq and push them back) the
vq right after the Stop command is sent to the guest, rather than doing the flush
when the 2nd initiation of live migration begins. The entries pushed back to the vq
will be in the used ring, what would the device need to wait for?


> I previously suggested pushing the stop/start commands from guest to host on
> the free page vq, and including an ID in host to guest and guest to host
> commands. This way ctrl vq is just for host to guest commands, and host
> matches commands and knows which command is a free page in response to.
> 
> I still think it's a good idea but go ahead and propose something else that works.
> 

Thanks for the suggestion. Probably I haven't fully understood it. Please see the example
below:

1) host-to-guest ctrl_vq:
StartCMD, ID=1

2) guest-to-host free_page_vq:
free_page, ID=1
free_page, ID=1
free_page, ID=1
free_page, ID=1

3) host-to-guest ctrl_vq:
StopCMD, ID=1

4) initiate the 2nd try of live migration via host-to-guest ctrl_vq:
StartCMD, ID=2

5) the guest-to-host free_page_vq might look like this:
free_page, ID=1
free_page, ID=1
free_page, ID=2
free_page, ID=2

The device will need to drop (pop out the two entries and push them back)
the first 2 obsolete free pages which are sent by ID=1.

I haven't found the benefits above yet. The device will perform the same operations
to get rid of the old free pages. If we drop the old free pages after the StopCMD (
ID may also not be needed in this case), the overhead won't be added to the live
migration time.

Would you have any thought about this?


Best,
Wei


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [Qemu-devel] [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-02 16:38       ` Wang, Wei W
  0 siblings, 0 replies; 146+ messages in thread
From: Wang, Wei W @ 2017-10-02 16:38 UTC (permalink / raw)
  To: 'Michael S. Tsirkin'
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
> On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> > +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> > +			  struct virtio_balloon_ctrlq_cmd *cmd,
> > +			  bool inbuf)
> > +{
> > +	struct virtqueue *vq = vb->ctrl_vq;
> > +
> > +	ctrlq_add_cmd(vq, cmd, inbuf);
> > +	if (!inbuf) {
> > +		/*
> > +		 * All the input cmd buffers are replenished here.
> > +		 * This is necessary because the input cmd buffers are lost
> > +		 * after live migration. The device needs to rewind all of
> > +		 * them from the ctrl_vq.
> 
> Confused. Live migration somehow loses state? Why is that and why is it a good
> idea? And how do you know this is migration even?
> Looks like all you know is you got free page end. Could be any reason for this.


I think this would be something that the current live migration lacks - what the
device read from the vq is not transferred during live migration, an example is the 
stat_vq_elem: 
Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c

For all the things that are added to the vq and need to be held by the device
to use later need to consider the situation that live migration might happen at any
time and they need to be re-taken from the vq by the device on the destination
machine.

So, even without this live migration optimization feature, I think all the things that are 
added to the vq for the device to hold, need a way for the device to rewind back from
the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
on the vq so that the device side rewinding can work. 

Please let me know if anything is missed or if you have other suggestions.


> > +static void ctrlq_handle(struct virtqueue *vq) {
> > +	struct virtio_balloon *vb = vq->vdev->priv;
> > +	struct virtio_balloon_ctrlq_cmd *msg;
> > +	unsigned int class, cmd, len;
> > +
> > +	msg = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len);
> > +	if (unlikely(!msg))
> > +		return;
> > +
> > +	/* The outbuf is sent by the host for recycling, so just return. */
> > +	if (msg == &vb->free_page_cmd_out)
> > +		return;
> > +
> > +	class = virtio32_to_cpu(vb->vdev, msg->class);
> > +	cmd =  virtio32_to_cpu(vb->vdev, msg->cmd);
> > +
> > +	switch (class) {
> > +	case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
> > +		if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
> > +			vb->report_free_page_stop = true;
> > +		} else if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
> > +			vb->report_free_page_stop = false;
> > +			queue_work(vb->balloon_wq, &vb-
> >report_free_page_work);
> > +		}
> > +		vb->free_page_cmd_in.class =
> > +
> 	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> > +		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
> > +	break;
> > +	default:
> > +		dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n",
> > +			 __func__);
> > +	}
> 
> Manipulating report_free_page_stop without any locks looks very suspicious.

> Also, what if we get two start commands? we should restart from beginning,
> should we not?
> 


Yes, it will start to report free pages from the beginning.
walk_free_mem_block() doesn't maintain any internal status, so the invoking of
it will always start from the beginning.


> > +/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE
> */
> > +#define VIRTIO_BALLOON_FREE_PAGE_F_STOP		0
> > +#define VIRTIO_BALLOON_FREE_PAGE_F_START	1
> > +
> >  #endif /* _LINUX_VIRTIO_BALLOON_H */
> 
> The stop command does not appear to be thought through.
> 
> Let's assume e.g. you started migration. You ask guest for free pages.
> Then you cancel it.  There are a bunch of pages in free vq and you are getting
> more.  You now want to start migration again. What to do?
> 
> A bunch of vq flushing and waiting will maybe do the trick, but waiting on guest
> is never a great idea.
> 


I think the device can flush (pop out what's left in the vq and push them back) the
vq right after the Stop command is sent to the guest, rather than doing the flush
when the 2nd initiation of live migration begins. The entries pushed back to the vq
will be in the used ring, what would the device need to wait for?


> I previously suggested pushing the stop/start commands from guest to host on
> the free page vq, and including an ID in host to guest and guest to host
> commands. This way ctrl vq is just for host to guest commands, and host
> matches commands and knows which command is a free page in response to.
> 
> I still think it's a good idea but go ahead and propose something else that works.
> 

Thanks for the suggestion. Probably I haven't fully understood it. Please see the example
below:

1) host-to-guest ctrl_vq:
StartCMD, ID=1

2) guest-to-host free_page_vq:
free_page, ID=1
free_page, ID=1
free_page, ID=1
free_page, ID=1

3) host-to-guest ctrl_vq:
StopCMD, ID=1

4) initiate the 2nd try of live migration via host-to-guest ctrl_vq:
StartCMD, ID=2

5) the guest-to-host free_page_vq might look like this:
free_page, ID=1
free_page, ID=1
free_page, ID=2
free_page, ID=2

The device will need to drop (pop out the two entries and push them back)
the first 2 obsolete free pages which are sent by ID=1.

I haven't found the benefits above yet. The device will perform the same operations
to get rid of the old free pages. If we drop the old free pages after the StopCMD (
ID may also not be needed in this case), the overhead won't be added to the live
migration time.

Would you have any thought about this?


Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* RE: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
  2017-10-01  3:18     ` Michael S. Tsirkin
                       ` (2 preceding siblings ...)
  (?)
@ 2017-10-02 16:38     ` Wang, Wei W
  -1 siblings, 0 replies; 146+ messages in thread
From: Wang, Wei W @ 2017-10-02 16:38 UTC (permalink / raw)
  To: 'Michael S. Tsirkin'
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko

On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
> On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> > +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> > +			  struct virtio_balloon_ctrlq_cmd *cmd,
> > +			  bool inbuf)
> > +{
> > +	struct virtqueue *vq = vb->ctrl_vq;
> > +
> > +	ctrlq_add_cmd(vq, cmd, inbuf);
> > +	if (!inbuf) {
> > +		/*
> > +		 * All the input cmd buffers are replenished here.
> > +		 * This is necessary because the input cmd buffers are lost
> > +		 * after live migration. The device needs to rewind all of
> > +		 * them from the ctrl_vq.
> 
> Confused. Live migration somehow loses state? Why is that and why is it a good
> idea? And how do you know this is migration even?
> Looks like all you know is you got free page end. Could be any reason for this.


I think this would be something that the current live migration lacks - what the
device read from the vq is not transferred during live migration, an example is the 
stat_vq_elem: 
Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c

For all the things that are added to the vq and need to be held by the device
to use later need to consider the situation that live migration might happen at any
time and they need to be re-taken from the vq by the device on the destination
machine.

So, even without this live migration optimization feature, I think all the things that are 
added to the vq for the device to hold, need a way for the device to rewind back from
the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
on the vq so that the device side rewinding can work. 

Please let me know if anything is missed or if you have other suggestions.


> > +static void ctrlq_handle(struct virtqueue *vq) {
> > +	struct virtio_balloon *vb = vq->vdev->priv;
> > +	struct virtio_balloon_ctrlq_cmd *msg;
> > +	unsigned int class, cmd, len;
> > +
> > +	msg = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len);
> > +	if (unlikely(!msg))
> > +		return;
> > +
> > +	/* The outbuf is sent by the host for recycling, so just return. */
> > +	if (msg == &vb->free_page_cmd_out)
> > +		return;
> > +
> > +	class = virtio32_to_cpu(vb->vdev, msg->class);
> > +	cmd =  virtio32_to_cpu(vb->vdev, msg->cmd);
> > +
> > +	switch (class) {
> > +	case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
> > +		if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
> > +			vb->report_free_page_stop = true;
> > +		} else if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
> > +			vb->report_free_page_stop = false;
> > +			queue_work(vb->balloon_wq, &vb-
> >report_free_page_work);
> > +		}
> > +		vb->free_page_cmd_in.class =
> > +
> 	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> > +		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
> > +	break;
> > +	default:
> > +		dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n",
> > +			 __func__);
> > +	}
> 
> Manipulating report_free_page_stop without any locks looks very suspicious.

> Also, what if we get two start commands? we should restart from beginning,
> should we not?
> 


Yes, it will start to report free pages from the beginning.
walk_free_mem_block() doesn't maintain any internal status, so the invoking of
it will always start from the beginning.


> > +/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE
> */
> > +#define VIRTIO_BALLOON_FREE_PAGE_F_STOP		0
> > +#define VIRTIO_BALLOON_FREE_PAGE_F_START	1
> > +
> >  #endif /* _LINUX_VIRTIO_BALLOON_H */
> 
> The stop command does not appear to be thought through.
> 
> Let's assume e.g. you started migration. You ask guest for free pages.
> Then you cancel it.  There are a bunch of pages in free vq and you are getting
> more.  You now want to start migration again. What to do?
> 
> A bunch of vq flushing and waiting will maybe do the trick, but waiting on guest
> is never a great idea.
> 


I think the device can flush (pop out what's left in the vq and push them back) the
vq right after the Stop command is sent to the guest, rather than doing the flush
when the 2nd initiation of live migration begins. The entries pushed back to the vq
will be in the used ring, what would the device need to wait for?


> I previously suggested pushing the stop/start commands from guest to host on
> the free page vq, and including an ID in host to guest and guest to host
> commands. This way ctrl vq is just for host to guest commands, and host
> matches commands and knows which command is a free page in response to.
> 
> I still think it's a good idea but go ahead and propose something else that works.
> 

Thanks for the suggestion. Probably I haven't fully understood it. Please see the example
below:

1) host-to-guest ctrl_vq:
StartCMD, ID=1

2) guest-to-host free_page_vq:
free_page, ID=1
free_page, ID=1
free_page, ID=1
free_page, ID=1

3) host-to-guest ctrl_vq:
StopCMD, ID=1

4) initiate the 2nd try of live migration via host-to-guest ctrl_vq:
StartCMD, ID=2

5) the guest-to-host free_page_vq might look like this:
free_page, ID=1
free_page, ID=1
free_page, ID=2
free_page, ID=2

The device will need to drop (pop out the two entries and push them back)
the first 2 obsolete free pages which are sent by ID=1.

I haven't found the benefits above yet. The device will perform the same operations
to get rid of the old free pages. If we drop the old free pages after the StopCMD (
ID may also not be needed in this case), the overhead won't be added to the live
migration time.

Would you have any thought about this?


Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [virtio-dev] RE: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-02 16:38       ` Wang, Wei W
  0 siblings, 0 replies; 146+ messages in thread
From: Wang, Wei W @ 2017-10-02 16:38 UTC (permalink / raw)
  To: 'Michael S. Tsirkin'
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
> On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> > +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> > +			  struct virtio_balloon_ctrlq_cmd *cmd,
> > +			  bool inbuf)
> > +{
> > +	struct virtqueue *vq = vb->ctrl_vq;
> > +
> > +	ctrlq_add_cmd(vq, cmd, inbuf);
> > +	if (!inbuf) {
> > +		/*
> > +		 * All the input cmd buffers are replenished here.
> > +		 * This is necessary because the input cmd buffers are lost
> > +		 * after live migration. The device needs to rewind all of
> > +		 * them from the ctrl_vq.
> 
> Confused. Live migration somehow loses state? Why is that and why is it a good
> idea? And how do you know this is migration even?
> Looks like all you know is you got free page end. Could be any reason for this.


I think this would be something that the current live migration lacks - what the
device read from the vq is not transferred during live migration, an example is the 
stat_vq_elem: 
Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c

For all the things that are added to the vq and need to be held by the device
to use later need to consider the situation that live migration might happen at any
time and they need to be re-taken from the vq by the device on the destination
machine.

So, even without this live migration optimization feature, I think all the things that are 
added to the vq for the device to hold, need a way for the device to rewind back from
the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
on the vq so that the device side rewinding can work. 

Please let me know if anything is missed or if you have other suggestions.


> > +static void ctrlq_handle(struct virtqueue *vq) {
> > +	struct virtio_balloon *vb = vq->vdev->priv;
> > +	struct virtio_balloon_ctrlq_cmd *msg;
> > +	unsigned int class, cmd, len;
> > +
> > +	msg = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len);
> > +	if (unlikely(!msg))
> > +		return;
> > +
> > +	/* The outbuf is sent by the host for recycling, so just return. */
> > +	if (msg == &vb->free_page_cmd_out)
> > +		return;
> > +
> > +	class = virtio32_to_cpu(vb->vdev, msg->class);
> > +	cmd =  virtio32_to_cpu(vb->vdev, msg->cmd);
> > +
> > +	switch (class) {
> > +	case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
> > +		if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
> > +			vb->report_free_page_stop = true;
> > +		} else if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
> > +			vb->report_free_page_stop = false;
> > +			queue_work(vb->balloon_wq, &vb-
> >report_free_page_work);
> > +		}
> > +		vb->free_page_cmd_in.class =
> > +
> 	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> > +		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
> > +	break;
> > +	default:
> > +		dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n",
> > +			 __func__);
> > +	}
> 
> Manipulating report_free_page_stop without any locks looks very suspicious.

> Also, what if we get two start commands? we should restart from beginning,
> should we not?
> 


Yes, it will start to report free pages from the beginning.
walk_free_mem_block() doesn't maintain any internal status, so the invoking of
it will always start from the beginning.


> > +/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE
> */
> > +#define VIRTIO_BALLOON_FREE_PAGE_F_STOP		0
> > +#define VIRTIO_BALLOON_FREE_PAGE_F_START	1
> > +
> >  #endif /* _LINUX_VIRTIO_BALLOON_H */
> 
> The stop command does not appear to be thought through.
> 
> Let's assume e.g. you started migration. You ask guest for free pages.
> Then you cancel it.  There are a bunch of pages in free vq and you are getting
> more.  You now want to start migration again. What to do?
> 
> A bunch of vq flushing and waiting will maybe do the trick, but waiting on guest
> is never a great idea.
> 


I think the device can flush (pop out what's left in the vq and push them back) the
vq right after the Stop command is sent to the guest, rather than doing the flush
when the 2nd initiation of live migration begins. The entries pushed back to the vq
will be in the used ring, what would the device need to wait for?


> I previously suggested pushing the stop/start commands from guest to host on
> the free page vq, and including an ID in host to guest and guest to host
> commands. This way ctrl vq is just for host to guest commands, and host
> matches commands and knows which command is a free page in response to.
> 
> I still think it's a good idea but go ahead and propose something else that works.
> 

Thanks for the suggestion. Probably I haven't fully understood it. Please see the example
below:

1) host-to-guest ctrl_vq:
StartCMD, ID=1

2) guest-to-host free_page_vq:
free_page, ID=1
free_page, ID=1
free_page, ID=1
free_page, ID=1

3) host-to-guest ctrl_vq:
StopCMD, ID=1

4) initiate the 2nd try of live migration via host-to-guest ctrl_vq:
StartCMD, ID=2

5) the guest-to-host free_page_vq might look like this:
free_page, ID=1
free_page, ID=1
free_page, ID=2
free_page, ID=2

The device will need to drop (pop out the two entries and push them back)
the first 2 obsolete free pages which are sent by ID=1.

I haven't found the benefits above yet. The device will perform the same operations
to get rid of the old free pages. If we drop the old free pages after the StopCMD (
ID may also not be needed in this case), the overhead won't be added to the live
migration time.

Would you have any thought about this?


Best,
Wei



---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 4/5] mm: support reporting free page blocks
  2017-09-30  4:05   ` Wei Wang
  (?)
@ 2017-10-03 14:50     ` Michal Hocko
  -1 siblings, 0 replies; 146+ messages in thread
From: Michal Hocko @ 2017-10-03 14:50 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Sat 30-09-17 12:05:53, Wei Wang wrote:
> This patch adds support to walk through the free page blocks in the
> system and report them via a callback function. Some page blocks may
> leave the free list after zone->lock is released, so it is the caller's
> responsibility to either detect or prevent the use of such pages.
> 
> One use example of this patch is to accelerate live migration by skipping
> the transfer of free pages reported from the guest. A popular method used
> by the hypervisor to track which part of memory is written during live
> migration is to write-protect all the guest memory. So, those pages that
> are reported as free pages but are written after the report function
> returns will be captured by the hypervisor, and they will be added to the
> next round of memory transfer.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Michael S. Tsirkin <mst@redhat.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/mm.h |  6 ++++
>  mm/page_alloc.c    | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 97 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 46b9ac5..d9652c2 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>  		unsigned long zone_start_pfn, unsigned long *zholes_size);
>  extern void free_initmem(void);
>  
> +extern void walk_free_mem_block(void *opaque,
> +				int min_order,
> +				bool (*report_pfn_range)(void *opaque,
> +							 unsigned long pfn,
> +							 unsigned long num));
> +
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>   * into the buddy system. The freed pages will be poisoned with pattern
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6d00f74..c6bb874 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4762,6 +4762,97 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  	show_swap_cache_info();
>  }
>  
> +/*
> + * Walk through a free page list and report the found pfn range via the
> + * callback.
> + *
> + * Return false if the callback requests to stop reporting. Otherwise,
> + * return true.
> + */
> +static bool walk_free_page_list(void *opaque,
> +				struct zone *zone,
> +				int order,
> +				enum migratetype mt,
> +				bool (*report_pfn_range)(void *,
> +							 unsigned long,
> +							 unsigned long))
> +{
> +	struct page *page;
> +	struct list_head *list;
> +	unsigned long pfn, flags;
> +	bool ret;
> +
> +	spin_lock_irqsave(&zone->lock, flags);
> +	list = &zone->free_area[order].free_list[mt];
> +	list_for_each_entry(page, list, lru) {
> +		pfn = page_to_pfn(page);
> +		ret = report_pfn_range(opaque, pfn, 1 << order);
> +		if (!ret)
> +			break;
> +	}
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +
> +	return ret;
> +}
> +
> +/**
> + * walk_free_mem_block - Walk through the free page blocks in the system
> + * @opaque: the context passed from the caller
> + * @min_order: the minimum order of free lists to check
> + * @report_pfn_range: the callback to report the pfn range of the free pages
> + *
> + * If the callback returns false, stop iterating the list of free page blocks.
> + * Otherwise, continue to report.
> + *
> + * Please note that there are no locking guarantees for the callback and
> + * that the reported pfn range might be freed or disappear after the
> + * callback returns so the caller has to be very careful how it is used.
> + *
> + * The callback itself must not sleep or perform any operations which would
> + * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
> + * or via any lock dependency. It is generally advisable to implement
> + * the callback as simple as possible and defer any heavy lifting to a
> + * different context.
> + *
> + * There is no guarantee that each free range will be reported only once
> + * during one walk_free_mem_block invocation.
> + *
> + * pfn_to_page on the given range is strongly discouraged and if there is
> + * an absolute need for that make sure to contact MM people to discuss
> + * potential problems.
> + *
> + * The function itself might sleep so it cannot be called from atomic
> + * contexts.
> + *
> + * In general low orders tend to be very volatile and so it makes more
> + * sense to query larger ones first for various optimizations which like
> + * ballooning etc... This will reduce the overhead as well.
> + */
> +void walk_free_mem_block(void *opaque,
> +			 int min_order,
> +			 bool (*report_pfn_range)(void *opaque,
> +						  unsigned long pfn,
> +						  unsigned long num))
> +{
> +	struct zone *zone;
> +	int order;
> +	enum migratetype mt;
> +	bool ret;
> +
> +	for_each_populated_zone(zone) {
> +		for (order = MAX_ORDER - 1; order >= min_order; order--) {
> +			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> +				ret = walk_free_page_list(opaque, zone,
> +							  order, mt,
> +							  report_pfn_range);
> +				if (!ret)
> +					return;
> +			}
> +		}
> +	}
> +}
> +EXPORT_SYMBOL_GPL(walk_free_mem_block);
> +
>  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
>  {
>  	zoneref->zone = zone;
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 4/5] mm: support reporting free page blocks
@ 2017-10-03 14:50     ` Michal Hocko
  0 siblings, 0 replies; 146+ messages in thread
From: Michal Hocko @ 2017-10-03 14:50 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Sat 30-09-17 12:05:53, Wei Wang wrote:
> This patch adds support to walk through the free page blocks in the
> system and report them via a callback function. Some page blocks may
> leave the free list after zone->lock is released, so it is the caller's
> responsibility to either detect or prevent the use of such pages.
> 
> One use example of this patch is to accelerate live migration by skipping
> the transfer of free pages reported from the guest. A popular method used
> by the hypervisor to track which part of memory is written during live
> migration is to write-protect all the guest memory. So, those pages that
> are reported as free pages but are written after the report function
> returns will be captured by the hypervisor, and they will be added to the
> next round of memory transfer.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Michael S. Tsirkin <mst@redhat.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/mm.h |  6 ++++
>  mm/page_alloc.c    | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 97 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 46b9ac5..d9652c2 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>  		unsigned long zone_start_pfn, unsigned long *zholes_size);
>  extern void free_initmem(void);
>  
> +extern void walk_free_mem_block(void *opaque,
> +				int min_order,
> +				bool (*report_pfn_range)(void *opaque,
> +							 unsigned long pfn,
> +							 unsigned long num));
> +
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>   * into the buddy system. The freed pages will be poisoned with pattern
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6d00f74..c6bb874 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4762,6 +4762,97 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  	show_swap_cache_info();
>  }
>  
> +/*
> + * Walk through a free page list and report the found pfn range via the
> + * callback.
> + *
> + * Return false if the callback requests to stop reporting. Otherwise,
> + * return true.
> + */
> +static bool walk_free_page_list(void *opaque,
> +				struct zone *zone,
> +				int order,
> +				enum migratetype mt,
> +				bool (*report_pfn_range)(void *,
> +							 unsigned long,
> +							 unsigned long))
> +{
> +	struct page *page;
> +	struct list_head *list;
> +	unsigned long pfn, flags;
> +	bool ret;
> +
> +	spin_lock_irqsave(&zone->lock, flags);
> +	list = &zone->free_area[order].free_list[mt];
> +	list_for_each_entry(page, list, lru) {
> +		pfn = page_to_pfn(page);
> +		ret = report_pfn_range(opaque, pfn, 1 << order);
> +		if (!ret)
> +			break;
> +	}
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +
> +	return ret;
> +}
> +
> +/**
> + * walk_free_mem_block - Walk through the free page blocks in the system
> + * @opaque: the context passed from the caller
> + * @min_order: the minimum order of free lists to check
> + * @report_pfn_range: the callback to report the pfn range of the free pages
> + *
> + * If the callback returns false, stop iterating the list of free page blocks.
> + * Otherwise, continue to report.
> + *
> + * Please note that there are no locking guarantees for the callback and
> + * that the reported pfn range might be freed or disappear after the
> + * callback returns so the caller has to be very careful how it is used.
> + *
> + * The callback itself must not sleep or perform any operations which would
> + * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
> + * or via any lock dependency. It is generally advisable to implement
> + * the callback as simple as possible and defer any heavy lifting to a
> + * different context.
> + *
> + * There is no guarantee that each free range will be reported only once
> + * during one walk_free_mem_block invocation.
> + *
> + * pfn_to_page on the given range is strongly discouraged and if there is
> + * an absolute need for that make sure to contact MM people to discuss
> + * potential problems.
> + *
> + * The function itself might sleep so it cannot be called from atomic
> + * contexts.
> + *
> + * In general low orders tend to be very volatile and so it makes more
> + * sense to query larger ones first for various optimizations which like
> + * ballooning etc... This will reduce the overhead as well.
> + */
> +void walk_free_mem_block(void *opaque,
> +			 int min_order,
> +			 bool (*report_pfn_range)(void *opaque,
> +						  unsigned long pfn,
> +						  unsigned long num))
> +{
> +	struct zone *zone;
> +	int order;
> +	enum migratetype mt;
> +	bool ret;
> +
> +	for_each_populated_zone(zone) {
> +		for (order = MAX_ORDER - 1; order >= min_order; order--) {
> +			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> +				ret = walk_free_page_list(opaque, zone,
> +							  order, mt,
> +							  report_pfn_range);
> +				if (!ret)
> +					return;
> +			}
> +		}
> +	}
> +}
> +EXPORT_SYMBOL_GPL(walk_free_mem_block);
> +
>  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
>  {
>  	zoneref->zone = zone;
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [Qemu-devel] [PATCH v16 4/5] mm: support reporting free page blocks
@ 2017-10-03 14:50     ` Michal Hocko
  0 siblings, 0 replies; 146+ messages in thread
From: Michal Hocko @ 2017-10-03 14:50 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Sat 30-09-17 12:05:53, Wei Wang wrote:
> This patch adds support to walk through the free page blocks in the
> system and report them via a callback function. Some page blocks may
> leave the free list after zone->lock is released, so it is the caller's
> responsibility to either detect or prevent the use of such pages.
> 
> One use example of this patch is to accelerate live migration by skipping
> the transfer of free pages reported from the guest. A popular method used
> by the hypervisor to track which part of memory is written during live
> migration is to write-protect all the guest memory. So, those pages that
> are reported as free pages but are written after the report function
> returns will be captured by the hypervisor, and they will be added to the
> next round of memory transfer.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Michael S. Tsirkin <mst@redhat.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/mm.h |  6 ++++
>  mm/page_alloc.c    | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 97 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 46b9ac5..d9652c2 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>  		unsigned long zone_start_pfn, unsigned long *zholes_size);
>  extern void free_initmem(void);
>  
> +extern void walk_free_mem_block(void *opaque,
> +				int min_order,
> +				bool (*report_pfn_range)(void *opaque,
> +							 unsigned long pfn,
> +							 unsigned long num));
> +
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>   * into the buddy system. The freed pages will be poisoned with pattern
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6d00f74..c6bb874 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4762,6 +4762,97 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  	show_swap_cache_info();
>  }
>  
> +/*
> + * Walk through a free page list and report the found pfn range via the
> + * callback.
> + *
> + * Return false if the callback requests to stop reporting. Otherwise,
> + * return true.
> + */
> +static bool walk_free_page_list(void *opaque,
> +				struct zone *zone,
> +				int order,
> +				enum migratetype mt,
> +				bool (*report_pfn_range)(void *,
> +							 unsigned long,
> +							 unsigned long))
> +{
> +	struct page *page;
> +	struct list_head *list;
> +	unsigned long pfn, flags;
> +	bool ret;
> +
> +	spin_lock_irqsave(&zone->lock, flags);
> +	list = &zone->free_area[order].free_list[mt];
> +	list_for_each_entry(page, list, lru) {
> +		pfn = page_to_pfn(page);
> +		ret = report_pfn_range(opaque, pfn, 1 << order);
> +		if (!ret)
> +			break;
> +	}
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +
> +	return ret;
> +}
> +
> +/**
> + * walk_free_mem_block - Walk through the free page blocks in the system
> + * @opaque: the context passed from the caller
> + * @min_order: the minimum order of free lists to check
> + * @report_pfn_range: the callback to report the pfn range of the free pages
> + *
> + * If the callback returns false, stop iterating the list of free page blocks.
> + * Otherwise, continue to report.
> + *
> + * Please note that there are no locking guarantees for the callback and
> + * that the reported pfn range might be freed or disappear after the
> + * callback returns so the caller has to be very careful how it is used.
> + *
> + * The callback itself must not sleep or perform any operations which would
> + * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
> + * or via any lock dependency. It is generally advisable to implement
> + * the callback as simple as possible and defer any heavy lifting to a
> + * different context.
> + *
> + * There is no guarantee that each free range will be reported only once
> + * during one walk_free_mem_block invocation.
> + *
> + * pfn_to_page on the given range is strongly discouraged and if there is
> + * an absolute need for that make sure to contact MM people to discuss
> + * potential problems.
> + *
> + * The function itself might sleep so it cannot be called from atomic
> + * contexts.
> + *
> + * In general low orders tend to be very volatile and so it makes more
> + * sense to query larger ones first for various optimizations which like
> + * ballooning etc... This will reduce the overhead as well.
> + */
> +void walk_free_mem_block(void *opaque,
> +			 int min_order,
> +			 bool (*report_pfn_range)(void *opaque,
> +						  unsigned long pfn,
> +						  unsigned long num))
> +{
> +	struct zone *zone;
> +	int order;
> +	enum migratetype mt;
> +	bool ret;
> +
> +	for_each_populated_zone(zone) {
> +		for (order = MAX_ORDER - 1; order >= min_order; order--) {
> +			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> +				ret = walk_free_page_list(opaque, zone,
> +							  order, mt,
> +							  report_pfn_range);
> +				if (!ret)
> +					return;
> +			}
> +		}
> +	}
> +}
> +EXPORT_SYMBOL_GPL(walk_free_mem_block);
> +
>  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
>  {
>  	zoneref->zone = zone;
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 4/5] mm: support reporting free page blocks
  2017-09-30  4:05   ` Wei Wang
                     ` (3 preceding siblings ...)
  (?)
@ 2017-10-03 14:50   ` Michal Hocko
  -1 siblings, 0 replies; 146+ messages in thread
From: Michal Hocko @ 2017-10-03 14:50 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, mst, qemu-devel, amit.shah,
	liliang.opensource, mawilcox, linux-kernel, willy,
	virtualization, linux-mm, yang.zhang.wz, quan.xu, cornelia.huck,
	pbonzini, akpm, mgorman

On Sat 30-09-17 12:05:53, Wei Wang wrote:
> This patch adds support to walk through the free page blocks in the
> system and report them via a callback function. Some page blocks may
> leave the free list after zone->lock is released, so it is the caller's
> responsibility to either detect or prevent the use of such pages.
> 
> One use example of this patch is to accelerate live migration by skipping
> the transfer of free pages reported from the guest. A popular method used
> by the hypervisor to track which part of memory is written during live
> migration is to write-protect all the guest memory. So, those pages that
> are reported as free pages but are written after the report function
> returns will be captured by the hypervisor, and they will be added to the
> next round of memory transfer.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Michael S. Tsirkin <mst@redhat.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/mm.h |  6 ++++
>  mm/page_alloc.c    | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 97 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 46b9ac5..d9652c2 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>  		unsigned long zone_start_pfn, unsigned long *zholes_size);
>  extern void free_initmem(void);
>  
> +extern void walk_free_mem_block(void *opaque,
> +				int min_order,
> +				bool (*report_pfn_range)(void *opaque,
> +							 unsigned long pfn,
> +							 unsigned long num));
> +
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>   * into the buddy system. The freed pages will be poisoned with pattern
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6d00f74..c6bb874 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4762,6 +4762,97 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  	show_swap_cache_info();
>  }
>  
> +/*
> + * Walk through a free page list and report the found pfn range via the
> + * callback.
> + *
> + * Return false if the callback requests to stop reporting. Otherwise,
> + * return true.
> + */
> +static bool walk_free_page_list(void *opaque,
> +				struct zone *zone,
> +				int order,
> +				enum migratetype mt,
> +				bool (*report_pfn_range)(void *,
> +							 unsigned long,
> +							 unsigned long))
> +{
> +	struct page *page;
> +	struct list_head *list;
> +	unsigned long pfn, flags;
> +	bool ret;
> +
> +	spin_lock_irqsave(&zone->lock, flags);
> +	list = &zone->free_area[order].free_list[mt];
> +	list_for_each_entry(page, list, lru) {
> +		pfn = page_to_pfn(page);
> +		ret = report_pfn_range(opaque, pfn, 1 << order);
> +		if (!ret)
> +			break;
> +	}
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +
> +	return ret;
> +}
> +
> +/**
> + * walk_free_mem_block - Walk through the free page blocks in the system
> + * @opaque: the context passed from the caller
> + * @min_order: the minimum order of free lists to check
> + * @report_pfn_range: the callback to report the pfn range of the free pages
> + *
> + * If the callback returns false, stop iterating the list of free page blocks.
> + * Otherwise, continue to report.
> + *
> + * Please note that there are no locking guarantees for the callback and
> + * that the reported pfn range might be freed or disappear after the
> + * callback returns so the caller has to be very careful how it is used.
> + *
> + * The callback itself must not sleep or perform any operations which would
> + * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
> + * or via any lock dependency. It is generally advisable to implement
> + * the callback as simple as possible and defer any heavy lifting to a
> + * different context.
> + *
> + * There is no guarantee that each free range will be reported only once
> + * during one walk_free_mem_block invocation.
> + *
> + * pfn_to_page on the given range is strongly discouraged and if there is
> + * an absolute need for that make sure to contact MM people to discuss
> + * potential problems.
> + *
> + * The function itself might sleep so it cannot be called from atomic
> + * contexts.
> + *
> + * In general low orders tend to be very volatile and so it makes more
> + * sense to query larger ones first for various optimizations which like
> + * ballooning etc... This will reduce the overhead as well.
> + */
> +void walk_free_mem_block(void *opaque,
> +			 int min_order,
> +			 bool (*report_pfn_range)(void *opaque,
> +						  unsigned long pfn,
> +						  unsigned long num))
> +{
> +	struct zone *zone;
> +	int order;
> +	enum migratetype mt;
> +	bool ret;
> +
> +	for_each_populated_zone(zone) {
> +		for (order = MAX_ORDER - 1; order >= min_order; order--) {
> +			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> +				ret = walk_free_page_list(opaque, zone,
> +							  order, mt,
> +							  report_pfn_range);
> +				if (!ret)
> +					return;
> +			}
> +		}
> +	}
> +}
> +EXPORT_SYMBOL_GPL(walk_free_mem_block);
> +
>  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
>  {
>  	zoneref->zone = zone;
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 0/5] Virtio-balloon Enhancement
  2017-10-01 13:25   ` Damian Tometzki
  (?)
  (?)
@ 2017-10-09  9:39     ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-09  9:39 UTC (permalink / raw)
  To: Damian Tometzki, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, liliang.opensource, yang.zhang.wz, quan.xu

On 10/01/2017 09:25 PM, Damian Tometzki wrote:
> Hello,
>
> where i can found the patch in git.kernel.org ?
>

We don't have patches there. If you want to try this feature, you can 
get the qemu side draft code here: https://github.com/wei-w-wang/qemu-lm

Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 0/5] Virtio-balloon Enhancement
@ 2017-10-09  9:39     ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-09  9:39 UTC (permalink / raw)
  To: Damian Tometzki, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, liliang.opensource, yang.zhang.wz, quan.xu

On 10/01/2017 09:25 PM, Damian Tometzki wrote:
> Hello,
>
> where i can found the patch in git.kernel.org ?
>

We don't have patches there. If you want to try this feature, you can 
get the qemu side draft code here: https://github.com/wei-w-wang/qemu-lm

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [Qemu-devel] [PATCH v16 0/5] Virtio-balloon Enhancement
@ 2017-10-09  9:39     ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-09  9:39 UTC (permalink / raw)
  To: Damian Tometzki, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, liliang.opensource, yang.zhang.wz, quan.xu

On 10/01/2017 09:25 PM, Damian Tometzki wrote:
> Hello,
>
> where i can found the patch in git.kernel.org ?
>

We don't have patches there. If you want to try this feature, you can 
get the qemu side draft code here: https://github.com/wei-w-wang/qemu-lm

Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 0/5] Virtio-balloon Enhancement
  2017-10-01 13:25   ` Damian Tometzki
                     ` (3 preceding siblings ...)
  (?)
@ 2017-10-09  9:39   ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-09  9:39 UTC (permalink / raw)
  To: Damian Tometzki, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, mst, mhocko, akpm, mawilcox
  Cc: aarcange, yang.zhang.wz, liliang.opensource, willy, amit.shah,
	quan.xu, cornelia.huck, pbonzini, mgorman

On 10/01/2017 09:25 PM, Damian Tometzki wrote:
> Hello,
>
> where i can found the patch in git.kernel.org ?
>

We don't have patches there. If you want to try this feature, you can 
get the qemu side draft code here: https://github.com/wei-w-wang/qemu-lm

Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [virtio-dev] Re: [PATCH v16 0/5] Virtio-balloon Enhancement
@ 2017-10-09  9:39     ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-09  9:39 UTC (permalink / raw)
  To: Damian Tometzki, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, mst, mhocko, akpm, mawilcox
  Cc: david, cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	willy, liliang.opensource, yang.zhang.wz, quan.xu

On 10/01/2017 09:25 PM, Damian Tometzki wrote:
> Hello,
>
> where i can found the patch in git.kernel.org ?
>

We don't have patches there. If you want to try this feature, you can 
get the qemu side draft code here: https://github.com/wei-w-wang/qemu-lm

Best,
Wei

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 1/5] lib/xbitmap: Introduce xbitmap
  2017-09-30  4:05   ` Wei Wang
  (?)
@ 2017-10-09 11:30     ` Tetsuo Handa
  -1 siblings, 0 replies; 146+ messages in thread
From: Tetsuo Handa @ 2017-10-09 11:30 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu

On 2017/09/30 13:05, Wei Wang wrote:
>  /**
> + *  xb_preload - preload for xb_set_bit()
> + *  @gfp_mask: allocation mask to use for preloading
> + *
> + * Preallocate memory to use for the next call to xb_set_bit(). This function
> + * returns with preemption disabled. It will be enabled by xb_preload_end().
> + */
> +void xb_preload(gfp_t gfp)
> +{
> +	if (__radix_tree_preload(gfp, XB_PRELOAD_SIZE) < 0)
> +		preempt_disable();
> +
> +	if (!this_cpu_read(ida_bitmap)) {
> +		struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
> +
> +		if (!bitmap)
> +			return;
> +		bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
> +		kfree(bitmap);
> +	}
> +}

I'm not sure whether this function is safe.

__radix_tree_preload() returns 0 with preemption disabled upon success.
xb_preload() disables preemption if __radix_tree_preload() fails.
Then, kmalloc() is called with preemption disabled, isn't it?
But xb_set_page() calls xb_preload(GFP_KERNEL) which might sleep...

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 1/5] lib/xbitmap: Introduce xbitmap
@ 2017-10-09 11:30     ` Tetsuo Handa
  0 siblings, 0 replies; 146+ messages in thread
From: Tetsuo Handa @ 2017-10-09 11:30 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu

On 2017/09/30 13:05, Wei Wang wrote:
>  /**
> + *  xb_preload - preload for xb_set_bit()
> + *  @gfp_mask: allocation mask to use for preloading
> + *
> + * Preallocate memory to use for the next call to xb_set_bit(). This function
> + * returns with preemption disabled. It will be enabled by xb_preload_end().
> + */
> +void xb_preload(gfp_t gfp)
> +{
> +	if (__radix_tree_preload(gfp, XB_PRELOAD_SIZE) < 0)
> +		preempt_disable();
> +
> +	if (!this_cpu_read(ida_bitmap)) {
> +		struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
> +
> +		if (!bitmap)
> +			return;
> +		bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
> +		kfree(bitmap);
> +	}
> +}

I'm not sure whether this function is safe.

__radix_tree_preload() returns 0 with preemption disabled upon success.
xb_preload() disables preemption if __radix_tree_preload() fails.
Then, kmalloc() is called with preemption disabled, isn't it?
But xb_set_page() calls xb_preload(GFP_KERNEL) which might sleep...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [Qemu-devel] [PATCH v16 1/5] lib/xbitmap: Introduce xbitmap
@ 2017-10-09 11:30     ` Tetsuo Handa
  0 siblings, 0 replies; 146+ messages in thread
From: Tetsuo Handa @ 2017-10-09 11:30 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, mhocko, akpm, mawilcox, david, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, willy,
	liliang.opensource, yang.zhang.wz, quan.xu

On 2017/09/30 13:05, Wei Wang wrote:
>  /**
> + *  xb_preload - preload for xb_set_bit()
> + *  @gfp_mask: allocation mask to use for preloading
> + *
> + * Preallocate memory to use for the next call to xb_set_bit(). This function
> + * returns with preemption disabled. It will be enabled by xb_preload_end().
> + */
> +void xb_preload(gfp_t gfp)
> +{
> +	if (__radix_tree_preload(gfp, XB_PRELOAD_SIZE) < 0)
> +		preempt_disable();
> +
> +	if (!this_cpu_read(ida_bitmap)) {
> +		struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
> +
> +		if (!bitmap)
> +			return;
> +		bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
> +		kfree(bitmap);
> +	}
> +}

I'm not sure whether this function is safe.

__radix_tree_preload() returns 0 with preemption disabled upon success.
xb_preload() disables preemption if __radix_tree_preload() fails.
Then, kmalloc() is called with preemption disabled, isn't it?
But xb_set_page() calls xb_preload(GFP_KERNEL) which might sleep...

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-09-30  4:05   ` Wei Wang
  (?)
  (?)
@ 2017-10-09 15:20     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-09 15:20 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> +static inline void xb_set_page(struct virtio_balloon *vb,
> +			       struct page *page,
> +			       unsigned long *pfn_min,
> +			       unsigned long *pfn_max)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +	xb_preload(GFP_KERNEL);
> +	xb_set_bit(&vb->page_xb, pfn);
> +	xb_preload_end();
> +}
> +

So, this will allocate memory

...

> @@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> -	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	/* Traditionally, we can only do one array worth at a time. */
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	/* We can't release more pages than taken */

And is sometimes called on OOM.


I suspect we need to

1. keep around some memory for leak on oom

2. for non oom allocate outside locks


-- 
MST

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-09 15:20     ` Michael S. Tsirkin
  0 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-09 15:20 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> +static inline void xb_set_page(struct virtio_balloon *vb,
> +			       struct page *page,
> +			       unsigned long *pfn_min,
> +			       unsigned long *pfn_max)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +	xb_preload(GFP_KERNEL);
> +	xb_set_bit(&vb->page_xb, pfn);
> +	xb_preload_end();
> +}
> +

So, this will allocate memory

...

> @@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> -	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	/* Traditionally, we can only do one array worth at a time. */
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	/* We can't release more pages than taken */

And is sometimes called on OOM.


I suspect we need to

1. keep around some memory for leak on oom

2. for non oom allocate outside locks


-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [Qemu-devel] [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-09 15:20     ` Michael S. Tsirkin
  0 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-09 15:20 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> +static inline void xb_set_page(struct virtio_balloon *vb,
> +			       struct page *page,
> +			       unsigned long *pfn_min,
> +			       unsigned long *pfn_max)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +	xb_preload(GFP_KERNEL);
> +	xb_set_bit(&vb->page_xb, pfn);
> +	xb_preload_end();
> +}
> +

So, this will allocate memory

...

> @@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> -	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	/* Traditionally, we can only do one array worth at a time. */
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	/* We can't release more pages than taken */

And is sometimes called on OOM.


I suspect we need to

1. keep around some memory for leak on oom

2. for non oom allocate outside locks


-- 
MST

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-09-30  4:05   ` Wei Wang
                     ` (5 preceding siblings ...)
  (?)
@ 2017-10-09 15:20   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-09 15:20 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko, mgorman

On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> +static inline void xb_set_page(struct virtio_balloon *vb,
> +			       struct page *page,
> +			       unsigned long *pfn_min,
> +			       unsigned long *pfn_max)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +	xb_preload(GFP_KERNEL);
> +	xb_set_bit(&vb->page_xb, pfn);
> +	xb_preload_end();
> +}
> +

So, this will allocate memory

...

> @@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> -	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	/* Traditionally, we can only do one array worth at a time. */
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	/* We can't release more pages than taken */

And is sometimes called on OOM.


I suspect we need to

1. keep around some memory for leak on oom

2. for non oom allocate outside locks


-- 
MST

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [virtio-dev] Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-09 15:20     ` Michael S. Tsirkin
  0 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-09 15:20 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> +static inline void xb_set_page(struct virtio_balloon *vb,
> +			       struct page *page,
> +			       unsigned long *pfn_min,
> +			       unsigned long *pfn_max)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +	xb_preload(GFP_KERNEL);
> +	xb_set_bit(&vb->page_xb, pfn);
> +	xb_preload_end();
> +}
> +

So, this will allocate memory

...

> @@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> -	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	/* Traditionally, we can only do one array worth at a time. */
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	/* We can't release more pages than taken */

And is sometimes called on OOM.


I suspect we need to

1. keep around some memory for leak on oom

2. for non oom allocate outside locks


-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-10-09 15:20     ` Michael S. Tsirkin
  (?)
  (?)
@ 2017-10-10  7:28       ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-10  7:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu, Tetsuo Handa

On 10/09/2017 11:20 PM, Michael S. Tsirkin wrote:
> On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
>> +static inline void xb_set_page(struct virtio_balloon *vb,
>> +			       struct page *page,
>> +			       unsigned long *pfn_min,
>> +			       unsigned long *pfn_max)
>> +{
>> +	unsigned long pfn = page_to_pfn(page);
>> +
>> +	*pfn_min = min(pfn, *pfn_min);
>> +	*pfn_max = max(pfn, *pfn_max);
>> +	xb_preload(GFP_KERNEL);
>> +	xb_set_bit(&vb->page_xb, pfn);
>> +	xb_preload_end();
>> +}
>> +
> So, this will allocate memory
>
> ...
>
>> @@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>>   	struct page *page;
>>   	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>>   	LIST_HEAD(pages);
>> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
>> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>>   
>> -	/* We can only do one array worth at a time. */
>> -	num = min(num, ARRAY_SIZE(vb->pfns));
>> +	/* Traditionally, we can only do one array worth at a time. */
>> +	if (!use_sg)
>> +		num = min(num, ARRAY_SIZE(vb->pfns));
>>   
>>   	mutex_lock(&vb->balloon_lock);
>>   	/* We can't release more pages than taken */
> And is sometimes called on OOM.
>
>
> I suspect we need to
>
> 1. keep around some memory for leak on oom
>
> 2. for non oom allocate outside locks
>
>

I think maybe we can optimize the existing balloon logic, which could 
remove the big balloon lock:

It would not be necessary to have the inflating and deflating run at the 
same time.
For example, 1st request to inflate 7G RAM, when 1GB has been given to 
the host (so 6G left), the
2nd request to deflate 5G is received. Instead of waiting for the 1st 
request to inflate 6G and then
continuing with the 2nd request to deflate 5G, we can do a diff (6G to 
inflate - 5G to deflate) immediately,
and got 1G to inflate. In this way, all that driver will do is to simply 
inflate another 1G.

Same for the OOM case: when OOM asks for 1G, while inflating 5G is in 
progress, then the driver can
deduct 1G from the amount that needs to inflate, and as a result, it 
will inflate 4G.

In this case, we will never have the inflating and deflating task run at 
the same time, so I think it is
possible to remove the lock, and therefore, we will not have that 
deadlock issue.

What would you guys think?

Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-10  7:28       ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-10  7:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu, Tetsuo Handa

On 10/09/2017 11:20 PM, Michael S. Tsirkin wrote:
> On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
>> +static inline void xb_set_page(struct virtio_balloon *vb,
>> +			       struct page *page,
>> +			       unsigned long *pfn_min,
>> +			       unsigned long *pfn_max)
>> +{
>> +	unsigned long pfn = page_to_pfn(page);
>> +
>> +	*pfn_min = min(pfn, *pfn_min);
>> +	*pfn_max = max(pfn, *pfn_max);
>> +	xb_preload(GFP_KERNEL);
>> +	xb_set_bit(&vb->page_xb, pfn);
>> +	xb_preload_end();
>> +}
>> +
> So, this will allocate memory
>
> ...
>
>> @@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>>   	struct page *page;
>>   	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>>   	LIST_HEAD(pages);
>> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
>> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>>   
>> -	/* We can only do one array worth at a time. */
>> -	num = min(num, ARRAY_SIZE(vb->pfns));
>> +	/* Traditionally, we can only do one array worth at a time. */
>> +	if (!use_sg)
>> +		num = min(num, ARRAY_SIZE(vb->pfns));
>>   
>>   	mutex_lock(&vb->balloon_lock);
>>   	/* We can't release more pages than taken */
> And is sometimes called on OOM.
>
>
> I suspect we need to
>
> 1. keep around some memory for leak on oom
>
> 2. for non oom allocate outside locks
>
>

I think maybe we can optimize the existing balloon logic, which could 
remove the big balloon lock:

It would not be necessary to have the inflating and deflating run at the 
same time.
For example, 1st request to inflate 7G RAM, when 1GB has been given to 
the host (so 6G left), the
2nd request to deflate 5G is received. Instead of waiting for the 1st 
request to inflate 6G and then
continuing with the 2nd request to deflate 5G, we can do a diff (6G to 
inflate - 5G to deflate) immediately,
and got 1G to inflate. In this way, all that driver will do is to simply 
inflate another 1G.

Same for the OOM case: when OOM asks for 1G, while inflating 5G is in 
progress, then the driver can
deduct 1G from the amount that needs to inflate, and as a result, it 
will inflate 4G.

In this case, we will never have the inflating and deflating task run at 
the same time, so I think it is
possible to remove the lock, and therefore, we will not have that 
deadlock issue.

What would you guys think?

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [Qemu-devel] [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-10  7:28       ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-10  7:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu, Tetsuo Handa

On 10/09/2017 11:20 PM, Michael S. Tsirkin wrote:
> On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
>> +static inline void xb_set_page(struct virtio_balloon *vb,
>> +			       struct page *page,
>> +			       unsigned long *pfn_min,
>> +			       unsigned long *pfn_max)
>> +{
>> +	unsigned long pfn = page_to_pfn(page);
>> +
>> +	*pfn_min = min(pfn, *pfn_min);
>> +	*pfn_max = max(pfn, *pfn_max);
>> +	xb_preload(GFP_KERNEL);
>> +	xb_set_bit(&vb->page_xb, pfn);
>> +	xb_preload_end();
>> +}
>> +
> So, this will allocate memory
>
> ...
>
>> @@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>>   	struct page *page;
>>   	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>>   	LIST_HEAD(pages);
>> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
>> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>>   
>> -	/* We can only do one array worth at a time. */
>> -	num = min(num, ARRAY_SIZE(vb->pfns));
>> +	/* Traditionally, we can only do one array worth at a time. */
>> +	if (!use_sg)
>> +		num = min(num, ARRAY_SIZE(vb->pfns));
>>   
>>   	mutex_lock(&vb->balloon_lock);
>>   	/* We can't release more pages than taken */
> And is sometimes called on OOM.
>
>
> I suspect we need to
>
> 1. keep around some memory for leak on oom
>
> 2. for non oom allocate outside locks
>
>

I think maybe we can optimize the existing balloon logic, which could 
remove the big balloon lock:

It would not be necessary to have the inflating and deflating run at the 
same time.
For example, 1st request to inflate 7G RAM, when 1GB has been given to 
the host (so 6G left), the
2nd request to deflate 5G is received. Instead of waiting for the 1st 
request to inflate 6G and then
continuing with the 2nd request to deflate 5G, we can do a diff (6G to 
inflate - 5G to deflate) immediately,
and got 1G to inflate. In this way, all that driver will do is to simply 
inflate another 1G.

Same for the OOM case: when OOM asks for 1G, while inflating 5G is in 
progress, then the driver can
deduct 1G from the amount that needs to inflate, and as a result, it 
will inflate 4G.

In this case, we will never have the inflating and deflating task run at 
the same time, so I think it is
possible to remove the lock, and therefore, we will not have that 
deadlock issue.

What would you guys think?

Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-10-09 15:20     ` Michael S. Tsirkin
                       ` (2 preceding siblings ...)
  (?)
@ 2017-10-10  7:28     ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-10  7:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, Tetsuo Handa, quan.xu, cornelia.huck,
	pbonzini, akpm, mhocko, mgorman

On 10/09/2017 11:20 PM, Michael S. Tsirkin wrote:
> On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
>> +static inline void xb_set_page(struct virtio_balloon *vb,
>> +			       struct page *page,
>> +			       unsigned long *pfn_min,
>> +			       unsigned long *pfn_max)
>> +{
>> +	unsigned long pfn = page_to_pfn(page);
>> +
>> +	*pfn_min = min(pfn, *pfn_min);
>> +	*pfn_max = max(pfn, *pfn_max);
>> +	xb_preload(GFP_KERNEL);
>> +	xb_set_bit(&vb->page_xb, pfn);
>> +	xb_preload_end();
>> +}
>> +
> So, this will allocate memory
>
> ...
>
>> @@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>>   	struct page *page;
>>   	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>>   	LIST_HEAD(pages);
>> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
>> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>>   
>> -	/* We can only do one array worth at a time. */
>> -	num = min(num, ARRAY_SIZE(vb->pfns));
>> +	/* Traditionally, we can only do one array worth at a time. */
>> +	if (!use_sg)
>> +		num = min(num, ARRAY_SIZE(vb->pfns));
>>   
>>   	mutex_lock(&vb->balloon_lock);
>>   	/* We can't release more pages than taken */
> And is sometimes called on OOM.
>
>
> I suspect we need to
>
> 1. keep around some memory for leak on oom
>
> 2. for non oom allocate outside locks
>
>

I think maybe we can optimize the existing balloon logic, which could 
remove the big balloon lock:

It would not be necessary to have the inflating and deflating run at the 
same time.
For example, 1st request to inflate 7G RAM, when 1GB has been given to 
the host (so 6G left), the
2nd request to deflate 5G is received. Instead of waiting for the 1st 
request to inflate 6G and then
continuing with the 2nd request to deflate 5G, we can do a diff (6G to 
inflate - 5G to deflate) immediately,
and got 1G to inflate. In this way, all that driver will do is to simply 
inflate another 1G.

Same for the OOM case: when OOM asks for 1G, while inflating 5G is in 
progress, then the driver can
deduct 1G from the amount that needs to inflate, and as a result, it 
will inflate 4G.

In this case, we will never have the inflating and deflating task run at 
the same time, so I think it is
possible to remove the lock, and therefore, we will not have that 
deadlock issue.

What would you guys think?

Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [virtio-dev] Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-10  7:28       ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-10  7:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu, Tetsuo Handa

On 10/09/2017 11:20 PM, Michael S. Tsirkin wrote:
> On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
>> +static inline void xb_set_page(struct virtio_balloon *vb,
>> +			       struct page *page,
>> +			       unsigned long *pfn_min,
>> +			       unsigned long *pfn_max)
>> +{
>> +	unsigned long pfn = page_to_pfn(page);
>> +
>> +	*pfn_min = min(pfn, *pfn_min);
>> +	*pfn_max = max(pfn, *pfn_max);
>> +	xb_preload(GFP_KERNEL);
>> +	xb_set_bit(&vb->page_xb, pfn);
>> +	xb_preload_end();
>> +}
>> +
> So, this will allocate memory
>
> ...
>
>> @@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>>   	struct page *page;
>>   	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>>   	LIST_HEAD(pages);
>> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
>> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>>   
>> -	/* We can only do one array worth at a time. */
>> -	num = min(num, ARRAY_SIZE(vb->pfns));
>> +	/* Traditionally, we can only do one array worth at a time. */
>> +	if (!use_sg)
>> +		num = min(num, ARRAY_SIZE(vb->pfns));
>>   
>>   	mutex_lock(&vb->balloon_lock);
>>   	/* We can't release more pages than taken */
> And is sometimes called on OOM.
>
>
> I suspect we need to
>
> 1. keep around some memory for leak on oom
>
> 2. for non oom allocate outside locks
>
>

I think maybe we can optimize the existing balloon logic, which could 
remove the big balloon lock:

It would not be necessary to have the inflating and deflating run at the 
same time.
For example, 1st request to inflate 7G RAM, when 1GB has been given to 
the host (so 6G left), the
2nd request to deflate 5G is received. Instead of waiting for the 1st 
request to inflate 6G and then
continuing with the 2nd request to deflate 5G, we can do a diff (6G to 
inflate - 5G to deflate) immediately,
and got 1G to inflate. In this way, all that driver will do is to simply 
inflate another 1G.

Same for the OOM case: when OOM asks for 1G, while inflating 5G is in 
progress, then the driver can
deduct 1G from the amount that needs to inflate, and as a result, it 
will inflate 4G.

In this case, we will never have the inflating and deflating task run at 
the same time, so I think it is
possible to remove the lock, and therefore, we will not have that 
deadlock issue.

What would you guys think?

Best,
Wei

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-10-10  7:28       ` Wei Wang
  (?)
@ 2017-10-10 11:08         ` Tetsuo Handa
  -1 siblings, 0 replies; 146+ messages in thread
From: Tetsuo Handa @ 2017-10-10 11:08 UTC (permalink / raw)
  To: wei.w.wang, mst
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

Wei Wang wrote:
> On 10/09/2017 11:20 PM, Michael S. Tsirkin wrote:
> > On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> >> +static inline void xb_set_page(struct virtio_balloon *vb,
> >> +			       struct page *page,
> >> +			       unsigned long *pfn_min,
> >> +			       unsigned long *pfn_max)
> >> +{
> >> +	unsigned long pfn = page_to_pfn(page);
> >> +
> >> +	*pfn_min = min(pfn, *pfn_min);
> >> +	*pfn_max = max(pfn, *pfn_max);
> >> +	xb_preload(GFP_KERNEL);
> >> +	xb_set_bit(&vb->page_xb, pfn);
> >> +	xb_preload_end();
> >> +}
> >> +
> > So, this will allocate memory
> >
> > ...
> >
> >> @@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
> >>   	struct page *page;
> >>   	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
> >>   	LIST_HEAD(pages);
> >> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> >> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
> >>   
> >> -	/* We can only do one array worth at a time. */
> >> -	num = min(num, ARRAY_SIZE(vb->pfns));
> >> +	/* Traditionally, we can only do one array worth at a time. */
> >> +	if (!use_sg)
> >> +		num = min(num, ARRAY_SIZE(vb->pfns));
> >>   
> >>   	mutex_lock(&vb->balloon_lock);
> >>   	/* We can't release more pages than taken */
> > And is sometimes called on OOM.
> >
> >
> > I suspect we need to
> >
> > 1. keep around some memory for leak on oom
> >
> > 2. for non oom allocate outside locks
> >
> >
> 
> I think maybe we can optimize the existing balloon logic, which could 
> remove the big balloon lock:
> 
> It would not be necessary to have the inflating and deflating run at the 
> same time.
> For example, 1st request to inflate 7G RAM, when 1GB has been given to 
> the host (so 6G left), the
> 2nd request to deflate 5G is received. Instead of waiting for the 1st 
> request to inflate 6G and then
> continuing with the 2nd request to deflate 5G, we can do a diff (6G to 
> inflate - 5G to deflate) immediately,
> and got 1G to inflate. In this way, all that driver will do is to simply 
> inflate another 1G.
> 
> Same for the OOM case: when OOM asks for 1G, while inflating 5G is in 
> progress, then the driver can
> deduct 1G from the amount that needs to inflate, and as a result, it 
> will inflate 4G.
> 
> In this case, we will never have the inflating and deflating task run at 
> the same time, so I think it is
> possible to remove the lock, and therefore, we will not have that 
> deadlock issue.
> 
> What would you guys think?

What is balloon_lock at virtballoon_migratepage() for?

  e22504296d4f64fb "virtio_balloon: introduce migration primitives to balloon pages"
  f68b992bbb474641 "virtio_balloon: fix race by fill and leak"

And even if we could remove balloon_lock, you still cannot use
__GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
"whether it is safe to wait" flag from
"[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-10 11:08         ` Tetsuo Handa
  0 siblings, 0 replies; 146+ messages in thread
From: Tetsuo Handa @ 2017-10-10 11:08 UTC (permalink / raw)
  To: wei.w.wang, mst
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

Wei Wang wrote:
> On 10/09/2017 11:20 PM, Michael S. Tsirkin wrote:
> > On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> >> +static inline void xb_set_page(struct virtio_balloon *vb,
> >> +			       struct page *page,
> >> +			       unsigned long *pfn_min,
> >> +			       unsigned long *pfn_max)
> >> +{
> >> +	unsigned long pfn = page_to_pfn(page);
> >> +
> >> +	*pfn_min = min(pfn, *pfn_min);
> >> +	*pfn_max = max(pfn, *pfn_max);
> >> +	xb_preload(GFP_KERNEL);
> >> +	xb_set_bit(&vb->page_xb, pfn);
> >> +	xb_preload_end();
> >> +}
> >> +
> > So, this will allocate memory
> >
> > ...
> >
> >> @@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
> >>   	struct page *page;
> >>   	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
> >>   	LIST_HEAD(pages);
> >> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> >> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
> >>   
> >> -	/* We can only do one array worth at a time. */
> >> -	num = min(num, ARRAY_SIZE(vb->pfns));
> >> +	/* Traditionally, we can only do one array worth at a time. */
> >> +	if (!use_sg)
> >> +		num = min(num, ARRAY_SIZE(vb->pfns));
> >>   
> >>   	mutex_lock(&vb->balloon_lock);
> >>   	/* We can't release more pages than taken */
> > And is sometimes called on OOM.
> >
> >
> > I suspect we need to
> >
> > 1. keep around some memory for leak on oom
> >
> > 2. for non oom allocate outside locks
> >
> >
> 
> I think maybe we can optimize the existing balloon logic, which could 
> remove the big balloon lock:
> 
> It would not be necessary to have the inflating and deflating run at the 
> same time.
> For example, 1st request to inflate 7G RAM, when 1GB has been given to 
> the host (so 6G left), the
> 2nd request to deflate 5G is received. Instead of waiting for the 1st 
> request to inflate 6G and then
> continuing with the 2nd request to deflate 5G, we can do a diff (6G to 
> inflate - 5G to deflate) immediately,
> and got 1G to inflate. In this way, all that driver will do is to simply 
> inflate another 1G.
> 
> Same for the OOM case: when OOM asks for 1G, while inflating 5G is in 
> progress, then the driver can
> deduct 1G from the amount that needs to inflate, and as a result, it 
> will inflate 4G.
> 
> In this case, we will never have the inflating and deflating task run at 
> the same time, so I think it is
> possible to remove the lock, and therefore, we will not have that 
> deadlock issue.
> 
> What would you guys think?

What is balloon_lock at virtballoon_migratepage() for?

  e22504296d4f64fb "virtio_balloon: introduce migration primitives to balloon pages"
  f68b992bbb474641 "virtio_balloon: fix race by fill and leak"

And even if we could remove balloon_lock, you still cannot use
__GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
"whether it is safe to wait" flag from
"[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [Qemu-devel] [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-10 11:08         ` Tetsuo Handa
  0 siblings, 0 replies; 146+ messages in thread
From: Tetsuo Handa @ 2017-10-10 11:08 UTC (permalink / raw)
  To: wei.w.wang, mst
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

Wei Wang wrote:
> On 10/09/2017 11:20 PM, Michael S. Tsirkin wrote:
> > On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> >> +static inline void xb_set_page(struct virtio_balloon *vb,
> >> +			       struct page *page,
> >> +			       unsigned long *pfn_min,
> >> +			       unsigned long *pfn_max)
> >> +{
> >> +	unsigned long pfn = page_to_pfn(page);
> >> +
> >> +	*pfn_min = min(pfn, *pfn_min);
> >> +	*pfn_max = max(pfn, *pfn_max);
> >> +	xb_preload(GFP_KERNEL);
> >> +	xb_set_bit(&vb->page_xb, pfn);
> >> +	xb_preload_end();
> >> +}
> >> +
> > So, this will allocate memory
> >
> > ...
> >
> >> @@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
> >>   	struct page *page;
> >>   	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
> >>   	LIST_HEAD(pages);
> >> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> >> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
> >>   
> >> -	/* We can only do one array worth at a time. */
> >> -	num = min(num, ARRAY_SIZE(vb->pfns));
> >> +	/* Traditionally, we can only do one array worth at a time. */
> >> +	if (!use_sg)
> >> +		num = min(num, ARRAY_SIZE(vb->pfns));
> >>   
> >>   	mutex_lock(&vb->balloon_lock);
> >>   	/* We can't release more pages than taken */
> > And is sometimes called on OOM.
> >
> >
> > I suspect we need to
> >
> > 1. keep around some memory for leak on oom
> >
> > 2. for non oom allocate outside locks
> >
> >
> 
> I think maybe we can optimize the existing balloon logic, which could 
> remove the big balloon lock:
> 
> It would not be necessary to have the inflating and deflating run at the 
> same time.
> For example, 1st request to inflate 7G RAM, when 1GB has been given to 
> the host (so 6G left), the
> 2nd request to deflate 5G is received. Instead of waiting for the 1st 
> request to inflate 6G and then
> continuing with the 2nd request to deflate 5G, we can do a diff (6G to 
> inflate - 5G to deflate) immediately,
> and got 1G to inflate. In this way, all that driver will do is to simply 
> inflate another 1G.
> 
> Same for the OOM case: when OOM asks for 1G, while inflating 5G is in 
> progress, then the driver can
> deduct 1G from the amount that needs to inflate, and as a result, it 
> will inflate 4G.
> 
> In this case, we will never have the inflating and deflating task run at 
> the same time, so I think it is
> possible to remove the lock, and therefore, we will not have that 
> deadlock issue.
> 
> What would you guys think?

What is balloon_lock at virtballoon_migratepage() for?

  e22504296d4f64fb "virtio_balloon: introduce migration primitives to balloon pages"
  f68b992bbb474641 "virtio_balloon: fix race by fill and leak"

And even if we could remove balloon_lock, you still cannot use
__GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
"whether it is safe to wait" flag from
"[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-10-10 11:08         ` Tetsuo Handa
  (?)
  (?)
@ 2017-10-10 12:32           ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-10 12:32 UTC (permalink / raw)
  To: Tetsuo Handa, mst
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 10/10/2017 07:08 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
>> On 10/09/2017 11:20 PM, Michael S. Tsirkin wrote:
>>> On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
>>>> +static inline void xb_set_page(struct virtio_balloon *vb,
>>>> +			       struct page *page,
>>>> +			       unsigned long *pfn_min,
>>>> +			       unsigned long *pfn_max)
>>>> +{
>>>> +	unsigned long pfn = page_to_pfn(page);
>>>> +
>>>> +	*pfn_min = min(pfn, *pfn_min);
>>>> +	*pfn_max = max(pfn, *pfn_max);
>>>> +	xb_preload(GFP_KERNEL);
>>>> +	xb_set_bit(&vb->page_xb, pfn);
>>>> +	xb_preload_end();
>>>> +}
>>>> +
>>> So, this will allocate memory
>>>
>>> ...
>>>
>>>> @@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>>>>    	struct page *page;
>>>>    	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>>>>    	LIST_HEAD(pages);
>>>> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
>>>> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>>>>    
>>>> -	/* We can only do one array worth at a time. */
>>>> -	num = min(num, ARRAY_SIZE(vb->pfns));
>>>> +	/* Traditionally, we can only do one array worth at a time. */
>>>> +	if (!use_sg)
>>>> +		num = min(num, ARRAY_SIZE(vb->pfns));
>>>>    
>>>>    	mutex_lock(&vb->balloon_lock);
>>>>    	/* We can't release more pages than taken */
>>> And is sometimes called on OOM.
>>>
>>>
>>> I suspect we need to
>>>
>>> 1. keep around some memory for leak on oom
>>>
>>> 2. for non oom allocate outside locks
>>>
>>>
>> I think maybe we can optimize the existing balloon logic, which could
>> remove the big balloon lock:
>>
>> It would not be necessary to have the inflating and deflating run at the
>> same time.
>> For example, 1st request to inflate 7G RAM, when 1GB has been given to
>> the host (so 6G left), the
>> 2nd request to deflate 5G is received. Instead of waiting for the 1st
>> request to inflate 6G and then
>> continuing with the 2nd request to deflate 5G, we can do a diff (6G to
>> inflate - 5G to deflate) immediately,
>> and got 1G to inflate. In this way, all that driver will do is to simply
>> inflate another 1G.
>>
>> Same for the OOM case: when OOM asks for 1G, while inflating 5G is in
>> progress, then the driver can
>> deduct 1G from the amount that needs to inflate, and as a result, it
>> will inflate 4G.
>>
>> In this case, we will never have the inflating and deflating task run at
>> the same time, so I think it is
>> possible to remove the lock, and therefore, we will not have that
>> deadlock issue.
>>
>> What would you guys think?
> What is balloon_lock at virtballoon_migratepage() for?
>
>    e22504296d4f64fb "virtio_balloon: introduce migration primitives to balloon pages"
>    f68b992bbb474641 "virtio_balloon: fix race by fill and leak"

I think that's the part we need to improve for the existing 
implementation when going with the above direction.

As also stated in the commit log, the lock was proposed to synchronize 
accesses to elements
of struct virtio_balloon and its queue operation. To be more precise, 
fill_balloon/leak_balloon/migrationpage
share vb->pfns[] and vb->num_pfns, which can actually be changed to use 
local variables of their own each.

For example, for migratepage:
+       __virtio32 pfn;
...
-       vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-       set_page_pfns(vb, vb->pfns, newpage);
-       tell_host(vb, vb->inflate_vq);
+       set_page_pfns(vb, &pfn, newpage);
+       tell_host(vb, vb->inflate_vq, &pfn, VIRTIO_BALLOON_PAGES_PER_PAGE);

For the queue access, it could be a small lock for each queue access, 
which I think won't cause the issue.



> And even if we could remove balloon_lock, you still cannot use
> __GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
> "whether it is safe to wait" flag from
> "[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .

Without the lock being held, why couldn't we use __GFP_DIRECT_RECLAIM at 
xb_set_page()?


Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-10 12:32           ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-10 12:32 UTC (permalink / raw)
  To: Tetsuo Handa, mst
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 10/10/2017 07:08 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
>> On 10/09/2017 11:20 PM, Michael S. Tsirkin wrote:
>>> On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
>>>> +static inline void xb_set_page(struct virtio_balloon *vb,
>>>> +			       struct page *page,
>>>> +			       unsigned long *pfn_min,
>>>> +			       unsigned long *pfn_max)
>>>> +{
>>>> +	unsigned long pfn = page_to_pfn(page);
>>>> +
>>>> +	*pfn_min = min(pfn, *pfn_min);
>>>> +	*pfn_max = max(pfn, *pfn_max);
>>>> +	xb_preload(GFP_KERNEL);
>>>> +	xb_set_bit(&vb->page_xb, pfn);
>>>> +	xb_preload_end();
>>>> +}
>>>> +
>>> So, this will allocate memory
>>>
>>> ...
>>>
>>>> @@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>>>>    	struct page *page;
>>>>    	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>>>>    	LIST_HEAD(pages);
>>>> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
>>>> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>>>>    
>>>> -	/* We can only do one array worth at a time. */
>>>> -	num = min(num, ARRAY_SIZE(vb->pfns));
>>>> +	/* Traditionally, we can only do one array worth at a time. */
>>>> +	if (!use_sg)
>>>> +		num = min(num, ARRAY_SIZE(vb->pfns));
>>>>    
>>>>    	mutex_lock(&vb->balloon_lock);
>>>>    	/* We can't release more pages than taken */
>>> And is sometimes called on OOM.
>>>
>>>
>>> I suspect we need to
>>>
>>> 1. keep around some memory for leak on oom
>>>
>>> 2. for non oom allocate outside locks
>>>
>>>
>> I think maybe we can optimize the existing balloon logic, which could
>> remove the big balloon lock:
>>
>> It would not be necessary to have the inflating and deflating run at the
>> same time.
>> For example, 1st request to inflate 7G RAM, when 1GB has been given to
>> the host (so 6G left), the
>> 2nd request to deflate 5G is received. Instead of waiting for the 1st
>> request to inflate 6G and then
>> continuing with the 2nd request to deflate 5G, we can do a diff (6G to
>> inflate - 5G to deflate) immediately,
>> and got 1G to inflate. In this way, all that driver will do is to simply
>> inflate another 1G.
>>
>> Same for the OOM case: when OOM asks for 1G, while inflating 5G is in
>> progress, then the driver can
>> deduct 1G from the amount that needs to inflate, and as a result, it
>> will inflate 4G.
>>
>> In this case, we will never have the inflating and deflating task run at
>> the same time, so I think it is
>> possible to remove the lock, and therefore, we will not have that
>> deadlock issue.
>>
>> What would you guys think?
> What is balloon_lock at virtballoon_migratepage() for?
>
>    e22504296d4f64fb "virtio_balloon: introduce migration primitives to balloon pages"
>    f68b992bbb474641 "virtio_balloon: fix race by fill and leak"

I think that's the part we need to improve for the existing 
implementation when going with the above direction.

As also stated in the commit log, the lock was proposed to synchronize 
accesses to elements
of struct virtio_balloon and its queue operation. To be more precise, 
fill_balloon/leak_balloon/migrationpage
share vb->pfns[] and vb->num_pfns, which can actually be changed to use 
local variables of their own each.

For example, for migratepage:
+       __virtio32 pfn;
...
-       vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-       set_page_pfns(vb, vb->pfns, newpage);
-       tell_host(vb, vb->inflate_vq);
+       set_page_pfns(vb, &pfn, newpage);
+       tell_host(vb, vb->inflate_vq, &pfn, VIRTIO_BALLOON_PAGES_PER_PAGE);

For the queue access, it could be a small lock for each queue access, 
which I think won't cause the issue.



> And even if we could remove balloon_lock, you still cannot use
> __GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
> "whether it is safe to wait" flag from
> "[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .

Without the lock being held, why couldn't we use __GFP_DIRECT_RECLAIM at 
xb_set_page()?


Best,
Wei


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [Qemu-devel] [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-10 12:32           ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-10 12:32 UTC (permalink / raw)
  To: Tetsuo Handa, mst
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 10/10/2017 07:08 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
>> On 10/09/2017 11:20 PM, Michael S. Tsirkin wrote:
>>> On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
>>>> +static inline void xb_set_page(struct virtio_balloon *vb,
>>>> +			       struct page *page,
>>>> +			       unsigned long *pfn_min,
>>>> +			       unsigned long *pfn_max)
>>>> +{
>>>> +	unsigned long pfn = page_to_pfn(page);
>>>> +
>>>> +	*pfn_min = min(pfn, *pfn_min);
>>>> +	*pfn_max = max(pfn, *pfn_max);
>>>> +	xb_preload(GFP_KERNEL);
>>>> +	xb_set_bit(&vb->page_xb, pfn);
>>>> +	xb_preload_end();
>>>> +}
>>>> +
>>> So, this will allocate memory
>>>
>>> ...
>>>
>>>> @@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>>>>    	struct page *page;
>>>>    	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>>>>    	LIST_HEAD(pages);
>>>> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
>>>> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>>>>    
>>>> -	/* We can only do one array worth at a time. */
>>>> -	num = min(num, ARRAY_SIZE(vb->pfns));
>>>> +	/* Traditionally, we can only do one array worth at a time. */
>>>> +	if (!use_sg)
>>>> +		num = min(num, ARRAY_SIZE(vb->pfns));
>>>>    
>>>>    	mutex_lock(&vb->balloon_lock);
>>>>    	/* We can't release more pages than taken */
>>> And is sometimes called on OOM.
>>>
>>>
>>> I suspect we need to
>>>
>>> 1. keep around some memory for leak on oom
>>>
>>> 2. for non oom allocate outside locks
>>>
>>>
>> I think maybe we can optimize the existing balloon logic, which could
>> remove the big balloon lock:
>>
>> It would not be necessary to have the inflating and deflating run at the
>> same time.
>> For example, 1st request to inflate 7G RAM, when 1GB has been given to
>> the host (so 6G left), the
>> 2nd request to deflate 5G is received. Instead of waiting for the 1st
>> request to inflate 6G and then
>> continuing with the 2nd request to deflate 5G, we can do a diff (6G to
>> inflate - 5G to deflate) immediately,
>> and got 1G to inflate. In this way, all that driver will do is to simply
>> inflate another 1G.
>>
>> Same for the OOM case: when OOM asks for 1G, while inflating 5G is in
>> progress, then the driver can
>> deduct 1G from the amount that needs to inflate, and as a result, it
>> will inflate 4G.
>>
>> In this case, we will never have the inflating and deflating task run at
>> the same time, so I think it is
>> possible to remove the lock, and therefore, we will not have that
>> deadlock issue.
>>
>> What would you guys think?
> What is balloon_lock at virtballoon_migratepage() for?
>
>    e22504296d4f64fb "virtio_balloon: introduce migration primitives to balloon pages"
>    f68b992bbb474641 "virtio_balloon: fix race by fill and leak"

I think that's the part we need to improve for the existing 
implementation when going with the above direction.

As also stated in the commit log, the lock was proposed to synchronize 
accesses to elements
of struct virtio_balloon and its queue operation. To be more precise, 
fill_balloon/leak_balloon/migrationpage
share vb->pfns[] and vb->num_pfns, which can actually be changed to use 
local variables of their own each.

For example, for migratepage:
+       __virtio32 pfn;
...
-       vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-       set_page_pfns(vb, vb->pfns, newpage);
-       tell_host(vb, vb->inflate_vq);
+       set_page_pfns(vb, &pfn, newpage);
+       tell_host(vb, vb->inflate_vq, &pfn, VIRTIO_BALLOON_PAGES_PER_PAGE);

For the queue access, it could be a small lock for each queue access, 
which I think won't cause the issue.



> And even if we could remove balloon_lock, you still cannot use
> __GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
> "whether it is safe to wait" flag from
> "[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .

Without the lock being held, why couldn't we use __GFP_DIRECT_RECLAIM at 
xb_set_page()?


Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-10-10 11:08         ` Tetsuo Handa
                           ` (2 preceding siblings ...)
  (?)
@ 2017-10-10 12:32         ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-10 12:32 UTC (permalink / raw)
  To: Tetsuo Handa, mst
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko, mgorman

On 10/10/2017 07:08 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
>> On 10/09/2017 11:20 PM, Michael S. Tsirkin wrote:
>>> On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
>>>> +static inline void xb_set_page(struct virtio_balloon *vb,
>>>> +			       struct page *page,
>>>> +			       unsigned long *pfn_min,
>>>> +			       unsigned long *pfn_max)
>>>> +{
>>>> +	unsigned long pfn = page_to_pfn(page);
>>>> +
>>>> +	*pfn_min = min(pfn, *pfn_min);
>>>> +	*pfn_max = max(pfn, *pfn_max);
>>>> +	xb_preload(GFP_KERNEL);
>>>> +	xb_set_bit(&vb->page_xb, pfn);
>>>> +	xb_preload_end();
>>>> +}
>>>> +
>>> So, this will allocate memory
>>>
>>> ...
>>>
>>>> @@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>>>>    	struct page *page;
>>>>    	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>>>>    	LIST_HEAD(pages);
>>>> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
>>>> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>>>>    
>>>> -	/* We can only do one array worth at a time. */
>>>> -	num = min(num, ARRAY_SIZE(vb->pfns));
>>>> +	/* Traditionally, we can only do one array worth at a time. */
>>>> +	if (!use_sg)
>>>> +		num = min(num, ARRAY_SIZE(vb->pfns));
>>>>    
>>>>    	mutex_lock(&vb->balloon_lock);
>>>>    	/* We can't release more pages than taken */
>>> And is sometimes called on OOM.
>>>
>>>
>>> I suspect we need to
>>>
>>> 1. keep around some memory for leak on oom
>>>
>>> 2. for non oom allocate outside locks
>>>
>>>
>> I think maybe we can optimize the existing balloon logic, which could
>> remove the big balloon lock:
>>
>> It would not be necessary to have the inflating and deflating run at the
>> same time.
>> For example, 1st request to inflate 7G RAM, when 1GB has been given to
>> the host (so 6G left), the
>> 2nd request to deflate 5G is received. Instead of waiting for the 1st
>> request to inflate 6G and then
>> continuing with the 2nd request to deflate 5G, we can do a diff (6G to
>> inflate - 5G to deflate) immediately,
>> and got 1G to inflate. In this way, all that driver will do is to simply
>> inflate another 1G.
>>
>> Same for the OOM case: when OOM asks for 1G, while inflating 5G is in
>> progress, then the driver can
>> deduct 1G from the amount that needs to inflate, and as a result, it
>> will inflate 4G.
>>
>> In this case, we will never have the inflating and deflating task run at
>> the same time, so I think it is
>> possible to remove the lock, and therefore, we will not have that
>> deadlock issue.
>>
>> What would you guys think?
> What is balloon_lock at virtballoon_migratepage() for?
>
>    e22504296d4f64fb "virtio_balloon: introduce migration primitives to balloon pages"
>    f68b992bbb474641 "virtio_balloon: fix race by fill and leak"

I think that's the part we need to improve for the existing 
implementation when going with the above direction.

As also stated in the commit log, the lock was proposed to synchronize 
accesses to elements
of struct virtio_balloon and its queue operation. To be more precise, 
fill_balloon/leak_balloon/migrationpage
share vb->pfns[] and vb->num_pfns, which can actually be changed to use 
local variables of their own each.

For example, for migratepage:
+       __virtio32 pfn;
...
-       vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-       set_page_pfns(vb, vb->pfns, newpage);
-       tell_host(vb, vb->inflate_vq);
+       set_page_pfns(vb, &pfn, newpage);
+       tell_host(vb, vb->inflate_vq, &pfn, VIRTIO_BALLOON_PAGES_PER_PAGE);

For the queue access, it could be a small lock for each queue access, 
which I think won't cause the issue.



> And even if we could remove balloon_lock, you still cannot use
> __GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
> "whether it is safe to wait" flag from
> "[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .

Without the lock being held, why couldn't we use __GFP_DIRECT_RECLAIM at 
xb_set_page()?


Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [virtio-dev] Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-10 12:32           ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-10 12:32 UTC (permalink / raw)
  To: Tetsuo Handa, mst
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 10/10/2017 07:08 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
>> On 10/09/2017 11:20 PM, Michael S. Tsirkin wrote:
>>> On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
>>>> +static inline void xb_set_page(struct virtio_balloon *vb,
>>>> +			       struct page *page,
>>>> +			       unsigned long *pfn_min,
>>>> +			       unsigned long *pfn_max)
>>>> +{
>>>> +	unsigned long pfn = page_to_pfn(page);
>>>> +
>>>> +	*pfn_min = min(pfn, *pfn_min);
>>>> +	*pfn_max = max(pfn, *pfn_max);
>>>> +	xb_preload(GFP_KERNEL);
>>>> +	xb_set_bit(&vb->page_xb, pfn);
>>>> +	xb_preload_end();
>>>> +}
>>>> +
>>> So, this will allocate memory
>>>
>>> ...
>>>
>>>> @@ -198,9 +327,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>>>>    	struct page *page;
>>>>    	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>>>>    	LIST_HEAD(pages);
>>>> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
>>>> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>>>>    
>>>> -	/* We can only do one array worth at a time. */
>>>> -	num = min(num, ARRAY_SIZE(vb->pfns));
>>>> +	/* Traditionally, we can only do one array worth at a time. */
>>>> +	if (!use_sg)
>>>> +		num = min(num, ARRAY_SIZE(vb->pfns));
>>>>    
>>>>    	mutex_lock(&vb->balloon_lock);
>>>>    	/* We can't release more pages than taken */
>>> And is sometimes called on OOM.
>>>
>>>
>>> I suspect we need to
>>>
>>> 1. keep around some memory for leak on oom
>>>
>>> 2. for non oom allocate outside locks
>>>
>>>
>> I think maybe we can optimize the existing balloon logic, which could
>> remove the big balloon lock:
>>
>> It would not be necessary to have the inflating and deflating run at the
>> same time.
>> For example, 1st request to inflate 7G RAM, when 1GB has been given to
>> the host (so 6G left), the
>> 2nd request to deflate 5G is received. Instead of waiting for the 1st
>> request to inflate 6G and then
>> continuing with the 2nd request to deflate 5G, we can do a diff (6G to
>> inflate - 5G to deflate) immediately,
>> and got 1G to inflate. In this way, all that driver will do is to simply
>> inflate another 1G.
>>
>> Same for the OOM case: when OOM asks for 1G, while inflating 5G is in
>> progress, then the driver can
>> deduct 1G from the amount that needs to inflate, and as a result, it
>> will inflate 4G.
>>
>> In this case, we will never have the inflating and deflating task run at
>> the same time, so I think it is
>> possible to remove the lock, and therefore, we will not have that
>> deadlock issue.
>>
>> What would you guys think?
> What is balloon_lock at virtballoon_migratepage() for?
>
>    e22504296d4f64fb "virtio_balloon: introduce migration primitives to balloon pages"
>    f68b992bbb474641 "virtio_balloon: fix race by fill and leak"

I think that's the part we need to improve for the existing 
implementation when going with the above direction.

As also stated in the commit log, the lock was proposed to synchronize 
accesses to elements
of struct virtio_balloon and its queue operation. To be more precise, 
fill_balloon/leak_balloon/migrationpage
share vb->pfns[] and vb->num_pfns, which can actually be changed to use 
local variables of their own each.

For example, for migratepage:
+       __virtio32 pfn;
...
-       vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-       set_page_pfns(vb, vb->pfns, newpage);
-       tell_host(vb, vb->inflate_vq);
+       set_page_pfns(vb, &pfn, newpage);
+       tell_host(vb, vb->inflate_vq, &pfn, VIRTIO_BALLOON_PAGES_PER_PAGE);

For the queue access, it could be a small lock for each queue access, 
which I think won't cause the issue.



> And even if we could remove balloon_lock, you still cannot use
> __GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
> "whether it is safe to wait" flag from
> "[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .

Without the lock being held, why couldn't we use __GFP_DIRECT_RECLAIM at 
xb_set_page()?


Best,
Wei



---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-10-10 12:32           ` Wei Wang
  (?)
@ 2017-10-10 13:09             ` Tetsuo Handa
  -1 siblings, 0 replies; 146+ messages in thread
From: Tetsuo Handa @ 2017-10-10 13:09 UTC (permalink / raw)
  To: wei.w.wang, mst
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

Wei Wang wrote:
> > And even if we could remove balloon_lock, you still cannot use
> > __GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
> > "whether it is safe to wait" flag from
> > "[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .
> 
> Without the lock being held, why couldn't we use __GFP_DIRECT_RECLAIM at 
> xb_set_page()?

Because of dependency shown below.

leak_balloon()
  xb_set_page()
    xb_preload(GFP_KERNEL)
      kmalloc(GFP_KERNEL)
        __alloc_pages_may_oom()
          Takes oom_lock
          out_of_memory()
            blocking_notifier_call_chain()
              leak_balloon()
                xb_set_page()
                  xb_preload(GFP_KERNEL)
                    kmalloc(GFP_KERNEL)
                      __alloc_pages_may_oom()
                        Fails to take oom_lock and loop forever

By the way, is xb_set_page() safe?
Sleeping in the kernel with preemption disabled is a bug, isn't it?
__radix_tree_preload() returns 0 with preemption disabled upon success.
xb_preload() disables preemption if __radix_tree_preload() fails.
Then, kmalloc() is called with preemption disabled, isn't it?
But xb_set_page() calls xb_preload(GFP_KERNEL) which might sleep with
preemption disabled.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-10 13:09             ` Tetsuo Handa
  0 siblings, 0 replies; 146+ messages in thread
From: Tetsuo Handa @ 2017-10-10 13:09 UTC (permalink / raw)
  To: wei.w.wang, mst
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

Wei Wang wrote:
> > And even if we could remove balloon_lock, you still cannot use
> > __GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
> > "whether it is safe to wait" flag from
> > "[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .
> 
> Without the lock being held, why couldn't we use __GFP_DIRECT_RECLAIM at 
> xb_set_page()?

Because of dependency shown below.

leak_balloon()
  xb_set_page()
    xb_preload(GFP_KERNEL)
      kmalloc(GFP_KERNEL)
        __alloc_pages_may_oom()
          Takes oom_lock
          out_of_memory()
            blocking_notifier_call_chain()
              leak_balloon()
                xb_set_page()
                  xb_preload(GFP_KERNEL)
                    kmalloc(GFP_KERNEL)
                      __alloc_pages_may_oom()
                        Fails to take oom_lock and loop forever

By the way, is xb_set_page() safe?
Sleeping in the kernel with preemption disabled is a bug, isn't it?
__radix_tree_preload() returns 0 with preemption disabled upon success.
xb_preload() disables preemption if __radix_tree_preload() fails.
Then, kmalloc() is called with preemption disabled, isn't it?
But xb_set_page() calls xb_preload(GFP_KERNEL) which might sleep with
preemption disabled.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [Qemu-devel] [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-10 13:09             ` Tetsuo Handa
  0 siblings, 0 replies; 146+ messages in thread
From: Tetsuo Handa @ 2017-10-10 13:09 UTC (permalink / raw)
  To: wei.w.wang, mst
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

Wei Wang wrote:
> > And even if we could remove balloon_lock, you still cannot use
> > __GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
> > "whether it is safe to wait" flag from
> > "[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .
> 
> Without the lock being held, why couldn't we use __GFP_DIRECT_RECLAIM at 
> xb_set_page()?

Because of dependency shown below.

leak_balloon()
  xb_set_page()
    xb_preload(GFP_KERNEL)
      kmalloc(GFP_KERNEL)
        __alloc_pages_may_oom()
          Takes oom_lock
          out_of_memory()
            blocking_notifier_call_chain()
              leak_balloon()
                xb_set_page()
                  xb_preload(GFP_KERNEL)
                    kmalloc(GFP_KERNEL)
                      __alloc_pages_may_oom()
                        Fails to take oom_lock and loop forever

By the way, is xb_set_page() safe?
Sleeping in the kernel with preemption disabled is a bug, isn't it?
__radix_tree_preload() returns 0 with preemption disabled upon success.
xb_preload() disables preemption if __radix_tree_preload() fails.
Then, kmalloc() is called with preemption disabled, isn't it?
But xb_set_page() calls xb_preload(GFP_KERNEL) which might sleep with
preemption disabled.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
  2017-10-02 16:38       ` Wang, Wei W
                           ` (2 preceding siblings ...)
  (?)
@ 2017-10-10 15:15         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-10 15:15 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Mon, Oct 02, 2017 at 04:38:01PM +0000, Wang, Wei W wrote:
> On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
> > On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> > > +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> > > +			  struct virtio_balloon_ctrlq_cmd *cmd,
> > > +			  bool inbuf)
> > > +{
> > > +	struct virtqueue *vq = vb->ctrl_vq;
> > > +
> > > +	ctrlq_add_cmd(vq, cmd, inbuf);
> > > +	if (!inbuf) {
> > > +		/*
> > > +		 * All the input cmd buffers are replenished here.
> > > +		 * This is necessary because the input cmd buffers are lost
> > > +		 * after live migration. The device needs to rewind all of
> > > +		 * them from the ctrl_vq.
> > 
> > Confused. Live migration somehow loses state? Why is that and why is it a good
> > idea? And how do you know this is migration even?
> > Looks like all you know is you got free page end. Could be any reason for this.
> 
> 
> I think this would be something that the current live migration lacks - what the
> device read from the vq is not transferred during live migration, an example is the 
> stat_vq_elem: 
> Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c

This does not touch guest memory though it just manipulates
internal state to make it easier to migrate.
It's transparent to guest as migration should be.

> For all the things that are added to the vq and need to be held by the device
> to use later need to consider the situation that live migration might happen at any
> time and they need to be re-taken from the vq by the device on the destination
> machine.
> 
> So, even without this live migration optimization feature, I think all the things that are 
> added to the vq for the device to hold, need a way for the device to rewind back from
> the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
> on the vq so that the device side rewinding can work. 
> 
> Please let me know if anything is missed or if you have other suggestions.

IMO migration should pass enough data source to destination for
destination to continue where source left off without guest help.

> 
> > > +static void ctrlq_handle(struct virtqueue *vq) {
> > > +	struct virtio_balloon *vb = vq->vdev->priv;
> > > +	struct virtio_balloon_ctrlq_cmd *msg;
> > > +	unsigned int class, cmd, len;
> > > +
> > > +	msg = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len);
> > > +	if (unlikely(!msg))
> > > +		return;
> > > +
> > > +	/* The outbuf is sent by the host for recycling, so just return. */
> > > +	if (msg == &vb->free_page_cmd_out)
> > > +		return;
> > > +
> > > +	class = virtio32_to_cpu(vb->vdev, msg->class);
> > > +	cmd =  virtio32_to_cpu(vb->vdev, msg->cmd);
> > > +
> > > +	switch (class) {
> > > +	case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
> > > +		if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
> > > +			vb->report_free_page_stop = true;
> > > +		} else if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
> > > +			vb->report_free_page_stop = false;
> > > +			queue_work(vb->balloon_wq, &vb-
> > >report_free_page_work);
> > > +		}
> > > +		vb->free_page_cmd_in.class =
> > > +
> > 	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> > > +		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
> > > +	break;
> > > +	default:
> > > +		dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n",
> > > +			 __func__);
> > > +	}
> > 
> > Manipulating report_free_page_stop without any locks looks very suspicious.
> 
> > Also, what if we get two start commands? we should restart from beginning,
> > should we not?
> > 
> 
> 
> Yes, it will start to report free pages from the beginning.
> walk_free_mem_block() doesn't maintain any internal status, so the invoking of
> it will always start from the beginning.

Well yes but it will first complete the previous walk.

> 
> > > +/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE
> > */
> > > +#define VIRTIO_BALLOON_FREE_PAGE_F_STOP		0
> > > +#define VIRTIO_BALLOON_FREE_PAGE_F_START	1
> > > +
> > >  #endif /* _LINUX_VIRTIO_BALLOON_H */
> > 
> > The stop command does not appear to be thought through.
> > 
> > Let's assume e.g. you started migration. You ask guest for free pages.
> > Then you cancel it.  There are a bunch of pages in free vq and you are getting
> > more.  You now want to start migration again. What to do?
> > 
> > A bunch of vq flushing and waiting will maybe do the trick, but waiting on guest
> > is never a great idea.
> > 
> 
> 
> I think the device can flush (pop out what's left in the vq and push them back) the
> vq right after the Stop command is sent to the guest, rather than doing the flush
> when the 2nd initiation of live migration begins. The entries pushed back to the vq
> will be in the used ring, what would the device need to wait for?

You will be getting stale pages in available ring which were possibly
taken out of free list since memory is not tracked when migration is not
going on.



> > I previously suggested pushing the stop/start commands from guest to host on
> > the free page vq, and including an ID in host to guest and guest to host
> > commands. This way ctrl vq is just for host to guest commands, and host
> > matches commands and knows which command is a free page in response to.
> > 
> > I still think it's a good idea but go ahead and propose something else that works.
> > 
> 
> Thanks for the suggestion. Probably I haven't fully understood it. Please see the example
> below:
> 
> 1) host-to-guest ctrl_vq:
> StartCMD, ID=1
> 
> 2) guest-to-host free_page_vq:
> free_page, ID=1
> free_page, ID=1
> free_page, ID=1
> free_page, ID=1
> 
> 3) host-to-guest ctrl_vq:
> StopCMD, ID=1
> 
> 4) initiate the 2nd try of live migration via host-to-guest ctrl_vq:
> StartCMD, ID=2
> 
> 5) the guest-to-host free_page_vq might look like this:
> free_page, ID=1
> free_page, ID=1
> free_page, ID=2
> free_page, ID=2
> 
> The device will need to drop (pop out the two entries and push them back)
> the first 2 obsolete free pages which are sent by ID=1.

yes. But you do not have to attach id to each page.

It can be:

ID=1
free_page
free_page
ID=2
free_page
free_page



> I haven't found the benefits above yet. The device will perform the same operations
> to get rid of the old free pages. If we drop the old free pages after the StopCMD (
> ID may also not be needed in this case), the overhead won't be added to the live
> migration time.
> Would you have any thought about this?
> 
> 
> Best,
> Wei
> 

As these are separate vqs there is not clean way to know whether
free_page was queued before or after stop command.
Sending the ID helps detect where the free pages for a given start
command are.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-10 15:15         ` Michael S. Tsirkin
  0 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-10 15:15 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy,
	liliang.opensource@gmail.com

On Mon, Oct 02, 2017 at 04:38:01PM +0000, Wang, Wei W wrote:
> On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
> > On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> > > +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> > > +			  struct virtio_balloon_ctrlq_cmd *cmd,
> > > +			  bool inbuf)
> > > +{
> > > +	struct virtqueue *vq = vb->ctrl_vq;
> > > +
> > > +	ctrlq_add_cmd(vq, cmd, inbuf);
> > > +	if (!inbuf) {
> > > +		/*
> > > +		 * All the input cmd buffers are replenished here.
> > > +		 * This is necessary because the input cmd buffers are lost
> > > +		 * after live migration. The device needs to rewind all of
> > > +		 * them from the ctrl_vq.
> > 
> > Confused. Live migration somehow loses state? Why is that and why is it a good
> > idea? And how do you know this is migration even?
> > Looks like all you know is you got free page end. Could be any reason for this.
> 
> 
> I think this would be something that the current live migration lacks - what the
> device read from the vq is not transferred during live migration, an example is the 
> stat_vq_elem: 
> Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c

This does not touch guest memory though it just manipulates
internal state to make it easier to migrate.
It's transparent to guest as migration should be.

> For all the things that are added to the vq and need to be held by the device
> to use later need to consider the situation that live migration might happen at any
> time and they need to be re-taken from the vq by the device on the destination
> machine.
> 
> So, even without this live migration optimization feature, I think all the things that are 
> added to the vq for the device to hold, need a way for the device to rewind back from
> the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
> on the vq so that the device side rewinding can work. 
> 
> Please let me know if anything is missed or if you have other suggestions.

IMO migration should pass enough data source to destination for
destination to continue where source left off without guest help.

> 
> > > +static void ctrlq_handle(struct virtqueue *vq) {
> > > +	struct virtio_balloon *vb = vq->vdev->priv;
> > > +	struct virtio_balloon_ctrlq_cmd *msg;
> > > +	unsigned int class, cmd, len;
> > > +
> > > +	msg = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len);
> > > +	if (unlikely(!msg))
> > > +		return;
> > > +
> > > +	/* The outbuf is sent by the host for recycling, so just return. */
> > > +	if (msg == &vb->free_page_cmd_out)
> > > +		return;
> > > +
> > > +	class = virtio32_to_cpu(vb->vdev, msg->class);
> > > +	cmd =  virtio32_to_cpu(vb->vdev, msg->cmd);
> > > +
> > > +	switch (class) {
> > > +	case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
> > > +		if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
> > > +			vb->report_free_page_stop = true;
> > > +		} else if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
> > > +			vb->report_free_page_stop = false;
> > > +			queue_work(vb->balloon_wq, &vb-
> > >report_free_page_work);
> > > +		}
> > > +		vb->free_page_cmd_in.class =
> > > +
> > 	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> > > +		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
> > > +	break;
> > > +	default:
> > > +		dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n",
> > > +			 __func__);
> > > +	}
> > 
> > Manipulating report_free_page_stop without any locks looks very suspicious.
> 
> > Also, what if we get two start commands? we should restart from beginning,
> > should we not?
> > 
> 
> 
> Yes, it will start to report free pages from the beginning.
> walk_free_mem_block() doesn't maintain any internal status, so the invoking of
> it will always start from the beginning.

Well yes but it will first complete the previous walk.

> 
> > > +/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE
> > */
> > > +#define VIRTIO_BALLOON_FREE_PAGE_F_STOP		0
> > > +#define VIRTIO_BALLOON_FREE_PAGE_F_START	1
> > > +
> > >  #endif /* _LINUX_VIRTIO_BALLOON_H */
> > 
> > The stop command does not appear to be thought through.
> > 
> > Let's assume e.g. you started migration. You ask guest for free pages.
> > Then you cancel it.  There are a bunch of pages in free vq and you are getting
> > more.  You now want to start migration again. What to do?
> > 
> > A bunch of vq flushing and waiting will maybe do the trick, but waiting on guest
> > is never a great idea.
> > 
> 
> 
> I think the device can flush (pop out what's left in the vq and push them back) the
> vq right after the Stop command is sent to the guest, rather than doing the flush
> when the 2nd initiation of live migration begins. The entries pushed back to the vq
> will be in the used ring, what would the device need to wait for?

You will be getting stale pages in available ring which were possibly
taken out of free list since memory is not tracked when migration is not
going on.



> > I previously suggested pushing the stop/start commands from guest to host on
> > the free page vq, and including an ID in host to guest and guest to host
> > commands. This way ctrl vq is just for host to guest commands, and host
> > matches commands and knows which command is a free page in response to.
> > 
> > I still think it's a good idea but go ahead and propose something else that works.
> > 
> 
> Thanks for the suggestion. Probably I haven't fully understood it. Please see the example
> below:
> 
> 1) host-to-guest ctrl_vq:
> StartCMD, ID=1
> 
> 2) guest-to-host free_page_vq:
> free_page, ID=1
> free_page, ID=1
> free_page, ID=1
> free_page, ID=1
> 
> 3) host-to-guest ctrl_vq:
> StopCMD, ID=1
> 
> 4) initiate the 2nd try of live migration via host-to-guest ctrl_vq:
> StartCMD, ID=2
> 
> 5) the guest-to-host free_page_vq might look like this:
> free_page, ID=1
> free_page, ID=1
> free_page, ID=2
> free_page, ID=2
> 
> The device will need to drop (pop out the two entries and push them back)
> the first 2 obsolete free pages which are sent by ID=1.

yes. But you do not have to attach id to each page.

It can be:

ID=1
free_page
free_page
ID=2
free_page
free_page



> I haven't found the benefits above yet. The device will perform the same operations
> to get rid of the old free pages. If we drop the old free pages after the StopCMD (
> ID may also not be needed in this case), the overhead won't be added to the live
> migration time.
> Would you have any thought about this?
> 
> 
> Best,
> Wei
> 

As these are separate vqs there is not clean way to know whether
free_page was queued before or after stop command.
Sending the ID helps detect where the free pages for a given start
command are.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-10 15:15         ` Michael S. Tsirkin
  0 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-10 15:15 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Mon, Oct 02, 2017 at 04:38:01PM +0000, Wang, Wei W wrote:
> On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
> > On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> > > +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> > > +			  struct virtio_balloon_ctrlq_cmd *cmd,
> > > +			  bool inbuf)
> > > +{
> > > +	struct virtqueue *vq = vb->ctrl_vq;
> > > +
> > > +	ctrlq_add_cmd(vq, cmd, inbuf);
> > > +	if (!inbuf) {
> > > +		/*
> > > +		 * All the input cmd buffers are replenished here.
> > > +		 * This is necessary because the input cmd buffers are lost
> > > +		 * after live migration. The device needs to rewind all of
> > > +		 * them from the ctrl_vq.
> > 
> > Confused. Live migration somehow loses state? Why is that and why is it a good
> > idea? And how do you know this is migration even?
> > Looks like all you know is you got free page end. Could be any reason for this.
> 
> 
> I think this would be something that the current live migration lacks - what the
> device read from the vq is not transferred during live migration, an example is the 
> stat_vq_elem: 
> Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c

This does not touch guest memory though it just manipulates
internal state to make it easier to migrate.
It's transparent to guest as migration should be.

> For all the things that are added to the vq and need to be held by the device
> to use later need to consider the situation that live migration might happen at any
> time and they need to be re-taken from the vq by the device on the destination
> machine.
> 
> So, even without this live migration optimization feature, I think all the things that are 
> added to the vq for the device to hold, need a way for the device to rewind back from
> the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
> on the vq so that the device side rewinding can work. 
> 
> Please let me know if anything is missed or if you have other suggestions.

IMO migration should pass enough data source to destination for
destination to continue where source left off without guest help.

> 
> > > +static void ctrlq_handle(struct virtqueue *vq) {
> > > +	struct virtio_balloon *vb = vq->vdev->priv;
> > > +	struct virtio_balloon_ctrlq_cmd *msg;
> > > +	unsigned int class, cmd, len;
> > > +
> > > +	msg = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len);
> > > +	if (unlikely(!msg))
> > > +		return;
> > > +
> > > +	/* The outbuf is sent by the host for recycling, so just return. */
> > > +	if (msg == &vb->free_page_cmd_out)
> > > +		return;
> > > +
> > > +	class = virtio32_to_cpu(vb->vdev, msg->class);
> > > +	cmd =  virtio32_to_cpu(vb->vdev, msg->cmd);
> > > +
> > > +	switch (class) {
> > > +	case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
> > > +		if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
> > > +			vb->report_free_page_stop = true;
> > > +		} else if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
> > > +			vb->report_free_page_stop = false;
> > > +			queue_work(vb->balloon_wq, &vb-
> > >report_free_page_work);
> > > +		}
> > > +		vb->free_page_cmd_in.class =
> > > +
> > 	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> > > +		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
> > > +	break;
> > > +	default:
> > > +		dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n",
> > > +			 __func__);
> > > +	}
> > 
> > Manipulating report_free_page_stop without any locks looks very suspicious.
> 
> > Also, what if we get two start commands? we should restart from beginning,
> > should we not?
> > 
> 
> 
> Yes, it will start to report free pages from the beginning.
> walk_free_mem_block() doesn't maintain any internal status, so the invoking of
> it will always start from the beginning.

Well yes but it will first complete the previous walk.

> 
> > > +/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE
> > */
> > > +#define VIRTIO_BALLOON_FREE_PAGE_F_STOP		0
> > > +#define VIRTIO_BALLOON_FREE_PAGE_F_START	1
> > > +
> > >  #endif /* _LINUX_VIRTIO_BALLOON_H */
> > 
> > The stop command does not appear to be thought through.
> > 
> > Let's assume e.g. you started migration. You ask guest for free pages.
> > Then you cancel it.  There are a bunch of pages in free vq and you are getting
> > more.  You now want to start migration again. What to do?
> > 
> > A bunch of vq flushing and waiting will maybe do the trick, but waiting on guest
> > is never a great idea.
> > 
> 
> 
> I think the device can flush (pop out what's left in the vq and push them back) the
> vq right after the Stop command is sent to the guest, rather than doing the flush
> when the 2nd initiation of live migration begins. The entries pushed back to the vq
> will be in the used ring, what would the device need to wait for?

You will be getting stale pages in available ring which were possibly
taken out of free list since memory is not tracked when migration is not
going on.



> > I previously suggested pushing the stop/start commands from guest to host on
> > the free page vq, and including an ID in host to guest and guest to host
> > commands. This way ctrl vq is just for host to guest commands, and host
> > matches commands and knows which command is a free page in response to.
> > 
> > I still think it's a good idea but go ahead and propose something else that works.
> > 
> 
> Thanks for the suggestion. Probably I haven't fully understood it. Please see the example
> below:
> 
> 1) host-to-guest ctrl_vq:
> StartCMD, ID=1
> 
> 2) guest-to-host free_page_vq:
> free_page, ID=1
> free_page, ID=1
> free_page, ID=1
> free_page, ID=1
> 
> 3) host-to-guest ctrl_vq:
> StopCMD, ID=1
> 
> 4) initiate the 2nd try of live migration via host-to-guest ctrl_vq:
> StartCMD, ID=2
> 
> 5) the guest-to-host free_page_vq might look like this:
> free_page, ID=1
> free_page, ID=1
> free_page, ID=2
> free_page, ID=2
> 
> The device will need to drop (pop out the two entries and push them back)
> the first 2 obsolete free pages which are sent by ID=1.

yes. But you do not have to attach id to each page.

It can be:

ID=1
free_page
free_page
ID=2
free_page
free_page



> I haven't found the benefits above yet. The device will perform the same operations
> to get rid of the old free pages. If we drop the old free pages after the StopCMD (
> ID may also not be needed in this case), the overhead won't be added to the live
> migration time.
> Would you have any thought about this?
> 
> 
> Best,
> Wei
> 

As these are separate vqs there is not clean way to know whether
free_page was queued before or after stop command.
Sending the ID helps detect where the free pages for a given start
command are.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [Qemu-devel] [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-10 15:15         ` Michael S. Tsirkin
  0 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-10 15:15 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Mon, Oct 02, 2017 at 04:38:01PM +0000, Wang, Wei W wrote:
> On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
> > On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> > > +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> > > +			  struct virtio_balloon_ctrlq_cmd *cmd,
> > > +			  bool inbuf)
> > > +{
> > > +	struct virtqueue *vq = vb->ctrl_vq;
> > > +
> > > +	ctrlq_add_cmd(vq, cmd, inbuf);
> > > +	if (!inbuf) {
> > > +		/*
> > > +		 * All the input cmd buffers are replenished here.
> > > +		 * This is necessary because the input cmd buffers are lost
> > > +		 * after live migration. The device needs to rewind all of
> > > +		 * them from the ctrl_vq.
> > 
> > Confused. Live migration somehow loses state? Why is that and why is it a good
> > idea? And how do you know this is migration even?
> > Looks like all you know is you got free page end. Could be any reason for this.
> 
> 
> I think this would be something that the current live migration lacks - what the
> device read from the vq is not transferred during live migration, an example is the 
> stat_vq_elem: 
> Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c

This does not touch guest memory though it just manipulates
internal state to make it easier to migrate.
It's transparent to guest as migration should be.

> For all the things that are added to the vq and need to be held by the device
> to use later need to consider the situation that live migration might happen at any
> time and they need to be re-taken from the vq by the device on the destination
> machine.
> 
> So, even without this live migration optimization feature, I think all the things that are 
> added to the vq for the device to hold, need a way for the device to rewind back from
> the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
> on the vq so that the device side rewinding can work. 
> 
> Please let me know if anything is missed or if you have other suggestions.

IMO migration should pass enough data source to destination for
destination to continue where source left off without guest help.

> 
> > > +static void ctrlq_handle(struct virtqueue *vq) {
> > > +	struct virtio_balloon *vb = vq->vdev->priv;
> > > +	struct virtio_balloon_ctrlq_cmd *msg;
> > > +	unsigned int class, cmd, len;
> > > +
> > > +	msg = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len);
> > > +	if (unlikely(!msg))
> > > +		return;
> > > +
> > > +	/* The outbuf is sent by the host for recycling, so just return. */
> > > +	if (msg == &vb->free_page_cmd_out)
> > > +		return;
> > > +
> > > +	class = virtio32_to_cpu(vb->vdev, msg->class);
> > > +	cmd =  virtio32_to_cpu(vb->vdev, msg->cmd);
> > > +
> > > +	switch (class) {
> > > +	case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
> > > +		if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
> > > +			vb->report_free_page_stop = true;
> > > +		} else if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
> > > +			vb->report_free_page_stop = false;
> > > +			queue_work(vb->balloon_wq, &vb-
> > >report_free_page_work);
> > > +		}
> > > +		vb->free_page_cmd_in.class =
> > > +
> > 	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> > > +		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
> > > +	break;
> > > +	default:
> > > +		dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n",
> > > +			 __func__);
> > > +	}
> > 
> > Manipulating report_free_page_stop without any locks looks very suspicious.
> 
> > Also, what if we get two start commands? we should restart from beginning,
> > should we not?
> > 
> 
> 
> Yes, it will start to report free pages from the beginning.
> walk_free_mem_block() doesn't maintain any internal status, so the invoking of
> it will always start from the beginning.

Well yes but it will first complete the previous walk.

> 
> > > +/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE
> > */
> > > +#define VIRTIO_BALLOON_FREE_PAGE_F_STOP		0
> > > +#define VIRTIO_BALLOON_FREE_PAGE_F_START	1
> > > +
> > >  #endif /* _LINUX_VIRTIO_BALLOON_H */
> > 
> > The stop command does not appear to be thought through.
> > 
> > Let's assume e.g. you started migration. You ask guest for free pages.
> > Then you cancel it.  There are a bunch of pages in free vq and you are getting
> > more.  You now want to start migration again. What to do?
> > 
> > A bunch of vq flushing and waiting will maybe do the trick, but waiting on guest
> > is never a great idea.
> > 
> 
> 
> I think the device can flush (pop out what's left in the vq and push them back) the
> vq right after the Stop command is sent to the guest, rather than doing the flush
> when the 2nd initiation of live migration begins. The entries pushed back to the vq
> will be in the used ring, what would the device need to wait for?

You will be getting stale pages in available ring which were possibly
taken out of free list since memory is not tracked when migration is not
going on.



> > I previously suggested pushing the stop/start commands from guest to host on
> > the free page vq, and including an ID in host to guest and guest to host
> > commands. This way ctrl vq is just for host to guest commands, and host
> > matches commands and knows which command is a free page in response to.
> > 
> > I still think it's a good idea but go ahead and propose something else that works.
> > 
> 
> Thanks for the suggestion. Probably I haven't fully understood it. Please see the example
> below:
> 
> 1) host-to-guest ctrl_vq:
> StartCMD, ID=1
> 
> 2) guest-to-host free_page_vq:
> free_page, ID=1
> free_page, ID=1
> free_page, ID=1
> free_page, ID=1
> 
> 3) host-to-guest ctrl_vq:
> StopCMD, ID=1
> 
> 4) initiate the 2nd try of live migration via host-to-guest ctrl_vq:
> StartCMD, ID=2
> 
> 5) the guest-to-host free_page_vq might look like this:
> free_page, ID=1
> free_page, ID=1
> free_page, ID=2
> free_page, ID=2
> 
> The device will need to drop (pop out the two entries and push them back)
> the first 2 obsolete free pages which are sent by ID=1.

yes. But you do not have to attach id to each page.

It can be:

ID=1
free_page
free_page
ID=2
free_page
free_page



> I haven't found the benefits above yet. The device will perform the same operations
> to get rid of the old free pages. If we drop the old free pages after the StopCMD (
> ID may also not be needed in this case), the overhead won't be added to the live
> migration time.
> Would you have any thought about this?
> 
> 
> Best,
> Wei
> 

As these are separate vqs there is not clean way to know whether
free_page was queued before or after stop command.
Sending the ID helps detect where the free pages for a given start
command are.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
  2017-10-02 16:38       ` Wang, Wei W
                         ` (4 preceding siblings ...)
  (?)
@ 2017-10-10 15:15       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-10 15:15 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko

On Mon, Oct 02, 2017 at 04:38:01PM +0000, Wang, Wei W wrote:
> On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
> > On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> > > +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> > > +			  struct virtio_balloon_ctrlq_cmd *cmd,
> > > +			  bool inbuf)
> > > +{
> > > +	struct virtqueue *vq = vb->ctrl_vq;
> > > +
> > > +	ctrlq_add_cmd(vq, cmd, inbuf);
> > > +	if (!inbuf) {
> > > +		/*
> > > +		 * All the input cmd buffers are replenished here.
> > > +		 * This is necessary because the input cmd buffers are lost
> > > +		 * after live migration. The device needs to rewind all of
> > > +		 * them from the ctrl_vq.
> > 
> > Confused. Live migration somehow loses state? Why is that and why is it a good
> > idea? And how do you know this is migration even?
> > Looks like all you know is you got free page end. Could be any reason for this.
> 
> 
> I think this would be something that the current live migration lacks - what the
> device read from the vq is not transferred during live migration, an example is the 
> stat_vq_elem: 
> Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c

This does not touch guest memory though it just manipulates
internal state to make it easier to migrate.
It's transparent to guest as migration should be.

> For all the things that are added to the vq and need to be held by the device
> to use later need to consider the situation that live migration might happen at any
> time and they need to be re-taken from the vq by the device on the destination
> machine.
> 
> So, even without this live migration optimization feature, I think all the things that are 
> added to the vq for the device to hold, need a way for the device to rewind back from
> the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
> on the vq so that the device side rewinding can work. 
> 
> Please let me know if anything is missed or if you have other suggestions.

IMO migration should pass enough data source to destination for
destination to continue where source left off without guest help.

> 
> > > +static void ctrlq_handle(struct virtqueue *vq) {
> > > +	struct virtio_balloon *vb = vq->vdev->priv;
> > > +	struct virtio_balloon_ctrlq_cmd *msg;
> > > +	unsigned int class, cmd, len;
> > > +
> > > +	msg = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len);
> > > +	if (unlikely(!msg))
> > > +		return;
> > > +
> > > +	/* The outbuf is sent by the host for recycling, so just return. */
> > > +	if (msg == &vb->free_page_cmd_out)
> > > +		return;
> > > +
> > > +	class = virtio32_to_cpu(vb->vdev, msg->class);
> > > +	cmd =  virtio32_to_cpu(vb->vdev, msg->cmd);
> > > +
> > > +	switch (class) {
> > > +	case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
> > > +		if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
> > > +			vb->report_free_page_stop = true;
> > > +		} else if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
> > > +			vb->report_free_page_stop = false;
> > > +			queue_work(vb->balloon_wq, &vb-
> > >report_free_page_work);
> > > +		}
> > > +		vb->free_page_cmd_in.class =
> > > +
> > 	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> > > +		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
> > > +	break;
> > > +	default:
> > > +		dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n",
> > > +			 __func__);
> > > +	}
> > 
> > Manipulating report_free_page_stop without any locks looks very suspicious.
> 
> > Also, what if we get two start commands? we should restart from beginning,
> > should we not?
> > 
> 
> 
> Yes, it will start to report free pages from the beginning.
> walk_free_mem_block() doesn't maintain any internal status, so the invoking of
> it will always start from the beginning.

Well yes but it will first complete the previous walk.

> 
> > > +/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE
> > */
> > > +#define VIRTIO_BALLOON_FREE_PAGE_F_STOP		0
> > > +#define VIRTIO_BALLOON_FREE_PAGE_F_START	1
> > > +
> > >  #endif /* _LINUX_VIRTIO_BALLOON_H */
> > 
> > The stop command does not appear to be thought through.
> > 
> > Let's assume e.g. you started migration. You ask guest for free pages.
> > Then you cancel it.  There are a bunch of pages in free vq and you are getting
> > more.  You now want to start migration again. What to do?
> > 
> > A bunch of vq flushing and waiting will maybe do the trick, but waiting on guest
> > is never a great idea.
> > 
> 
> 
> I think the device can flush (pop out what's left in the vq and push them back) the
> vq right after the Stop command is sent to the guest, rather than doing the flush
> when the 2nd initiation of live migration begins. The entries pushed back to the vq
> will be in the used ring, what would the device need to wait for?

You will be getting stale pages in available ring which were possibly
taken out of free list since memory is not tracked when migration is not
going on.



> > I previously suggested pushing the stop/start commands from guest to host on
> > the free page vq, and including an ID in host to guest and guest to host
> > commands. This way ctrl vq is just for host to guest commands, and host
> > matches commands and knows which command is a free page in response to.
> > 
> > I still think it's a good idea but go ahead and propose something else that works.
> > 
> 
> Thanks for the suggestion. Probably I haven't fully understood it. Please see the example
> below:
> 
> 1) host-to-guest ctrl_vq:
> StartCMD, ID=1
> 
> 2) guest-to-host free_page_vq:
> free_page, ID=1
> free_page, ID=1
> free_page, ID=1
> free_page, ID=1
> 
> 3) host-to-guest ctrl_vq:
> StopCMD, ID=1
> 
> 4) initiate the 2nd try of live migration via host-to-guest ctrl_vq:
> StartCMD, ID=2
> 
> 5) the guest-to-host free_page_vq might look like this:
> free_page, ID=1
> free_page, ID=1
> free_page, ID=2
> free_page, ID=2
> 
> The device will need to drop (pop out the two entries and push them back)
> the first 2 obsolete free pages which are sent by ID=1.

yes. But you do not have to attach id to each page.

It can be:

ID=1
free_page
free_page
ID=2
free_page
free_page



> I haven't found the benefits above yet. The device will perform the same operations
> to get rid of the old free pages. If we drop the old free pages after the StopCMD (
> ID may also not be needed in this case), the overhead won't be added to the live
> migration time.
> Would you have any thought about this?
> 
> 
> Best,
> Wei
> 

As these are separate vqs there is not clean way to know whether
free_page was queued before or after stop command.
Sending the ID helps detect where the free pages for a given start
command are.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [virtio-dev] Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-10 15:15         ` Michael S. Tsirkin
  0 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-10 15:15 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Mon, Oct 02, 2017 at 04:38:01PM +0000, Wang, Wei W wrote:
> On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
> > On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> > > +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> > > +			  struct virtio_balloon_ctrlq_cmd *cmd,
> > > +			  bool inbuf)
> > > +{
> > > +	struct virtqueue *vq = vb->ctrl_vq;
> > > +
> > > +	ctrlq_add_cmd(vq, cmd, inbuf);
> > > +	if (!inbuf) {
> > > +		/*
> > > +		 * All the input cmd buffers are replenished here.
> > > +		 * This is necessary because the input cmd buffers are lost
> > > +		 * after live migration. The device needs to rewind all of
> > > +		 * them from the ctrl_vq.
> > 
> > Confused. Live migration somehow loses state? Why is that and why is it a good
> > idea? And how do you know this is migration even?
> > Looks like all you know is you got free page end. Could be any reason for this.
> 
> 
> I think this would be something that the current live migration lacks - what the
> device read from the vq is not transferred during live migration, an example is the 
> stat_vq_elem: 
> Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c

This does not touch guest memory though it just manipulates
internal state to make it easier to migrate.
It's transparent to guest as migration should be.

> For all the things that are added to the vq and need to be held by the device
> to use later need to consider the situation that live migration might happen at any
> time and they need to be re-taken from the vq by the device on the destination
> machine.
> 
> So, even without this live migration optimization feature, I think all the things that are 
> added to the vq for the device to hold, need a way for the device to rewind back from
> the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
> on the vq so that the device side rewinding can work. 
> 
> Please let me know if anything is missed or if you have other suggestions.

IMO migration should pass enough data source to destination for
destination to continue where source left off without guest help.

> 
> > > +static void ctrlq_handle(struct virtqueue *vq) {
> > > +	struct virtio_balloon *vb = vq->vdev->priv;
> > > +	struct virtio_balloon_ctrlq_cmd *msg;
> > > +	unsigned int class, cmd, len;
> > > +
> > > +	msg = (struct virtio_balloon_ctrlq_cmd *)virtqueue_get_buf(vq, &len);
> > > +	if (unlikely(!msg))
> > > +		return;
> > > +
> > > +	/* The outbuf is sent by the host for recycling, so just return. */
> > > +	if (msg == &vb->free_page_cmd_out)
> > > +		return;
> > > +
> > > +	class = virtio32_to_cpu(vb->vdev, msg->class);
> > > +	cmd =  virtio32_to_cpu(vb->vdev, msg->cmd);
> > > +
> > > +	switch (class) {
> > > +	case VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE:
> > > +		if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_STOP) {
> > > +			vb->report_free_page_stop = true;
> > > +		} else if (cmd == VIRTIO_BALLOON_FREE_PAGE_F_START) {
> > > +			vb->report_free_page_stop = false;
> > > +			queue_work(vb->balloon_wq, &vb-
> > >report_free_page_work);
> > > +		}
> > > +		vb->free_page_cmd_in.class =
> > > +
> > 	VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE;
> > > +		ctrlq_send_cmd(vb, &vb->free_page_cmd_in, true);
> > > +	break;
> > > +	default:
> > > +		dev_warn(&vb->vdev->dev, "%s: cmd class not supported\n",
> > > +			 __func__);
> > > +	}
> > 
> > Manipulating report_free_page_stop without any locks looks very suspicious.
> 
> > Also, what if we get two start commands? we should restart from beginning,
> > should we not?
> > 
> 
> 
> Yes, it will start to report free pages from the beginning.
> walk_free_mem_block() doesn't maintain any internal status, so the invoking of
> it will always start from the beginning.

Well yes but it will first complete the previous walk.

> 
> > > +/* Ctrlq commands related to VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE
> > */
> > > +#define VIRTIO_BALLOON_FREE_PAGE_F_STOP		0
> > > +#define VIRTIO_BALLOON_FREE_PAGE_F_START	1
> > > +
> > >  #endif /* _LINUX_VIRTIO_BALLOON_H */
> > 
> > The stop command does not appear to be thought through.
> > 
> > Let's assume e.g. you started migration. You ask guest for free pages.
> > Then you cancel it.  There are a bunch of pages in free vq and you are getting
> > more.  You now want to start migration again. What to do?
> > 
> > A bunch of vq flushing and waiting will maybe do the trick, but waiting on guest
> > is never a great idea.
> > 
> 
> 
> I think the device can flush (pop out what's left in the vq and push them back) the
> vq right after the Stop command is sent to the guest, rather than doing the flush
> when the 2nd initiation of live migration begins. The entries pushed back to the vq
> will be in the used ring, what would the device need to wait for?

You will be getting stale pages in available ring which were possibly
taken out of free list since memory is not tracked when migration is not
going on.



> > I previously suggested pushing the stop/start commands from guest to host on
> > the free page vq, and including an ID in host to guest and guest to host
> > commands. This way ctrl vq is just for host to guest commands, and host
> > matches commands and knows which command is a free page in response to.
> > 
> > I still think it's a good idea but go ahead and propose something else that works.
> > 
> 
> Thanks for the suggestion. Probably I haven't fully understood it. Please see the example
> below:
> 
> 1) host-to-guest ctrl_vq:
> StartCMD, ID=1
> 
> 2) guest-to-host free_page_vq:
> free_page, ID=1
> free_page, ID=1
> free_page, ID=1
> free_page, ID=1
> 
> 3) host-to-guest ctrl_vq:
> StopCMD, ID=1
> 
> 4) initiate the 2nd try of live migration via host-to-guest ctrl_vq:
> StartCMD, ID=2
> 
> 5) the guest-to-host free_page_vq might look like this:
> free_page, ID=1
> free_page, ID=1
> free_page, ID=2
> free_page, ID=2
> 
> The device will need to drop (pop out the two entries and push them back)
> the first 2 obsolete free pages which are sent by ID=1.

yes. But you do not have to attach id to each page.

It can be:

ID=1
free_page
free_page
ID=2
free_page
free_page



> I haven't found the benefits above yet. The device will perform the same operations
> to get rid of the old free pages. If we drop the old free pages after the StopCMD (
> ID may also not be needed in this case), the overhead won't be added to the live
> migration time.
> Would you have any thought about this?
> 
> 
> Best,
> Wei
> 

As these are separate vqs there is not clean way to know whether
free_page was queued before or after stop command.
Sending the ID helps detect where the free pages for a given start
command are.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-10-10 13:09             ` Tetsuo Handa
  (?)
  (?)
@ 2017-10-11  1:51               ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-11  1:51 UTC (permalink / raw)
  To: Tetsuo Handa, mst
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 10/10/2017 09:09 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
>>> And even if we could remove balloon_lock, you still cannot use
>>> __GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
>>> "whether it is safe to wait" flag from
>>> "[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .
>> Without the lock being held, why couldn't we use __GFP_DIRECT_RECLAIM at
>> xb_set_page()?
> Because of dependency shown below.
>
> leak_balloon()
>    xb_set_page()
>      xb_preload(GFP_KERNEL)
>        kmalloc(GFP_KERNEL)
>          __alloc_pages_may_oom()
>            Takes oom_lock
>            out_of_memory()
>              blocking_notifier_call_chain()
>                leak_balloon()
>                  xb_set_page()
>                    xb_preload(GFP_KERNEL)
>                      kmalloc(GFP_KERNEL)
>                        __alloc_pages_may_oom()
>                          Fails to take oom_lock and loop forever

__alloc_pages_may_oom() uses mutex_trylock(&oom_lock).

I think the second __alloc_pages_may_oom() will not continue since the
first one is in progress.

>
> By the way, is xb_set_page() safe?
> Sleeping in the kernel with preemption disabled is a bug, isn't it?
> __radix_tree_preload() returns 0 with preemption disabled upon success.
> xb_preload() disables preemption if __radix_tree_preload() fails.
> Then, kmalloc() is called with preemption disabled, isn't it?
> But xb_set_page() calls xb_preload(GFP_KERNEL) which might sleep with
> preemption disabled.

Yes, I think that should not be expected, thanks.

I plan to change it like this:

bool xb_preload(gfp_t gfp)
{
         if (!this_cpu_read(ida_bitmap)) {
                 struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);

                 if (!bitmap)
                         return false;
                 bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
                 kfree(bitmap);
         }

         if (__radix_tree_preload(gfp, XB_PRELOAD_SIZE) < 0)
                 return false;

         return true;
}


Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-11  1:51               ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-11  1:51 UTC (permalink / raw)
  To: Tetsuo Handa, mst
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 10/10/2017 09:09 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
>>> And even if we could remove balloon_lock, you still cannot use
>>> __GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
>>> "whether it is safe to wait" flag from
>>> "[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .
>> Without the lock being held, why couldn't we use __GFP_DIRECT_RECLAIM at
>> xb_set_page()?
> Because of dependency shown below.
>
> leak_balloon()
>    xb_set_page()
>      xb_preload(GFP_KERNEL)
>        kmalloc(GFP_KERNEL)
>          __alloc_pages_may_oom()
>            Takes oom_lock
>            out_of_memory()
>              blocking_notifier_call_chain()
>                leak_balloon()
>                  xb_set_page()
>                    xb_preload(GFP_KERNEL)
>                      kmalloc(GFP_KERNEL)
>                        __alloc_pages_may_oom()
>                          Fails to take oom_lock and loop forever

__alloc_pages_may_oom() uses mutex_trylock(&oom_lock).

I think the second __alloc_pages_may_oom() will not continue since the
first one is in progress.

>
> By the way, is xb_set_page() safe?
> Sleeping in the kernel with preemption disabled is a bug, isn't it?
> __radix_tree_preload() returns 0 with preemption disabled upon success.
> xb_preload() disables preemption if __radix_tree_preload() fails.
> Then, kmalloc() is called with preemption disabled, isn't it?
> But xb_set_page() calls xb_preload(GFP_KERNEL) which might sleep with
> preemption disabled.

Yes, I think that should not be expected, thanks.

I plan to change it like this:

bool xb_preload(gfp_t gfp)
{
         if (!this_cpu_read(ida_bitmap)) {
                 struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);

                 if (!bitmap)
                         return false;
                 bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
                 kfree(bitmap);
         }

         if (__radix_tree_preload(gfp, XB_PRELOAD_SIZE) < 0)
                 return false;

         return true;
}


Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [Qemu-devel] [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-11  1:51               ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-11  1:51 UTC (permalink / raw)
  To: Tetsuo Handa, mst
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 10/10/2017 09:09 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
>>> And even if we could remove balloon_lock, you still cannot use
>>> __GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
>>> "whether it is safe to wait" flag from
>>> "[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .
>> Without the lock being held, why couldn't we use __GFP_DIRECT_RECLAIM at
>> xb_set_page()?
> Because of dependency shown below.
>
> leak_balloon()
>    xb_set_page()
>      xb_preload(GFP_KERNEL)
>        kmalloc(GFP_KERNEL)
>          __alloc_pages_may_oom()
>            Takes oom_lock
>            out_of_memory()
>              blocking_notifier_call_chain()
>                leak_balloon()
>                  xb_set_page()
>                    xb_preload(GFP_KERNEL)
>                      kmalloc(GFP_KERNEL)
>                        __alloc_pages_may_oom()
>                          Fails to take oom_lock and loop forever

__alloc_pages_may_oom() uses mutex_trylock(&oom_lock).

I think the second __alloc_pages_may_oom() will not continue since the
first one is in progress.

>
> By the way, is xb_set_page() safe?
> Sleeping in the kernel with preemption disabled is a bug, isn't it?
> __radix_tree_preload() returns 0 with preemption disabled upon success.
> xb_preload() disables preemption if __radix_tree_preload() fails.
> Then, kmalloc() is called with preemption disabled, isn't it?
> But xb_set_page() calls xb_preload(GFP_KERNEL) which might sleep with
> preemption disabled.

Yes, I think that should not be expected, thanks.

I plan to change it like this:

bool xb_preload(gfp_t gfp)
{
         if (!this_cpu_read(ida_bitmap)) {
                 struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);

                 if (!bitmap)
                         return false;
                 bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
                 kfree(bitmap);
         }

         if (__radix_tree_preload(gfp, XB_PRELOAD_SIZE) < 0)
                 return false;

         return true;
}


Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-10-10 13:09             ` Tetsuo Handa
  (?)
  (?)
@ 2017-10-11  1:51             ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-11  1:51 UTC (permalink / raw)
  To: Tetsuo Handa, mst
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko, mgorman

On 10/10/2017 09:09 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
>>> And even if we could remove balloon_lock, you still cannot use
>>> __GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
>>> "whether it is safe to wait" flag from
>>> "[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .
>> Without the lock being held, why couldn't we use __GFP_DIRECT_RECLAIM at
>> xb_set_page()?
> Because of dependency shown below.
>
> leak_balloon()
>    xb_set_page()
>      xb_preload(GFP_KERNEL)
>        kmalloc(GFP_KERNEL)
>          __alloc_pages_may_oom()
>            Takes oom_lock
>            out_of_memory()
>              blocking_notifier_call_chain()
>                leak_balloon()
>                  xb_set_page()
>                    xb_preload(GFP_KERNEL)
>                      kmalloc(GFP_KERNEL)
>                        __alloc_pages_may_oom()
>                          Fails to take oom_lock and loop forever

__alloc_pages_may_oom() uses mutex_trylock(&oom_lock).

I think the second __alloc_pages_may_oom() will not continue since the
first one is in progress.

>
> By the way, is xb_set_page() safe?
> Sleeping in the kernel with preemption disabled is a bug, isn't it?
> __radix_tree_preload() returns 0 with preemption disabled upon success.
> xb_preload() disables preemption if __radix_tree_preload() fails.
> Then, kmalloc() is called with preemption disabled, isn't it?
> But xb_set_page() calls xb_preload(GFP_KERNEL) which might sleep with
> preemption disabled.

Yes, I think that should not be expected, thanks.

I plan to change it like this:

bool xb_preload(gfp_t gfp)
{
         if (!this_cpu_read(ida_bitmap)) {
                 struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);

                 if (!bitmap)
                         return false;
                 bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
                 kfree(bitmap);
         }

         if (__radix_tree_preload(gfp, XB_PRELOAD_SIZE) < 0)
                 return false;

         return true;
}


Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [virtio-dev] Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-11  1:51               ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-11  1:51 UTC (permalink / raw)
  To: Tetsuo Handa, mst
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 10/10/2017 09:09 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
>>> And even if we could remove balloon_lock, you still cannot use
>>> __GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
>>> "whether it is safe to wait" flag from
>>> "[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .
>> Without the lock being held, why couldn't we use __GFP_DIRECT_RECLAIM at
>> xb_set_page()?
> Because of dependency shown below.
>
> leak_balloon()
>    xb_set_page()
>      xb_preload(GFP_KERNEL)
>        kmalloc(GFP_KERNEL)
>          __alloc_pages_may_oom()
>            Takes oom_lock
>            out_of_memory()
>              blocking_notifier_call_chain()
>                leak_balloon()
>                  xb_set_page()
>                    xb_preload(GFP_KERNEL)
>                      kmalloc(GFP_KERNEL)
>                        __alloc_pages_may_oom()
>                          Fails to take oom_lock and loop forever

__alloc_pages_may_oom() uses mutex_trylock(&oom_lock).

I think the second __alloc_pages_may_oom() will not continue since the
first one is in progress.

>
> By the way, is xb_set_page() safe?
> Sleeping in the kernel with preemption disabled is a bug, isn't it?
> __radix_tree_preload() returns 0 with preemption disabled upon success.
> xb_preload() disables preemption if __radix_tree_preload() fails.
> Then, kmalloc() is called with preemption disabled, isn't it?
> But xb_set_page() calls xb_preload(GFP_KERNEL) which might sleep with
> preemption disabled.

Yes, I think that should not be expected, thanks.

I plan to change it like this:

bool xb_preload(gfp_t gfp)
{
         if (!this_cpu_read(ida_bitmap)) {
                 struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);

                 if (!bitmap)
                         return false;
                 bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
                 kfree(bitmap);
         }

         if (__radix_tree_preload(gfp, XB_PRELOAD_SIZE) < 0)
                 return false;

         return true;
}


Best,
Wei


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-10-11  1:51               ` Wei Wang
@ 2017-10-11  2:26                 ` Tetsuo Handa
  -1 siblings, 0 replies; 146+ messages in thread
From: Tetsuo Handa @ 2017-10-11  2:26 UTC (permalink / raw)
  To: Wei Wang
  Cc: mst, virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

Wei Wang wrote:
>On 10/10/2017 09:09 PM, Tetsuo Handa wrote:
>> Wei Wang wrote:
>>>> And even if we could remove balloon_lock, you still cannot use
>>>> __GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
>>>> "whether it is safe to wait" flag from
>>>> "[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .
>>> Without the lock being held, why couldn't we use __GFP_DIRECT_RECLAIM at
>>> xb_set_page()?
>> Because of dependency shown below.
>>
>> leak_balloon()
>>    xb_set_page()
>>      xb_preload(GFP_KERNEL)
>>        kmalloc(GFP_KERNEL)
>>          __alloc_pages_may_oom()
>>            Takes oom_lock
>>            out_of_memory()
>>              blocking_notifier_call_chain()
>>                leak_balloon()
>>                  xb_set_page()
>>                    xb_preload(GFP_KERNEL)
>>                      kmalloc(GFP_KERNEL)
>>                        __alloc_pages_may_oom()
>>                          Fails to take oom_lock and loop forever
>
>__alloc_pages_may_oom() uses mutex_trylock(&oom_lock).

Yes. But this mutex_trylock(&oom_lock) is semantically mutex_lock(&oom_lock)
because __alloc_pages_slowpath() will continue looping until
mutex_trylock(&oom_lock) succeeds (or somebody releases memory).

>
>I think the second __alloc_pages_may_oom() will not continue since the
>first one is in progress.

The second __alloc_pages_may_oom() will be called repeatedly because
__alloc_pages_slowpath() will continue looping (unless somebody releases
memory).

>
>>
>> By the way, is xb_set_page() safe?
>> Sleeping in the kernel with preemption disabled is a bug, isn't it?
>> __radix_tree_preload() returns 0 with preemption disabled upon success.
>> xb_preload() disables preemption if __radix_tree_preload() fails.
>> Then, kmalloc() is called with preemption disabled, isn't it?
>> But xb_set_page() calls xb_preload(GFP_KERNEL) which might sleep with
>> preemption disabled.
>
>Yes, I think that should not be expected, thanks.
>
>I plan to change it like this:
>
>bool xb_preload(gfp_t gfp)
>{
>         if (!this_cpu_read(ida_bitmap)) {
>                 struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
>
>                 if (!bitmap)
>                         return false;
>                 bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
>                 kfree(bitmap);
>         }

Excuse me, but you are allocating per-CPU memory when running CPU might
change at this line? What happens if running CPU has changed at this line?
Will it work even with new CPU's ida_bitmap == NULL ?

>
>         if (__radix_tree_preload(gfp, XB_PRELOAD_SIZE) < 0)
>                 return false;
>
>         return true;
>}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [Qemu-devel] [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-11  2:26                 ` Tetsuo Handa
  0 siblings, 0 replies; 146+ messages in thread
From: Tetsuo Handa @ 2017-10-11  2:26 UTC (permalink / raw)
  To: Wei Wang
  Cc: mst, virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

Wei Wang wrote:
>On 10/10/2017 09:09 PM, Tetsuo Handa wrote:
>> Wei Wang wrote:
>>>> And even if we could remove balloon_lock, you still cannot use
>>>> __GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
>>>> "whether it is safe to wait" flag from
>>>> "[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .
>>> Without the lock being held, why couldn't we use __GFP_DIRECT_RECLAIM at
>>> xb_set_page()?
>> Because of dependency shown below.
>>
>> leak_balloon()
>>    xb_set_page()
>>      xb_preload(GFP_KERNEL)
>>        kmalloc(GFP_KERNEL)
>>          __alloc_pages_may_oom()
>>            Takes oom_lock
>>            out_of_memory()
>>              blocking_notifier_call_chain()
>>                leak_balloon()
>>                  xb_set_page()
>>                    xb_preload(GFP_KERNEL)
>>                      kmalloc(GFP_KERNEL)
>>                        __alloc_pages_may_oom()
>>                          Fails to take oom_lock and loop forever
>
>__alloc_pages_may_oom() uses mutex_trylock(&oom_lock).

Yes. But this mutex_trylock(&oom_lock) is semantically mutex_lock(&oom_lock)
because __alloc_pages_slowpath() will continue looping until
mutex_trylock(&oom_lock) succeeds (or somebody releases memory).

>
>I think the second __alloc_pages_may_oom() will not continue since the
>first one is in progress.

The second __alloc_pages_may_oom() will be called repeatedly because
__alloc_pages_slowpath() will continue looping (unless somebody releases
memory).

>
>>
>> By the way, is xb_set_page() safe?
>> Sleeping in the kernel with preemption disabled is a bug, isn't it?
>> __radix_tree_preload() returns 0 with preemption disabled upon success.
>> xb_preload() disables preemption if __radix_tree_preload() fails.
>> Then, kmalloc() is called with preemption disabled, isn't it?
>> But xb_set_page() calls xb_preload(GFP_KERNEL) which might sleep with
>> preemption disabled.
>
>Yes, I think that should not be expected, thanks.
>
>I plan to change it like this:
>
>bool xb_preload(gfp_t gfp)
>{
>         if (!this_cpu_read(ida_bitmap)) {
>                 struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
>
>                 if (!bitmap)
>                         return false;
>                 bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
>                 kfree(bitmap);
>         }

Excuse me, but you are allocating per-CPU memory when running CPU might
change at this line? What happens if running CPU has changed at this line?
Will it work even with new CPU's ida_bitmap == NULL ?

>
>         if (__radix_tree_preload(gfp, XB_PRELOAD_SIZE) < 0)
>                 return false;
>
>         return true;
>}

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-10-11  2:26                 ` [Qemu-devel] " Tetsuo Handa
  (?)
  (?)
@ 2017-10-11  3:16                   ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-11  3:16 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mst, virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 10/11/2017 10:26 AM, Tetsuo Handa wrote:
> Wei Wang wrote:
>> On 10/10/2017 09:09 PM, Tetsuo Handa wrote:
>>> Wei Wang wrote:
>>>>> And even if we could remove balloon_lock, you still cannot use
>>>>> __GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
>>>>> "whether it is safe to wait" flag from
>>>>> "[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .
>>>> Without the lock being held, why couldn't we use __GFP_DIRECT_RECLAIM at
>>>> xb_set_page()?
>>> Because of dependency shown below.
>>>
>>> leak_balloon()
>>>    xb_set_page()
>>>      xb_preload(GFP_KERNEL)
>>>        kmalloc(GFP_KERNEL)
>>>          __alloc_pages_may_oom()
>>>            Takes oom_lock
>>>            out_of_memory()
>>>              blocking_notifier_call_chain()
>>>                leak_balloon()
>>>                  xb_set_page()
>>>                    xb_preload(GFP_KERNEL)
>>>                      kmalloc(GFP_KERNEL)
>>>                        __alloc_pages_may_oom()
>>>                          Fails to take oom_lock and loop forever
>> __alloc_pages_may_oom() uses mutex_trylock(&oom_lock).
> Yes. But this mutex_trylock(&oom_lock) is semantically mutex_lock(&oom_lock)
> because __alloc_pages_slowpath() will continue looping until
> mutex_trylock(&oom_lock) succeeds (or somebody releases memory).
>
>> I think the second __alloc_pages_may_oom() will not continue since the
>> first one is in progress.
> The second __alloc_pages_may_oom() will be called repeatedly because
> __alloc_pages_slowpath() will continue looping (unless somebody releases
> memory).
>

OK, I see, thanks. So, the point is that the OOM code path should not
have memory allocation, and the
old leak_balloon (without the F_SG feature) don't need xb_preload(). I
think one solution would be to let
the OOM uses the old leak_balloon() code path, and we can add one more
parameter to leak_balloon
to control that:

leak_balloon(struct virtio_balloon *vb, size_t num, bool oom)



>>> By the way, is xb_set_page() safe?
>>> Sleeping in the kernel with preemption disabled is a bug, isn't it?
>>> __radix_tree_preload() returns 0 with preemption disabled upon success.
>>> xb_preload() disables preemption if __radix_tree_preload() fails.
>>> Then, kmalloc() is called with preemption disabled, isn't it?
>>> But xb_set_page() calls xb_preload(GFP_KERNEL) which might sleep with
>>> preemption disabled.
>> Yes, I think that should not be expected, thanks.
>>
>> I plan to change it like this:
>>
>> bool xb_preload(gfp_t gfp)
>> {
>>         if (!this_cpu_read(ida_bitmap)) {
>>                 struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
>>
>>                 if (!bitmap)
>>                         return false;
>>                 bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
>>                 kfree(bitmap);
>>         }
> Excuse me, but you are allocating per-CPU memory when running CPU might
> change at this line? What happens if running CPU has changed at this line?
> Will it work even with new CPU's ida_bitmap == NULL ?
>


Yes, it will be detected in xb_set_bit(): when ida_bitmap = NULL on the
new CPU, xb_set_bit() will
return -EAGAIN to the caller, and the caller should restart from
xb_preload().

Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-11  3:16                   ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-11  3:16 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mst, virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 10/11/2017 10:26 AM, Tetsuo Handa wrote:
> Wei Wang wrote:
>> On 10/10/2017 09:09 PM, Tetsuo Handa wrote:
>>> Wei Wang wrote:
>>>>> And even if we could remove balloon_lock, you still cannot use
>>>>> __GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
>>>>> "whether it is safe to wait" flag from
>>>>> "[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .
>>>> Without the lock being held, why couldn't we use __GFP_DIRECT_RECLAIM at
>>>> xb_set_page()?
>>> Because of dependency shown below.
>>>
>>> leak_balloon()
>>>    xb_set_page()
>>>      xb_preload(GFP_KERNEL)
>>>        kmalloc(GFP_KERNEL)
>>>          __alloc_pages_may_oom()
>>>            Takes oom_lock
>>>            out_of_memory()
>>>              blocking_notifier_call_chain()
>>>                leak_balloon()
>>>                  xb_set_page()
>>>                    xb_preload(GFP_KERNEL)
>>>                      kmalloc(GFP_KERNEL)
>>>                        __alloc_pages_may_oom()
>>>                          Fails to take oom_lock and loop forever
>> __alloc_pages_may_oom() uses mutex_trylock(&oom_lock).
> Yes. But this mutex_trylock(&oom_lock) is semantically mutex_lock(&oom_lock)
> because __alloc_pages_slowpath() will continue looping until
> mutex_trylock(&oom_lock) succeeds (or somebody releases memory).
>
>> I think the second __alloc_pages_may_oom() will not continue since the
>> first one is in progress.
> The second __alloc_pages_may_oom() will be called repeatedly because
> __alloc_pages_slowpath() will continue looping (unless somebody releases
> memory).
>

OK, I see, thanks. So, the point is that the OOM code path should not
have memory allocation, and the
old leak_balloon (without the F_SG feature) don't need xb_preload(). I
think one solution would be to let
the OOM uses the old leak_balloon() code path, and we can add one more
parameter to leak_balloon
to control that:

leak_balloon(struct virtio_balloon *vb, size_t num, bool oom)



>>> By the way, is xb_set_page() safe?
>>> Sleeping in the kernel with preemption disabled is a bug, isn't it?
>>> __radix_tree_preload() returns 0 with preemption disabled upon success.
>>> xb_preload() disables preemption if __radix_tree_preload() fails.
>>> Then, kmalloc() is called with preemption disabled, isn't it?
>>> But xb_set_page() calls xb_preload(GFP_KERNEL) which might sleep with
>>> preemption disabled.
>> Yes, I think that should not be expected, thanks.
>>
>> I plan to change it like this:
>>
>> bool xb_preload(gfp_t gfp)
>> {
>>         if (!this_cpu_read(ida_bitmap)) {
>>                 struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
>>
>>                 if (!bitmap)
>>                         return false;
>>                 bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
>>                 kfree(bitmap);
>>         }
> Excuse me, but you are allocating per-CPU memory when running CPU might
> change at this line? What happens if running CPU has changed at this line?
> Will it work even with new CPU's ida_bitmap == NULL ?
>


Yes, it will be detected in xb_set_bit(): when ida_bitmap = NULL on the
new CPU, xb_set_bit() will
return -EAGAIN to the caller, and the caller should restart from
xb_preload().

Best,
Wei



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [Qemu-devel] [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-11  3:16                   ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-11  3:16 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mst, virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 10/11/2017 10:26 AM, Tetsuo Handa wrote:
> Wei Wang wrote:
>> On 10/10/2017 09:09 PM, Tetsuo Handa wrote:
>>> Wei Wang wrote:
>>>>> And even if we could remove balloon_lock, you still cannot use
>>>>> __GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
>>>>> "whether it is safe to wait" flag from
>>>>> "[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .
>>>> Without the lock being held, why couldn't we use __GFP_DIRECT_RECLAIM at
>>>> xb_set_page()?
>>> Because of dependency shown below.
>>>
>>> leak_balloon()
>>>    xb_set_page()
>>>      xb_preload(GFP_KERNEL)
>>>        kmalloc(GFP_KERNEL)
>>>          __alloc_pages_may_oom()
>>>            Takes oom_lock
>>>            out_of_memory()
>>>              blocking_notifier_call_chain()
>>>                leak_balloon()
>>>                  xb_set_page()
>>>                    xb_preload(GFP_KERNEL)
>>>                      kmalloc(GFP_KERNEL)
>>>                        __alloc_pages_may_oom()
>>>                          Fails to take oom_lock and loop forever
>> __alloc_pages_may_oom() uses mutex_trylock(&oom_lock).
> Yes. But this mutex_trylock(&oom_lock) is semantically mutex_lock(&oom_lock)
> because __alloc_pages_slowpath() will continue looping until
> mutex_trylock(&oom_lock) succeeds (or somebody releases memory).
>
>> I think the second __alloc_pages_may_oom() will not continue since the
>> first one is in progress.
> The second __alloc_pages_may_oom() will be called repeatedly because
> __alloc_pages_slowpath() will continue looping (unless somebody releases
> memory).
>

OK, I see, thanks. So, the point is that the OOM code path should not
have memory allocation, and the
old leak_balloon (without the F_SG feature) don't need xb_preload(). I
think one solution would be to let
the OOM uses the old leak_balloon() code path, and we can add one more
parameter to leak_balloon
to control that:

leak_balloon(struct virtio_balloon *vb, size_t num, bool oom)



>>> By the way, is xb_set_page() safe?
>>> Sleeping in the kernel with preemption disabled is a bug, isn't it?
>>> __radix_tree_preload() returns 0 with preemption disabled upon success.
>>> xb_preload() disables preemption if __radix_tree_preload() fails.
>>> Then, kmalloc() is called with preemption disabled, isn't it?
>>> But xb_set_page() calls xb_preload(GFP_KERNEL) which might sleep with
>>> preemption disabled.
>> Yes, I think that should not be expected, thanks.
>>
>> I plan to change it like this:
>>
>> bool xb_preload(gfp_t gfp)
>> {
>>         if (!this_cpu_read(ida_bitmap)) {
>>                 struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
>>
>>                 if (!bitmap)
>>                         return false;
>>                 bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
>>                 kfree(bitmap);
>>         }
> Excuse me, but you are allocating per-CPU memory when running CPU might
> change at this line? What happens if running CPU has changed at this line?
> Will it work even with new CPU's ida_bitmap == NULL ?
>


Yes, it will be detected in xb_set_bit(): when ida_bitmap = NULL on the
new CPU, xb_set_bit() will
return -EAGAIN to the caller, and the caller should restart from
xb_preload().

Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
  2017-10-11  2:26                 ` [Qemu-devel] " Tetsuo Handa
  (?)
  (?)
@ 2017-10-11  3:16                 ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-11  3:16 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: aarcange, virtio-dev, amit.shah, kvm, mst, linux-kernel,
	liliang.opensource, mawilcox, qemu-devel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko, mgorman

On 10/11/2017 10:26 AM, Tetsuo Handa wrote:
> Wei Wang wrote:
>> On 10/10/2017 09:09 PM, Tetsuo Handa wrote:
>>> Wei Wang wrote:
>>>>> And even if we could remove balloon_lock, you still cannot use
>>>>> __GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
>>>>> "whether it is safe to wait" flag from
>>>>> "[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .
>>>> Without the lock being held, why couldn't we use __GFP_DIRECT_RECLAIM at
>>>> xb_set_page()?
>>> Because of dependency shown below.
>>>
>>> leak_balloon()
>>>    xb_set_page()
>>>      xb_preload(GFP_KERNEL)
>>>        kmalloc(GFP_KERNEL)
>>>          __alloc_pages_may_oom()
>>>            Takes oom_lock
>>>            out_of_memory()
>>>              blocking_notifier_call_chain()
>>>                leak_balloon()
>>>                  xb_set_page()
>>>                    xb_preload(GFP_KERNEL)
>>>                      kmalloc(GFP_KERNEL)
>>>                        __alloc_pages_may_oom()
>>>                          Fails to take oom_lock and loop forever
>> __alloc_pages_may_oom() uses mutex_trylock(&oom_lock).
> Yes. But this mutex_trylock(&oom_lock) is semantically mutex_lock(&oom_lock)
> because __alloc_pages_slowpath() will continue looping until
> mutex_trylock(&oom_lock) succeeds (or somebody releases memory).
>
>> I think the second __alloc_pages_may_oom() will not continue since the
>> first one is in progress.
> The second __alloc_pages_may_oom() will be called repeatedly because
> __alloc_pages_slowpath() will continue looping (unless somebody releases
> memory).
>

OK, I see, thanks. So, the point is that the OOM code path should not
have memory allocation, and the
old leak_balloon (without the F_SG feature) don't need xb_preload(). I
think one solution would be to let
the OOM uses the old leak_balloon() code path, and we can add one more
parameter to leak_balloon
to control that:

leak_balloon(struct virtio_balloon *vb, size_t num, bool oom)



>>> By the way, is xb_set_page() safe?
>>> Sleeping in the kernel with preemption disabled is a bug, isn't it?
>>> __radix_tree_preload() returns 0 with preemption disabled upon success.
>>> xb_preload() disables preemption if __radix_tree_preload() fails.
>>> Then, kmalloc() is called with preemption disabled, isn't it?
>>> But xb_set_page() calls xb_preload(GFP_KERNEL) which might sleep with
>>> preemption disabled.
>> Yes, I think that should not be expected, thanks.
>>
>> I plan to change it like this:
>>
>> bool xb_preload(gfp_t gfp)
>> {
>>         if (!this_cpu_read(ida_bitmap)) {
>>                 struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
>>
>>                 if (!bitmap)
>>                         return false;
>>                 bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
>>                 kfree(bitmap);
>>         }
> Excuse me, but you are allocating per-CPU memory when running CPU might
> change at this line? What happens if running CPU has changed at this line?
> Will it work even with new CPU's ida_bitmap == NULL ?
>


Yes, it will be detected in xb_set_bit(): when ida_bitmap = NULL on the
new CPU, xb_set_bit() will
return -EAGAIN to the caller, and the caller should restart from
xb_preload().

Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [virtio-dev] Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG
@ 2017-10-11  3:16                   ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-11  3:16 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mst, virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 10/11/2017 10:26 AM, Tetsuo Handa wrote:
> Wei Wang wrote:
>> On 10/10/2017 09:09 PM, Tetsuo Handa wrote:
>>> Wei Wang wrote:
>>>>> And even if we could remove balloon_lock, you still cannot use
>>>>> __GFP_DIRECT_RECLAIM at xb_set_page(). I think you will need to use
>>>>> "whether it is safe to wait" flag from
>>>>> "[PATCH] virtio: avoid possible OOM lockup at virtballoon_oom_notify()" .
>>>> Without the lock being held, why couldn't we use __GFP_DIRECT_RECLAIM at
>>>> xb_set_page()?
>>> Because of dependency shown below.
>>>
>>> leak_balloon()
>>>    xb_set_page()
>>>      xb_preload(GFP_KERNEL)
>>>        kmalloc(GFP_KERNEL)
>>>          __alloc_pages_may_oom()
>>>            Takes oom_lock
>>>            out_of_memory()
>>>              blocking_notifier_call_chain()
>>>                leak_balloon()
>>>                  xb_set_page()
>>>                    xb_preload(GFP_KERNEL)
>>>                      kmalloc(GFP_KERNEL)
>>>                        __alloc_pages_may_oom()
>>>                          Fails to take oom_lock and loop forever
>> __alloc_pages_may_oom() uses mutex_trylock(&oom_lock).
> Yes. But this mutex_trylock(&oom_lock) is semantically mutex_lock(&oom_lock)
> because __alloc_pages_slowpath() will continue looping until
> mutex_trylock(&oom_lock) succeeds (or somebody releases memory).
>
>> I think the second __alloc_pages_may_oom() will not continue since the
>> first one is in progress.
> The second __alloc_pages_may_oom() will be called repeatedly because
> __alloc_pages_slowpath() will continue looping (unless somebody releases
> memory).
>

OK, I see, thanks. So, the point is that the OOM code path should not
have memory allocation, and the
old leak_balloon (without the F_SG feature) don't need xb_preload(). I
think one solution would be to let
the OOM uses the old leak_balloon() code path, and we can add one more
parameter to leak_balloon
to control that:

leak_balloon(struct virtio_balloon *vb, size_t num, bool oom)



>>> By the way, is xb_set_page() safe?
>>> Sleeping in the kernel with preemption disabled is a bug, isn't it?
>>> __radix_tree_preload() returns 0 with preemption disabled upon success.
>>> xb_preload() disables preemption if __radix_tree_preload() fails.
>>> Then, kmalloc() is called with preemption disabled, isn't it?
>>> But xb_set_page() calls xb_preload(GFP_KERNEL) which might sleep with
>>> preemption disabled.
>> Yes, I think that should not be expected, thanks.
>>
>> I plan to change it like this:
>>
>> bool xb_preload(gfp_t gfp)
>> {
>>         if (!this_cpu_read(ida_bitmap)) {
>>                 struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
>>
>>                 if (!bitmap)
>>                         return false;
>>                 bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
>>                 kfree(bitmap);
>>         }
> Excuse me, but you are allocating per-CPU memory when running CPU might
> change at this line? What happens if running CPU has changed at this line?
> Will it work even with new CPU's ida_bitmap == NULL ?
>


Yes, it will be detected in xb_set_bit(): when ida_bitmap = NULL on the
new CPU, xb_set_bit() will
return -EAGAIN to the caller, and the caller should restart from
xb_preload().

Best,
Wei




---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
  2017-10-10 15:15         ` Michael S. Tsirkin
                             ` (2 preceding siblings ...)
  (?)
@ 2017-10-11  6:03           ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-11  6:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 10/10/2017 11:15 PM, Michael S. Tsirkin wrote:
> On Mon, Oct 02, 2017 at 04:38:01PM +0000, Wang, Wei W wrote:
>> On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
>>> On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
>>>> +static void ctrlq_send_cmd(struct virtio_balloon *vb,
>>>> +			  struct virtio_balloon_ctrlq_cmd *cmd,
>>>> +			  bool inbuf)
>>>> +{
>>>> +	struct virtqueue *vq = vb->ctrl_vq;
>>>> +
>>>> +	ctrlq_add_cmd(vq, cmd, inbuf);
>>>> +	if (!inbuf) {
>>>> +		/*
>>>> +		 * All the input cmd buffers are replenished here.
>>>> +		 * This is necessary because the input cmd buffers are lost
>>>> +		 * after live migration. The device needs to rewind all of
>>>> +		 * them from the ctrl_vq.
>>> Confused. Live migration somehow loses state? Why is that and why is it a good
>>> idea? And how do you know this is migration even?
>>> Looks like all you know is you got free page end. Could be any reason for this.
>>
>> I think this would be something that the current live migration lacks - what the
>> device read from the vq is not transferred during live migration, an example is the
>> stat_vq_elem:
>> Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c
> This does not touch guest memory though it just manipulates
> internal state to make it easier to migrate.
> It's transparent to guest as migration should be.
>
>> For all the things that are added to the vq and need to be held by the device
>> to use later need to consider the situation that live migration might happen at any
>> time and they need to be re-taken from the vq by the device on the destination
>> machine.
>>
>> So, even without this live migration optimization feature, I think all the things that are
>> added to the vq for the device to hold, need a way for the device to rewind back from
>> the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
>> on the vq so that the device side rewinding can work.
>>
>> Please let me know if anything is missed or if you have other suggestions.
> IMO migration should pass enough data source to destination for
> destination to continue where source left off without guest help.
>

I'm afraid it would be difficult to pass the entire VirtQueueElement to 
the destination. I think
that would also be the reason that stats_vq_elem chose to rewind from 
the guest vq, which re-do the
virtqueue_pop() --> virtqueue_map_desc() steps (the QEMU virtual address 
to the guest physical
address relationship may be changed on the destination).


How about another direction which would be easier - using two 32-bit 
device specific configuration registers,
Host2Guest and Guest2Host command registers, to replace the ctrlq for 
command exchange:

The flow can be as follows:

1) Before Host sending a StartCMD, it flushes the free_page_vq in case 
any old free page hint is left there;

2) Host writes StartCMD to the Host2Guest register, and notifies the guest;

3) Upon receiving a configuration notification, Guest reads the 
Host2Guest register, and detaches all the used buffers from free_page_vq;
(then for each StartCMD, the free_page_vq will always have no obsolete 
free page hints, right? )

4) Guest start report free pages:
     4.1) Host may actively write StopCMD to the Host2Guest register 
before the guest finishes; or
     4.2) Guest finishes reporting, write StopCMD  the Guest2HOST 
register, which traps to QEMU, to stop.


Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-11  6:03           ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-11  6:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy,

On 10/10/2017 11:15 PM, Michael S. Tsirkin wrote:
> On Mon, Oct 02, 2017 at 04:38:01PM +0000, Wang, Wei W wrote:
>> On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
>>> On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
>>>> +static void ctrlq_send_cmd(struct virtio_balloon *vb,
>>>> +			  struct virtio_balloon_ctrlq_cmd *cmd,
>>>> +			  bool inbuf)
>>>> +{
>>>> +	struct virtqueue *vq = vb->ctrl_vq;
>>>> +
>>>> +	ctrlq_add_cmd(vq, cmd, inbuf);
>>>> +	if (!inbuf) {
>>>> +		/*
>>>> +		 * All the input cmd buffers are replenished here.
>>>> +		 * This is necessary because the input cmd buffers are lost
>>>> +		 * after live migration. The device needs to rewind all of
>>>> +		 * them from the ctrl_vq.
>>> Confused. Live migration somehow loses state? Why is that and why is it a good
>>> idea? And how do you know this is migration even?
>>> Looks like all you know is you got free page end. Could be any reason for this.
>>
>> I think this would be something that the current live migration lacks - what the
>> device read from the vq is not transferred during live migration, an example is the
>> stat_vq_elem:
>> Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c
> This does not touch guest memory though it just manipulates
> internal state to make it easier to migrate.
> It's transparent to guest as migration should be.
>
>> For all the things that are added to the vq and need to be held by the device
>> to use later need to consider the situation that live migration might happen at any
>> time and they need to be re-taken from the vq by the device on the destination
>> machine.
>>
>> So, even without this live migration optimization feature, I think all the things that are
>> added to the vq for the device to hold, need a way for the device to rewind back from
>> the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
>> on the vq so that the device side rewinding can work.
>>
>> Please let me know if anything is missed or if you have other suggestions.
> IMO migration should pass enough data source to destination for
> destination to continue where source left off without guest help.
>

I'm afraid it would be difficult to pass the entire VirtQueueElement to 
the destination. I think
that would also be the reason that stats_vq_elem chose to rewind from 
the guest vq, which re-do the
virtqueue_pop() --> virtqueue_map_desc() steps (the QEMU virtual address 
to the guest physical
address relationship may be changed on the destination).


How about another direction which would be easier - using two 32-bit 
device specific configuration registers,
Host2Guest and Guest2Host command registers, to replace the ctrlq for 
command exchange:

The flow can be as follows:

1) Before Host sending a StartCMD, it flushes the free_page_vq in case 
any old free page hint is left there;

2) Host writes StartCMD to the Host2Guest register, and notifies the guest;

3) Upon receiving a configuration notification, Guest reads the 
Host2Guest register, and detaches all the used buffers from free_page_vq;
(then for each StartCMD, the free_page_vq will always have no obsolete 
free page hints, right? )

4) Guest start report free pages:
     4.1) Host may actively write StopCMD to the Host2Guest register 
before the guest finishes; or
     4.2) Guest finishes reporting, write StopCMD  the Guest2HOST 
register, which traps to QEMU, to stop.


Best,
Wei




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-11  6:03           ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-11  6:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 10/10/2017 11:15 PM, Michael S. Tsirkin wrote:
> On Mon, Oct 02, 2017 at 04:38:01PM +0000, Wang, Wei W wrote:
>> On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
>>> On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
>>>> +static void ctrlq_send_cmd(struct virtio_balloon *vb,
>>>> +			  struct virtio_balloon_ctrlq_cmd *cmd,
>>>> +			  bool inbuf)
>>>> +{
>>>> +	struct virtqueue *vq = vb->ctrl_vq;
>>>> +
>>>> +	ctrlq_add_cmd(vq, cmd, inbuf);
>>>> +	if (!inbuf) {
>>>> +		/*
>>>> +		 * All the input cmd buffers are replenished here.
>>>> +		 * This is necessary because the input cmd buffers are lost
>>>> +		 * after live migration. The device needs to rewind all of
>>>> +		 * them from the ctrl_vq.
>>> Confused. Live migration somehow loses state? Why is that and why is it a good
>>> idea? And how do you know this is migration even?
>>> Looks like all you know is you got free page end. Could be any reason for this.
>>
>> I think this would be something that the current live migration lacks - what the
>> device read from the vq is not transferred during live migration, an example is the
>> stat_vq_elem:
>> Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c
> This does not touch guest memory though it just manipulates
> internal state to make it easier to migrate.
> It's transparent to guest as migration should be.
>
>> For all the things that are added to the vq and need to be held by the device
>> to use later need to consider the situation that live migration might happen at any
>> time and they need to be re-taken from the vq by the device on the destination
>> machine.
>>
>> So, even without this live migration optimization feature, I think all the things that are
>> added to the vq for the device to hold, need a way for the device to rewind back from
>> the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
>> on the vq so that the device side rewinding can work.
>>
>> Please let me know if anything is missed or if you have other suggestions.
> IMO migration should pass enough data source to destination for
> destination to continue where source left off without guest help.
>

I'm afraid it would be difficult to pass the entire VirtQueueElement to 
the destination. I think
that would also be the reason that stats_vq_elem chose to rewind from 
the guest vq, which re-do the
virtqueue_pop() --> virtqueue_map_desc() steps (the QEMU virtual address 
to the guest physical
address relationship may be changed on the destination).


How about another direction which would be easier - using two 32-bit 
device specific configuration registers,
Host2Guest and Guest2Host command registers, to replace the ctrlq for 
command exchange:

The flow can be as follows:

1) Before Host sending a StartCMD, it flushes the free_page_vq in case 
any old free page hint is left there;

2) Host writes StartCMD to the Host2Guest register, and notifies the guest;

3) Upon receiving a configuration notification, Guest reads the 
Host2Guest register, and detaches all the used buffers from free_page_vq;
(then for each StartCMD, the free_page_vq will always have no obsolete 
free page hints, right? )

4) Guest start report free pages:
     4.1) Host may actively write StopCMD to the Host2Guest register 
before the guest finishes; or
     4.2) Guest finishes reporting, write StopCMD  the Guest2HOST 
register, which traps to QEMU, to stop.


Best,
Wei




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [Qemu-devel] [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-11  6:03           ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-11  6:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 10/10/2017 11:15 PM, Michael S. Tsirkin wrote:
> On Mon, Oct 02, 2017 at 04:38:01PM +0000, Wang, Wei W wrote:
>> On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
>>> On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
>>>> +static void ctrlq_send_cmd(struct virtio_balloon *vb,
>>>> +			  struct virtio_balloon_ctrlq_cmd *cmd,
>>>> +			  bool inbuf)
>>>> +{
>>>> +	struct virtqueue *vq = vb->ctrl_vq;
>>>> +
>>>> +	ctrlq_add_cmd(vq, cmd, inbuf);
>>>> +	if (!inbuf) {
>>>> +		/*
>>>> +		 * All the input cmd buffers are replenished here.
>>>> +		 * This is necessary because the input cmd buffers are lost
>>>> +		 * after live migration. The device needs to rewind all of
>>>> +		 * them from the ctrl_vq.
>>> Confused. Live migration somehow loses state? Why is that and why is it a good
>>> idea? And how do you know this is migration even?
>>> Looks like all you know is you got free page end. Could be any reason for this.
>>
>> I think this would be something that the current live migration lacks - what the
>> device read from the vq is not transferred during live migration, an example is the
>> stat_vq_elem:
>> Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c
> This does not touch guest memory though it just manipulates
> internal state to make it easier to migrate.
> It's transparent to guest as migration should be.
>
>> For all the things that are added to the vq and need to be held by the device
>> to use later need to consider the situation that live migration might happen at any
>> time and they need to be re-taken from the vq by the device on the destination
>> machine.
>>
>> So, even without this live migration optimization feature, I think all the things that are
>> added to the vq for the device to hold, need a way for the device to rewind back from
>> the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
>> on the vq so that the device side rewinding can work.
>>
>> Please let me know if anything is missed or if you have other suggestions.
> IMO migration should pass enough data source to destination for
> destination to continue where source left off without guest help.
>

I'm afraid it would be difficult to pass the entire VirtQueueElement to 
the destination. I think
that would also be the reason that stats_vq_elem chose to rewind from 
the guest vq, which re-do the
virtqueue_pop() --> virtqueue_map_desc() steps (the QEMU virtual address 
to the guest physical
address relationship may be changed on the destination).


How about another direction which would be easier - using two 32-bit 
device specific configuration registers,
Host2Guest and Guest2Host command registers, to replace the ctrlq for 
command exchange:

The flow can be as follows:

1) Before Host sending a StartCMD, it flushes the free_page_vq in case 
any old free page hint is left there;

2) Host writes StartCMD to the Host2Guest register, and notifies the guest;

3) Upon receiving a configuration notification, Guest reads the 
Host2Guest register, and detaches all the used buffers from free_page_vq;
(then for each StartCMD, the free_page_vq will always have no obsolete 
free page hints, right? )

4) Guest start report free pages:
     4.1) Host may actively write StopCMD to the Host2Guest register 
before the guest finishes; or
     4.2) Guest finishes reporting, write StopCMD  the Guest2HOST 
register, which traps to QEMU, to stop.


Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
  2017-10-10 15:15         ` Michael S. Tsirkin
                           ` (3 preceding siblings ...)
  (?)
@ 2017-10-11  6:03         ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-11  6:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko

On 10/10/2017 11:15 PM, Michael S. Tsirkin wrote:
> On Mon, Oct 02, 2017 at 04:38:01PM +0000, Wang, Wei W wrote:
>> On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
>>> On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
>>>> +static void ctrlq_send_cmd(struct virtio_balloon *vb,
>>>> +			  struct virtio_balloon_ctrlq_cmd *cmd,
>>>> +			  bool inbuf)
>>>> +{
>>>> +	struct virtqueue *vq = vb->ctrl_vq;
>>>> +
>>>> +	ctrlq_add_cmd(vq, cmd, inbuf);
>>>> +	if (!inbuf) {
>>>> +		/*
>>>> +		 * All the input cmd buffers are replenished here.
>>>> +		 * This is necessary because the input cmd buffers are lost
>>>> +		 * after live migration. The device needs to rewind all of
>>>> +		 * them from the ctrl_vq.
>>> Confused. Live migration somehow loses state? Why is that and why is it a good
>>> idea? And how do you know this is migration even?
>>> Looks like all you know is you got free page end. Could be any reason for this.
>>
>> I think this would be something that the current live migration lacks - what the
>> device read from the vq is not transferred during live migration, an example is the
>> stat_vq_elem:
>> Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c
> This does not touch guest memory though it just manipulates
> internal state to make it easier to migrate.
> It's transparent to guest as migration should be.
>
>> For all the things that are added to the vq and need to be held by the device
>> to use later need to consider the situation that live migration might happen at any
>> time and they need to be re-taken from the vq by the device on the destination
>> machine.
>>
>> So, even without this live migration optimization feature, I think all the things that are
>> added to the vq for the device to hold, need a way for the device to rewind back from
>> the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
>> on the vq so that the device side rewinding can work.
>>
>> Please let me know if anything is missed or if you have other suggestions.
> IMO migration should pass enough data source to destination for
> destination to continue where source left off without guest help.
>

I'm afraid it would be difficult to pass the entire VirtQueueElement to 
the destination. I think
that would also be the reason that stats_vq_elem chose to rewind from 
the guest vq, which re-do the
virtqueue_pop() --> virtqueue_map_desc() steps (the QEMU virtual address 
to the guest physical
address relationship may be changed on the destination).


How about another direction which would be easier - using two 32-bit 
device specific configuration registers,
Host2Guest and Guest2Host command registers, to replace the ctrlq for 
command exchange:

The flow can be as follows:

1) Before Host sending a StartCMD, it flushes the free_page_vq in case 
any old free page hint is left there;

2) Host writes StartCMD to the Host2Guest register, and notifies the guest;

3) Upon receiving a configuration notification, Guest reads the 
Host2Guest register, and detaches all the used buffers from free_page_vq;
(then for each StartCMD, the free_page_vq will always have no obsolete 
free page hints, right? )

4) Guest start report free pages:
     4.1) Host may actively write StopCMD to the Host2Guest register 
before the guest finishes; or
     4.2) Guest finishes reporting, write StopCMD  the Guest2HOST 
register, which traps to QEMU, to stop.


Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [virtio-dev] Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-11  6:03           ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-11  6:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 10/10/2017 11:15 PM, Michael S. Tsirkin wrote:
> On Mon, Oct 02, 2017 at 04:38:01PM +0000, Wang, Wei W wrote:
>> On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
>>> On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
>>>> +static void ctrlq_send_cmd(struct virtio_balloon *vb,
>>>> +			  struct virtio_balloon_ctrlq_cmd *cmd,
>>>> +			  bool inbuf)
>>>> +{
>>>> +	struct virtqueue *vq = vb->ctrl_vq;
>>>> +
>>>> +	ctrlq_add_cmd(vq, cmd, inbuf);
>>>> +	if (!inbuf) {
>>>> +		/*
>>>> +		 * All the input cmd buffers are replenished here.
>>>> +		 * This is necessary because the input cmd buffers are lost
>>>> +		 * after live migration. The device needs to rewind all of
>>>> +		 * them from the ctrl_vq.
>>> Confused. Live migration somehow loses state? Why is that and why is it a good
>>> idea? And how do you know this is migration even?
>>> Looks like all you know is you got free page end. Could be any reason for this.
>>
>> I think this would be something that the current live migration lacks - what the
>> device read from the vq is not transferred during live migration, an example is the
>> stat_vq_elem:
>> Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c
> This does not touch guest memory though it just manipulates
> internal state to make it easier to migrate.
> It's transparent to guest as migration should be.
>
>> For all the things that are added to the vq and need to be held by the device
>> to use later need to consider the situation that live migration might happen at any
>> time and they need to be re-taken from the vq by the device on the destination
>> machine.
>>
>> So, even without this live migration optimization feature, I think all the things that are
>> added to the vq for the device to hold, need a way for the device to rewind back from
>> the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
>> on the vq so that the device side rewinding can work.
>>
>> Please let me know if anything is missed or if you have other suggestions.
> IMO migration should pass enough data source to destination for
> destination to continue where source left off without guest help.
>

I'm afraid it would be difficult to pass the entire VirtQueueElement to 
the destination. I think
that would also be the reason that stats_vq_elem chose to rewind from 
the guest vq, which re-do the
virtqueue_pop() --> virtqueue_map_desc() steps (the QEMU virtual address 
to the guest physical
address relationship may be changed on the destination).


How about another direction which would be easier - using two 32-bit 
device specific configuration registers,
Host2Guest and Guest2Host command registers, to replace the ctrlq for 
command exchange:

The flow can be as follows:

1) Before Host sending a StartCMD, it flushes the free_page_vq in case 
any old free page hint is left there;

2) Host writes StartCMD to the Host2Guest register, and notifies the guest;

3) Upon receiving a configuration notification, Guest reads the 
Host2Guest register, and detaches all the used buffers from free_page_vq;
(then for each StartCMD, the free_page_vq will always have no obsolete 
free page hints, right? )

4) Guest start report free pages:
     4.1) Host may actively write StopCMD to the Host2Guest register 
before the guest finishes; or
     4.2) Guest finishes reporting, write StopCMD  the Guest2HOST 
register, which traps to QEMU, to stop.


Best,
Wei





---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
  2017-10-11  6:03           ` Wei Wang
                               ` (2 preceding siblings ...)
  (?)
@ 2017-10-11 13:49             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-11 13:49 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Wed, Oct 11, 2017 at 02:03:20PM +0800, Wei Wang wrote:
> On 10/10/2017 11:15 PM, Michael S. Tsirkin wrote:
> > On Mon, Oct 02, 2017 at 04:38:01PM +0000, Wang, Wei W wrote:
> > > On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
> > > > On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> > > > > +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> > > > > +			  struct virtio_balloon_ctrlq_cmd *cmd,
> > > > > +			  bool inbuf)
> > > > > +{
> > > > > +	struct virtqueue *vq = vb->ctrl_vq;
> > > > > +
> > > > > +	ctrlq_add_cmd(vq, cmd, inbuf);
> > > > > +	if (!inbuf) {
> > > > > +		/*
> > > > > +		 * All the input cmd buffers are replenished here.
> > > > > +		 * This is necessary because the input cmd buffers are lost
> > > > > +		 * after live migration. The device needs to rewind all of
> > > > > +		 * them from the ctrl_vq.
> > > > Confused. Live migration somehow loses state? Why is that and why is it a good
> > > > idea? And how do you know this is migration even?
> > > > Looks like all you know is you got free page end. Could be any reason for this.
> > > 
> > > I think this would be something that the current live migration lacks - what the
> > > device read from the vq is not transferred during live migration, an example is the
> > > stat_vq_elem:
> > > Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c
> > This does not touch guest memory though it just manipulates
> > internal state to make it easier to migrate.
> > It's transparent to guest as migration should be.
> > 
> > > For all the things that are added to the vq and need to be held by the device
> > > to use later need to consider the situation that live migration might happen at any
> > > time and they need to be re-taken from the vq by the device on the destination
> > > machine.
> > > 
> > > So, even without this live migration optimization feature, I think all the things that are
> > > added to the vq for the device to hold, need a way for the device to rewind back from
> > > the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
> > > on the vq so that the device side rewinding can work.
> > > 
> > > Please let me know if anything is missed or if you have other suggestions.
> > IMO migration should pass enough data source to destination for
> > destination to continue where source left off without guest help.
> > 
> 
> I'm afraid it would be difficult to pass the entire VirtQueueElement to the
> destination. I think
> that would also be the reason that stats_vq_elem chose to rewind from the
> guest vq, which re-do the
> virtqueue_pop() --> virtqueue_map_desc() steps (the QEMU virtual address to
> the guest physical
> address relationship may be changed on the destination).

Yes but note how that rewind does not involve modifying the ring.
It just rolls back some indices.


> 
> How about another direction which would be easier - using two 32-bit device
> specific configuration registers,
> Host2Guest and Guest2Host command registers, to replace the ctrlq for
> command exchange:
> 
> The flow can be as follows:
> 
> 1) Before Host sending a StartCMD, it flushes the free_page_vq in case any
> old free page hint is left there;

> 2) Host writes StartCMD to the Host2Guest register, and notifies the guest;
> 
> 3) Upon receiving a configuration notification, Guest reads the Host2Guest
> register, and detaches all the used buffers from free_page_vq;
> (then for each StartCMD, the free_page_vq will always have no obsolete free
> page hints, right? )
> 
> 4) Guest start report free pages:
>     4.1) Host may actively write StopCMD to the Host2Guest register before
> the guest finishes; or
>     4.2) Guest finishes reporting, write StopCMD  the Guest2HOST register,
> which traps to QEMU, to stop.
> 
> 
> Best,
> Wei

I am not sure it matters whether a VQ or the config are used to start/stop.
But I think flushing is very fragile. You will easily run into races
if one of the actors gets out of sync and keeps adding data.
I think adding an ID in the free vq stream is a more robust
approach.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-11 13:49             ` Michael S. Tsirkin
  0 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-11 13:49 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini@redhat.com

On Wed, Oct 11, 2017 at 02:03:20PM +0800, Wei Wang wrote:
> On 10/10/2017 11:15 PM, Michael S. Tsirkin wrote:
> > On Mon, Oct 02, 2017 at 04:38:01PM +0000, Wang, Wei W wrote:
> > > On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
> > > > On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> > > > > +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> > > > > +			  struct virtio_balloon_ctrlq_cmd *cmd,
> > > > > +			  bool inbuf)
> > > > > +{
> > > > > +	struct virtqueue *vq = vb->ctrl_vq;
> > > > > +
> > > > > +	ctrlq_add_cmd(vq, cmd, inbuf);
> > > > > +	if (!inbuf) {
> > > > > +		/*
> > > > > +		 * All the input cmd buffers are replenished here.
> > > > > +		 * This is necessary because the input cmd buffers are lost
> > > > > +		 * after live migration. The device needs to rewind all of
> > > > > +		 * them from the ctrl_vq.
> > > > Confused. Live migration somehow loses state? Why is that and why is it a good
> > > > idea? And how do you know this is migration even?
> > > > Looks like all you know is you got free page end. Could be any reason for this.
> > > 
> > > I think this would be something that the current live migration lacks - what the
> > > device read from the vq is not transferred during live migration, an example is the
> > > stat_vq_elem:
> > > Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c
> > This does not touch guest memory though it just manipulates
> > internal state to make it easier to migrate.
> > It's transparent to guest as migration should be.
> > 
> > > For all the things that are added to the vq and need to be held by the device
> > > to use later need to consider the situation that live migration might happen at any
> > > time and they need to be re-taken from the vq by the device on the destination
> > > machine.
> > > 
> > > So, even without this live migration optimization feature, I think all the things that are
> > > added to the vq for the device to hold, need a way for the device to rewind back from
> > > the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
> > > on the vq so that the device side rewinding can work.
> > > 
> > > Please let me know if anything is missed or if you have other suggestions.
> > IMO migration should pass enough data source to destination for
> > destination to continue where source left off without guest help.
> > 
> 
> I'm afraid it would be difficult to pass the entire VirtQueueElement to the
> destination. I think
> that would also be the reason that stats_vq_elem chose to rewind from the
> guest vq, which re-do the
> virtqueue_pop() --> virtqueue_map_desc() steps (the QEMU virtual address to
> the guest physical
> address relationship may be changed on the destination).

Yes but note how that rewind does not involve modifying the ring.
It just rolls back some indices.


> 
> How about another direction which would be easier - using two 32-bit device
> specific configuration registers,
> Host2Guest and Guest2Host command registers, to replace the ctrlq for
> command exchange:
> 
> The flow can be as follows:
> 
> 1) Before Host sending a StartCMD, it flushes the free_page_vq in case any
> old free page hint is left there;

> 2) Host writes StartCMD to the Host2Guest register, and notifies the guest;
> 
> 3) Upon receiving a configuration notification, Guest reads the Host2Guest
> register, and detaches all the used buffers from free_page_vq;
> (then for each StartCMD, the free_page_vq will always have no obsolete free
> page hints, right? )
> 
> 4) Guest start report free pages:
>     4.1) Host may actively write StopCMD to the Host2Guest register before
> the guest finishes; or
>     4.2) Guest finishes reporting, write StopCMD  the Guest2HOST register,
> which traps to QEMU, to stop.
> 
> 
> Best,
> Wei

I am not sure it matters whether a VQ or the config are used to start/stop.
But I think flushing is very fragile. You will easily run into races
if one of the actors gets out of sync and keeps adding data.
I think adding an ID in the free vq stream is a more robust
approach.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-11 13:49             ` Michael S. Tsirkin
  0 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-11 13:49 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Wed, Oct 11, 2017 at 02:03:20PM +0800, Wei Wang wrote:
> On 10/10/2017 11:15 PM, Michael S. Tsirkin wrote:
> > On Mon, Oct 02, 2017 at 04:38:01PM +0000, Wang, Wei W wrote:
> > > On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
> > > > On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> > > > > +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> > > > > +			  struct virtio_balloon_ctrlq_cmd *cmd,
> > > > > +			  bool inbuf)
> > > > > +{
> > > > > +	struct virtqueue *vq = vb->ctrl_vq;
> > > > > +
> > > > > +	ctrlq_add_cmd(vq, cmd, inbuf);
> > > > > +	if (!inbuf) {
> > > > > +		/*
> > > > > +		 * All the input cmd buffers are replenished here.
> > > > > +		 * This is necessary because the input cmd buffers are lost
> > > > > +		 * after live migration. The device needs to rewind all of
> > > > > +		 * them from the ctrl_vq.
> > > > Confused. Live migration somehow loses state? Why is that and why is it a good
> > > > idea? And how do you know this is migration even?
> > > > Looks like all you know is you got free page end. Could be any reason for this.
> > > 
> > > I think this would be something that the current live migration lacks - what the
> > > device read from the vq is not transferred during live migration, an example is the
> > > stat_vq_elem:
> > > Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c
> > This does not touch guest memory though it just manipulates
> > internal state to make it easier to migrate.
> > It's transparent to guest as migration should be.
> > 
> > > For all the things that are added to the vq and need to be held by the device
> > > to use later need to consider the situation that live migration might happen at any
> > > time and they need to be re-taken from the vq by the device on the destination
> > > machine.
> > > 
> > > So, even without this live migration optimization feature, I think all the things that are
> > > added to the vq for the device to hold, need a way for the device to rewind back from
> > > the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
> > > on the vq so that the device side rewinding can work.
> > > 
> > > Please let me know if anything is missed or if you have other suggestions.
> > IMO migration should pass enough data source to destination for
> > destination to continue where source left off without guest help.
> > 
> 
> I'm afraid it would be difficult to pass the entire VirtQueueElement to the
> destination. I think
> that would also be the reason that stats_vq_elem chose to rewind from the
> guest vq, which re-do the
> virtqueue_pop() --> virtqueue_map_desc() steps (the QEMU virtual address to
> the guest physical
> address relationship may be changed on the destination).

Yes but note how that rewind does not involve modifying the ring.
It just rolls back some indices.


> 
> How about another direction which would be easier - using two 32-bit device
> specific configuration registers,
> Host2Guest and Guest2Host command registers, to replace the ctrlq for
> command exchange:
> 
> The flow can be as follows:
> 
> 1) Before Host sending a StartCMD, it flushes the free_page_vq in case any
> old free page hint is left there;

> 2) Host writes StartCMD to the Host2Guest register, and notifies the guest;
> 
> 3) Upon receiving a configuration notification, Guest reads the Host2Guest
> register, and detaches all the used buffers from free_page_vq;
> (then for each StartCMD, the free_page_vq will always have no obsolete free
> page hints, right? )
> 
> 4) Guest start report free pages:
>     4.1) Host may actively write StopCMD to the Host2Guest register before
> the guest finishes; or
>     4.2) Guest finishes reporting, write StopCMD  the Guest2HOST register,
> which traps to QEMU, to stop.
> 
> 
> Best,
> Wei

I am not sure it matters whether a VQ or the config are used to start/stop.
But I think flushing is very fragile. You will easily run into races
if one of the actors gets out of sync and keeps adding data.
I think adding an ID in the free vq stream is a more robust
approach.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [Qemu-devel] [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-11 13:49             ` Michael S. Tsirkin
  0 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-11 13:49 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Wed, Oct 11, 2017 at 02:03:20PM +0800, Wei Wang wrote:
> On 10/10/2017 11:15 PM, Michael S. Tsirkin wrote:
> > On Mon, Oct 02, 2017 at 04:38:01PM +0000, Wang, Wei W wrote:
> > > On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
> > > > On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> > > > > +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> > > > > +			  struct virtio_balloon_ctrlq_cmd *cmd,
> > > > > +			  bool inbuf)
> > > > > +{
> > > > > +	struct virtqueue *vq = vb->ctrl_vq;
> > > > > +
> > > > > +	ctrlq_add_cmd(vq, cmd, inbuf);
> > > > > +	if (!inbuf) {
> > > > > +		/*
> > > > > +		 * All the input cmd buffers are replenished here.
> > > > > +		 * This is necessary because the input cmd buffers are lost
> > > > > +		 * after live migration. The device needs to rewind all of
> > > > > +		 * them from the ctrl_vq.
> > > > Confused. Live migration somehow loses state? Why is that and why is it a good
> > > > idea? And how do you know this is migration even?
> > > > Looks like all you know is you got free page end. Could be any reason for this.
> > > 
> > > I think this would be something that the current live migration lacks - what the
> > > device read from the vq is not transferred during live migration, an example is the
> > > stat_vq_elem:
> > > Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c
> > This does not touch guest memory though it just manipulates
> > internal state to make it easier to migrate.
> > It's transparent to guest as migration should be.
> > 
> > > For all the things that are added to the vq and need to be held by the device
> > > to use later need to consider the situation that live migration might happen at any
> > > time and they need to be re-taken from the vq by the device on the destination
> > > machine.
> > > 
> > > So, even without this live migration optimization feature, I think all the things that are
> > > added to the vq for the device to hold, need a way for the device to rewind back from
> > > the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
> > > on the vq so that the device side rewinding can work.
> > > 
> > > Please let me know if anything is missed or if you have other suggestions.
> > IMO migration should pass enough data source to destination for
> > destination to continue where source left off without guest help.
> > 
> 
> I'm afraid it would be difficult to pass the entire VirtQueueElement to the
> destination. I think
> that would also be the reason that stats_vq_elem chose to rewind from the
> guest vq, which re-do the
> virtqueue_pop() --> virtqueue_map_desc() steps (the QEMU virtual address to
> the guest physical
> address relationship may be changed on the destination).

Yes but note how that rewind does not involve modifying the ring.
It just rolls back some indices.


> 
> How about another direction which would be easier - using two 32-bit device
> specific configuration registers,
> Host2Guest and Guest2Host command registers, to replace the ctrlq for
> command exchange:
> 
> The flow can be as follows:
> 
> 1) Before Host sending a StartCMD, it flushes the free_page_vq in case any
> old free page hint is left there;

> 2) Host writes StartCMD to the Host2Guest register, and notifies the guest;
> 
> 3) Upon receiving a configuration notification, Guest reads the Host2Guest
> register, and detaches all the used buffers from free_page_vq;
> (then for each StartCMD, the free_page_vq will always have no obsolete free
> page hints, right? )
> 
> 4) Guest start report free pages:
>     4.1) Host may actively write StopCMD to the Host2Guest register before
> the guest finishes; or
>     4.2) Guest finishes reporting, write StopCMD  the Guest2HOST register,
> which traps to QEMU, to stop.
> 
> 
> Best,
> Wei

I am not sure it matters whether a VQ or the config are used to start/stop.
But I think flushing is very fragile. You will easily run into races
if one of the actors gets out of sync and keeps adding data.
I think adding an ID in the free vq stream is a more robust
approach.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
  2017-10-11  6:03           ` Wei Wang
                             ` (3 preceding siblings ...)
  (?)
@ 2017-10-11 13:49           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-11 13:49 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko

On Wed, Oct 11, 2017 at 02:03:20PM +0800, Wei Wang wrote:
> On 10/10/2017 11:15 PM, Michael S. Tsirkin wrote:
> > On Mon, Oct 02, 2017 at 04:38:01PM +0000, Wang, Wei W wrote:
> > > On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
> > > > On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> > > > > +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> > > > > +			  struct virtio_balloon_ctrlq_cmd *cmd,
> > > > > +			  bool inbuf)
> > > > > +{
> > > > > +	struct virtqueue *vq = vb->ctrl_vq;
> > > > > +
> > > > > +	ctrlq_add_cmd(vq, cmd, inbuf);
> > > > > +	if (!inbuf) {
> > > > > +		/*
> > > > > +		 * All the input cmd buffers are replenished here.
> > > > > +		 * This is necessary because the input cmd buffers are lost
> > > > > +		 * after live migration. The device needs to rewind all of
> > > > > +		 * them from the ctrl_vq.
> > > > Confused. Live migration somehow loses state? Why is that and why is it a good
> > > > idea? And how do you know this is migration even?
> > > > Looks like all you know is you got free page end. Could be any reason for this.
> > > 
> > > I think this would be something that the current live migration lacks - what the
> > > device read from the vq is not transferred during live migration, an example is the
> > > stat_vq_elem:
> > > Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c
> > This does not touch guest memory though it just manipulates
> > internal state to make it easier to migrate.
> > It's transparent to guest as migration should be.
> > 
> > > For all the things that are added to the vq and need to be held by the device
> > > to use later need to consider the situation that live migration might happen at any
> > > time and they need to be re-taken from the vq by the device on the destination
> > > machine.
> > > 
> > > So, even without this live migration optimization feature, I think all the things that are
> > > added to the vq for the device to hold, need a way for the device to rewind back from
> > > the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
> > > on the vq so that the device side rewinding can work.
> > > 
> > > Please let me know if anything is missed or if you have other suggestions.
> > IMO migration should pass enough data source to destination for
> > destination to continue where source left off without guest help.
> > 
> 
> I'm afraid it would be difficult to pass the entire VirtQueueElement to the
> destination. I think
> that would also be the reason that stats_vq_elem chose to rewind from the
> guest vq, which re-do the
> virtqueue_pop() --> virtqueue_map_desc() steps (the QEMU virtual address to
> the guest physical
> address relationship may be changed on the destination).

Yes but note how that rewind does not involve modifying the ring.
It just rolls back some indices.


> 
> How about another direction which would be easier - using two 32-bit device
> specific configuration registers,
> Host2Guest and Guest2Host command registers, to replace the ctrlq for
> command exchange:
> 
> The flow can be as follows:
> 
> 1) Before Host sending a StartCMD, it flushes the free_page_vq in case any
> old free page hint is left there;

> 2) Host writes StartCMD to the Host2Guest register, and notifies the guest;
> 
> 3) Upon receiving a configuration notification, Guest reads the Host2Guest
> register, and detaches all the used buffers from free_page_vq;
> (then for each StartCMD, the free_page_vq will always have no obsolete free
> page hints, right? )
> 
> 4) Guest start report free pages:
>     4.1) Host may actively write StopCMD to the Host2Guest register before
> the guest finishes; or
>     4.2) Guest finishes reporting, write StopCMD  the Guest2HOST register,
> which traps to QEMU, to stop.
> 
> 
> Best,
> Wei

I am not sure it matters whether a VQ or the config are used to start/stop.
But I think flushing is very fragile. You will easily run into races
if one of the actors gets out of sync and keeps adding data.
I think adding an ID in the free vq stream is a more robust
approach.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [virtio-dev] Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-11 13:49             ` Michael S. Tsirkin
  0 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-11 13:49 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Wed, Oct 11, 2017 at 02:03:20PM +0800, Wei Wang wrote:
> On 10/10/2017 11:15 PM, Michael S. Tsirkin wrote:
> > On Mon, Oct 02, 2017 at 04:38:01PM +0000, Wang, Wei W wrote:
> > > On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
> > > > On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
> > > > > +static void ctrlq_send_cmd(struct virtio_balloon *vb,
> > > > > +			  struct virtio_balloon_ctrlq_cmd *cmd,
> > > > > +			  bool inbuf)
> > > > > +{
> > > > > +	struct virtqueue *vq = vb->ctrl_vq;
> > > > > +
> > > > > +	ctrlq_add_cmd(vq, cmd, inbuf);
> > > > > +	if (!inbuf) {
> > > > > +		/*
> > > > > +		 * All the input cmd buffers are replenished here.
> > > > > +		 * This is necessary because the input cmd buffers are lost
> > > > > +		 * after live migration. The device needs to rewind all of
> > > > > +		 * them from the ctrl_vq.
> > > > Confused. Live migration somehow loses state? Why is that and why is it a good
> > > > idea? And how do you know this is migration even?
> > > > Looks like all you know is you got free page end. Could be any reason for this.
> > > 
> > > I think this would be something that the current live migration lacks - what the
> > > device read from the vq is not transferred during live migration, an example is the
> > > stat_vq_elem:
> > > Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c
> > This does not touch guest memory though it just manipulates
> > internal state to make it easier to migrate.
> > It's transparent to guest as migration should be.
> > 
> > > For all the things that are added to the vq and need to be held by the device
> > > to use later need to consider the situation that live migration might happen at any
> > > time and they need to be re-taken from the vq by the device on the destination
> > > machine.
> > > 
> > > So, even without this live migration optimization feature, I think all the things that are
> > > added to the vq for the device to hold, need a way for the device to rewind back from
> > > the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
> > > on the vq so that the device side rewinding can work.
> > > 
> > > Please let me know if anything is missed or if you have other suggestions.
> > IMO migration should pass enough data source to destination for
> > destination to continue where source left off without guest help.
> > 
> 
> I'm afraid it would be difficult to pass the entire VirtQueueElement to the
> destination. I think
> that would also be the reason that stats_vq_elem chose to rewind from the
> guest vq, which re-do the
> virtqueue_pop() --> virtqueue_map_desc() steps (the QEMU virtual address to
> the guest physical
> address relationship may be changed on the destination).

Yes but note how that rewind does not involve modifying the ring.
It just rolls back some indices.


> 
> How about another direction which would be easier - using two 32-bit device
> specific configuration registers,
> Host2Guest and Guest2Host command registers, to replace the ctrlq for
> command exchange:
> 
> The flow can be as follows:
> 
> 1) Before Host sending a StartCMD, it flushes the free_page_vq in case any
> old free page hint is left there;

> 2) Host writes StartCMD to the Host2Guest register, and notifies the guest;
> 
> 3) Upon receiving a configuration notification, Guest reads the Host2Guest
> register, and detaches all the used buffers from free_page_vq;
> (then for each StartCMD, the free_page_vq will always have no obsolete free
> page hints, right? )
> 
> 4) Guest start report free pages:
>     4.1) Host may actively write StopCMD to the Host2Guest register before
> the guest finishes; or
>     4.2) Guest finishes reporting, write StopCMD  the Guest2HOST register,
> which traps to QEMU, to stop.
> 
> 
> Best,
> Wei

I am not sure it matters whether a VQ or the config are used to start/stop.
But I think flushing is very fragile. You will easily run into races
if one of the actors gets out of sync and keeps adding data.
I think adding an ID in the free vq stream is a more robust
approach.


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
  2017-10-11 13:49             ` Michael S. Tsirkin
                                 ` (2 preceding siblings ...)
  (?)
@ 2017-10-12  3:54               ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-12  3:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 10/11/2017 09:49 PM, Michael S. Tsirkin wrote:
> On Wed, Oct 11, 2017 at 02:03:20PM +0800, Wei Wang wrote:
>> On 10/10/2017 11:15 PM, Michael S. Tsirkin wrote:
>>> On Mon, Oct 02, 2017 at 04:38:01PM +0000, Wang, Wei W wrote:
>>>> On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
>>>>> On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
>>>>>> +static void ctrlq_send_cmd(struct virtio_balloon *vb,
>>>>>> +			  struct virtio_balloon_ctrlq_cmd *cmd,
>>>>>> +			  bool inbuf)
>>>>>> +{
>>>>>> +	struct virtqueue *vq = vb->ctrl_vq;
>>>>>> +
>>>>>> +	ctrlq_add_cmd(vq, cmd, inbuf);
>>>>>> +	if (!inbuf) {
>>>>>> +		/*
>>>>>> +		 * All the input cmd buffers are replenished here.
>>>>>> +		 * This is necessary because the input cmd buffers are lost
>>>>>> +		 * after live migration. The device needs to rewind all of
>>>>>> +		 * them from the ctrl_vq.
>>>>> Confused. Live migration somehow loses state? Why is that and why is it a good
>>>>> idea? And how do you know this is migration even?
>>>>> Looks like all you know is you got free page end. Could be any reason for this.
>>>> I think this would be something that the current live migration lacks - what the
>>>> device read from the vq is not transferred during live migration, an example is the
>>>> stat_vq_elem:
>>>> Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c
>>> This does not touch guest memory though it just manipulates
>>> internal state to make it easier to migrate.
>>> It's transparent to guest as migration should be.
>>>
>>>> For all the things that are added to the vq and need to be held by the device
>>>> to use later need to consider the situation that live migration might happen at any
>>>> time and they need to be re-taken from the vq by the device on the destination
>>>> machine.
>>>>
>>>> So, even without this live migration optimization feature, I think all the things that are
>>>> added to the vq for the device to hold, need a way for the device to rewind back from
>>>> the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
>>>> on the vq so that the device side rewinding can work.
>>>>
>>>> Please let me know if anything is missed or if you have other suggestions.
>>> IMO migration should pass enough data source to destination for
>>> destination to continue where source left off without guest help.
>>>
>> I'm afraid it would be difficult to pass the entire VirtQueueElement to the
>> destination. I think
>> that would also be the reason that stats_vq_elem chose to rewind from the
>> guest vq, which re-do the
>> virtqueue_pop() --> virtqueue_map_desc() steps (the QEMU virtual address to
>> the guest physical
>> address relationship may be changed on the destination).
> Yes but note how that rewind does not involve modifying the ring.
> It just rolls back some indices.

Yes, it rolls back the indices, then the following 
virtio_balloon_receive_stats()
can re-pop out the previous entry given by the guest.

Recall how stats_vq_elem works: there is only one stats buffer, which is 
used by the
guest to report stats, and also used by the host to ask the guest for 
stats report.

So the host can roll back one previous entry and what it gets will 
always be stat_vq_elem.


Our case is a little more complex than that - we have both free_page_cmd_in
(for host to guest command) and free_page_cmd_out (for guest to host 
command) buffer
passed via ctrl_vq. When the host rolls back one entry, it may get the 
free_page_cmd_out
buffer which can't be used as the host to guest buffer (i.e. 
free_page_elem held by the device).

So a trick in the driver is to refill the free_page_cmd_in buffer every 
time after the free_page_cmd_out
was sent to the host, so that when the host rewind one previous entry, 
it can always get the
free_page_cmd_in buffer (may be not a very nice method).



>
>> How about another direction which would be easier - using two 32-bit device
>> specific configuration registers,
>> Host2Guest and Guest2Host command registers, to replace the ctrlq for
>> command exchange:
>>
>> The flow can be as follows:
>>
>> 1) Before Host sending a StartCMD, it flushes the free_page_vq in case any
>> old free page hint is left there;
>> 2) Host writes StartCMD to the Host2Guest register, and notifies the guest;
>>
>> 3) Upon receiving a configuration notification, Guest reads the Host2Guest
>> register, and detaches all the used buffers from free_page_vq;
>> (then for each StartCMD, the free_page_vq will always have no obsolete free
>> page hints, right? )
>>
>> 4) Guest start report free pages:
>>      4.1) Host may actively write StopCMD to the Host2Guest register before
>> the guest finishes; or
>>      4.2) Guest finishes reporting, write StopCMD  the Guest2HOST register,
>> which traps to QEMU, to stop.
>>
>>
>> Best,
>> Wei
> I am not sure it matters whether a VQ or the config are used to start/stop.


Not matters, in terms of the flushing issue. The config method could 
avoid the above rewind issue.


> But I think flushing is very fragile. You will easily run into races
> if one of the actors gets out of sync and keeps adding data.
> I think adding an ID in the free vq stream is a more robust
> approach.
>

Adding ID to the free vq would need the device to distinguish whether it 
receives an ID or a free page hint,
so an extra protocol is needed for the two sides to talk. Currently, we 
directly assign the free page
address to desc->addr. With ID support, we would need to first allocate 
buffer for the protocol header,
and add the free page address to the header, then desc->addr = &header.

How about putting the ID to the command path? This would avoid the above 
trouble.

For example, using the 32-bit config registers:
first 16-bit: Command field
send 16-bit: ID field

Then, the working flow would look like this:

1) Host writes "Start, 1" to the Host2Guest register and notify;

2) Guest reads Host2Guest register, and ACKs by writing "Start, 1" to 
Guest2Host register;

3) Guest starts report free pages;

4) Each time when the host receives a free page hint from the 
free_page_vq, it compares the ID fields of
the Host2Guest and Guest2Host register. If matching, then filter out the 
free page from the migration dirty bitmap,
otherwise, simply push back without doing the filtering.


Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-12  3:54               ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-12  3:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy,

On 10/11/2017 09:49 PM, Michael S. Tsirkin wrote:
> On Wed, Oct 11, 2017 at 02:03:20PM +0800, Wei Wang wrote:
>> On 10/10/2017 11:15 PM, Michael S. Tsirkin wrote:
>>> On Mon, Oct 02, 2017 at 04:38:01PM +0000, Wang, Wei W wrote:
>>>> On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
>>>>> On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
>>>>>> +static void ctrlq_send_cmd(struct virtio_balloon *vb,
>>>>>> +			  struct virtio_balloon_ctrlq_cmd *cmd,
>>>>>> +			  bool inbuf)
>>>>>> +{
>>>>>> +	struct virtqueue *vq = vb->ctrl_vq;
>>>>>> +
>>>>>> +	ctrlq_add_cmd(vq, cmd, inbuf);
>>>>>> +	if (!inbuf) {
>>>>>> +		/*
>>>>>> +		 * All the input cmd buffers are replenished here.
>>>>>> +		 * This is necessary because the input cmd buffers are lost
>>>>>> +		 * after live migration. The device needs to rewind all of
>>>>>> +		 * them from the ctrl_vq.
>>>>> Confused. Live migration somehow loses state? Why is that and why is it a good
>>>>> idea? And how do you know this is migration even?
>>>>> Looks like all you know is you got free page end. Could be any reason for this.
>>>> I think this would be something that the current live migration lacks - what the
>>>> device read from the vq is not transferred during live migration, an example is the
>>>> stat_vq_elem:
>>>> Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c
>>> This does not touch guest memory though it just manipulates
>>> internal state to make it easier to migrate.
>>> It's transparent to guest as migration should be.
>>>
>>>> For all the things that are added to the vq and need to be held by the device
>>>> to use later need to consider the situation that live migration might happen at any
>>>> time and they need to be re-taken from the vq by the device on the destination
>>>> machine.
>>>>
>>>> So, even without this live migration optimization feature, I think all the things that are
>>>> added to the vq for the device to hold, need a way for the device to rewind back from
>>>> the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
>>>> on the vq so that the device side rewinding can work.
>>>>
>>>> Please let me know if anything is missed or if you have other suggestions.
>>> IMO migration should pass enough data source to destination for
>>> destination to continue where source left off without guest help.
>>>
>> I'm afraid it would be difficult to pass the entire VirtQueueElement to the
>> destination. I think
>> that would also be the reason that stats_vq_elem chose to rewind from the
>> guest vq, which re-do the
>> virtqueue_pop() --> virtqueue_map_desc() steps (the QEMU virtual address to
>> the guest physical
>> address relationship may be changed on the destination).
> Yes but note how that rewind does not involve modifying the ring.
> It just rolls back some indices.

Yes, it rolls back the indices, then the following 
virtio_balloon_receive_stats()
can re-pop out the previous entry given by the guest.

Recall how stats_vq_elem works: there is only one stats buffer, which is 
used by the
guest to report stats, and also used by the host to ask the guest for 
stats report.

So the host can roll back one previous entry and what it gets will 
always be stat_vq_elem.


Our case is a little more complex than that - we have both free_page_cmd_in
(for host to guest command) and free_page_cmd_out (for guest to host 
command) buffer
passed via ctrl_vq. When the host rolls back one entry, it may get the 
free_page_cmd_out
buffer which can't be used as the host to guest buffer (i.e. 
free_page_elem held by the device).

So a trick in the driver is to refill the free_page_cmd_in buffer every 
time after the free_page_cmd_out
was sent to the host, so that when the host rewind one previous entry, 
it can always get the
free_page_cmd_in buffer (may be not a very nice method).



>
>> How about another direction which would be easier - using two 32-bit device
>> specific configuration registers,
>> Host2Guest and Guest2Host command registers, to replace the ctrlq for
>> command exchange:
>>
>> The flow can be as follows:
>>
>> 1) Before Host sending a StartCMD, it flushes the free_page_vq in case any
>> old free page hint is left there;
>> 2) Host writes StartCMD to the Host2Guest register, and notifies the guest;
>>
>> 3) Upon receiving a configuration notification, Guest reads the Host2Guest
>> register, and detaches all the used buffers from free_page_vq;
>> (then for each StartCMD, the free_page_vq will always have no obsolete free
>> page hints, right? )
>>
>> 4) Guest start report free pages:
>>      4.1) Host may actively write StopCMD to the Host2Guest register before
>> the guest finishes; or
>>      4.2) Guest finishes reporting, write StopCMD  the Guest2HOST register,
>> which traps to QEMU, to stop.
>>
>>
>> Best,
>> Wei
> I am not sure it matters whether a VQ or the config are used to start/stop.


Not matters, in terms of the flushing issue. The config method could 
avoid the above rewind issue.


> But I think flushing is very fragile. You will easily run into races
> if one of the actors gets out of sync and keeps adding data.
> I think adding an ID in the free vq stream is a more robust
> approach.
>

Adding ID to the free vq would need the device to distinguish whether it 
receives an ID or a free page hint,
so an extra protocol is needed for the two sides to talk. Currently, we 
directly assign the free page
address to desc->addr. With ID support, we would need to first allocate 
buffer for the protocol header,
and add the free page address to the header, then desc->addr = &header.

How about putting the ID to the command path? This would avoid the above 
trouble.

For example, using the 32-bit config registers:
first 16-bit: Command field
send 16-bit: ID field

Then, the working flow would look like this:

1) Host writes "Start, 1" to the Host2Guest register and notify;

2) Guest reads Host2Guest register, and ACKs by writing "Start, 1" to 
Guest2Host register;

3) Guest starts report free pages;

4) Each time when the host receives a free page hint from the 
free_page_vq, it compares the ID fields of
the Host2Guest and Guest2Host register. If matching, then filter out the 
free page from the migration dirty bitmap,
otherwise, simply push back without doing the filtering.


Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-12  3:54               ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-12  3:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 10/11/2017 09:49 PM, Michael S. Tsirkin wrote:
> On Wed, Oct 11, 2017 at 02:03:20PM +0800, Wei Wang wrote:
>> On 10/10/2017 11:15 PM, Michael S. Tsirkin wrote:
>>> On Mon, Oct 02, 2017 at 04:38:01PM +0000, Wang, Wei W wrote:
>>>> On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
>>>>> On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
>>>>>> +static void ctrlq_send_cmd(struct virtio_balloon *vb,
>>>>>> +			  struct virtio_balloon_ctrlq_cmd *cmd,
>>>>>> +			  bool inbuf)
>>>>>> +{
>>>>>> +	struct virtqueue *vq = vb->ctrl_vq;
>>>>>> +
>>>>>> +	ctrlq_add_cmd(vq, cmd, inbuf);
>>>>>> +	if (!inbuf) {
>>>>>> +		/*
>>>>>> +		 * All the input cmd buffers are replenished here.
>>>>>> +		 * This is necessary because the input cmd buffers are lost
>>>>>> +		 * after live migration. The device needs to rewind all of
>>>>>> +		 * them from the ctrl_vq.
>>>>> Confused. Live migration somehow loses state? Why is that and why is it a good
>>>>> idea? And how do you know this is migration even?
>>>>> Looks like all you know is you got free page end. Could be any reason for this.
>>>> I think this would be something that the current live migration lacks - what the
>>>> device read from the vq is not transferred during live migration, an example is the
>>>> stat_vq_elem:
>>>> Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c
>>> This does not touch guest memory though it just manipulates
>>> internal state to make it easier to migrate.
>>> It's transparent to guest as migration should be.
>>>
>>>> For all the things that are added to the vq and need to be held by the device
>>>> to use later need to consider the situation that live migration might happen at any
>>>> time and they need to be re-taken from the vq by the device on the destination
>>>> machine.
>>>>
>>>> So, even without this live migration optimization feature, I think all the things that are
>>>> added to the vq for the device to hold, need a way for the device to rewind back from
>>>> the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
>>>> on the vq so that the device side rewinding can work.
>>>>
>>>> Please let me know if anything is missed or if you have other suggestions.
>>> IMO migration should pass enough data source to destination for
>>> destination to continue where source left off without guest help.
>>>
>> I'm afraid it would be difficult to pass the entire VirtQueueElement to the
>> destination. I think
>> that would also be the reason that stats_vq_elem chose to rewind from the
>> guest vq, which re-do the
>> virtqueue_pop() --> virtqueue_map_desc() steps (the QEMU virtual address to
>> the guest physical
>> address relationship may be changed on the destination).
> Yes but note how that rewind does not involve modifying the ring.
> It just rolls back some indices.

Yes, it rolls back the indices, then the following 
virtio_balloon_receive_stats()
can re-pop out the previous entry given by the guest.

Recall how stats_vq_elem works: there is only one stats buffer, which is 
used by the
guest to report stats, and also used by the host to ask the guest for 
stats report.

So the host can roll back one previous entry and what it gets will 
always be stat_vq_elem.


Our case is a little more complex than that - we have both free_page_cmd_in
(for host to guest command) and free_page_cmd_out (for guest to host 
command) buffer
passed via ctrl_vq. When the host rolls back one entry, it may get the 
free_page_cmd_out
buffer which can't be used as the host to guest buffer (i.e. 
free_page_elem held by the device).

So a trick in the driver is to refill the free_page_cmd_in buffer every 
time after the free_page_cmd_out
was sent to the host, so that when the host rewind one previous entry, 
it can always get the
free_page_cmd_in buffer (may be not a very nice method).



>
>> How about another direction which would be easier - using two 32-bit device
>> specific configuration registers,
>> Host2Guest and Guest2Host command registers, to replace the ctrlq for
>> command exchange:
>>
>> The flow can be as follows:
>>
>> 1) Before Host sending a StartCMD, it flushes the free_page_vq in case any
>> old free page hint is left there;
>> 2) Host writes StartCMD to the Host2Guest register, and notifies the guest;
>>
>> 3) Upon receiving a configuration notification, Guest reads the Host2Guest
>> register, and detaches all the used buffers from free_page_vq;
>> (then for each StartCMD, the free_page_vq will always have no obsolete free
>> page hints, right? )
>>
>> 4) Guest start report free pages:
>>      4.1) Host may actively write StopCMD to the Host2Guest register before
>> the guest finishes; or
>>      4.2) Guest finishes reporting, write StopCMD  the Guest2HOST register,
>> which traps to QEMU, to stop.
>>
>>
>> Best,
>> Wei
> I am not sure it matters whether a VQ or the config are used to start/stop.


Not matters, in terms of the flushing issue. The config method could 
avoid the above rewind issue.


> But I think flushing is very fragile. You will easily run into races
> if one of the actors gets out of sync and keeps adding data.
> I think adding an ID in the free vq stream is a more robust
> approach.
>

Adding ID to the free vq would need the device to distinguish whether it 
receives an ID or a free page hint,
so an extra protocol is needed for the two sides to talk. Currently, we 
directly assign the free page
address to desc->addr. With ID support, we would need to first allocate 
buffer for the protocol header,
and add the free page address to the header, then desc->addr = &header.

How about putting the ID to the command path? This would avoid the above 
trouble.

For example, using the 32-bit config registers:
first 16-bit: Command field
send 16-bit: ID field

Then, the working flow would look like this:

1) Host writes "Start, 1" to the Host2Guest register and notify;

2) Guest reads Host2Guest register, and ACKs by writing "Start, 1" to 
Guest2Host register;

3) Guest starts report free pages;

4) Each time when the host receives a free page hint from the 
free_page_vq, it compares the ID fields of
the Host2Guest and Guest2Host register. If matching, then filter out the 
free page from the migration dirty bitmap,
otherwise, simply push back without doing the filtering.


Best,
Wei


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [Qemu-devel] [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-12  3:54               ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-12  3:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 10/11/2017 09:49 PM, Michael S. Tsirkin wrote:
> On Wed, Oct 11, 2017 at 02:03:20PM +0800, Wei Wang wrote:
>> On 10/10/2017 11:15 PM, Michael S. Tsirkin wrote:
>>> On Mon, Oct 02, 2017 at 04:38:01PM +0000, Wang, Wei W wrote:
>>>> On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
>>>>> On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
>>>>>> +static void ctrlq_send_cmd(struct virtio_balloon *vb,
>>>>>> +			  struct virtio_balloon_ctrlq_cmd *cmd,
>>>>>> +			  bool inbuf)
>>>>>> +{
>>>>>> +	struct virtqueue *vq = vb->ctrl_vq;
>>>>>> +
>>>>>> +	ctrlq_add_cmd(vq, cmd, inbuf);
>>>>>> +	if (!inbuf) {
>>>>>> +		/*
>>>>>> +		 * All the input cmd buffers are replenished here.
>>>>>> +		 * This is necessary because the input cmd buffers are lost
>>>>>> +		 * after live migration. The device needs to rewind all of
>>>>>> +		 * them from the ctrl_vq.
>>>>> Confused. Live migration somehow loses state? Why is that and why is it a good
>>>>> idea? And how do you know this is migration even?
>>>>> Looks like all you know is you got free page end. Could be any reason for this.
>>>> I think this would be something that the current live migration lacks - what the
>>>> device read from the vq is not transferred during live migration, an example is the
>>>> stat_vq_elem:
>>>> Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c
>>> This does not touch guest memory though it just manipulates
>>> internal state to make it easier to migrate.
>>> It's transparent to guest as migration should be.
>>>
>>>> For all the things that are added to the vq and need to be held by the device
>>>> to use later need to consider the situation that live migration might happen at any
>>>> time and they need to be re-taken from the vq by the device on the destination
>>>> machine.
>>>>
>>>> So, even without this live migration optimization feature, I think all the things that are
>>>> added to the vq for the device to hold, need a way for the device to rewind back from
>>>> the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
>>>> on the vq so that the device side rewinding can work.
>>>>
>>>> Please let me know if anything is missed or if you have other suggestions.
>>> IMO migration should pass enough data source to destination for
>>> destination to continue where source left off without guest help.
>>>
>> I'm afraid it would be difficult to pass the entire VirtQueueElement to the
>> destination. I think
>> that would also be the reason that stats_vq_elem chose to rewind from the
>> guest vq, which re-do the
>> virtqueue_pop() --> virtqueue_map_desc() steps (the QEMU virtual address to
>> the guest physical
>> address relationship may be changed on the destination).
> Yes but note how that rewind does not involve modifying the ring.
> It just rolls back some indices.

Yes, it rolls back the indices, then the following 
virtio_balloon_receive_stats()
can re-pop out the previous entry given by the guest.

Recall how stats_vq_elem works: there is only one stats buffer, which is 
used by the
guest to report stats, and also used by the host to ask the guest for 
stats report.

So the host can roll back one previous entry and what it gets will 
always be stat_vq_elem.


Our case is a little more complex than that - we have both free_page_cmd_in
(for host to guest command) and free_page_cmd_out (for guest to host 
command) buffer
passed via ctrl_vq. When the host rolls back one entry, it may get the 
free_page_cmd_out
buffer which can't be used as the host to guest buffer (i.e. 
free_page_elem held by the device).

So a trick in the driver is to refill the free_page_cmd_in buffer every 
time after the free_page_cmd_out
was sent to the host, so that when the host rewind one previous entry, 
it can always get the
free_page_cmd_in buffer (may be not a very nice method).



>
>> How about another direction which would be easier - using two 32-bit device
>> specific configuration registers,
>> Host2Guest and Guest2Host command registers, to replace the ctrlq for
>> command exchange:
>>
>> The flow can be as follows:
>>
>> 1) Before Host sending a StartCMD, it flushes the free_page_vq in case any
>> old free page hint is left there;
>> 2) Host writes StartCMD to the Host2Guest register, and notifies the guest;
>>
>> 3) Upon receiving a configuration notification, Guest reads the Host2Guest
>> register, and detaches all the used buffers from free_page_vq;
>> (then for each StartCMD, the free_page_vq will always have no obsolete free
>> page hints, right? )
>>
>> 4) Guest start report free pages:
>>      4.1) Host may actively write StopCMD to the Host2Guest register before
>> the guest finishes; or
>>      4.2) Guest finishes reporting, write StopCMD  the Guest2HOST register,
>> which traps to QEMU, to stop.
>>
>>
>> Best,
>> Wei
> I am not sure it matters whether a VQ or the config are used to start/stop.


Not matters, in terms of the flushing issue. The config method could 
avoid the above rewind issue.


> But I think flushing is very fragile. You will easily run into races
> if one of the actors gets out of sync and keeps adding data.
> I think adding an ID in the free vq stream is a more robust
> approach.
>

Adding ID to the free vq would need the device to distinguish whether it 
receives an ID or a free page hint,
so an extra protocol is needed for the two sides to talk. Currently, we 
directly assign the free page
address to desc->addr. With ID support, we would need to first allocate 
buffer for the protocol header,
and add the free page address to the header, then desc->addr = &header.

How about putting the ID to the command path? This would avoid the above 
trouble.

For example, using the 32-bit config registers:
first 16-bit: Command field
send 16-bit: ID field

Then, the working flow would look like this:

1) Host writes "Start, 1" to the Host2Guest register and notify;

2) Guest reads Host2Guest register, and ACKs by writing "Start, 1" to 
Guest2Host register;

3) Guest starts report free pages;

4) Each time when the host receives a free page hint from the 
free_page_vq, it compares the ID fields of
the Host2Guest and Guest2Host register. If matching, then filter out the 
free page from the migration dirty bitmap,
otherwise, simply push back without doing the filtering.


Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
  2017-10-11 13:49             ` Michael S. Tsirkin
                               ` (4 preceding siblings ...)
  (?)
@ 2017-10-12  3:54             ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-12  3:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko

On 10/11/2017 09:49 PM, Michael S. Tsirkin wrote:
> On Wed, Oct 11, 2017 at 02:03:20PM +0800, Wei Wang wrote:
>> On 10/10/2017 11:15 PM, Michael S. Tsirkin wrote:
>>> On Mon, Oct 02, 2017 at 04:38:01PM +0000, Wang, Wei W wrote:
>>>> On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
>>>>> On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
>>>>>> +static void ctrlq_send_cmd(struct virtio_balloon *vb,
>>>>>> +			  struct virtio_balloon_ctrlq_cmd *cmd,
>>>>>> +			  bool inbuf)
>>>>>> +{
>>>>>> +	struct virtqueue *vq = vb->ctrl_vq;
>>>>>> +
>>>>>> +	ctrlq_add_cmd(vq, cmd, inbuf);
>>>>>> +	if (!inbuf) {
>>>>>> +		/*
>>>>>> +		 * All the input cmd buffers are replenished here.
>>>>>> +		 * This is necessary because the input cmd buffers are lost
>>>>>> +		 * after live migration. The device needs to rewind all of
>>>>>> +		 * them from the ctrl_vq.
>>>>> Confused. Live migration somehow loses state? Why is that and why is it a good
>>>>> idea? And how do you know this is migration even?
>>>>> Looks like all you know is you got free page end. Could be any reason for this.
>>>> I think this would be something that the current live migration lacks - what the
>>>> device read from the vq is not transferred during live migration, an example is the
>>>> stat_vq_elem:
>>>> Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c
>>> This does not touch guest memory though it just manipulates
>>> internal state to make it easier to migrate.
>>> It's transparent to guest as migration should be.
>>>
>>>> For all the things that are added to the vq and need to be held by the device
>>>> to use later need to consider the situation that live migration might happen at any
>>>> time and they need to be re-taken from the vq by the device on the destination
>>>> machine.
>>>>
>>>> So, even without this live migration optimization feature, I think all the things that are
>>>> added to the vq for the device to hold, need a way for the device to rewind back from
>>>> the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
>>>> on the vq so that the device side rewinding can work.
>>>>
>>>> Please let me know if anything is missed or if you have other suggestions.
>>> IMO migration should pass enough data source to destination for
>>> destination to continue where source left off without guest help.
>>>
>> I'm afraid it would be difficult to pass the entire VirtQueueElement to the
>> destination. I think
>> that would also be the reason that stats_vq_elem chose to rewind from the
>> guest vq, which re-do the
>> virtqueue_pop() --> virtqueue_map_desc() steps (the QEMU virtual address to
>> the guest physical
>> address relationship may be changed on the destination).
> Yes but note how that rewind does not involve modifying the ring.
> It just rolls back some indices.

Yes, it rolls back the indices, then the following 
virtio_balloon_receive_stats()
can re-pop out the previous entry given by the guest.

Recall how stats_vq_elem works: there is only one stats buffer, which is 
used by the
guest to report stats, and also used by the host to ask the guest for 
stats report.

So the host can roll back one previous entry and what it gets will 
always be stat_vq_elem.


Our case is a little more complex than that - we have both free_page_cmd_in
(for host to guest command) and free_page_cmd_out (for guest to host 
command) buffer
passed via ctrl_vq. When the host rolls back one entry, it may get the 
free_page_cmd_out
buffer which can't be used as the host to guest buffer (i.e. 
free_page_elem held by the device).

So a trick in the driver is to refill the free_page_cmd_in buffer every 
time after the free_page_cmd_out
was sent to the host, so that when the host rewind one previous entry, 
it can always get the
free_page_cmd_in buffer (may be not a very nice method).



>
>> How about another direction which would be easier - using two 32-bit device
>> specific configuration registers,
>> Host2Guest and Guest2Host command registers, to replace the ctrlq for
>> command exchange:
>>
>> The flow can be as follows:
>>
>> 1) Before Host sending a StartCMD, it flushes the free_page_vq in case any
>> old free page hint is left there;
>> 2) Host writes StartCMD to the Host2Guest register, and notifies the guest;
>>
>> 3) Upon receiving a configuration notification, Guest reads the Host2Guest
>> register, and detaches all the used buffers from free_page_vq;
>> (then for each StartCMD, the free_page_vq will always have no obsolete free
>> page hints, right? )
>>
>> 4) Guest start report free pages:
>>      4.1) Host may actively write StopCMD to the Host2Guest register before
>> the guest finishes; or
>>      4.2) Guest finishes reporting, write StopCMD  the Guest2HOST register,
>> which traps to QEMU, to stop.
>>
>>
>> Best,
>> Wei
> I am not sure it matters whether a VQ or the config are used to start/stop.


Not matters, in terms of the flushing issue. The config method could 
avoid the above rewind issue.


> But I think flushing is very fragile. You will easily run into races
> if one of the actors gets out of sync and keeps adding data.
> I think adding an ID in the free vq stream is a more robust
> approach.
>

Adding ID to the free vq would need the device to distinguish whether it 
receives an ID or a free page hint,
so an extra protocol is needed for the two sides to talk. Currently, we 
directly assign the free page
address to desc->addr. With ID support, we would need to first allocate 
buffer for the protocol header,
and add the free page address to the header, then desc->addr = &header.

How about putting the ID to the command path? This would avoid the above 
trouble.

For example, using the 32-bit config registers:
first 16-bit: Command field
send 16-bit: ID field

Then, the working flow would look like this:

1) Host writes "Start, 1" to the Host2Guest register and notify;

2) Guest reads Host2Guest register, and ACKs by writing "Start, 1" to 
Guest2Host register;

3) Guest starts report free pages;

4) Each time when the host receives a free page hint from the 
free_page_vq, it compares the ID fields of
the Host2Guest and Guest2Host register. If matching, then filter out the 
free page from the migration dirty bitmap,
otherwise, simply push back without doing the filtering.


Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [virtio-dev] Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-12  3:54               ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-12  3:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 10/11/2017 09:49 PM, Michael S. Tsirkin wrote:
> On Wed, Oct 11, 2017 at 02:03:20PM +0800, Wei Wang wrote:
>> On 10/10/2017 11:15 PM, Michael S. Tsirkin wrote:
>>> On Mon, Oct 02, 2017 at 04:38:01PM +0000, Wang, Wei W wrote:
>>>> On Sunday, October 1, 2017 11:19 AM, Michael S. Tsirkin wrote:
>>>>> On Sat, Sep 30, 2017 at 12:05:54PM +0800, Wei Wang wrote:
>>>>>> +static void ctrlq_send_cmd(struct virtio_balloon *vb,
>>>>>> +			  struct virtio_balloon_ctrlq_cmd *cmd,
>>>>>> +			  bool inbuf)
>>>>>> +{
>>>>>> +	struct virtqueue *vq = vb->ctrl_vq;
>>>>>> +
>>>>>> +	ctrlq_add_cmd(vq, cmd, inbuf);
>>>>>> +	if (!inbuf) {
>>>>>> +		/*
>>>>>> +		 * All the input cmd buffers are replenished here.
>>>>>> +		 * This is necessary because the input cmd buffers are lost
>>>>>> +		 * after live migration. The device needs to rewind all of
>>>>>> +		 * them from the ctrl_vq.
>>>>> Confused. Live migration somehow loses state? Why is that and why is it a good
>>>>> idea? And how do you know this is migration even?
>>>>> Looks like all you know is you got free page end. Could be any reason for this.
>>>> I think this would be something that the current live migration lacks - what the
>>>> device read from the vq is not transferred during live migration, an example is the
>>>> stat_vq_elem:
>>>> Line 476 at https://github.com/qemu/qemu/blob/master/hw/virtio/virtio-balloon.c
>>> This does not touch guest memory though it just manipulates
>>> internal state to make it easier to migrate.
>>> It's transparent to guest as migration should be.
>>>
>>>> For all the things that are added to the vq and need to be held by the device
>>>> to use later need to consider the situation that live migration might happen at any
>>>> time and they need to be re-taken from the vq by the device on the destination
>>>> machine.
>>>>
>>>> So, even without this live migration optimization feature, I think all the things that are
>>>> added to the vq for the device to hold, need a way for the device to rewind back from
>>>> the vq - re-adding all the elements to the vq is a trick to keep a record of all of them
>>>> on the vq so that the device side rewinding can work.
>>>>
>>>> Please let me know if anything is missed or if you have other suggestions.
>>> IMO migration should pass enough data source to destination for
>>> destination to continue where source left off without guest help.
>>>
>> I'm afraid it would be difficult to pass the entire VirtQueueElement to the
>> destination. I think
>> that would also be the reason that stats_vq_elem chose to rewind from the
>> guest vq, which re-do the
>> virtqueue_pop() --> virtqueue_map_desc() steps (the QEMU virtual address to
>> the guest physical
>> address relationship may be changed on the destination).
> Yes but note how that rewind does not involve modifying the ring.
> It just rolls back some indices.

Yes, it rolls back the indices, then the following 
virtio_balloon_receive_stats()
can re-pop out the previous entry given by the guest.

Recall how stats_vq_elem works: there is only one stats buffer, which is 
used by the
guest to report stats, and also used by the host to ask the guest for 
stats report.

So the host can roll back one previous entry and what it gets will 
always be stat_vq_elem.


Our case is a little more complex than that - we have both free_page_cmd_in
(for host to guest command) and free_page_cmd_out (for guest to host 
command) buffer
passed via ctrl_vq. When the host rolls back one entry, it may get the 
free_page_cmd_out
buffer which can't be used as the host to guest buffer (i.e. 
free_page_elem held by the device).

So a trick in the driver is to refill the free_page_cmd_in buffer every 
time after the free_page_cmd_out
was sent to the host, so that when the host rewind one previous entry, 
it can always get the
free_page_cmd_in buffer (may be not a very nice method).



>
>> How about another direction which would be easier - using two 32-bit device
>> specific configuration registers,
>> Host2Guest and Guest2Host command registers, to replace the ctrlq for
>> command exchange:
>>
>> The flow can be as follows:
>>
>> 1) Before Host sending a StartCMD, it flushes the free_page_vq in case any
>> old free page hint is left there;
>> 2) Host writes StartCMD to the Host2Guest register, and notifies the guest;
>>
>> 3) Upon receiving a configuration notification, Guest reads the Host2Guest
>> register, and detaches all the used buffers from free_page_vq;
>> (then for each StartCMD, the free_page_vq will always have no obsolete free
>> page hints, right? )
>>
>> 4) Guest start report free pages:
>>      4.1) Host may actively write StopCMD to the Host2Guest register before
>> the guest finishes; or
>>      4.2) Guest finishes reporting, write StopCMD  the Guest2HOST register,
>> which traps to QEMU, to stop.
>>
>>
>> Best,
>> Wei
> I am not sure it matters whether a VQ or the config are used to start/stop.


Not matters, in terms of the flushing issue. The config method could 
avoid the above rewind issue.


> But I think flushing is very fragile. You will easily run into races
> if one of the actors gets out of sync and keeps adding data.
> I think adding an ID in the free vq stream is a more robust
> approach.
>

Adding ID to the free vq would need the device to distinguish whether it 
receives an ID or a free page hint,
so an extra protocol is needed for the two sides to talk. Currently, we 
directly assign the free page
address to desc->addr. With ID support, we would need to first allocate 
buffer for the protocol header,
and add the free page address to the header, then desc->addr = &header.

How about putting the ID to the command path? This would avoid the above 
trouble.

For example, using the 32-bit config registers:
first 16-bit: Command field
send 16-bit: ID field

Then, the working flow would look like this:

1) Host writes "Start, 1" to the Host2Guest register and notify;

2) Guest reads Host2Guest register, and ACKs by writing "Start, 1" to 
Guest2Host register;

3) Guest starts report free pages;

4) Each time when the host receives a free page hint from the 
free_page_vq, it compares the ID fields of
the Host2Guest and Guest2Host register. If matching, then filter out the 
free page from the migration dirty bitmap,
otherwise, simply push back without doing the filtering.


Best,
Wei



---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
  2017-10-12  3:54               ` Wei Wang
                                   ` (2 preceding siblings ...)
  (?)
@ 2017-10-13 13:38                 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-13 13:38 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Thu, Oct 12, 2017 at 11:54:56AM +0800, Wei Wang wrote:
> > But I think flushing is very fragile. You will easily run into races
> > if one of the actors gets out of sync and keeps adding data.
> > I think adding an ID in the free vq stream is a more robust
> > approach.
> > 
> 
> Adding ID to the free vq would need the device to distinguish whether it
> receives an ID or a free page hint,

Not really.  It's pretty simple: a 64 bit buffer is an ID. A 4K and bigger one
is a page.


> so an extra protocol is needed for the two sides to talk. Currently, we
> directly assign the free page
> address to desc->addr. With ID support, we would need to first allocate
> buffer for the protocol header,
> and add the free page address to the header, then desc->addr = &header.


I do not think you should add ID on each page. What would be the point?
Add it each time you detect a new start command.

> How about putting the ID to the command path? This would avoid the above
> trouble.
> 
> For example, using the 32-bit config registers:
> first 16-bit: Command field
> send 16-bit: ID field
> 
> Then, the working flow would look like this:
> 
> 1) Host writes "Start, 1" to the Host2Guest register and notify;
> 
> 2) Guest reads Host2Guest register, and ACKs by writing "Start, 1" to
> Guest2Host register;
> 
> 3) Guest starts report free pages;
> 
> 4) Each time when the host receives a free page hint from the free_page_vq,
> it compares the ID fields of
> the Host2Guest and Guest2Host register. If matching, then filter out the
> free page from the migration dirty bitmap,
> otherwise, simply push back without doing the filtering.
> 
> 
> Best,
> Wei


All fine but config and vq ops are asynchronous. Host has no idea when
were entries added to vq. So the ID sent to host needs to be through vq.
And I would make it a 64 or at least 32 bit ID, not a 16 bit one,
to avoid wrap-around.
-- 
MST

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-13 13:38                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-13 13:38 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy,
	liliang.opensource@gmail.com

On Thu, Oct 12, 2017 at 11:54:56AM +0800, Wei Wang wrote:
> > But I think flushing is very fragile. You will easily run into races
> > if one of the actors gets out of sync and keeps adding data.
> > I think adding an ID in the free vq stream is a more robust
> > approach.
> > 
> 
> Adding ID to the free vq would need the device to distinguish whether it
> receives an ID or a free page hint,

Not really.  It's pretty simple: a 64 bit buffer is an ID. A 4K and bigger one
is a page.


> so an extra protocol is needed for the two sides to talk. Currently, we
> directly assign the free page
> address to desc->addr. With ID support, we would need to first allocate
> buffer for the protocol header,
> and add the free page address to the header, then desc->addr = &header.


I do not think you should add ID on each page. What would be the point?
Add it each time you detect a new start command.

> How about putting the ID to the command path? This would avoid the above
> trouble.
> 
> For example, using the 32-bit config registers:
> first 16-bit: Command field
> send 16-bit: ID field
> 
> Then, the working flow would look like this:
> 
> 1) Host writes "Start, 1" to the Host2Guest register and notify;
> 
> 2) Guest reads Host2Guest register, and ACKs by writing "Start, 1" to
> Guest2Host register;
> 
> 3) Guest starts report free pages;
> 
> 4) Each time when the host receives a free page hint from the free_page_vq,
> it compares the ID fields of
> the Host2Guest and Guest2Host register. If matching, then filter out the
> free page from the migration dirty bitmap,
> otherwise, simply push back without doing the filtering.
> 
> 
> Best,
> Wei


All fine but config and vq ops are asynchronous. Host has no idea when
were entries added to vq. So the ID sent to host needs to be through vq.
And I would make it a 64 or at least 32 bit ID, not a 16 bit one,
to avoid wrap-around.
-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-13 13:38                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-13 13:38 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Thu, Oct 12, 2017 at 11:54:56AM +0800, Wei Wang wrote:
> > But I think flushing is very fragile. You will easily run into races
> > if one of the actors gets out of sync and keeps adding data.
> > I think adding an ID in the free vq stream is a more robust
> > approach.
> > 
> 
> Adding ID to the free vq would need the device to distinguish whether it
> receives an ID or a free page hint,

Not really.  It's pretty simple: a 64 bit buffer is an ID. A 4K and bigger one
is a page.


> so an extra protocol is needed for the two sides to talk. Currently, we
> directly assign the free page
> address to desc->addr. With ID support, we would need to first allocate
> buffer for the protocol header,
> and add the free page address to the header, then desc->addr = &header.


I do not think you should add ID on each page. What would be the point?
Add it each time you detect a new start command.

> How about putting the ID to the command path? This would avoid the above
> trouble.
> 
> For example, using the 32-bit config registers:
> first 16-bit: Command field
> send 16-bit: ID field
> 
> Then, the working flow would look like this:
> 
> 1) Host writes "Start, 1" to the Host2Guest register and notify;
> 
> 2) Guest reads Host2Guest register, and ACKs by writing "Start, 1" to
> Guest2Host register;
> 
> 3) Guest starts report free pages;
> 
> 4) Each time when the host receives a free page hint from the free_page_vq,
> it compares the ID fields of
> the Host2Guest and Guest2Host register. If matching, then filter out the
> free page from the migration dirty bitmap,
> otherwise, simply push back without doing the filtering.
> 
> 
> Best,
> Wei


All fine but config and vq ops are asynchronous. Host has no idea when
were entries added to vq. So the ID sent to host needs to be through vq.
And I would make it a 64 or at least 32 bit ID, not a 16 bit one,
to avoid wrap-around.
-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [Qemu-devel] [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-13 13:38                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-13 13:38 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Thu, Oct 12, 2017 at 11:54:56AM +0800, Wei Wang wrote:
> > But I think flushing is very fragile. You will easily run into races
> > if one of the actors gets out of sync and keeps adding data.
> > I think adding an ID in the free vq stream is a more robust
> > approach.
> > 
> 
> Adding ID to the free vq would need the device to distinguish whether it
> receives an ID or a free page hint,

Not really.  It's pretty simple: a 64 bit buffer is an ID. A 4K and bigger one
is a page.


> so an extra protocol is needed for the two sides to talk. Currently, we
> directly assign the free page
> address to desc->addr. With ID support, we would need to first allocate
> buffer for the protocol header,
> and add the free page address to the header, then desc->addr = &header.


I do not think you should add ID on each page. What would be the point?
Add it each time you detect a new start command.

> How about putting the ID to the command path? This would avoid the above
> trouble.
> 
> For example, using the 32-bit config registers:
> first 16-bit: Command field
> send 16-bit: ID field
> 
> Then, the working flow would look like this:
> 
> 1) Host writes "Start, 1" to the Host2Guest register and notify;
> 
> 2) Guest reads Host2Guest register, and ACKs by writing "Start, 1" to
> Guest2Host register;
> 
> 3) Guest starts report free pages;
> 
> 4) Each time when the host receives a free page hint from the free_page_vq,
> it compares the ID fields of
> the Host2Guest and Guest2Host register. If matching, then filter out the
> free page from the migration dirty bitmap,
> otherwise, simply push back without doing the filtering.
> 
> 
> Best,
> Wei


All fine but config and vq ops are asynchronous. Host has no idea when
were entries added to vq. So the ID sent to host needs to be through vq.
And I would make it a 64 or at least 32 bit ID, not a 16 bit one,
to avoid wrap-around.
-- 
MST

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
  2017-10-12  3:54               ` Wei Wang
                                 ` (4 preceding siblings ...)
  (?)
@ 2017-10-13 13:38               ` Michael S. Tsirkin
  -1 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-13 13:38 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko

On Thu, Oct 12, 2017 at 11:54:56AM +0800, Wei Wang wrote:
> > But I think flushing is very fragile. You will easily run into races
> > if one of the actors gets out of sync and keeps adding data.
> > I think adding an ID in the free vq stream is a more robust
> > approach.
> > 
> 
> Adding ID to the free vq would need the device to distinguish whether it
> receives an ID or a free page hint,

Not really.  It's pretty simple: a 64 bit buffer is an ID. A 4K and bigger one
is a page.


> so an extra protocol is needed for the two sides to talk. Currently, we
> directly assign the free page
> address to desc->addr. With ID support, we would need to first allocate
> buffer for the protocol header,
> and add the free page address to the header, then desc->addr = &header.


I do not think you should add ID on each page. What would be the point?
Add it each time you detect a new start command.

> How about putting the ID to the command path? This would avoid the above
> trouble.
> 
> For example, using the 32-bit config registers:
> first 16-bit: Command field
> send 16-bit: ID field
> 
> Then, the working flow would look like this:
> 
> 1) Host writes "Start, 1" to the Host2Guest register and notify;
> 
> 2) Guest reads Host2Guest register, and ACKs by writing "Start, 1" to
> Guest2Host register;
> 
> 3) Guest starts report free pages;
> 
> 4) Each time when the host receives a free page hint from the free_page_vq,
> it compares the ID fields of
> the Host2Guest and Guest2Host register. If matching, then filter out the
> free page from the migration dirty bitmap,
> otherwise, simply push back without doing the filtering.
> 
> 
> Best,
> Wei


All fine but config and vq ops are asynchronous. Host has no idea when
were entries added to vq. So the ID sent to host needs to be through vq.
And I would make it a 64 or at least 32 bit ID, not a 16 bit one,
to avoid wrap-around.
-- 
MST

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [virtio-dev] Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-13 13:38                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 146+ messages in thread
From: Michael S. Tsirkin @ 2017-10-13 13:38 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On Thu, Oct 12, 2017 at 11:54:56AM +0800, Wei Wang wrote:
> > But I think flushing is very fragile. You will easily run into races
> > if one of the actors gets out of sync and keeps adding data.
> > I think adding an ID in the free vq stream is a more robust
> > approach.
> > 
> 
> Adding ID to the free vq would need the device to distinguish whether it
> receives an ID or a free page hint,

Not really.  It's pretty simple: a 64 bit buffer is an ID. A 4K and bigger one
is a page.


> so an extra protocol is needed for the two sides to talk. Currently, we
> directly assign the free page
> address to desc->addr. With ID support, we would need to first allocate
> buffer for the protocol header,
> and add the free page address to the header, then desc->addr = &header.


I do not think you should add ID on each page. What would be the point?
Add it each time you detect a new start command.

> How about putting the ID to the command path? This would avoid the above
> trouble.
> 
> For example, using the 32-bit config registers:
> first 16-bit: Command field
> send 16-bit: ID field
> 
> Then, the working flow would look like this:
> 
> 1) Host writes "Start, 1" to the Host2Guest register and notify;
> 
> 2) Guest reads Host2Guest register, and ACKs by writing "Start, 1" to
> Guest2Host register;
> 
> 3) Guest starts report free pages;
> 
> 4) Each time when the host receives a free page hint from the free_page_vq,
> it compares the ID fields of
> the Host2Guest and Guest2Host register. If matching, then filter out the
> free page from the migration dirty bitmap,
> otherwise, simply push back without doing the filtering.
> 
> 
> Best,
> Wei


All fine but config and vq ops are asynchronous. Host has no idea when
were entries added to vq. So the ID sent to host needs to be through vq.
And I would make it a 64 or at least 32 bit ID, not a 16 bit one,
to avoid wrap-around.
-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
  2017-10-13 13:38                 ` Michael S. Tsirkin
                                     ` (2 preceding siblings ...)
  (?)
@ 2017-10-19  8:07                   ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-19  8:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 10/13/2017 09:38 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 12, 2017 at 11:54:56AM +0800, Wei Wang wrote:
>>> But I think flushing is very fragile. You will easily run into races
>>> if one of the actors gets out of sync and keeps adding data.
>>> I think adding an ID in the free vq stream is a more robust
>>> approach.
>>>
>> Adding ID to the free vq would need the device to distinguish whether it
>> receives an ID or a free page hint,
> Not really.  It's pretty simple: a 64 bit buffer is an ID. A 4K and bigger one
> is a page.

I think we can also use the previous method, free page via in_buf, and 
id via out_buf.

Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-19  8:07                   ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-19  8:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy,

On 10/13/2017 09:38 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 12, 2017 at 11:54:56AM +0800, Wei Wang wrote:
>>> But I think flushing is very fragile. You will easily run into races
>>> if one of the actors gets out of sync and keeps adding data.
>>> I think adding an ID in the free vq stream is a more robust
>>> approach.
>>>
>> Adding ID to the free vq would need the device to distinguish whether it
>> receives an ID or a free page hint,
> Not really.  It's pretty simple: a 64 bit buffer is an ID. A 4K and bigger one
> is a page.

I think we can also use the previous method, free page via in_buf, and 
id via out_buf.

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-19  8:07                   ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-19  8:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 10/13/2017 09:38 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 12, 2017 at 11:54:56AM +0800, Wei Wang wrote:
>>> But I think flushing is very fragile. You will easily run into races
>>> if one of the actors gets out of sync and keeps adding data.
>>> I think adding an ID in the free vq stream is a more robust
>>> approach.
>>>
>> Adding ID to the free vq would need the device to distinguish whether it
>> receives an ID or a free page hint,
> Not really.  It's pretty simple: a 64 bit buffer is an ID. A 4K and bigger one
> is a page.

I think we can also use the previous method, free page via in_buf, and 
id via out_buf.

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [Qemu-devel] [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-19  8:07                   ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-19  8:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 10/13/2017 09:38 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 12, 2017 at 11:54:56AM +0800, Wei Wang wrote:
>>> But I think flushing is very fragile. You will easily run into races
>>> if one of the actors gets out of sync and keeps adding data.
>>> I think adding an ID in the free vq stream is a more robust
>>> approach.
>>>
>> Adding ID to the free vq would need the device to distinguish whether it
>> receives an ID or a free page hint,
> Not really.  It's pretty simple: a 64 bit buffer is an ID. A 4K and bigger one
> is a page.

I think we can also use the previous method, free page via in_buf, and 
id via out_buf.

Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
  2017-10-13 13:38                 ` Michael S. Tsirkin
                                   ` (3 preceding siblings ...)
  (?)
@ 2017-10-19  8:07                 ` Wei Wang
  -1 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-19  8:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, mawilcox, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, willy, virtualization,
	linux-mm, yang.zhang.wz, quan.xu, cornelia.huck, pbonzini, akpm,
	mhocko

On 10/13/2017 09:38 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 12, 2017 at 11:54:56AM +0800, Wei Wang wrote:
>>> But I think flushing is very fragile. You will easily run into races
>>> if one of the actors gets out of sync and keeps adding data.
>>> I think adding an ID in the free vq stream is a more robust
>>> approach.
>>>
>> Adding ID to the free vq would need the device to distinguish whether it
>> receives an ID or a free page hint,
> Not really.  It's pretty simple: a 64 bit buffer is an ID. A 4K and bigger one
> is a page.

I think we can also use the previous method, free page via in_buf, and 
id via out_buf.

Best,
Wei

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [virtio-dev] Re: [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
@ 2017-10-19  8:07                   ` Wei Wang
  0 siblings, 0 replies; 146+ messages in thread
From: Wei Wang @ 2017-10-19  8:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mhocko, akpm, mawilcox, david, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, willy, liliang.opensource,
	yang.zhang.wz, quan.xu

On 10/13/2017 09:38 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 12, 2017 at 11:54:56AM +0800, Wei Wang wrote:
>>> But I think flushing is very fragile. You will easily run into races
>>> if one of the actors gets out of sync and keeps adding data.
>>> I think adding an ID in the free vq stream is a more robust
>>> approach.
>>>
>> Adding ID to the free vq would need the device to distinguish whether it
>> receives an ID or a free page hint,
> Not really.  It's pretty simple: a 64 bit buffer is an ID. A 4K and bigger one
> is a page.

I think we can also use the previous method, free page via in_buf, and 
id via out_buf.

Best,
Wei

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 146+ messages in thread

end of thread, other threads:[~2017-10-19  8:07 UTC | newest]

Thread overview: 146+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-30  4:05 [PATCH v16 0/5] Virtio-balloon Enhancement Wei Wang
2017-09-30  4:05 ` [virtio-dev] " Wei Wang
2017-09-30  4:05 ` [Qemu-devel] " Wei Wang
2017-09-30  4:05 ` Wei Wang
2017-09-30  4:05 ` [PATCH v16 1/5] lib/xbitmap: Introduce xbitmap Wei Wang
2017-09-30  4:05   ` [virtio-dev] " Wei Wang
2017-09-30  4:05   ` [Qemu-devel] " Wei Wang
2017-09-30  4:05   ` Wei Wang
2017-09-30  4:05   ` Wei Wang
2017-10-09 11:30   ` Tetsuo Handa
2017-10-09 11:30     ` [Qemu-devel] " Tetsuo Handa
2017-10-09 11:30     ` Tetsuo Handa
2017-09-30  4:05 ` Wei Wang
2017-09-30  4:05 ` [PATCH v16 2/5] radix tree test suite: add tests for xbitmap Wei Wang
2017-09-30  4:05 ` Wei Wang
2017-09-30  4:05   ` [virtio-dev] " Wei Wang
2017-09-30  4:05   ` [Qemu-devel] " Wei Wang
2017-09-30  4:05   ` Wei Wang
2017-09-30  4:05 ` [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
2017-09-30  4:05 ` Wei Wang
2017-09-30  4:05   ` [virtio-dev] " Wei Wang
2017-09-30  4:05   ` [Qemu-devel] " Wei Wang
2017-09-30  4:05   ` Wei Wang
2017-10-02  4:30   ` Michael S. Tsirkin
2017-10-02  4:30   ` Michael S. Tsirkin
2017-10-02  4:30     ` [Qemu-devel] " Michael S. Tsirkin
2017-10-02  4:30     ` Michael S. Tsirkin
2017-10-02 12:39     ` Wang, Wei W
2017-10-02 12:39     ` Wang, Wei W
2017-10-02 12:39       ` [Qemu-devel] " Wang, Wei W
2017-10-02 12:39       ` Wang, Wei W
2017-10-02 12:39       ` Wang, Wei W
2017-10-02 13:44       ` Michael S. Tsirkin
2017-10-02 13:44       ` Michael S. Tsirkin
2017-10-02 13:44         ` [Qemu-devel] " Michael S. Tsirkin
2017-10-02 13:44         ` Michael S. Tsirkin
2017-10-02 13:44         ` Michael S. Tsirkin
2017-10-09 15:20   ` Michael S. Tsirkin
2017-10-09 15:20     ` [virtio-dev] " Michael S. Tsirkin
2017-10-09 15:20     ` [Qemu-devel] " Michael S. Tsirkin
2017-10-09 15:20     ` Michael S. Tsirkin
2017-10-10  7:28     ` Wei Wang
2017-10-10  7:28     ` Wei Wang
2017-10-10  7:28       ` [virtio-dev] " Wei Wang
2017-10-10  7:28       ` [Qemu-devel] " Wei Wang
2017-10-10  7:28       ` Wei Wang
2017-10-10 11:08       ` Tetsuo Handa
2017-10-10 11:08         ` [Qemu-devel] " Tetsuo Handa
2017-10-10 11:08         ` Tetsuo Handa
2017-10-10 12:32         ` Wei Wang
2017-10-10 12:32           ` [virtio-dev] " Wei Wang
2017-10-10 12:32           ` [Qemu-devel] " Wei Wang
2017-10-10 12:32           ` Wei Wang
2017-10-10 13:09           ` Tetsuo Handa
2017-10-10 13:09             ` [Qemu-devel] " Tetsuo Handa
2017-10-10 13:09             ` Tetsuo Handa
2017-10-11  1:51             ` Wei Wang
2017-10-11  1:51             ` Wei Wang
2017-10-11  1:51               ` [virtio-dev] " Wei Wang
2017-10-11  1:51               ` [Qemu-devel] " Wei Wang
2017-10-11  1:51               ` Wei Wang
2017-10-11  2:26               ` Tetsuo Handa
2017-10-11  2:26                 ` [Qemu-devel] " Tetsuo Handa
2017-10-11  3:16                 ` Wei Wang
2017-10-11  3:16                   ` [virtio-dev] " Wei Wang
2017-10-11  3:16                   ` [Qemu-devel] " Wei Wang
2017-10-11  3:16                   ` Wei Wang
2017-10-11  3:16                 ` Wei Wang
2017-10-10 12:32         ` Wei Wang
2017-10-09 15:20   ` Michael S. Tsirkin
2017-09-30  4:05 ` [PATCH v16 4/5] mm: support reporting free page blocks Wei Wang
2017-09-30  4:05 ` Wei Wang
2017-09-30  4:05   ` [virtio-dev] " Wei Wang
2017-09-30  4:05   ` [Qemu-devel] " Wei Wang
2017-09-30  4:05   ` Wei Wang
2017-10-03 14:50   ` Michal Hocko
2017-10-03 14:50     ` [Qemu-devel] " Michal Hocko
2017-10-03 14:50     ` Michal Hocko
2017-10-03 14:50   ` Michal Hocko
2017-09-30  4:05 ` [PATCH v16 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ Wei Wang
2017-09-30  4:05   ` [virtio-dev] " Wei Wang
2017-09-30  4:05   ` [Qemu-devel] " Wei Wang
2017-09-30  4:05   ` Wei Wang
2017-10-01  3:18   ` Michael S. Tsirkin
2017-10-01  3:18     ` [virtio-dev] " Michael S. Tsirkin
2017-10-01  3:18     ` [Qemu-devel] " Michael S. Tsirkin
2017-10-01  3:18     ` Michael S. Tsirkin
2017-10-02 16:38     ` Wang, Wei W
2017-10-02 16:38     ` Wang, Wei W
2017-10-02 16:38       ` [virtio-dev] " Wang, Wei W
2017-10-02 16:38       ` [Qemu-devel] " Wang, Wei W
2017-10-02 16:38       ` Wang, Wei W
2017-10-02 16:38       ` Wang, Wei W
2017-10-10 15:15       ` Michael S. Tsirkin
2017-10-10 15:15         ` [virtio-dev] " Michael S. Tsirkin
2017-10-10 15:15         ` [Qemu-devel] " Michael S. Tsirkin
2017-10-10 15:15         ` Michael S. Tsirkin
2017-10-10 15:15         ` Michael S. Tsirkin
2017-10-11  6:03         ` Wei Wang
2017-10-11  6:03         ` Wei Wang
2017-10-11  6:03           ` [virtio-dev] " Wei Wang
2017-10-11  6:03           ` [Qemu-devel] " Wei Wang
2017-10-11  6:03           ` Wei Wang
2017-10-11  6:03           ` Wei Wang
2017-10-11 13:49           ` Michael S. Tsirkin
2017-10-11 13:49           ` Michael S. Tsirkin
2017-10-11 13:49             ` [virtio-dev] " Michael S. Tsirkin
2017-10-11 13:49             ` [Qemu-devel] " Michael S. Tsirkin
2017-10-11 13:49             ` Michael S. Tsirkin
2017-10-11 13:49             ` Michael S. Tsirkin
2017-10-12  3:54             ` Wei Wang
2017-10-12  3:54               ` [virtio-dev] " Wei Wang
2017-10-12  3:54               ` [Qemu-devel] " Wei Wang
2017-10-12  3:54               ` Wei Wang
2017-10-12  3:54               ` Wei Wang
2017-10-13 13:38               ` Michael S. Tsirkin
2017-10-13 13:38                 ` [virtio-dev] " Michael S. Tsirkin
2017-10-13 13:38                 ` [Qemu-devel] " Michael S. Tsirkin
2017-10-13 13:38                 ` Michael S. Tsirkin
2017-10-13 13:38                 ` Michael S. Tsirkin
2017-10-19  8:07                 ` Wei Wang
2017-10-19  8:07                 ` Wei Wang
2017-10-19  8:07                   ` [virtio-dev] " Wei Wang
2017-10-19  8:07                   ` [Qemu-devel] " Wei Wang
2017-10-19  8:07                   ` Wei Wang
2017-10-19  8:07                   ` Wei Wang
2017-10-13 13:38               ` Michael S. Tsirkin
2017-10-12  3:54             ` Wei Wang
2017-10-10 15:15       ` Michael S. Tsirkin
2017-10-01  3:18   ` Michael S. Tsirkin
2017-09-30  4:05 ` Wei Wang
2017-10-01 13:16 ` [PATCH v16 0/5] Virtio-balloon Enhancement Damian Tometzki
2017-10-01 13:16 ` Damian Tometzki
2017-10-01 13:16   ` [Qemu-devel] " Damian Tometzki
2017-10-01 13:16   ` Damian Tometzki
2017-10-01 13:16   ` Damian Tometzki
2017-10-01 13:25 ` Damian Tometzki
2017-10-01 13:25   ` [Qemu-devel] " Damian Tometzki
2017-10-01 13:25   ` Damian Tometzki
2017-10-01 13:25   ` Damian Tometzki
2017-10-09  9:39   ` Wei Wang
2017-10-09  9:39     ` [virtio-dev] " Wei Wang
2017-10-09  9:39     ` [Qemu-devel] " Wei Wang
2017-10-09  9:39     ` Wei Wang
2017-10-09  9:39   ` Wei Wang
2017-10-01 13:25 ` Damian Tometzki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.