All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v11 0/6] Virtio-balloon Enhancement
@ 2017-06-09 10:41 ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

This patch series enhances the existing virtio-balloon with the following new
features:
1) fast ballooning: transfer ballooned pages between the guest and host in
chunks, instead of one by one; and
2) cmdq: a new virtqueue to send commands between the device and driver.
Currently, it supports commands to report memory stats (replace the old statq
mechanism) and report guest unused pages.

Liang Li (1):
  virtio-balloon: deflate via a page list

Wei Wang (5):
  virtio-balloon: coding format cleanup
  virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
  mm: function to offer a page block on the free list
  mm: export symbol of next_zone and first_online_pgdat
  virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ

 drivers/virtio/virtio_balloon.c     | 781 ++++++++++++++++++++++++++++++++----
 drivers/virtio/virtio_ring.c        | 120 +++++-
 include/linux/mm.h                  |   5 +
 include/linux/virtio.h              |   7 +
 include/uapi/linux/virtio_balloon.h |  14 +
 include/uapi/linux/virtio_ring.h    |   3 +
 mm/mmzone.c                         |   2 +
 mm/page_alloc.c                     |  91 +++++
 8 files changed, 950 insertions(+), 73 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH v11 0/6] Virtio-balloon Enhancement
@ 2017-06-09 10:41 ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

This patch series enhances the existing virtio-balloon with the following new
features:
1) fast ballooning: transfer ballooned pages between the guest and host in
chunks, instead of one by one; and
2) cmdq: a new virtqueue to send commands between the device and driver.
Currently, it supports commands to report memory stats (replace the old statq
mechanism) and report guest unused pages.

Liang Li (1):
  virtio-balloon: deflate via a page list

Wei Wang (5):
  virtio-balloon: coding format cleanup
  virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
  mm: function to offer a page block on the free list
  mm: export symbol of next_zone and first_online_pgdat
  virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ

 drivers/virtio/virtio_balloon.c     | 781 ++++++++++++++++++++++++++++++++----
 drivers/virtio/virtio_ring.c        | 120 +++++-
 include/linux/mm.h                  |   5 +
 include/linux/virtio.h              |   7 +
 include/uapi/linux/virtio_balloon.h |  14 +
 include/uapi/linux/virtio_ring.h    |   3 +
 mm/mmzone.c                         |   2 +
 mm/page_alloc.c                     |  91 +++++
 8 files changed, 950 insertions(+), 73 deletions(-)

-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [Qemu-devel] [PATCH v11 0/6] Virtio-balloon Enhancement
@ 2017-06-09 10:41 ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

This patch series enhances the existing virtio-balloon with the following new
features:
1) fast ballooning: transfer ballooned pages between the guest and host in
chunks, instead of one by one; and
2) cmdq: a new virtqueue to send commands between the device and driver.
Currently, it supports commands to report memory stats (replace the old statq
mechanism) and report guest unused pages.

Liang Li (1):
  virtio-balloon: deflate via a page list

Wei Wang (5):
  virtio-balloon: coding format cleanup
  virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
  mm: function to offer a page block on the free list
  mm: export symbol of next_zone and first_online_pgdat
  virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ

 drivers/virtio/virtio_balloon.c     | 781 ++++++++++++++++++++++++++++++++----
 drivers/virtio/virtio_ring.c        | 120 +++++-
 include/linux/mm.h                  |   5 +
 include/linux/virtio.h              |   7 +
 include/uapi/linux/virtio_balloon.h |  14 +
 include/uapi/linux/virtio_ring.h    |   3 +
 mm/mmzone.c                         |   2 +
 mm/page_alloc.c                     |  91 +++++
 8 files changed, 950 insertions(+), 73 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 175+ messages in thread

* [PATCH v11 1/6] virtio-balloon: deflate via a page list
  2017-06-09 10:41 ` Wei Wang
  (?)
@ 2017-06-09 10:41   ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

From: Liang Li <liang.z.li@intel.com>

This patch saves the deflated pages to a list, instead of the PFN array.
Accordingly, the balloon_pfn_to_page() function is removed.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 drivers/virtio/virtio_balloon.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 34adf9b..4a9f307 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -104,12 +104,6 @@ static u32 page_to_balloon_pfn(struct page *page)
 	return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-	BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-	return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
 	struct virtio_balloon *vb = vq->vdev->priv;
@@ -182,18 +176,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 	return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+				 struct list_head *pages)
 {
-	unsigned int i;
-	struct page *page;
+	struct page *page, *next;
 
-	/* Find pfns pointing at start of each page, get pages and free them. */
-	for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-		page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-							   vb->pfns[i]));
+	list_for_each_entry_safe(page, next, pages, lru) {
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
 			adjust_managed_page_count(page, 1);
+		list_del(&page->lru);
 		put_page(page); /* balloon reference */
 	}
 }
@@ -203,6 +195,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	unsigned num_freed_pages;
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
+	LIST_HEAD(pages);
 
 	/* We can only do one array worth at a time. */
 	num = min(num, ARRAY_SIZE(vb->pfns));
@@ -216,6 +209,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		if (!page)
 			break;
 		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
 
@@ -227,7 +221,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 */
 	if (vb->num_pfns != 0)
 		tell_host(vb, vb->deflate_vq);
-	release_pages_balloon(vb);
+	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 175+ messages in thread

* [PATCH v11 1/6] virtio-balloon: deflate via a page list
@ 2017-06-09 10:41   ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

From: Liang Li <liang.z.li@intel.com>

This patch saves the deflated pages to a list, instead of the PFN array.
Accordingly, the balloon_pfn_to_page() function is removed.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 drivers/virtio/virtio_balloon.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 34adf9b..4a9f307 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -104,12 +104,6 @@ static u32 page_to_balloon_pfn(struct page *page)
 	return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-	BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-	return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
 	struct virtio_balloon *vb = vq->vdev->priv;
@@ -182,18 +176,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 	return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+				 struct list_head *pages)
 {
-	unsigned int i;
-	struct page *page;
+	struct page *page, *next;
 
-	/* Find pfns pointing at start of each page, get pages and free them. */
-	for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-		page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-							   vb->pfns[i]));
+	list_for_each_entry_safe(page, next, pages, lru) {
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
 			adjust_managed_page_count(page, 1);
+		list_del(&page->lru);
 		put_page(page); /* balloon reference */
 	}
 }
@@ -203,6 +195,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	unsigned num_freed_pages;
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
+	LIST_HEAD(pages);
 
 	/* We can only do one array worth at a time. */
 	num = min(num, ARRAY_SIZE(vb->pfns));
@@ -216,6 +209,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		if (!page)
 			break;
 		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
 
@@ -227,7 +221,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 */
 	if (vb->num_pfns != 0)
 		tell_host(vb, vb->deflate_vq);
-	release_pages_balloon(vb);
+	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
 }
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 175+ messages in thread

* [Qemu-devel] [PATCH v11 1/6] virtio-balloon: deflate via a page list
@ 2017-06-09 10:41   ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

From: Liang Li <liang.z.li@intel.com>

This patch saves the deflated pages to a list, instead of the PFN array.
Accordingly, the balloon_pfn_to_page() function is removed.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 drivers/virtio/virtio_balloon.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 34adf9b..4a9f307 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -104,12 +104,6 @@ static u32 page_to_balloon_pfn(struct page *page)
 	return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-	BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-	return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
 	struct virtio_balloon *vb = vq->vdev->priv;
@@ -182,18 +176,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 	return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+				 struct list_head *pages)
 {
-	unsigned int i;
-	struct page *page;
+	struct page *page, *next;
 
-	/* Find pfns pointing at start of each page, get pages and free them. */
-	for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-		page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-							   vb->pfns[i]));
+	list_for_each_entry_safe(page, next, pages, lru) {
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
 			adjust_managed_page_count(page, 1);
+		list_del(&page->lru);
 		put_page(page); /* balloon reference */
 	}
 }
@@ -203,6 +195,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	unsigned num_freed_pages;
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
+	LIST_HEAD(pages);
 
 	/* We can only do one array worth at a time. */
 	num = min(num, ARRAY_SIZE(vb->pfns));
@@ -216,6 +209,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		if (!page)
 			break;
 		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
 
@@ -227,7 +221,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 */
 	if (vb->num_pfns != 0)
 		tell_host(vb, vb->deflate_vq);
-	release_pages_balloon(vb);
+	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 175+ messages in thread

* [PATCH v11 1/6] virtio-balloon: deflate via a page list
  2017-06-09 10:41 ` Wei Wang
  (?)
  (?)
@ 2017-06-09 10:41 ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

From: Liang Li <liang.z.li@intel.com>

This patch saves the deflated pages to a list, instead of the PFN array.
Accordingly, the balloon_pfn_to_page() function is removed.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 drivers/virtio/virtio_balloon.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 34adf9b..4a9f307 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -104,12 +104,6 @@ static u32 page_to_balloon_pfn(struct page *page)
 	return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-	BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-	return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
 	struct virtio_balloon *vb = vq->vdev->priv;
@@ -182,18 +176,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 	return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+				 struct list_head *pages)
 {
-	unsigned int i;
-	struct page *page;
+	struct page *page, *next;
 
-	/* Find pfns pointing at start of each page, get pages and free them. */
-	for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-		page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-							   vb->pfns[i]));
+	list_for_each_entry_safe(page, next, pages, lru) {
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
 			adjust_managed_page_count(page, 1);
+		list_del(&page->lru);
 		put_page(page); /* balloon reference */
 	}
 }
@@ -203,6 +195,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	unsigned num_freed_pages;
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
+	LIST_HEAD(pages);
 
 	/* We can only do one array worth at a time. */
 	num = min(num, ARRAY_SIZE(vb->pfns));
@@ -216,6 +209,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		if (!page)
 			break;
 		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
 
@@ -227,7 +221,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 */
 	if (vb->num_pfns != 0)
 		tell_host(vb, vb->deflate_vq);
-	release_pages_balloon(vb);
+	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 175+ messages in thread

* [PATCH v11 2/6] virtio-balloon: coding format cleanup
  2017-06-09 10:41 ` Wei Wang
  (?)
@ 2017-06-09 10:41   ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Clean up the comment format.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 drivers/virtio/virtio_balloon.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 4a9f307..ecb64e9 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -132,8 +132,10 @@ static void set_page_pfns(struct virtio_balloon *vb,
 {
 	unsigned int i;
 
-	/* Set balloon pfns pointing at this page.
-	 * Note that the first pfn points at start of the page. */
+	/*
+	 * Set balloon pfns pointing at this page.
+	 * Note that the first pfn points at start of the page.
+	 */
 	for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
 		pfns[i] = cpu_to_virtio32(vb->vdev,
 					  page_to_balloon_pfn(page) + i);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 175+ messages in thread

* [PATCH v11 2/6] virtio-balloon: coding format cleanup
@ 2017-06-09 10:41   ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Clean up the comment format.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 drivers/virtio/virtio_balloon.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 4a9f307..ecb64e9 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -132,8 +132,10 @@ static void set_page_pfns(struct virtio_balloon *vb,
 {
 	unsigned int i;
 
-	/* Set balloon pfns pointing at this page.
-	 * Note that the first pfn points at start of the page. */
+	/*
+	 * Set balloon pfns pointing at this page.
+	 * Note that the first pfn points at start of the page.
+	 */
 	for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
 		pfns[i] = cpu_to_virtio32(vb->vdev,
 					  page_to_balloon_pfn(page) + i);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 175+ messages in thread

* [Qemu-devel] [PATCH v11 2/6] virtio-balloon: coding format cleanup
@ 2017-06-09 10:41   ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Clean up the comment format.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 drivers/virtio/virtio_balloon.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 4a9f307..ecb64e9 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -132,8 +132,10 @@ static void set_page_pfns(struct virtio_balloon *vb,
 {
 	unsigned int i;
 
-	/* Set balloon pfns pointing at this page.
-	 * Note that the first pfn points at start of the page. */
+	/*
+	 * Set balloon pfns pointing at this page.
+	 * Note that the first pfn points at start of the page.
+	 */
 	for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
 		pfns[i] = cpu_to_virtio32(vb->vdev,
 					  page_to_balloon_pfn(page) + i);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 175+ messages in thread

* [PATCH v11 2/6] virtio-balloon: coding format cleanup
  2017-06-09 10:41 ` Wei Wang
                   ` (4 preceding siblings ...)
  (?)
@ 2017-06-09 10:41 ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Clean up the comment format.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 drivers/virtio/virtio_balloon.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 4a9f307..ecb64e9 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -132,8 +132,10 @@ static void set_page_pfns(struct virtio_balloon *vb,
 {
 	unsigned int i;
 
-	/* Set balloon pfns pointing at this page.
-	 * Note that the first pfn points at start of the page. */
+	/*
+	 * Set balloon pfns pointing at this page.
+	 * Note that the first pfn points at start of the page.
+	 */
 	for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
 		pfns[i] = cpu_to_virtio32(vb->vdev,
 					  page_to_balloon_pfn(page) + i);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 175+ messages in thread

* [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
  2017-06-09 10:41 ` Wei Wang
  (?)
@ 2017-06-09 10:41   ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Add a new feature, VIRTIO_BALLOON_F_PAGE_CHUNKS, which enables
the transfer of the ballooned (i.e. inflated/deflated) pages in
chunks to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the ballooned pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
chunks. A chunk consists of guest physically continuous pages.
When the pages are packed into a chunk, they are converted into
balloon page size (4KB) pages. A chunk is offered to the host
via a base address (i.e. the start guest physical address of those
physically continuous pages) and the size (i.e. the total number
of the 4KB balloon size pages). A chunk is described via a
vring_desc struct in the implementation.

By doing so, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~590ms
resulting in an improvement of ~85%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_balloon.c     | 418 +++++++++++++++++++++++++++++++++---
 drivers/virtio/virtio_ring.c        | 120 ++++++++++-
 include/linux/virtio.h              |   7 +
 include/uapi/linux/virtio_balloon.h |   1 +
 include/uapi/linux/virtio_ring.h    |   3 +
 5 files changed, 517 insertions(+), 32 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index ecb64e9..0cf945c 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -51,6 +51,36 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 static struct vfsmount *balloon_mnt;
 #endif
 
+/* The size of one page_bmap used to record inflated/deflated pages. */
+#define VIRTIO_BALLOON_PAGE_BMAP_SIZE	(8 * PAGE_SIZE)
+/*
+ * Callulates how many pfns can a page_bmap record. A bit corresponds to a
+ * page of PAGE_SIZE.
+ */
+#define VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP \
+	(VIRTIO_BALLOON_PAGE_BMAP_SIZE * BITS_PER_BYTE)
+
+/* The number of page_bmap to allocate by default. */
+#define VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM	1
+/* The maximum number of page_bmap that can be allocated. */
+#define VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM	32
+
+/*
+ * QEMU virtio implementation requires the desc table size less than
+ * VIRTQUEUE_MAX_SIZE, so minus 1 here.
+ */
+#define VIRTIO_BALLOON_MAX_PAGE_CHUNKS (VIRTQUEUE_MAX_SIZE - 1)
+
+/* The struct to manage ballooned pages in chunks */
+struct virtio_balloon_page_chunk {
+	/* Indirect desc table to hold chunks of balloon pages */
+	struct vring_desc *desc_table;
+	/* Number of added chunks of balloon pages */
+	unsigned int chunk_num;
+	/* Bitmap used to record ballooned pages. */
+	unsigned long *page_bmap[VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM];
+};
+
 struct virtio_balloon {
 	struct virtio_device *vdev;
 	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
@@ -79,6 +109,8 @@ struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	struct virtio_balloon_page_chunk balloon_page_chunk;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -111,6 +143,133 @@ static void balloon_ack(struct virtqueue *vq)
 	wake_up(&vb->acked);
 }
 
+/* Update pfn_max and pfn_min according to the pfn of page */
+static inline void update_pfn_range(struct virtio_balloon *vb,
+				    struct page *page,
+				    unsigned long *pfn_min,
+				    unsigned long *pfn_max)
+{
+	unsigned long pfn = page_to_pfn(page);
+
+	*pfn_min = min(pfn, *pfn_min);
+	*pfn_max = max(pfn, *pfn_max);
+}
+
+static unsigned int extend_page_bmap_size(struct virtio_balloon *vb,
+					  unsigned long pfn_num)
+{
+	unsigned int i, bmap_num, allocated_bmap_num;
+	unsigned long bmap_len;
+
+	allocated_bmap_num = VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM;
+	bmap_len = ALIGN(pfn_num, BITS_PER_LONG) / BITS_PER_BYTE;
+	bmap_len = roundup(bmap_len, VIRTIO_BALLOON_PAGE_BMAP_SIZE);
+	/*
+	 * VIRTIO_BALLOON_PAGE_BMAP_SIZE is the size of one page_bmap, so
+	 * divide it to calculate how many page_bmap that we need.
+	 */
+	bmap_num = (unsigned int)(bmap_len / VIRTIO_BALLOON_PAGE_BMAP_SIZE);
+	/* The number of page_bmap to allocate should not exceed the max */
+	bmap_num = min_t(unsigned int, VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM,
+			 bmap_num);
+
+	for (i = VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i < bmap_num; i++) {
+		vb->balloon_page_chunk.page_bmap[i] =
+			    kmalloc(VIRTIO_BALLOON_PAGE_BMAP_SIZE, GFP_KERNEL);
+		if (vb->balloon_page_chunk.page_bmap[i])
+			allocated_bmap_num++;
+		else
+			break;
+	}
+
+	return allocated_bmap_num;
+}
+
+static void free_extended_page_bmap(struct virtio_balloon *vb,
+				    unsigned int page_bmap_num)
+{
+	unsigned int i;
+
+	for (i = VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i < page_bmap_num;
+	     i++) {
+		kfree(vb->balloon_page_chunk.page_bmap[i]);
+		vb->balloon_page_chunk.page_bmap[i] = NULL;
+		page_bmap_num--;
+	}
+}
+
+static void clear_page_bmap(struct virtio_balloon *vb,
+			    unsigned int page_bmap_num)
+{
+	int i;
+
+	for (i = 0; i < page_bmap_num; i++)
+		memset(vb->balloon_page_chunk.page_bmap[i], 0,
+		       VIRTIO_BALLOON_PAGE_BMAP_SIZE);
+}
+
+static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq)
+{
+	unsigned int len, num;
+	struct vring_desc *desc = vb->balloon_page_chunk.desc_table;
+
+	num = vb->balloon_page_chunk.chunk_num;
+	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
+		virtqueue_kick(vq);
+		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		vb->balloon_page_chunk.chunk_num = 0;
+	}
+}
+
+/* Add a chunk to the buffer. */
+static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
+			  u64 base_addr, u32 size)
+{
+	unsigned int *num = &vb->balloon_page_chunk.chunk_num;
+	struct vring_desc *desc = &vb->balloon_page_chunk.desc_table[*num];
+
+	desc->addr = cpu_to_virtio64(vb->vdev, base_addr);
+	desc->len = cpu_to_virtio32(vb->vdev, size);
+	*num += 1;
+	if (*num == VIRTIO_BALLOON_MAX_PAGE_CHUNKS)
+		send_page_chunks(vb, vq);
+}
+
+static void convert_bmap_to_chunks(struct virtio_balloon *vb,
+				   struct virtqueue *vq,
+				   unsigned long *bmap,
+				   unsigned long pfn_start,
+				   unsigned long size)
+{
+	unsigned long next_one, next_zero, pos = 0;
+	u64 chunk_base_addr;
+	u32 chunk_size;
+
+	while (pos < size) {
+		next_one = find_next_bit(bmap, size, pos);
+		/*
+		 * No "1" bit found, which means that there is no pfn
+		 * recorded in the rest of this bmap.
+		 */
+		if (next_one == size)
+			break;
+		next_zero = find_next_zero_bit(bmap, size, next_one + 1);
+		/*
+		 * A bit in page_bmap corresponds to a page of PAGE_SIZE.
+		 * Convert it to be pages of 4KB balloon page size when
+		 * adding it to a chunk.
+		 */
+		chunk_size = (next_zero - next_one) *
+			     VIRTIO_BALLOON_PAGES_PER_PAGE;
+		chunk_base_addr = (pfn_start + next_one) <<
+				  VIRTIO_BALLOON_PFN_SHIFT;
+		if (chunk_size) {
+			add_one_chunk(vb, vq, chunk_base_addr, chunk_size);
+			pos += next_zero + 1;
+		}
+	}
+}
+
 static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 {
 	struct scatterlist sg;
@@ -124,7 +283,35 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 
 	/* When host has read buffer, this completes via balloon_ack */
 	wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+}
+
+static void tell_host_from_page_bmap(struct virtio_balloon *vb,
+				     struct virtqueue *vq,
+				     unsigned long pfn_start,
+				     unsigned long pfn_end,
+				     unsigned int page_bmap_num)
+{
+	unsigned long i, pfn_num;
 
+	for (i = 0; i < page_bmap_num; i++) {
+		/*
+		 * For the last page_bmap, only the remaining number of pfns
+		 * need to be searched rather than the entire page_bmap.
+		 */
+		if (i + 1 == page_bmap_num)
+			pfn_num = (pfn_end - pfn_start) %
+				  VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
+		else
+			pfn_num = VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
+
+		convert_bmap_to_chunks(vb, vq,
+				       vb->balloon_page_chunk.page_bmap[i],
+				       pfn_start +
+				       i * VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP,
+				       pfn_num);
+	}
+	if (vb->balloon_page_chunk.chunk_num > 0)
+		send_page_chunks(vb, vq);
 }
 
 static void set_page_pfns(struct virtio_balloon *vb,
@@ -141,13 +328,89 @@ static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
+/*
+ * Send ballooned pages in chunks to host.
+ * The ballooned pages are recorded in page bitmaps. Each bit in a bitmap
+ * corresponds to a page of PAGE_SIZE. The page bitmaps are searched for
+ * continuous "1" bits, which correspond to continuous pages, to chunk.
+ * When packing those continuous pages into chunks, pages are converted into
+ * 4KB balloon pages.
+ *
+ * pfn_max and pfn_min form the range of pfns that need to use page bitmaps to
+ * record. If the range is too large to be recorded into the allocated page
+ * bitmaps, the page bitmaps are used multiple times to record the entire
+ * range of pfns.
+ */
+static void tell_host_page_chunks(struct virtio_balloon *vb,
+				  struct list_head *pages,
+				  struct virtqueue *vq,
+				  unsigned long pfn_max,
+				  unsigned long pfn_min)
+{
+	/*
+	 * The pfn_start and pfn_end form the range of pfns that the allocated
+	 * page_bmap can record in each round.
+	 */
+	unsigned long pfn_start, pfn_end;
+	/* Total number of allocated page_bmap */
+	unsigned int page_bmap_num;
+	struct page *page;
+	bool found;
+
+	/*
+	 * In the case that one page_bmap is not sufficient to record the pfn
+	 * range, page_bmap will be extended by allocating more numbers of
+	 * page_bmap.
+	 */
+	page_bmap_num = extend_page_bmap_size(vb, pfn_max - pfn_min + 1);
+
+	/* Start from the beginning of the whole pfn range */
+	pfn_start = pfn_min;
+	while (pfn_start < pfn_max) {
+		pfn_end = pfn_start +
+			  VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP * page_bmap_num;
+		pfn_end = pfn_end < pfn_max ? pfn_end : pfn_max;
+		clear_page_bmap(vb, page_bmap_num);
+		found = false;
+
+		list_for_each_entry(page, pages, lru) {
+			unsigned long bmap_idx, bmap_pos, this_pfn;
+
+			this_pfn = page_to_pfn(page);
+			if (this_pfn < pfn_start || this_pfn > pfn_end)
+				continue;
+			bmap_idx = (this_pfn - pfn_start) /
+				   VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
+			bmap_pos = (this_pfn - pfn_start) %
+				   VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
+			set_bit(bmap_pos,
+				vb->balloon_page_chunk.page_bmap[bmap_idx]);
+
+			found = true;
+		}
+		if (found)
+			tell_host_from_page_bmap(vb, vq, pfn_start, pfn_end,
+						 page_bmap_num);
+		/*
+		 * Start the next round when pfn_start and pfn_end couldn't
+		 * cover the whole pfn range given by pfn_max and pfn_min.
+		 */
+		pfn_start = pfn_end;
+	}
+	free_extended_page_bmap(vb, page_bmap_num);
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	unsigned num_allocated_pages;
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_PAGE_CHUNKS);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (!chunking)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -162,7 +425,10 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (chunking)
+			update_pfn_range(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -171,8 +437,14 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns != 0) {
+		if (chunking)
+			tell_host_page_chunks(vb, &vb_dev_info->pages,
+					      vb->inflate_vq,
+					      pfn_max, pfn_min);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -198,9 +470,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_PAGE_CHUNKS);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	/* Traditionally, we can only do one array worth at a time. */
+	if (!chunking)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	/* We can't release more pages than taken */
@@ -210,7 +486,10 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (chunking)
+			update_pfn_range(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -221,8 +500,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns != 0) {
+		if (chunking)
+			tell_host_page_chunks(vb, &pages, vb->deflate_vq,
+					      pfn_max, pfn_min);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
@@ -442,6 +726,14 @@ static int init_vqs(struct virtio_balloon *vb)
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
+
+static void tell_host_one_page(struct virtio_balloon *vb,
+			       struct virtqueue *vq, struct page *page)
+{
+	add_one_chunk(vb, vq, page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT,
+		      VIRTIO_BALLOON_PAGES_PER_PAGE);
+}
+
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
  *			     a compation thread.     (called under page lock)
@@ -465,6 +757,8 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_PAGE_CHUNKS);
 	unsigned long flags;
 
 	/*
@@ -486,16 +780,22 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (chunking) {
+		tell_host_one_page(vb, vb->inflate_vq, newpage);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (chunking) {
+		tell_host_one_page(vb, vb->deflate_vq, page);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -522,9 +822,78 @@ static struct file_system_type balloon_fs = {
 
 #endif /* CONFIG_BALLOON_COMPACTION */
 
+static void free_page_bmap(struct virtio_balloon *vb)
+{
+	int i;
+
+	for (i = 0; i < VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i++) {
+		kfree(vb->balloon_page_chunk.page_bmap[i]);
+		vb->balloon_page_chunk.page_bmap[i] = NULL;
+	}
+}
+
+static int balloon_page_chunk_init(struct virtio_balloon *vb)
+{
+	int i;
+
+	vb->balloon_page_chunk.desc_table = alloc_indirect(vb->vdev,
+						VIRTIO_BALLOON_MAX_PAGE_CHUNKS,
+						GFP_KERNEL);
+	if (!vb->balloon_page_chunk.desc_table)
+		goto err_page_chunk;
+	vb->balloon_page_chunk.chunk_num = 0;
+
+	/*
+	 * The default number of page_bmaps are allocated. More may be
+	 * allocated on demand.
+	 */
+	for (i = 0; i < VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i++) {
+		vb->balloon_page_chunk.page_bmap[i] =
+			    kmalloc(VIRTIO_BALLOON_PAGE_BMAP_SIZE, GFP_KERNEL);
+		if (!vb->balloon_page_chunk.page_bmap[i])
+			goto err_page_bmap;
+	}
+
+	return 0;
+err_page_bmap:
+	free_page_bmap(vb);
+	kfree(vb->balloon_page_chunk.desc_table);
+	vb->balloon_page_chunk.desc_table = NULL;
+err_page_chunk:
+	__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_PAGE_CHUNKS);
+	dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
+	return -ENOMEM;
+}
+
+static int virtballoon_validate(struct virtio_device *vdev)
+{
+	struct virtio_balloon *vb = NULL;
+	int err;
+
+	vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
+	if (!vb) {
+		err = -ENOMEM;
+		goto err_vb;
+	}
+	vb->vdev = vdev;
+
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_CHUNKS)) {
+		err = balloon_page_chunk_init(vb);
+		if (err < 0)
+			goto err_page_chunk;
+	}
+
+	return 0;
+
+err_page_chunk:
+	kfree(vb);
+err_vb:
+	return err;
+}
+
 static int virtballoon_probe(struct virtio_device *vdev)
 {
-	struct virtio_balloon *vb;
+	struct virtio_balloon *vb = vdev->priv;
 	int err;
 
 	if (!vdev->config->get) {
@@ -533,20 +902,14 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		return -EINVAL;
 	}
 
-	vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
-	if (!vb) {
-		err = -ENOMEM;
-		goto out;
-	}
-
 	INIT_WORK(&vb->update_balloon_stats_work, update_balloon_stats_func);
 	INIT_WORK(&vb->update_balloon_size_work, update_balloon_size_func);
 	spin_lock_init(&vb->stop_update_lock);
 	vb->stop_update = false;
 	vb->num_pages = 0;
+
 	mutex_init(&vb->balloon_lock);
 	init_waitqueue_head(&vb->acked);
-	vb->vdev = vdev;
 
 	balloon_devinfo_init(&vb->vb_dev_info);
 
@@ -590,7 +953,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	vdev->config->del_vqs(vdev);
 out_free_vb:
 	kfree(vb);
-out:
 	return err;
 }
 
@@ -620,6 +982,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	cancel_work_sync(&vb->update_balloon_stats_work);
 
 	remove_common(vb);
+	free_page_bmap(vb);
+	kfree(vb->balloon_page_chunk.desc_table);
 #ifdef CONFIG_BALLOON_COMPACTION
 	if (vb->vb_dev_info.inode)
 		iput(vb->vb_dev_info.inode);
@@ -664,6 +1028,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_PAGE_CHUNKS,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
@@ -674,6 +1039,7 @@ static struct virtio_driver virtio_balloon_driver = {
 	.id_table =	id_table,
 	.probe =	virtballoon_probe,
 	.remove =	virtballoon_remove,
+	.validate =	virtballoon_validate,
 	.config_changed = virtballoon_changed,
 #ifdef CONFIG_PM_SLEEP
 	.freeze	=	virtballoon_freeze,
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 409aeaa..0ea2512 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -235,8 +235,17 @@ static int vring_mapping_error(const struct vring_virtqueue *vq,
 	return dma_mapping_error(vring_dma_dev(vq), addr);
 }
 
-static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
-					 unsigned int total_sg, gfp_t gfp)
+/**
+ * alloc_indirect - allocate an indirect desc table
+ * @vdev: the virtio_device that owns the indirect desc table.
+ * @num: the number of entries that the table will have.
+ * @gfp: how to do memory allocations (if necessary).
+ *
+ * Return NULL if the table allocation failed. Otherwise, return the address
+ * of the table.
+ */
+struct vring_desc *alloc_indirect(struct virtio_device *vdev, unsigned int num,
+				  gfp_t gfp)
 {
 	struct vring_desc *desc;
 	unsigned int i;
@@ -248,14 +257,15 @@ static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
 	 */
 	gfp &= ~__GFP_HIGHMEM;
 
-	desc = kmalloc(total_sg * sizeof(struct vring_desc), gfp);
+	desc = kmalloc_array(num, sizeof(struct vring_desc), gfp);
 	if (!desc)
 		return NULL;
 
-	for (i = 0; i < total_sg; i++)
-		desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);
+	for (i = 0; i < num; i++)
+		desc[i].next = cpu_to_virtio16(vdev, i + 1);
 	return desc;
 }
+EXPORT_SYMBOL_GPL(alloc_indirect);
 
 static inline int virtqueue_add(struct virtqueue *_vq,
 				struct scatterlist *sgs[],
@@ -302,7 +312,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	/* If the host supports indirect descriptor tables, and we have multiple
 	 * buffers, then go indirect. FIXME: tune this threshold */
 	if (vq->indirect && total_sg > 1 && vq->vq.num_free)
-		desc = alloc_indirect(_vq, total_sg, gfp);
+		desc = alloc_indirect(_vq->vdev, total_sg, gfp);
 	else
 		desc = NULL;
 
@@ -433,6 +443,104 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 }
 
 /**
+ * virtqueue_indirect_desc_table_add - add an indirect desc table to the vq
+ * @_vq: the struct virtqueue we're talking about.
+ * @desc: the desc table we're talking about.
+ * @num: the number of entries that the desc table has.
+ *
+ * Returns zero or a negative error (ie. ENOSPC, EIO).
+ */
+int virtqueue_indirect_desc_table_add(struct virtqueue *_vq,
+				      struct vring_desc *desc,
+				      unsigned int num)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	dma_addr_t desc_addr;
+	unsigned int i, avail;
+	int head;
+
+	/* Sanity check */
+	if (!desc) {
+		pr_debug("%s: empty desc table\n", __func__);
+		return -EINVAL;
+	}
+
+	START_USE(vq);
+
+	if (unlikely(vq->broken)) {
+		END_USE(vq);
+		return -EIO;
+	}
+
+	if (!vq->vq.num_free) {
+		pr_debug("%s: the virtioqueue is full\n", __func__);
+		END_USE(vq);
+		return -ENOSPC;
+	}
+
+	/* Map and fill in the indirect table */
+	desc_addr = vring_map_single(vq, desc, num * sizeof(struct vring_desc),
+				     DMA_TO_DEVICE);
+	if (vring_mapping_error(vq, desc_addr)) {
+		pr_debug("%s: map desc failed\n", __func__);
+		END_USE(vq);
+		return -EIO;
+	}
+
+	/* Mark the flag of the table entries */
+	for (i = 0; i < num; i++)
+		desc[i].flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_NEXT);
+	/* The last one doesn't continue. */
+	desc[num - 1].flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
+
+	/* Get a ring entry to point to the indirect table */
+	head = vq->free_head;
+	vq->vring.desc[head].flags = cpu_to_virtio16(_vq->vdev,
+						     VRING_DESC_F_INDIRECT);
+	vq->vring.desc[head].addr = cpu_to_virtio64(_vq->vdev, desc_addr);
+	vq->vring.desc[head].len = cpu_to_virtio32(_vq->vdev, num *
+						   sizeof(struct vring_desc));
+	/* We're using 1 buffers from the free list. */
+	vq->vq.num_free--;
+	/* Update free pointer */
+	vq->free_head = virtio16_to_cpu(_vq->vdev, vq->vring.desc[head].next);
+
+	/* Store token and indirect buffer state. */
+	vq->desc_state[head].data = desc;
+	/* Don't free the caller allocated indirect table when detach_buf. */
+	vq->desc_state[head].indir_desc = NULL;
+
+	/*
+	 * Put entry in available array (but don't update avail->idx until they
+	 * do sync).
+	 */
+	avail = vq->avail_idx_shadow & (vq->vring.num - 1);
+	vq->vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
+
+	/*
+	 * Descriptors and available array need to be set before we expose the
+	 * new available array entries.
+	 */
+	virtio_wmb(vq->weak_barriers);
+	vq->avail_idx_shadow++;
+	vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
+	vq->num_added++;
+
+	pr_debug("%s: added buffer head %i to %p\n", __func__, head, vq);
+	END_USE(vq);
+
+	/*
+	 * This is very unlikely, but theoretically possible.  Kick
+	 * just in case.
+	 */
+	if (unlikely(vq->num_added == (1 << 16) - 1))
+		virtqueue_kick(_vq);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtqueue_indirect_desc_table_add);
+
+/**
  * virtqueue_add_sgs - expose buffers to other end
  * @vq: the struct virtqueue we're talking about.
  * @sgs: array of terminated scatterlists.
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 7edfbdb..01dad22 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -34,6 +34,13 @@ struct virtqueue {
 	void *priv;
 };
 
+struct vring_desc *alloc_indirect(struct virtio_device *vdev,
+				  unsigned int num, gfp_t gfp);
+
+int virtqueue_indirect_desc_table_add(struct virtqueue *_vq,
+				      struct vring_desc *desc,
+				      unsigned int num);
+
 int virtqueue_add_outbuf(struct virtqueue *vq,
 			 struct scatterlist sg[], unsigned int num,
 			 void *data,
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..5ed3c7b 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_PAGE_CHUNKS	3 /* Inflate/Deflate pages in chunks */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index c072959..0499fb8 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -111,6 +111,9 @@ struct vring {
 #define VRING_USED_ALIGN_SIZE 4
 #define VRING_DESC_ALIGN_SIZE 16
 
+/* The supported max queue size */
+#define VIRTQUEUE_MAX_SIZE 1024
+
 /* The standard layout for the ring is a continuous chunk of memory which looks
  * like this.  We assume num is a power of 2.
  *
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 175+ messages in thread

* [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
@ 2017-06-09 10:41   ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Add a new feature, VIRTIO_BALLOON_F_PAGE_CHUNKS, which enables
the transfer of the ballooned (i.e. inflated/deflated) pages in
chunks to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the ballooned pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
chunks. A chunk consists of guest physically continuous pages.
When the pages are packed into a chunk, they are converted into
balloon page size (4KB) pages. A chunk is offered to the host
via a base address (i.e. the start guest physical address of those
physically continuous pages) and the size (i.e. the total number
of the 4KB balloon size pages). A chunk is described via a
vring_desc struct in the implementation.

By doing so, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~590ms
resulting in an improvement of ~85%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_balloon.c     | 418 +++++++++++++++++++++++++++++++++---
 drivers/virtio/virtio_ring.c        | 120 ++++++++++-
 include/linux/virtio.h              |   7 +
 include/uapi/linux/virtio_balloon.h |   1 +
 include/uapi/linux/virtio_ring.h    |   3 +
 5 files changed, 517 insertions(+), 32 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index ecb64e9..0cf945c 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -51,6 +51,36 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 static struct vfsmount *balloon_mnt;
 #endif
 
+/* The size of one page_bmap used to record inflated/deflated pages. */
+#define VIRTIO_BALLOON_PAGE_BMAP_SIZE	(8 * PAGE_SIZE)
+/*
+ * Callulates how many pfns can a page_bmap record. A bit corresponds to a
+ * page of PAGE_SIZE.
+ */
+#define VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP \
+	(VIRTIO_BALLOON_PAGE_BMAP_SIZE * BITS_PER_BYTE)
+
+/* The number of page_bmap to allocate by default. */
+#define VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM	1
+/* The maximum number of page_bmap that can be allocated. */
+#define VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM	32
+
+/*
+ * QEMU virtio implementation requires the desc table size less than
+ * VIRTQUEUE_MAX_SIZE, so minus 1 here.
+ */
+#define VIRTIO_BALLOON_MAX_PAGE_CHUNKS (VIRTQUEUE_MAX_SIZE - 1)
+
+/* The struct to manage ballooned pages in chunks */
+struct virtio_balloon_page_chunk {
+	/* Indirect desc table to hold chunks of balloon pages */
+	struct vring_desc *desc_table;
+	/* Number of added chunks of balloon pages */
+	unsigned int chunk_num;
+	/* Bitmap used to record ballooned pages. */
+	unsigned long *page_bmap[VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM];
+};
+
 struct virtio_balloon {
 	struct virtio_device *vdev;
 	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
@@ -79,6 +109,8 @@ struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	struct virtio_balloon_page_chunk balloon_page_chunk;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -111,6 +143,133 @@ static void balloon_ack(struct virtqueue *vq)
 	wake_up(&vb->acked);
 }
 
+/* Update pfn_max and pfn_min according to the pfn of page */
+static inline void update_pfn_range(struct virtio_balloon *vb,
+				    struct page *page,
+				    unsigned long *pfn_min,
+				    unsigned long *pfn_max)
+{
+	unsigned long pfn = page_to_pfn(page);
+
+	*pfn_min = min(pfn, *pfn_min);
+	*pfn_max = max(pfn, *pfn_max);
+}
+
+static unsigned int extend_page_bmap_size(struct virtio_balloon *vb,
+					  unsigned long pfn_num)
+{
+	unsigned int i, bmap_num, allocated_bmap_num;
+	unsigned long bmap_len;
+
+	allocated_bmap_num = VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM;
+	bmap_len = ALIGN(pfn_num, BITS_PER_LONG) / BITS_PER_BYTE;
+	bmap_len = roundup(bmap_len, VIRTIO_BALLOON_PAGE_BMAP_SIZE);
+	/*
+	 * VIRTIO_BALLOON_PAGE_BMAP_SIZE is the size of one page_bmap, so
+	 * divide it to calculate how many page_bmap that we need.
+	 */
+	bmap_num = (unsigned int)(bmap_len / VIRTIO_BALLOON_PAGE_BMAP_SIZE);
+	/* The number of page_bmap to allocate should not exceed the max */
+	bmap_num = min_t(unsigned int, VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM,
+			 bmap_num);
+
+	for (i = VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i < bmap_num; i++) {
+		vb->balloon_page_chunk.page_bmap[i] =
+			    kmalloc(VIRTIO_BALLOON_PAGE_BMAP_SIZE, GFP_KERNEL);
+		if (vb->balloon_page_chunk.page_bmap[i])
+			allocated_bmap_num++;
+		else
+			break;
+	}
+
+	return allocated_bmap_num;
+}
+
+static void free_extended_page_bmap(struct virtio_balloon *vb,
+				    unsigned int page_bmap_num)
+{
+	unsigned int i;
+
+	for (i = VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i < page_bmap_num;
+	     i++) {
+		kfree(vb->balloon_page_chunk.page_bmap[i]);
+		vb->balloon_page_chunk.page_bmap[i] = NULL;
+		page_bmap_num--;
+	}
+}
+
+static void clear_page_bmap(struct virtio_balloon *vb,
+			    unsigned int page_bmap_num)
+{
+	int i;
+
+	for (i = 0; i < page_bmap_num; i++)
+		memset(vb->balloon_page_chunk.page_bmap[i], 0,
+		       VIRTIO_BALLOON_PAGE_BMAP_SIZE);
+}
+
+static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq)
+{
+	unsigned int len, num;
+	struct vring_desc *desc = vb->balloon_page_chunk.desc_table;
+
+	num = vb->balloon_page_chunk.chunk_num;
+	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
+		virtqueue_kick(vq);
+		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		vb->balloon_page_chunk.chunk_num = 0;
+	}
+}
+
+/* Add a chunk to the buffer. */
+static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
+			  u64 base_addr, u32 size)
+{
+	unsigned int *num = &vb->balloon_page_chunk.chunk_num;
+	struct vring_desc *desc = &vb->balloon_page_chunk.desc_table[*num];
+
+	desc->addr = cpu_to_virtio64(vb->vdev, base_addr);
+	desc->len = cpu_to_virtio32(vb->vdev, size);
+	*num += 1;
+	if (*num == VIRTIO_BALLOON_MAX_PAGE_CHUNKS)
+		send_page_chunks(vb, vq);
+}
+
+static void convert_bmap_to_chunks(struct virtio_balloon *vb,
+				   struct virtqueue *vq,
+				   unsigned long *bmap,
+				   unsigned long pfn_start,
+				   unsigned long size)
+{
+	unsigned long next_one, next_zero, pos = 0;
+	u64 chunk_base_addr;
+	u32 chunk_size;
+
+	while (pos < size) {
+		next_one = find_next_bit(bmap, size, pos);
+		/*
+		 * No "1" bit found, which means that there is no pfn
+		 * recorded in the rest of this bmap.
+		 */
+		if (next_one == size)
+			break;
+		next_zero = find_next_zero_bit(bmap, size, next_one + 1);
+		/*
+		 * A bit in page_bmap corresponds to a page of PAGE_SIZE.
+		 * Convert it to be pages of 4KB balloon page size when
+		 * adding it to a chunk.
+		 */
+		chunk_size = (next_zero - next_one) *
+			     VIRTIO_BALLOON_PAGES_PER_PAGE;
+		chunk_base_addr = (pfn_start + next_one) <<
+				  VIRTIO_BALLOON_PFN_SHIFT;
+		if (chunk_size) {
+			add_one_chunk(vb, vq, chunk_base_addr, chunk_size);
+			pos += next_zero + 1;
+		}
+	}
+}
+
 static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 {
 	struct scatterlist sg;
@@ -124,7 +283,35 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 
 	/* When host has read buffer, this completes via balloon_ack */
 	wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+}
+
+static void tell_host_from_page_bmap(struct virtio_balloon *vb,
+				     struct virtqueue *vq,
+				     unsigned long pfn_start,
+				     unsigned long pfn_end,
+				     unsigned int page_bmap_num)
+{
+	unsigned long i, pfn_num;
 
+	for (i = 0; i < page_bmap_num; i++) {
+		/*
+		 * For the last page_bmap, only the remaining number of pfns
+		 * need to be searched rather than the entire page_bmap.
+		 */
+		if (i + 1 == page_bmap_num)
+			pfn_num = (pfn_end - pfn_start) %
+				  VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
+		else
+			pfn_num = VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
+
+		convert_bmap_to_chunks(vb, vq,
+				       vb->balloon_page_chunk.page_bmap[i],
+				       pfn_start +
+				       i * VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP,
+				       pfn_num);
+	}
+	if (vb->balloon_page_chunk.chunk_num > 0)
+		send_page_chunks(vb, vq);
 }
 
 static void set_page_pfns(struct virtio_balloon *vb,
@@ -141,13 +328,89 @@ static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
+/*
+ * Send ballooned pages in chunks to host.
+ * The ballooned pages are recorded in page bitmaps. Each bit in a bitmap
+ * corresponds to a page of PAGE_SIZE. The page bitmaps are searched for
+ * continuous "1" bits, which correspond to continuous pages, to chunk.
+ * When packing those continuous pages into chunks, pages are converted into
+ * 4KB balloon pages.
+ *
+ * pfn_max and pfn_min form the range of pfns that need to use page bitmaps to
+ * record. If the range is too large to be recorded into the allocated page
+ * bitmaps, the page bitmaps are used multiple times to record the entire
+ * range of pfns.
+ */
+static void tell_host_page_chunks(struct virtio_balloon *vb,
+				  struct list_head *pages,
+				  struct virtqueue *vq,
+				  unsigned long pfn_max,
+				  unsigned long pfn_min)
+{
+	/*
+	 * The pfn_start and pfn_end form the range of pfns that the allocated
+	 * page_bmap can record in each round.
+	 */
+	unsigned long pfn_start, pfn_end;
+	/* Total number of allocated page_bmap */
+	unsigned int page_bmap_num;
+	struct page *page;
+	bool found;
+
+	/*
+	 * In the case that one page_bmap is not sufficient to record the pfn
+	 * range, page_bmap will be extended by allocating more numbers of
+	 * page_bmap.
+	 */
+	page_bmap_num = extend_page_bmap_size(vb, pfn_max - pfn_min + 1);
+
+	/* Start from the beginning of the whole pfn range */
+	pfn_start = pfn_min;
+	while (pfn_start < pfn_max) {
+		pfn_end = pfn_start +
+			  VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP * page_bmap_num;
+		pfn_end = pfn_end < pfn_max ? pfn_end : pfn_max;
+		clear_page_bmap(vb, page_bmap_num);
+		found = false;
+
+		list_for_each_entry(page, pages, lru) {
+			unsigned long bmap_idx, bmap_pos, this_pfn;
+
+			this_pfn = page_to_pfn(page);
+			if (this_pfn < pfn_start || this_pfn > pfn_end)
+				continue;
+			bmap_idx = (this_pfn - pfn_start) /
+				   VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
+			bmap_pos = (this_pfn - pfn_start) %
+				   VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
+			set_bit(bmap_pos,
+				vb->balloon_page_chunk.page_bmap[bmap_idx]);
+
+			found = true;
+		}
+		if (found)
+			tell_host_from_page_bmap(vb, vq, pfn_start, pfn_end,
+						 page_bmap_num);
+		/*
+		 * Start the next round when pfn_start and pfn_end couldn't
+		 * cover the whole pfn range given by pfn_max and pfn_min.
+		 */
+		pfn_start = pfn_end;
+	}
+	free_extended_page_bmap(vb, page_bmap_num);
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	unsigned num_allocated_pages;
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_PAGE_CHUNKS);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (!chunking)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -162,7 +425,10 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (chunking)
+			update_pfn_range(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -171,8 +437,14 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns != 0) {
+		if (chunking)
+			tell_host_page_chunks(vb, &vb_dev_info->pages,
+					      vb->inflate_vq,
+					      pfn_max, pfn_min);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -198,9 +470,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_PAGE_CHUNKS);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	/* Traditionally, we can only do one array worth at a time. */
+	if (!chunking)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	/* We can't release more pages than taken */
@@ -210,7 +486,10 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (chunking)
+			update_pfn_range(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -221,8 +500,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns != 0) {
+		if (chunking)
+			tell_host_page_chunks(vb, &pages, vb->deflate_vq,
+					      pfn_max, pfn_min);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
@@ -442,6 +726,14 @@ static int init_vqs(struct virtio_balloon *vb)
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
+
+static void tell_host_one_page(struct virtio_balloon *vb,
+			       struct virtqueue *vq, struct page *page)
+{
+	add_one_chunk(vb, vq, page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT,
+		      VIRTIO_BALLOON_PAGES_PER_PAGE);
+}
+
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
  *			     a compation thread.     (called under page lock)
@@ -465,6 +757,8 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_PAGE_CHUNKS);
 	unsigned long flags;
 
 	/*
@@ -486,16 +780,22 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (chunking) {
+		tell_host_one_page(vb, vb->inflate_vq, newpage);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (chunking) {
+		tell_host_one_page(vb, vb->deflate_vq, page);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -522,9 +822,78 @@ static struct file_system_type balloon_fs = {
 
 #endif /* CONFIG_BALLOON_COMPACTION */
 
+static void free_page_bmap(struct virtio_balloon *vb)
+{
+	int i;
+
+	for (i = 0; i < VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i++) {
+		kfree(vb->balloon_page_chunk.page_bmap[i]);
+		vb->balloon_page_chunk.page_bmap[i] = NULL;
+	}
+}
+
+static int balloon_page_chunk_init(struct virtio_balloon *vb)
+{
+	int i;
+
+	vb->balloon_page_chunk.desc_table = alloc_indirect(vb->vdev,
+						VIRTIO_BALLOON_MAX_PAGE_CHUNKS,
+						GFP_KERNEL);
+	if (!vb->balloon_page_chunk.desc_table)
+		goto err_page_chunk;
+	vb->balloon_page_chunk.chunk_num = 0;
+
+	/*
+	 * The default number of page_bmaps are allocated. More may be
+	 * allocated on demand.
+	 */
+	for (i = 0; i < VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i++) {
+		vb->balloon_page_chunk.page_bmap[i] =
+			    kmalloc(VIRTIO_BALLOON_PAGE_BMAP_SIZE, GFP_KERNEL);
+		if (!vb->balloon_page_chunk.page_bmap[i])
+			goto err_page_bmap;
+	}
+
+	return 0;
+err_page_bmap:
+	free_page_bmap(vb);
+	kfree(vb->balloon_page_chunk.desc_table);
+	vb->balloon_page_chunk.desc_table = NULL;
+err_page_chunk:
+	__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_PAGE_CHUNKS);
+	dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
+	return -ENOMEM;
+}
+
+static int virtballoon_validate(struct virtio_device *vdev)
+{
+	struct virtio_balloon *vb = NULL;
+	int err;
+
+	vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
+	if (!vb) {
+		err = -ENOMEM;
+		goto err_vb;
+	}
+	vb->vdev = vdev;
+
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_CHUNKS)) {
+		err = balloon_page_chunk_init(vb);
+		if (err < 0)
+			goto err_page_chunk;
+	}
+
+	return 0;
+
+err_page_chunk:
+	kfree(vb);
+err_vb:
+	return err;
+}
+
 static int virtballoon_probe(struct virtio_device *vdev)
 {
-	struct virtio_balloon *vb;
+	struct virtio_balloon *vb = vdev->priv;
 	int err;
 
 	if (!vdev->config->get) {
@@ -533,20 +902,14 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		return -EINVAL;
 	}
 
-	vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
-	if (!vb) {
-		err = -ENOMEM;
-		goto out;
-	}
-
 	INIT_WORK(&vb->update_balloon_stats_work, update_balloon_stats_func);
 	INIT_WORK(&vb->update_balloon_size_work, update_balloon_size_func);
 	spin_lock_init(&vb->stop_update_lock);
 	vb->stop_update = false;
 	vb->num_pages = 0;
+
 	mutex_init(&vb->balloon_lock);
 	init_waitqueue_head(&vb->acked);
-	vb->vdev = vdev;
 
 	balloon_devinfo_init(&vb->vb_dev_info);
 
@@ -590,7 +953,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	vdev->config->del_vqs(vdev);
 out_free_vb:
 	kfree(vb);
-out:
 	return err;
 }
 
@@ -620,6 +982,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	cancel_work_sync(&vb->update_balloon_stats_work);
 
 	remove_common(vb);
+	free_page_bmap(vb);
+	kfree(vb->balloon_page_chunk.desc_table);
 #ifdef CONFIG_BALLOON_COMPACTION
 	if (vb->vb_dev_info.inode)
 		iput(vb->vb_dev_info.inode);
@@ -664,6 +1028,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_PAGE_CHUNKS,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
@@ -674,6 +1039,7 @@ static struct virtio_driver virtio_balloon_driver = {
 	.id_table =	id_table,
 	.probe =	virtballoon_probe,
 	.remove =	virtballoon_remove,
+	.validate =	virtballoon_validate,
 	.config_changed = virtballoon_changed,
 #ifdef CONFIG_PM_SLEEP
 	.freeze	=	virtballoon_freeze,
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 409aeaa..0ea2512 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -235,8 +235,17 @@ static int vring_mapping_error(const struct vring_virtqueue *vq,
 	return dma_mapping_error(vring_dma_dev(vq), addr);
 }
 
-static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
-					 unsigned int total_sg, gfp_t gfp)
+/**
+ * alloc_indirect - allocate an indirect desc table
+ * @vdev: the virtio_device that owns the indirect desc table.
+ * @num: the number of entries that the table will have.
+ * @gfp: how to do memory allocations (if necessary).
+ *
+ * Return NULL if the table allocation failed. Otherwise, return the address
+ * of the table.
+ */
+struct vring_desc *alloc_indirect(struct virtio_device *vdev, unsigned int num,
+				  gfp_t gfp)
 {
 	struct vring_desc *desc;
 	unsigned int i;
@@ -248,14 +257,15 @@ static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
 	 */
 	gfp &= ~__GFP_HIGHMEM;
 
-	desc = kmalloc(total_sg * sizeof(struct vring_desc), gfp);
+	desc = kmalloc_array(num, sizeof(struct vring_desc), gfp);
 	if (!desc)
 		return NULL;
 
-	for (i = 0; i < total_sg; i++)
-		desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);
+	for (i = 0; i < num; i++)
+		desc[i].next = cpu_to_virtio16(vdev, i + 1);
 	return desc;
 }
+EXPORT_SYMBOL_GPL(alloc_indirect);
 
 static inline int virtqueue_add(struct virtqueue *_vq,
 				struct scatterlist *sgs[],
@@ -302,7 +312,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	/* If the host supports indirect descriptor tables, and we have multiple
 	 * buffers, then go indirect. FIXME: tune this threshold */
 	if (vq->indirect && total_sg > 1 && vq->vq.num_free)
-		desc = alloc_indirect(_vq, total_sg, gfp);
+		desc = alloc_indirect(_vq->vdev, total_sg, gfp);
 	else
 		desc = NULL;
 
@@ -433,6 +443,104 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 }
 
 /**
+ * virtqueue_indirect_desc_table_add - add an indirect desc table to the vq
+ * @_vq: the struct virtqueue we're talking about.
+ * @desc: the desc table we're talking about.
+ * @num: the number of entries that the desc table has.
+ *
+ * Returns zero or a negative error (ie. ENOSPC, EIO).
+ */
+int virtqueue_indirect_desc_table_add(struct virtqueue *_vq,
+				      struct vring_desc *desc,
+				      unsigned int num)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	dma_addr_t desc_addr;
+	unsigned int i, avail;
+	int head;
+
+	/* Sanity check */
+	if (!desc) {
+		pr_debug("%s: empty desc table\n", __func__);
+		return -EINVAL;
+	}
+
+	START_USE(vq);
+
+	if (unlikely(vq->broken)) {
+		END_USE(vq);
+		return -EIO;
+	}
+
+	if (!vq->vq.num_free) {
+		pr_debug("%s: the virtioqueue is full\n", __func__);
+		END_USE(vq);
+		return -ENOSPC;
+	}
+
+	/* Map and fill in the indirect table */
+	desc_addr = vring_map_single(vq, desc, num * sizeof(struct vring_desc),
+				     DMA_TO_DEVICE);
+	if (vring_mapping_error(vq, desc_addr)) {
+		pr_debug("%s: map desc failed\n", __func__);
+		END_USE(vq);
+		return -EIO;
+	}
+
+	/* Mark the flag of the table entries */
+	for (i = 0; i < num; i++)
+		desc[i].flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_NEXT);
+	/* The last one doesn't continue. */
+	desc[num - 1].flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
+
+	/* Get a ring entry to point to the indirect table */
+	head = vq->free_head;
+	vq->vring.desc[head].flags = cpu_to_virtio16(_vq->vdev,
+						     VRING_DESC_F_INDIRECT);
+	vq->vring.desc[head].addr = cpu_to_virtio64(_vq->vdev, desc_addr);
+	vq->vring.desc[head].len = cpu_to_virtio32(_vq->vdev, num *
+						   sizeof(struct vring_desc));
+	/* We're using 1 buffers from the free list. */
+	vq->vq.num_free--;
+	/* Update free pointer */
+	vq->free_head = virtio16_to_cpu(_vq->vdev, vq->vring.desc[head].next);
+
+	/* Store token and indirect buffer state. */
+	vq->desc_state[head].data = desc;
+	/* Don't free the caller allocated indirect table when detach_buf. */
+	vq->desc_state[head].indir_desc = NULL;
+
+	/*
+	 * Put entry in available array (but don't update avail->idx until they
+	 * do sync).
+	 */
+	avail = vq->avail_idx_shadow & (vq->vring.num - 1);
+	vq->vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
+
+	/*
+	 * Descriptors and available array need to be set before we expose the
+	 * new available array entries.
+	 */
+	virtio_wmb(vq->weak_barriers);
+	vq->avail_idx_shadow++;
+	vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
+	vq->num_added++;
+
+	pr_debug("%s: added buffer head %i to %p\n", __func__, head, vq);
+	END_USE(vq);
+
+	/*
+	 * This is very unlikely, but theoretically possible.  Kick
+	 * just in case.
+	 */
+	if (unlikely(vq->num_added == (1 << 16) - 1))
+		virtqueue_kick(_vq);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtqueue_indirect_desc_table_add);
+
+/**
  * virtqueue_add_sgs - expose buffers to other end
  * @vq: the struct virtqueue we're talking about.
  * @sgs: array of terminated scatterlists.
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 7edfbdb..01dad22 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -34,6 +34,13 @@ struct virtqueue {
 	void *priv;
 };
 
+struct vring_desc *alloc_indirect(struct virtio_device *vdev,
+				  unsigned int num, gfp_t gfp);
+
+int virtqueue_indirect_desc_table_add(struct virtqueue *_vq,
+				      struct vring_desc *desc,
+				      unsigned int num);
+
 int virtqueue_add_outbuf(struct virtqueue *vq,
 			 struct scatterlist sg[], unsigned int num,
 			 void *data,
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..5ed3c7b 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_PAGE_CHUNKS	3 /* Inflate/Deflate pages in chunks */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index c072959..0499fb8 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -111,6 +111,9 @@ struct vring {
 #define VRING_USED_ALIGN_SIZE 4
 #define VRING_DESC_ALIGN_SIZE 16
 
+/* The supported max queue size */
+#define VIRTQUEUE_MAX_SIZE 1024
+
 /* The standard layout for the ring is a continuous chunk of memory which looks
  * like this.  We assume num is a power of 2.
  *
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 175+ messages in thread

* [Qemu-devel] [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
@ 2017-06-09 10:41   ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Add a new feature, VIRTIO_BALLOON_F_PAGE_CHUNKS, which enables
the transfer of the ballooned (i.e. inflated/deflated) pages in
chunks to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the ballooned pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
chunks. A chunk consists of guest physically continuous pages.
When the pages are packed into a chunk, they are converted into
balloon page size (4KB) pages. A chunk is offered to the host
via a base address (i.e. the start guest physical address of those
physically continuous pages) and the size (i.e. the total number
of the 4KB balloon size pages). A chunk is described via a
vring_desc struct in the implementation.

By doing so, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~590ms
resulting in an improvement of ~85%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_balloon.c     | 418 +++++++++++++++++++++++++++++++++---
 drivers/virtio/virtio_ring.c        | 120 ++++++++++-
 include/linux/virtio.h              |   7 +
 include/uapi/linux/virtio_balloon.h |   1 +
 include/uapi/linux/virtio_ring.h    |   3 +
 5 files changed, 517 insertions(+), 32 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index ecb64e9..0cf945c 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -51,6 +51,36 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 static struct vfsmount *balloon_mnt;
 #endif
 
+/* The size of one page_bmap used to record inflated/deflated pages. */
+#define VIRTIO_BALLOON_PAGE_BMAP_SIZE	(8 * PAGE_SIZE)
+/*
+ * Callulates how many pfns can a page_bmap record. A bit corresponds to a
+ * page of PAGE_SIZE.
+ */
+#define VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP \
+	(VIRTIO_BALLOON_PAGE_BMAP_SIZE * BITS_PER_BYTE)
+
+/* The number of page_bmap to allocate by default. */
+#define VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM	1
+/* The maximum number of page_bmap that can be allocated. */
+#define VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM	32
+
+/*
+ * QEMU virtio implementation requires the desc table size less than
+ * VIRTQUEUE_MAX_SIZE, so minus 1 here.
+ */
+#define VIRTIO_BALLOON_MAX_PAGE_CHUNKS (VIRTQUEUE_MAX_SIZE - 1)
+
+/* The struct to manage ballooned pages in chunks */
+struct virtio_balloon_page_chunk {
+	/* Indirect desc table to hold chunks of balloon pages */
+	struct vring_desc *desc_table;
+	/* Number of added chunks of balloon pages */
+	unsigned int chunk_num;
+	/* Bitmap used to record ballooned pages. */
+	unsigned long *page_bmap[VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM];
+};
+
 struct virtio_balloon {
 	struct virtio_device *vdev;
 	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
@@ -79,6 +109,8 @@ struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	struct virtio_balloon_page_chunk balloon_page_chunk;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -111,6 +143,133 @@ static void balloon_ack(struct virtqueue *vq)
 	wake_up(&vb->acked);
 }
 
+/* Update pfn_max and pfn_min according to the pfn of page */
+static inline void update_pfn_range(struct virtio_balloon *vb,
+				    struct page *page,
+				    unsigned long *pfn_min,
+				    unsigned long *pfn_max)
+{
+	unsigned long pfn = page_to_pfn(page);
+
+	*pfn_min = min(pfn, *pfn_min);
+	*pfn_max = max(pfn, *pfn_max);
+}
+
+static unsigned int extend_page_bmap_size(struct virtio_balloon *vb,
+					  unsigned long pfn_num)
+{
+	unsigned int i, bmap_num, allocated_bmap_num;
+	unsigned long bmap_len;
+
+	allocated_bmap_num = VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM;
+	bmap_len = ALIGN(pfn_num, BITS_PER_LONG) / BITS_PER_BYTE;
+	bmap_len = roundup(bmap_len, VIRTIO_BALLOON_PAGE_BMAP_SIZE);
+	/*
+	 * VIRTIO_BALLOON_PAGE_BMAP_SIZE is the size of one page_bmap, so
+	 * divide it to calculate how many page_bmap that we need.
+	 */
+	bmap_num = (unsigned int)(bmap_len / VIRTIO_BALLOON_PAGE_BMAP_SIZE);
+	/* The number of page_bmap to allocate should not exceed the max */
+	bmap_num = min_t(unsigned int, VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM,
+			 bmap_num);
+
+	for (i = VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i < bmap_num; i++) {
+		vb->balloon_page_chunk.page_bmap[i] =
+			    kmalloc(VIRTIO_BALLOON_PAGE_BMAP_SIZE, GFP_KERNEL);
+		if (vb->balloon_page_chunk.page_bmap[i])
+			allocated_bmap_num++;
+		else
+			break;
+	}
+
+	return allocated_bmap_num;
+}
+
+static void free_extended_page_bmap(struct virtio_balloon *vb,
+				    unsigned int page_bmap_num)
+{
+	unsigned int i;
+
+	for (i = VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i < page_bmap_num;
+	     i++) {
+		kfree(vb->balloon_page_chunk.page_bmap[i]);
+		vb->balloon_page_chunk.page_bmap[i] = NULL;
+		page_bmap_num--;
+	}
+}
+
+static void clear_page_bmap(struct virtio_balloon *vb,
+			    unsigned int page_bmap_num)
+{
+	int i;
+
+	for (i = 0; i < page_bmap_num; i++)
+		memset(vb->balloon_page_chunk.page_bmap[i], 0,
+		       VIRTIO_BALLOON_PAGE_BMAP_SIZE);
+}
+
+static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq)
+{
+	unsigned int len, num;
+	struct vring_desc *desc = vb->balloon_page_chunk.desc_table;
+
+	num = vb->balloon_page_chunk.chunk_num;
+	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
+		virtqueue_kick(vq);
+		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		vb->balloon_page_chunk.chunk_num = 0;
+	}
+}
+
+/* Add a chunk to the buffer. */
+static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
+			  u64 base_addr, u32 size)
+{
+	unsigned int *num = &vb->balloon_page_chunk.chunk_num;
+	struct vring_desc *desc = &vb->balloon_page_chunk.desc_table[*num];
+
+	desc->addr = cpu_to_virtio64(vb->vdev, base_addr);
+	desc->len = cpu_to_virtio32(vb->vdev, size);
+	*num += 1;
+	if (*num == VIRTIO_BALLOON_MAX_PAGE_CHUNKS)
+		send_page_chunks(vb, vq);
+}
+
+static void convert_bmap_to_chunks(struct virtio_balloon *vb,
+				   struct virtqueue *vq,
+				   unsigned long *bmap,
+				   unsigned long pfn_start,
+				   unsigned long size)
+{
+	unsigned long next_one, next_zero, pos = 0;
+	u64 chunk_base_addr;
+	u32 chunk_size;
+
+	while (pos < size) {
+		next_one = find_next_bit(bmap, size, pos);
+		/*
+		 * No "1" bit found, which means that there is no pfn
+		 * recorded in the rest of this bmap.
+		 */
+		if (next_one == size)
+			break;
+		next_zero = find_next_zero_bit(bmap, size, next_one + 1);
+		/*
+		 * A bit in page_bmap corresponds to a page of PAGE_SIZE.
+		 * Convert it to be pages of 4KB balloon page size when
+		 * adding it to a chunk.
+		 */
+		chunk_size = (next_zero - next_one) *
+			     VIRTIO_BALLOON_PAGES_PER_PAGE;
+		chunk_base_addr = (pfn_start + next_one) <<
+				  VIRTIO_BALLOON_PFN_SHIFT;
+		if (chunk_size) {
+			add_one_chunk(vb, vq, chunk_base_addr, chunk_size);
+			pos += next_zero + 1;
+		}
+	}
+}
+
 static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 {
 	struct scatterlist sg;
@@ -124,7 +283,35 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 
 	/* When host has read buffer, this completes via balloon_ack */
 	wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+}
+
+static void tell_host_from_page_bmap(struct virtio_balloon *vb,
+				     struct virtqueue *vq,
+				     unsigned long pfn_start,
+				     unsigned long pfn_end,
+				     unsigned int page_bmap_num)
+{
+	unsigned long i, pfn_num;
 
+	for (i = 0; i < page_bmap_num; i++) {
+		/*
+		 * For the last page_bmap, only the remaining number of pfns
+		 * need to be searched rather than the entire page_bmap.
+		 */
+		if (i + 1 == page_bmap_num)
+			pfn_num = (pfn_end - pfn_start) %
+				  VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
+		else
+			pfn_num = VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
+
+		convert_bmap_to_chunks(vb, vq,
+				       vb->balloon_page_chunk.page_bmap[i],
+				       pfn_start +
+				       i * VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP,
+				       pfn_num);
+	}
+	if (vb->balloon_page_chunk.chunk_num > 0)
+		send_page_chunks(vb, vq);
 }
 
 static void set_page_pfns(struct virtio_balloon *vb,
@@ -141,13 +328,89 @@ static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
+/*
+ * Send ballooned pages in chunks to host.
+ * The ballooned pages are recorded in page bitmaps. Each bit in a bitmap
+ * corresponds to a page of PAGE_SIZE. The page bitmaps are searched for
+ * continuous "1" bits, which correspond to continuous pages, to chunk.
+ * When packing those continuous pages into chunks, pages are converted into
+ * 4KB balloon pages.
+ *
+ * pfn_max and pfn_min form the range of pfns that need to use page bitmaps to
+ * record. If the range is too large to be recorded into the allocated page
+ * bitmaps, the page bitmaps are used multiple times to record the entire
+ * range of pfns.
+ */
+static void tell_host_page_chunks(struct virtio_balloon *vb,
+				  struct list_head *pages,
+				  struct virtqueue *vq,
+				  unsigned long pfn_max,
+				  unsigned long pfn_min)
+{
+	/*
+	 * The pfn_start and pfn_end form the range of pfns that the allocated
+	 * page_bmap can record in each round.
+	 */
+	unsigned long pfn_start, pfn_end;
+	/* Total number of allocated page_bmap */
+	unsigned int page_bmap_num;
+	struct page *page;
+	bool found;
+
+	/*
+	 * In the case that one page_bmap is not sufficient to record the pfn
+	 * range, page_bmap will be extended by allocating more numbers of
+	 * page_bmap.
+	 */
+	page_bmap_num = extend_page_bmap_size(vb, pfn_max - pfn_min + 1);
+
+	/* Start from the beginning of the whole pfn range */
+	pfn_start = pfn_min;
+	while (pfn_start < pfn_max) {
+		pfn_end = pfn_start +
+			  VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP * page_bmap_num;
+		pfn_end = pfn_end < pfn_max ? pfn_end : pfn_max;
+		clear_page_bmap(vb, page_bmap_num);
+		found = false;
+
+		list_for_each_entry(page, pages, lru) {
+			unsigned long bmap_idx, bmap_pos, this_pfn;
+
+			this_pfn = page_to_pfn(page);
+			if (this_pfn < pfn_start || this_pfn > pfn_end)
+				continue;
+			bmap_idx = (this_pfn - pfn_start) /
+				   VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
+			bmap_pos = (this_pfn - pfn_start) %
+				   VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
+			set_bit(bmap_pos,
+				vb->balloon_page_chunk.page_bmap[bmap_idx]);
+
+			found = true;
+		}
+		if (found)
+			tell_host_from_page_bmap(vb, vq, pfn_start, pfn_end,
+						 page_bmap_num);
+		/*
+		 * Start the next round when pfn_start and pfn_end couldn't
+		 * cover the whole pfn range given by pfn_max and pfn_min.
+		 */
+		pfn_start = pfn_end;
+	}
+	free_extended_page_bmap(vb, page_bmap_num);
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	unsigned num_allocated_pages;
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_PAGE_CHUNKS);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (!chunking)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -162,7 +425,10 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (chunking)
+			update_pfn_range(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -171,8 +437,14 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns != 0) {
+		if (chunking)
+			tell_host_page_chunks(vb, &vb_dev_info->pages,
+					      vb->inflate_vq,
+					      pfn_max, pfn_min);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -198,9 +470,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_PAGE_CHUNKS);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	/* Traditionally, we can only do one array worth at a time. */
+	if (!chunking)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	/* We can't release more pages than taken */
@@ -210,7 +486,10 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (chunking)
+			update_pfn_range(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -221,8 +500,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns != 0) {
+		if (chunking)
+			tell_host_page_chunks(vb, &pages, vb->deflate_vq,
+					      pfn_max, pfn_min);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
@@ -442,6 +726,14 @@ static int init_vqs(struct virtio_balloon *vb)
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
+
+static void tell_host_one_page(struct virtio_balloon *vb,
+			       struct virtqueue *vq, struct page *page)
+{
+	add_one_chunk(vb, vq, page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT,
+		      VIRTIO_BALLOON_PAGES_PER_PAGE);
+}
+
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
  *			     a compation thread.     (called under page lock)
@@ -465,6 +757,8 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_PAGE_CHUNKS);
 	unsigned long flags;
 
 	/*
@@ -486,16 +780,22 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (chunking) {
+		tell_host_one_page(vb, vb->inflate_vq, newpage);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (chunking) {
+		tell_host_one_page(vb, vb->deflate_vq, page);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -522,9 +822,78 @@ static struct file_system_type balloon_fs = {
 
 #endif /* CONFIG_BALLOON_COMPACTION */
 
+static void free_page_bmap(struct virtio_balloon *vb)
+{
+	int i;
+
+	for (i = 0; i < VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i++) {
+		kfree(vb->balloon_page_chunk.page_bmap[i]);
+		vb->balloon_page_chunk.page_bmap[i] = NULL;
+	}
+}
+
+static int balloon_page_chunk_init(struct virtio_balloon *vb)
+{
+	int i;
+
+	vb->balloon_page_chunk.desc_table = alloc_indirect(vb->vdev,
+						VIRTIO_BALLOON_MAX_PAGE_CHUNKS,
+						GFP_KERNEL);
+	if (!vb->balloon_page_chunk.desc_table)
+		goto err_page_chunk;
+	vb->balloon_page_chunk.chunk_num = 0;
+
+	/*
+	 * The default number of page_bmaps are allocated. More may be
+	 * allocated on demand.
+	 */
+	for (i = 0; i < VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i++) {
+		vb->balloon_page_chunk.page_bmap[i] =
+			    kmalloc(VIRTIO_BALLOON_PAGE_BMAP_SIZE, GFP_KERNEL);
+		if (!vb->balloon_page_chunk.page_bmap[i])
+			goto err_page_bmap;
+	}
+
+	return 0;
+err_page_bmap:
+	free_page_bmap(vb);
+	kfree(vb->balloon_page_chunk.desc_table);
+	vb->balloon_page_chunk.desc_table = NULL;
+err_page_chunk:
+	__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_PAGE_CHUNKS);
+	dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
+	return -ENOMEM;
+}
+
+static int virtballoon_validate(struct virtio_device *vdev)
+{
+	struct virtio_balloon *vb = NULL;
+	int err;
+
+	vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
+	if (!vb) {
+		err = -ENOMEM;
+		goto err_vb;
+	}
+	vb->vdev = vdev;
+
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_CHUNKS)) {
+		err = balloon_page_chunk_init(vb);
+		if (err < 0)
+			goto err_page_chunk;
+	}
+
+	return 0;
+
+err_page_chunk:
+	kfree(vb);
+err_vb:
+	return err;
+}
+
 static int virtballoon_probe(struct virtio_device *vdev)
 {
-	struct virtio_balloon *vb;
+	struct virtio_balloon *vb = vdev->priv;
 	int err;
 
 	if (!vdev->config->get) {
@@ -533,20 +902,14 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		return -EINVAL;
 	}
 
-	vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
-	if (!vb) {
-		err = -ENOMEM;
-		goto out;
-	}
-
 	INIT_WORK(&vb->update_balloon_stats_work, update_balloon_stats_func);
 	INIT_WORK(&vb->update_balloon_size_work, update_balloon_size_func);
 	spin_lock_init(&vb->stop_update_lock);
 	vb->stop_update = false;
 	vb->num_pages = 0;
+
 	mutex_init(&vb->balloon_lock);
 	init_waitqueue_head(&vb->acked);
-	vb->vdev = vdev;
 
 	balloon_devinfo_init(&vb->vb_dev_info);
 
@@ -590,7 +953,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	vdev->config->del_vqs(vdev);
 out_free_vb:
 	kfree(vb);
-out:
 	return err;
 }
 
@@ -620,6 +982,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	cancel_work_sync(&vb->update_balloon_stats_work);
 
 	remove_common(vb);
+	free_page_bmap(vb);
+	kfree(vb->balloon_page_chunk.desc_table);
 #ifdef CONFIG_BALLOON_COMPACTION
 	if (vb->vb_dev_info.inode)
 		iput(vb->vb_dev_info.inode);
@@ -664,6 +1028,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_PAGE_CHUNKS,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
@@ -674,6 +1039,7 @@ static struct virtio_driver virtio_balloon_driver = {
 	.id_table =	id_table,
 	.probe =	virtballoon_probe,
 	.remove =	virtballoon_remove,
+	.validate =	virtballoon_validate,
 	.config_changed = virtballoon_changed,
 #ifdef CONFIG_PM_SLEEP
 	.freeze	=	virtballoon_freeze,
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 409aeaa..0ea2512 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -235,8 +235,17 @@ static int vring_mapping_error(const struct vring_virtqueue *vq,
 	return dma_mapping_error(vring_dma_dev(vq), addr);
 }
 
-static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
-					 unsigned int total_sg, gfp_t gfp)
+/**
+ * alloc_indirect - allocate an indirect desc table
+ * @vdev: the virtio_device that owns the indirect desc table.
+ * @num: the number of entries that the table will have.
+ * @gfp: how to do memory allocations (if necessary).
+ *
+ * Return NULL if the table allocation failed. Otherwise, return the address
+ * of the table.
+ */
+struct vring_desc *alloc_indirect(struct virtio_device *vdev, unsigned int num,
+				  gfp_t gfp)
 {
 	struct vring_desc *desc;
 	unsigned int i;
@@ -248,14 +257,15 @@ static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
 	 */
 	gfp &= ~__GFP_HIGHMEM;
 
-	desc = kmalloc(total_sg * sizeof(struct vring_desc), gfp);
+	desc = kmalloc_array(num, sizeof(struct vring_desc), gfp);
 	if (!desc)
 		return NULL;
 
-	for (i = 0; i < total_sg; i++)
-		desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);
+	for (i = 0; i < num; i++)
+		desc[i].next = cpu_to_virtio16(vdev, i + 1);
 	return desc;
 }
+EXPORT_SYMBOL_GPL(alloc_indirect);
 
 static inline int virtqueue_add(struct virtqueue *_vq,
 				struct scatterlist *sgs[],
@@ -302,7 +312,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	/* If the host supports indirect descriptor tables, and we have multiple
 	 * buffers, then go indirect. FIXME: tune this threshold */
 	if (vq->indirect && total_sg > 1 && vq->vq.num_free)
-		desc = alloc_indirect(_vq, total_sg, gfp);
+		desc = alloc_indirect(_vq->vdev, total_sg, gfp);
 	else
 		desc = NULL;
 
@@ -433,6 +443,104 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 }
 
 /**
+ * virtqueue_indirect_desc_table_add - add an indirect desc table to the vq
+ * @_vq: the struct virtqueue we're talking about.
+ * @desc: the desc table we're talking about.
+ * @num: the number of entries that the desc table has.
+ *
+ * Returns zero or a negative error (ie. ENOSPC, EIO).
+ */
+int virtqueue_indirect_desc_table_add(struct virtqueue *_vq,
+				      struct vring_desc *desc,
+				      unsigned int num)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	dma_addr_t desc_addr;
+	unsigned int i, avail;
+	int head;
+
+	/* Sanity check */
+	if (!desc) {
+		pr_debug("%s: empty desc table\n", __func__);
+		return -EINVAL;
+	}
+
+	START_USE(vq);
+
+	if (unlikely(vq->broken)) {
+		END_USE(vq);
+		return -EIO;
+	}
+
+	if (!vq->vq.num_free) {
+		pr_debug("%s: the virtioqueue is full\n", __func__);
+		END_USE(vq);
+		return -ENOSPC;
+	}
+
+	/* Map and fill in the indirect table */
+	desc_addr = vring_map_single(vq, desc, num * sizeof(struct vring_desc),
+				     DMA_TO_DEVICE);
+	if (vring_mapping_error(vq, desc_addr)) {
+		pr_debug("%s: map desc failed\n", __func__);
+		END_USE(vq);
+		return -EIO;
+	}
+
+	/* Mark the flag of the table entries */
+	for (i = 0; i < num; i++)
+		desc[i].flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_NEXT);
+	/* The last one doesn't continue. */
+	desc[num - 1].flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
+
+	/* Get a ring entry to point to the indirect table */
+	head = vq->free_head;
+	vq->vring.desc[head].flags = cpu_to_virtio16(_vq->vdev,
+						     VRING_DESC_F_INDIRECT);
+	vq->vring.desc[head].addr = cpu_to_virtio64(_vq->vdev, desc_addr);
+	vq->vring.desc[head].len = cpu_to_virtio32(_vq->vdev, num *
+						   sizeof(struct vring_desc));
+	/* We're using 1 buffers from the free list. */
+	vq->vq.num_free--;
+	/* Update free pointer */
+	vq->free_head = virtio16_to_cpu(_vq->vdev, vq->vring.desc[head].next);
+
+	/* Store token and indirect buffer state. */
+	vq->desc_state[head].data = desc;
+	/* Don't free the caller allocated indirect table when detach_buf. */
+	vq->desc_state[head].indir_desc = NULL;
+
+	/*
+	 * Put entry in available array (but don't update avail->idx until they
+	 * do sync).
+	 */
+	avail = vq->avail_idx_shadow & (vq->vring.num - 1);
+	vq->vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
+
+	/*
+	 * Descriptors and available array need to be set before we expose the
+	 * new available array entries.
+	 */
+	virtio_wmb(vq->weak_barriers);
+	vq->avail_idx_shadow++;
+	vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
+	vq->num_added++;
+
+	pr_debug("%s: added buffer head %i to %p\n", __func__, head, vq);
+	END_USE(vq);
+
+	/*
+	 * This is very unlikely, but theoretically possible.  Kick
+	 * just in case.
+	 */
+	if (unlikely(vq->num_added == (1 << 16) - 1))
+		virtqueue_kick(_vq);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtqueue_indirect_desc_table_add);
+
+/**
  * virtqueue_add_sgs - expose buffers to other end
  * @vq: the struct virtqueue we're talking about.
  * @sgs: array of terminated scatterlists.
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 7edfbdb..01dad22 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -34,6 +34,13 @@ struct virtqueue {
 	void *priv;
 };
 
+struct vring_desc *alloc_indirect(struct virtio_device *vdev,
+				  unsigned int num, gfp_t gfp);
+
+int virtqueue_indirect_desc_table_add(struct virtqueue *_vq,
+				      struct vring_desc *desc,
+				      unsigned int num);
+
 int virtqueue_add_outbuf(struct virtqueue *vq,
 			 struct scatterlist sg[], unsigned int num,
 			 void *data,
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..5ed3c7b 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_PAGE_CHUNKS	3 /* Inflate/Deflate pages in chunks */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index c072959..0499fb8 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -111,6 +111,9 @@ struct vring {
 #define VRING_USED_ALIGN_SIZE 4
 #define VRING_DESC_ALIGN_SIZE 16
 
+/* The supported max queue size */
+#define VIRTQUEUE_MAX_SIZE 1024
+
 /* The standard layout for the ring is a continuous chunk of memory which looks
  * like this.  We assume num is a power of 2.
  *
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 175+ messages in thread

* [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
  2017-06-09 10:41 ` Wei Wang
                   ` (6 preceding siblings ...)
  (?)
@ 2017-06-09 10:41 ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Add a new feature, VIRTIO_BALLOON_F_PAGE_CHUNKS, which enables
the transfer of the ballooned (i.e. inflated/deflated) pages in
chunks to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the ballooned pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
chunks. A chunk consists of guest physically continuous pages.
When the pages are packed into a chunk, they are converted into
balloon page size (4KB) pages. A chunk is offered to the host
via a base address (i.e. the start guest physical address of those
physically continuous pages) and the size (i.e. the total number
of the 4KB balloon size pages). A chunk is described via a
vring_desc struct in the implementation.

By doing so, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~590ms
resulting in an improvement of ~85%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_balloon.c     | 418 +++++++++++++++++++++++++++++++++---
 drivers/virtio/virtio_ring.c        | 120 ++++++++++-
 include/linux/virtio.h              |   7 +
 include/uapi/linux/virtio_balloon.h |   1 +
 include/uapi/linux/virtio_ring.h    |   3 +
 5 files changed, 517 insertions(+), 32 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index ecb64e9..0cf945c 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -51,6 +51,36 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 static struct vfsmount *balloon_mnt;
 #endif
 
+/* The size of one page_bmap used to record inflated/deflated pages. */
+#define VIRTIO_BALLOON_PAGE_BMAP_SIZE	(8 * PAGE_SIZE)
+/*
+ * Callulates how many pfns can a page_bmap record. A bit corresponds to a
+ * page of PAGE_SIZE.
+ */
+#define VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP \
+	(VIRTIO_BALLOON_PAGE_BMAP_SIZE * BITS_PER_BYTE)
+
+/* The number of page_bmap to allocate by default. */
+#define VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM	1
+/* The maximum number of page_bmap that can be allocated. */
+#define VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM	32
+
+/*
+ * QEMU virtio implementation requires the desc table size less than
+ * VIRTQUEUE_MAX_SIZE, so minus 1 here.
+ */
+#define VIRTIO_BALLOON_MAX_PAGE_CHUNKS (VIRTQUEUE_MAX_SIZE - 1)
+
+/* The struct to manage ballooned pages in chunks */
+struct virtio_balloon_page_chunk {
+	/* Indirect desc table to hold chunks of balloon pages */
+	struct vring_desc *desc_table;
+	/* Number of added chunks of balloon pages */
+	unsigned int chunk_num;
+	/* Bitmap used to record ballooned pages. */
+	unsigned long *page_bmap[VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM];
+};
+
 struct virtio_balloon {
 	struct virtio_device *vdev;
 	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
@@ -79,6 +109,8 @@ struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	struct virtio_balloon_page_chunk balloon_page_chunk;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -111,6 +143,133 @@ static void balloon_ack(struct virtqueue *vq)
 	wake_up(&vb->acked);
 }
 
+/* Update pfn_max and pfn_min according to the pfn of page */
+static inline void update_pfn_range(struct virtio_balloon *vb,
+				    struct page *page,
+				    unsigned long *pfn_min,
+				    unsigned long *pfn_max)
+{
+	unsigned long pfn = page_to_pfn(page);
+
+	*pfn_min = min(pfn, *pfn_min);
+	*pfn_max = max(pfn, *pfn_max);
+}
+
+static unsigned int extend_page_bmap_size(struct virtio_balloon *vb,
+					  unsigned long pfn_num)
+{
+	unsigned int i, bmap_num, allocated_bmap_num;
+	unsigned long bmap_len;
+
+	allocated_bmap_num = VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM;
+	bmap_len = ALIGN(pfn_num, BITS_PER_LONG) / BITS_PER_BYTE;
+	bmap_len = roundup(bmap_len, VIRTIO_BALLOON_PAGE_BMAP_SIZE);
+	/*
+	 * VIRTIO_BALLOON_PAGE_BMAP_SIZE is the size of one page_bmap, so
+	 * divide it to calculate how many page_bmap that we need.
+	 */
+	bmap_num = (unsigned int)(bmap_len / VIRTIO_BALLOON_PAGE_BMAP_SIZE);
+	/* The number of page_bmap to allocate should not exceed the max */
+	bmap_num = min_t(unsigned int, VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM,
+			 bmap_num);
+
+	for (i = VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i < bmap_num; i++) {
+		vb->balloon_page_chunk.page_bmap[i] =
+			    kmalloc(VIRTIO_BALLOON_PAGE_BMAP_SIZE, GFP_KERNEL);
+		if (vb->balloon_page_chunk.page_bmap[i])
+			allocated_bmap_num++;
+		else
+			break;
+	}
+
+	return allocated_bmap_num;
+}
+
+static void free_extended_page_bmap(struct virtio_balloon *vb,
+				    unsigned int page_bmap_num)
+{
+	unsigned int i;
+
+	for (i = VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i < page_bmap_num;
+	     i++) {
+		kfree(vb->balloon_page_chunk.page_bmap[i]);
+		vb->balloon_page_chunk.page_bmap[i] = NULL;
+		page_bmap_num--;
+	}
+}
+
+static void clear_page_bmap(struct virtio_balloon *vb,
+			    unsigned int page_bmap_num)
+{
+	int i;
+
+	for (i = 0; i < page_bmap_num; i++)
+		memset(vb->balloon_page_chunk.page_bmap[i], 0,
+		       VIRTIO_BALLOON_PAGE_BMAP_SIZE);
+}
+
+static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq)
+{
+	unsigned int len, num;
+	struct vring_desc *desc = vb->balloon_page_chunk.desc_table;
+
+	num = vb->balloon_page_chunk.chunk_num;
+	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
+		virtqueue_kick(vq);
+		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		vb->balloon_page_chunk.chunk_num = 0;
+	}
+}
+
+/* Add a chunk to the buffer. */
+static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
+			  u64 base_addr, u32 size)
+{
+	unsigned int *num = &vb->balloon_page_chunk.chunk_num;
+	struct vring_desc *desc = &vb->balloon_page_chunk.desc_table[*num];
+
+	desc->addr = cpu_to_virtio64(vb->vdev, base_addr);
+	desc->len = cpu_to_virtio32(vb->vdev, size);
+	*num += 1;
+	if (*num == VIRTIO_BALLOON_MAX_PAGE_CHUNKS)
+		send_page_chunks(vb, vq);
+}
+
+static void convert_bmap_to_chunks(struct virtio_balloon *vb,
+				   struct virtqueue *vq,
+				   unsigned long *bmap,
+				   unsigned long pfn_start,
+				   unsigned long size)
+{
+	unsigned long next_one, next_zero, pos = 0;
+	u64 chunk_base_addr;
+	u32 chunk_size;
+
+	while (pos < size) {
+		next_one = find_next_bit(bmap, size, pos);
+		/*
+		 * No "1" bit found, which means that there is no pfn
+		 * recorded in the rest of this bmap.
+		 */
+		if (next_one == size)
+			break;
+		next_zero = find_next_zero_bit(bmap, size, next_one + 1);
+		/*
+		 * A bit in page_bmap corresponds to a page of PAGE_SIZE.
+		 * Convert it to be pages of 4KB balloon page size when
+		 * adding it to a chunk.
+		 */
+		chunk_size = (next_zero - next_one) *
+			     VIRTIO_BALLOON_PAGES_PER_PAGE;
+		chunk_base_addr = (pfn_start + next_one) <<
+				  VIRTIO_BALLOON_PFN_SHIFT;
+		if (chunk_size) {
+			add_one_chunk(vb, vq, chunk_base_addr, chunk_size);
+			pos += next_zero + 1;
+		}
+	}
+}
+
 static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 {
 	struct scatterlist sg;
@@ -124,7 +283,35 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 
 	/* When host has read buffer, this completes via balloon_ack */
 	wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+}
+
+static void tell_host_from_page_bmap(struct virtio_balloon *vb,
+				     struct virtqueue *vq,
+				     unsigned long pfn_start,
+				     unsigned long pfn_end,
+				     unsigned int page_bmap_num)
+{
+	unsigned long i, pfn_num;
 
+	for (i = 0; i < page_bmap_num; i++) {
+		/*
+		 * For the last page_bmap, only the remaining number of pfns
+		 * need to be searched rather than the entire page_bmap.
+		 */
+		if (i + 1 == page_bmap_num)
+			pfn_num = (pfn_end - pfn_start) %
+				  VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
+		else
+			pfn_num = VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
+
+		convert_bmap_to_chunks(vb, vq,
+				       vb->balloon_page_chunk.page_bmap[i],
+				       pfn_start +
+				       i * VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP,
+				       pfn_num);
+	}
+	if (vb->balloon_page_chunk.chunk_num > 0)
+		send_page_chunks(vb, vq);
 }
 
 static void set_page_pfns(struct virtio_balloon *vb,
@@ -141,13 +328,89 @@ static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
+/*
+ * Send ballooned pages in chunks to host.
+ * The ballooned pages are recorded in page bitmaps. Each bit in a bitmap
+ * corresponds to a page of PAGE_SIZE. The page bitmaps are searched for
+ * continuous "1" bits, which correspond to continuous pages, to chunk.
+ * When packing those continuous pages into chunks, pages are converted into
+ * 4KB balloon pages.
+ *
+ * pfn_max and pfn_min form the range of pfns that need to use page bitmaps to
+ * record. If the range is too large to be recorded into the allocated page
+ * bitmaps, the page bitmaps are used multiple times to record the entire
+ * range of pfns.
+ */
+static void tell_host_page_chunks(struct virtio_balloon *vb,
+				  struct list_head *pages,
+				  struct virtqueue *vq,
+				  unsigned long pfn_max,
+				  unsigned long pfn_min)
+{
+	/*
+	 * The pfn_start and pfn_end form the range of pfns that the allocated
+	 * page_bmap can record in each round.
+	 */
+	unsigned long pfn_start, pfn_end;
+	/* Total number of allocated page_bmap */
+	unsigned int page_bmap_num;
+	struct page *page;
+	bool found;
+
+	/*
+	 * In the case that one page_bmap is not sufficient to record the pfn
+	 * range, page_bmap will be extended by allocating more numbers of
+	 * page_bmap.
+	 */
+	page_bmap_num = extend_page_bmap_size(vb, pfn_max - pfn_min + 1);
+
+	/* Start from the beginning of the whole pfn range */
+	pfn_start = pfn_min;
+	while (pfn_start < pfn_max) {
+		pfn_end = pfn_start +
+			  VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP * page_bmap_num;
+		pfn_end = pfn_end < pfn_max ? pfn_end : pfn_max;
+		clear_page_bmap(vb, page_bmap_num);
+		found = false;
+
+		list_for_each_entry(page, pages, lru) {
+			unsigned long bmap_idx, bmap_pos, this_pfn;
+
+			this_pfn = page_to_pfn(page);
+			if (this_pfn < pfn_start || this_pfn > pfn_end)
+				continue;
+			bmap_idx = (this_pfn - pfn_start) /
+				   VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
+			bmap_pos = (this_pfn - pfn_start) %
+				   VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
+			set_bit(bmap_pos,
+				vb->balloon_page_chunk.page_bmap[bmap_idx]);
+
+			found = true;
+		}
+		if (found)
+			tell_host_from_page_bmap(vb, vq, pfn_start, pfn_end,
+						 page_bmap_num);
+		/*
+		 * Start the next round when pfn_start and pfn_end couldn't
+		 * cover the whole pfn range given by pfn_max and pfn_min.
+		 */
+		pfn_start = pfn_end;
+	}
+	free_extended_page_bmap(vb, page_bmap_num);
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	unsigned num_allocated_pages;
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_PAGE_CHUNKS);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (!chunking)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -162,7 +425,10 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (chunking)
+			update_pfn_range(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -171,8 +437,14 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns != 0) {
+		if (chunking)
+			tell_host_page_chunks(vb, &vb_dev_info->pages,
+					      vb->inflate_vq,
+					      pfn_max, pfn_min);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -198,9 +470,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_PAGE_CHUNKS);
+	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	/* Traditionally, we can only do one array worth at a time. */
+	if (!chunking)
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	/* We can't release more pages than taken */
@@ -210,7 +486,10 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (chunking)
+			update_pfn_range(vb, page, &pfn_min, &pfn_max);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -221,8 +500,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns != 0) {
+		if (chunking)
+			tell_host_page_chunks(vb, &pages, vb->deflate_vq,
+					      pfn_max, pfn_min);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
@@ -442,6 +726,14 @@ static int init_vqs(struct virtio_balloon *vb)
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
+
+static void tell_host_one_page(struct virtio_balloon *vb,
+			       struct virtqueue *vq, struct page *page)
+{
+	add_one_chunk(vb, vq, page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT,
+		      VIRTIO_BALLOON_PAGES_PER_PAGE);
+}
+
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
  *			     a compation thread.     (called under page lock)
@@ -465,6 +757,8 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_PAGE_CHUNKS);
 	unsigned long flags;
 
 	/*
@@ -486,16 +780,22 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (chunking) {
+		tell_host_one_page(vb, vb->inflate_vq, newpage);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (chunking) {
+		tell_host_one_page(vb, vb->deflate_vq, page);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -522,9 +822,78 @@ static struct file_system_type balloon_fs = {
 
 #endif /* CONFIG_BALLOON_COMPACTION */
 
+static void free_page_bmap(struct virtio_balloon *vb)
+{
+	int i;
+
+	for (i = 0; i < VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i++) {
+		kfree(vb->balloon_page_chunk.page_bmap[i]);
+		vb->balloon_page_chunk.page_bmap[i] = NULL;
+	}
+}
+
+static int balloon_page_chunk_init(struct virtio_balloon *vb)
+{
+	int i;
+
+	vb->balloon_page_chunk.desc_table = alloc_indirect(vb->vdev,
+						VIRTIO_BALLOON_MAX_PAGE_CHUNKS,
+						GFP_KERNEL);
+	if (!vb->balloon_page_chunk.desc_table)
+		goto err_page_chunk;
+	vb->balloon_page_chunk.chunk_num = 0;
+
+	/*
+	 * The default number of page_bmaps are allocated. More may be
+	 * allocated on demand.
+	 */
+	for (i = 0; i < VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i++) {
+		vb->balloon_page_chunk.page_bmap[i] =
+			    kmalloc(VIRTIO_BALLOON_PAGE_BMAP_SIZE, GFP_KERNEL);
+		if (!vb->balloon_page_chunk.page_bmap[i])
+			goto err_page_bmap;
+	}
+
+	return 0;
+err_page_bmap:
+	free_page_bmap(vb);
+	kfree(vb->balloon_page_chunk.desc_table);
+	vb->balloon_page_chunk.desc_table = NULL;
+err_page_chunk:
+	__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_PAGE_CHUNKS);
+	dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
+	return -ENOMEM;
+}
+
+static int virtballoon_validate(struct virtio_device *vdev)
+{
+	struct virtio_balloon *vb = NULL;
+	int err;
+
+	vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
+	if (!vb) {
+		err = -ENOMEM;
+		goto err_vb;
+	}
+	vb->vdev = vdev;
+
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_CHUNKS)) {
+		err = balloon_page_chunk_init(vb);
+		if (err < 0)
+			goto err_page_chunk;
+	}
+
+	return 0;
+
+err_page_chunk:
+	kfree(vb);
+err_vb:
+	return err;
+}
+
 static int virtballoon_probe(struct virtio_device *vdev)
 {
-	struct virtio_balloon *vb;
+	struct virtio_balloon *vb = vdev->priv;
 	int err;
 
 	if (!vdev->config->get) {
@@ -533,20 +902,14 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		return -EINVAL;
 	}
 
-	vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
-	if (!vb) {
-		err = -ENOMEM;
-		goto out;
-	}
-
 	INIT_WORK(&vb->update_balloon_stats_work, update_balloon_stats_func);
 	INIT_WORK(&vb->update_balloon_size_work, update_balloon_size_func);
 	spin_lock_init(&vb->stop_update_lock);
 	vb->stop_update = false;
 	vb->num_pages = 0;
+
 	mutex_init(&vb->balloon_lock);
 	init_waitqueue_head(&vb->acked);
-	vb->vdev = vdev;
 
 	balloon_devinfo_init(&vb->vb_dev_info);
 
@@ -590,7 +953,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	vdev->config->del_vqs(vdev);
 out_free_vb:
 	kfree(vb);
-out:
 	return err;
 }
 
@@ -620,6 +982,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	cancel_work_sync(&vb->update_balloon_stats_work);
 
 	remove_common(vb);
+	free_page_bmap(vb);
+	kfree(vb->balloon_page_chunk.desc_table);
 #ifdef CONFIG_BALLOON_COMPACTION
 	if (vb->vb_dev_info.inode)
 		iput(vb->vb_dev_info.inode);
@@ -664,6 +1028,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_PAGE_CHUNKS,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
@@ -674,6 +1039,7 @@ static struct virtio_driver virtio_balloon_driver = {
 	.id_table =	id_table,
 	.probe =	virtballoon_probe,
 	.remove =	virtballoon_remove,
+	.validate =	virtballoon_validate,
 	.config_changed = virtballoon_changed,
 #ifdef CONFIG_PM_SLEEP
 	.freeze	=	virtballoon_freeze,
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 409aeaa..0ea2512 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -235,8 +235,17 @@ static int vring_mapping_error(const struct vring_virtqueue *vq,
 	return dma_mapping_error(vring_dma_dev(vq), addr);
 }
 
-static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
-					 unsigned int total_sg, gfp_t gfp)
+/**
+ * alloc_indirect - allocate an indirect desc table
+ * @vdev: the virtio_device that owns the indirect desc table.
+ * @num: the number of entries that the table will have.
+ * @gfp: how to do memory allocations (if necessary).
+ *
+ * Return NULL if the table allocation failed. Otherwise, return the address
+ * of the table.
+ */
+struct vring_desc *alloc_indirect(struct virtio_device *vdev, unsigned int num,
+				  gfp_t gfp)
 {
 	struct vring_desc *desc;
 	unsigned int i;
@@ -248,14 +257,15 @@ static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
 	 */
 	gfp &= ~__GFP_HIGHMEM;
 
-	desc = kmalloc(total_sg * sizeof(struct vring_desc), gfp);
+	desc = kmalloc_array(num, sizeof(struct vring_desc), gfp);
 	if (!desc)
 		return NULL;
 
-	for (i = 0; i < total_sg; i++)
-		desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);
+	for (i = 0; i < num; i++)
+		desc[i].next = cpu_to_virtio16(vdev, i + 1);
 	return desc;
 }
+EXPORT_SYMBOL_GPL(alloc_indirect);
 
 static inline int virtqueue_add(struct virtqueue *_vq,
 				struct scatterlist *sgs[],
@@ -302,7 +312,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	/* If the host supports indirect descriptor tables, and we have multiple
 	 * buffers, then go indirect. FIXME: tune this threshold */
 	if (vq->indirect && total_sg > 1 && vq->vq.num_free)
-		desc = alloc_indirect(_vq, total_sg, gfp);
+		desc = alloc_indirect(_vq->vdev, total_sg, gfp);
 	else
 		desc = NULL;
 
@@ -433,6 +443,104 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 }
 
 /**
+ * virtqueue_indirect_desc_table_add - add an indirect desc table to the vq
+ * @_vq: the struct virtqueue we're talking about.
+ * @desc: the desc table we're talking about.
+ * @num: the number of entries that the desc table has.
+ *
+ * Returns zero or a negative error (ie. ENOSPC, EIO).
+ */
+int virtqueue_indirect_desc_table_add(struct virtqueue *_vq,
+				      struct vring_desc *desc,
+				      unsigned int num)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	dma_addr_t desc_addr;
+	unsigned int i, avail;
+	int head;
+
+	/* Sanity check */
+	if (!desc) {
+		pr_debug("%s: empty desc table\n", __func__);
+		return -EINVAL;
+	}
+
+	START_USE(vq);
+
+	if (unlikely(vq->broken)) {
+		END_USE(vq);
+		return -EIO;
+	}
+
+	if (!vq->vq.num_free) {
+		pr_debug("%s: the virtioqueue is full\n", __func__);
+		END_USE(vq);
+		return -ENOSPC;
+	}
+
+	/* Map and fill in the indirect table */
+	desc_addr = vring_map_single(vq, desc, num * sizeof(struct vring_desc),
+				     DMA_TO_DEVICE);
+	if (vring_mapping_error(vq, desc_addr)) {
+		pr_debug("%s: map desc failed\n", __func__);
+		END_USE(vq);
+		return -EIO;
+	}
+
+	/* Mark the flag of the table entries */
+	for (i = 0; i < num; i++)
+		desc[i].flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_NEXT);
+	/* The last one doesn't continue. */
+	desc[num - 1].flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
+
+	/* Get a ring entry to point to the indirect table */
+	head = vq->free_head;
+	vq->vring.desc[head].flags = cpu_to_virtio16(_vq->vdev,
+						     VRING_DESC_F_INDIRECT);
+	vq->vring.desc[head].addr = cpu_to_virtio64(_vq->vdev, desc_addr);
+	vq->vring.desc[head].len = cpu_to_virtio32(_vq->vdev, num *
+						   sizeof(struct vring_desc));
+	/* We're using 1 buffers from the free list. */
+	vq->vq.num_free--;
+	/* Update free pointer */
+	vq->free_head = virtio16_to_cpu(_vq->vdev, vq->vring.desc[head].next);
+
+	/* Store token and indirect buffer state. */
+	vq->desc_state[head].data = desc;
+	/* Don't free the caller allocated indirect table when detach_buf. */
+	vq->desc_state[head].indir_desc = NULL;
+
+	/*
+	 * Put entry in available array (but don't update avail->idx until they
+	 * do sync).
+	 */
+	avail = vq->avail_idx_shadow & (vq->vring.num - 1);
+	vq->vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
+
+	/*
+	 * Descriptors and available array need to be set before we expose the
+	 * new available array entries.
+	 */
+	virtio_wmb(vq->weak_barriers);
+	vq->avail_idx_shadow++;
+	vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
+	vq->num_added++;
+
+	pr_debug("%s: added buffer head %i to %p\n", __func__, head, vq);
+	END_USE(vq);
+
+	/*
+	 * This is very unlikely, but theoretically possible.  Kick
+	 * just in case.
+	 */
+	if (unlikely(vq->num_added == (1 << 16) - 1))
+		virtqueue_kick(_vq);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtqueue_indirect_desc_table_add);
+
+/**
  * virtqueue_add_sgs - expose buffers to other end
  * @vq: the struct virtqueue we're talking about.
  * @sgs: array of terminated scatterlists.
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 7edfbdb..01dad22 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -34,6 +34,13 @@ struct virtqueue {
 	void *priv;
 };
 
+struct vring_desc *alloc_indirect(struct virtio_device *vdev,
+				  unsigned int num, gfp_t gfp);
+
+int virtqueue_indirect_desc_table_add(struct virtqueue *_vq,
+				      struct vring_desc *desc,
+				      unsigned int num);
+
 int virtqueue_add_outbuf(struct virtqueue *vq,
 			 struct scatterlist sg[], unsigned int num,
 			 void *data,
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..5ed3c7b 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_PAGE_CHUNKS	3 /* Inflate/Deflate pages in chunks */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index c072959..0499fb8 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -111,6 +111,9 @@ struct vring {
 #define VRING_USED_ALIGN_SIZE 4
 #define VRING_DESC_ALIGN_SIZE 16
 
+/* The supported max queue size */
+#define VIRTQUEUE_MAX_SIZE 1024
+
 /* The standard layout for the ring is a continuous chunk of memory which looks
  * like this.  We assume num is a power of 2.
  *
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 175+ messages in thread

* [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-09 10:41 ` Wei Wang
  (?)
@ 2017-06-09 10:41   ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Add a function to find a page block on the free list specified by the
caller. Pages from the page block may be used immediately after the
function returns. The caller is responsible for detecting or preventing
the use of such pages.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 include/linux/mm.h |  5 +++
 mm/page_alloc.c    | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 96 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5d22e69..82361a6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1841,6 +1841,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 
+#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
+extern int report_unused_page_block(struct zone *zone, unsigned int order,
+				    unsigned int migratetype,
+				    struct page **page);
+#endif
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
  * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c25de4..0aefe02 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4615,6 +4615,97 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 	show_swap_cache_info();
 }
 
+#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
+
+/*
+ * Heuristically get a page block in the system that is unused.
+ * It is possible that pages from the page block are used immediately after
+ * report_unused_page_block() returns. It is the caller's responsibility
+ * to either detect or prevent the use of such pages.
+ *
+ * The free list to check: zone->free_area[order].free_list[migratetype].
+ *
+ * If the caller supplied page block (i.e. **page) is on the free list, offer
+ * the next page block on the list to the caller. Otherwise, offer the first
+ * page block on the list.
+ *
+ * Return 0 when a page block is found on the caller specified free list.
+ */
+int report_unused_page_block(struct zone *zone, unsigned int order,
+			     unsigned int migratetype, struct page **page)
+{
+	struct zone *this_zone;
+	struct list_head *this_list;
+	int ret = 0;
+	unsigned long flags;
+
+	/* Sanity check */
+	if (zone == NULL || page == NULL || order >= MAX_ORDER ||
+	    migratetype >= MIGRATE_TYPES)
+		return -EINVAL;
+
+	/* Zone validity check */
+	for_each_populated_zone(this_zone) {
+		if (zone == this_zone)
+			break;
+	}
+
+	/* Got a non-existent zone from the caller? */
+	if (zone != this_zone)
+		return -EINVAL;
+
+	spin_lock_irqsave(&this_zone->lock, flags);
+
+	this_list = &zone->free_area[order].free_list[migratetype];
+	if (list_empty(this_list)) {
+		*page = NULL;
+		ret = 1;
+		goto out;
+	}
+
+	/* The caller is asking for the first free page block on the list */
+	if ((*page) == NULL) {
+		*page = list_first_entry(this_list, struct page, lru);
+		ret = 0;
+		goto out;
+	}
+
+	/*
+	 * The page block passed from the caller is not on this free list
+	 * anymore (e.g. a 1MB free page block has been split). In this case,
+	 * offer the first page block on the free list that the caller is
+	 * asking for.
+	 */
+	if (PageBuddy(*page) && order != page_order(*page)) {
+		*page = list_first_entry(this_list, struct page, lru);
+		ret = 0;
+		goto out;
+	}
+
+	/*
+	 * The page block passed from the caller has been the last page block
+	 * on the list.
+	 */
+	if ((*page)->lru.next == this_list) {
+		*page = NULL;
+		ret = 1;
+		goto out;
+	}
+
+	/*
+	 * Finally, fall into the regular case: the page block passed from the
+	 * caller is still on the free list. Offer the next one.
+	 */
+	*page = list_next_entry((*page), lru);
+	ret = 0;
+out:
+	spin_unlock_irqrestore(&this_zone->lock, flags);
+	return ret;
+}
+EXPORT_SYMBOL(report_unused_page_block);
+
+#endif
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 175+ messages in thread

* [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-09 10:41   ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Add a function to find a page block on the free list specified by the
caller. Pages from the page block may be used immediately after the
function returns. The caller is responsible for detecting or preventing
the use of such pages.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 include/linux/mm.h |  5 +++
 mm/page_alloc.c    | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 96 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5d22e69..82361a6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1841,6 +1841,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 
+#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
+extern int report_unused_page_block(struct zone *zone, unsigned int order,
+				    unsigned int migratetype,
+				    struct page **page);
+#endif
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
  * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c25de4..0aefe02 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4615,6 +4615,97 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 	show_swap_cache_info();
 }
 
+#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
+
+/*
+ * Heuristically get a page block in the system that is unused.
+ * It is possible that pages from the page block are used immediately after
+ * report_unused_page_block() returns. It is the caller's responsibility
+ * to either detect or prevent the use of such pages.
+ *
+ * The free list to check: zone->free_area[order].free_list[migratetype].
+ *
+ * If the caller supplied page block (i.e. **page) is on the free list, offer
+ * the next page block on the list to the caller. Otherwise, offer the first
+ * page block on the list.
+ *
+ * Return 0 when a page block is found on the caller specified free list.
+ */
+int report_unused_page_block(struct zone *zone, unsigned int order,
+			     unsigned int migratetype, struct page **page)
+{
+	struct zone *this_zone;
+	struct list_head *this_list;
+	int ret = 0;
+	unsigned long flags;
+
+	/* Sanity check */
+	if (zone == NULL || page == NULL || order >= MAX_ORDER ||
+	    migratetype >= MIGRATE_TYPES)
+		return -EINVAL;
+
+	/* Zone validity check */
+	for_each_populated_zone(this_zone) {
+		if (zone == this_zone)
+			break;
+	}
+
+	/* Got a non-existent zone from the caller? */
+	if (zone != this_zone)
+		return -EINVAL;
+
+	spin_lock_irqsave(&this_zone->lock, flags);
+
+	this_list = &zone->free_area[order].free_list[migratetype];
+	if (list_empty(this_list)) {
+		*page = NULL;
+		ret = 1;
+		goto out;
+	}
+
+	/* The caller is asking for the first free page block on the list */
+	if ((*page) == NULL) {
+		*page = list_first_entry(this_list, struct page, lru);
+		ret = 0;
+		goto out;
+	}
+
+	/*
+	 * The page block passed from the caller is not on this free list
+	 * anymore (e.g. a 1MB free page block has been split). In this case,
+	 * offer the first page block on the free list that the caller is
+	 * asking for.
+	 */
+	if (PageBuddy(*page) && order != page_order(*page)) {
+		*page = list_first_entry(this_list, struct page, lru);
+		ret = 0;
+		goto out;
+	}
+
+	/*
+	 * The page block passed from the caller has been the last page block
+	 * on the list.
+	 */
+	if ((*page)->lru.next == this_list) {
+		*page = NULL;
+		ret = 1;
+		goto out;
+	}
+
+	/*
+	 * Finally, fall into the regular case: the page block passed from the
+	 * caller is still on the free list. Offer the next one.
+	 */
+	*page = list_next_entry((*page), lru);
+	ret = 0;
+out:
+	spin_unlock_irqrestore(&this_zone->lock, flags);
+	return ret;
+}
+EXPORT_SYMBOL(report_unused_page_block);
+
+#endif
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 175+ messages in thread

* [Qemu-devel] [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-09 10:41   ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Add a function to find a page block on the free list specified by the
caller. Pages from the page block may be used immediately after the
function returns. The caller is responsible for detecting or preventing
the use of such pages.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 include/linux/mm.h |  5 +++
 mm/page_alloc.c    | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 96 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5d22e69..82361a6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1841,6 +1841,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 
+#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
+extern int report_unused_page_block(struct zone *zone, unsigned int order,
+				    unsigned int migratetype,
+				    struct page **page);
+#endif
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
  * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c25de4..0aefe02 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4615,6 +4615,97 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 	show_swap_cache_info();
 }
 
+#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
+
+/*
+ * Heuristically get a page block in the system that is unused.
+ * It is possible that pages from the page block are used immediately after
+ * report_unused_page_block() returns. It is the caller's responsibility
+ * to either detect or prevent the use of such pages.
+ *
+ * The free list to check: zone->free_area[order].free_list[migratetype].
+ *
+ * If the caller supplied page block (i.e. **page) is on the free list, offer
+ * the next page block on the list to the caller. Otherwise, offer the first
+ * page block on the list.
+ *
+ * Return 0 when a page block is found on the caller specified free list.
+ */
+int report_unused_page_block(struct zone *zone, unsigned int order,
+			     unsigned int migratetype, struct page **page)
+{
+	struct zone *this_zone;
+	struct list_head *this_list;
+	int ret = 0;
+	unsigned long flags;
+
+	/* Sanity check */
+	if (zone == NULL || page == NULL || order >= MAX_ORDER ||
+	    migratetype >= MIGRATE_TYPES)
+		return -EINVAL;
+
+	/* Zone validity check */
+	for_each_populated_zone(this_zone) {
+		if (zone == this_zone)
+			break;
+	}
+
+	/* Got a non-existent zone from the caller? */
+	if (zone != this_zone)
+		return -EINVAL;
+
+	spin_lock_irqsave(&this_zone->lock, flags);
+
+	this_list = &zone->free_area[order].free_list[migratetype];
+	if (list_empty(this_list)) {
+		*page = NULL;
+		ret = 1;
+		goto out;
+	}
+
+	/* The caller is asking for the first free page block on the list */
+	if ((*page) == NULL) {
+		*page = list_first_entry(this_list, struct page, lru);
+		ret = 0;
+		goto out;
+	}
+
+	/*
+	 * The page block passed from the caller is not on this free list
+	 * anymore (e.g. a 1MB free page block has been split). In this case,
+	 * offer the first page block on the free list that the caller is
+	 * asking for.
+	 */
+	if (PageBuddy(*page) && order != page_order(*page)) {
+		*page = list_first_entry(this_list, struct page, lru);
+		ret = 0;
+		goto out;
+	}
+
+	/*
+	 * The page block passed from the caller has been the last page block
+	 * on the list.
+	 */
+	if ((*page)->lru.next == this_list) {
+		*page = NULL;
+		ret = 1;
+		goto out;
+	}
+
+	/*
+	 * Finally, fall into the regular case: the page block passed from the
+	 * caller is still on the free list. Offer the next one.
+	 */
+	*page = list_next_entry((*page), lru);
+	ret = 0;
+out:
+	spin_unlock_irqrestore(&this_zone->lock, flags);
+	return ret;
+}
+EXPORT_SYMBOL(report_unused_page_block);
+
+#endif
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 175+ messages in thread

* [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-09 10:41 ` Wei Wang
                   ` (7 preceding siblings ...)
  (?)
@ 2017-06-09 10:41 ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Add a function to find a page block on the free list specified by the
caller. Pages from the page block may be used immediately after the
function returns. The caller is responsible for detecting or preventing
the use of such pages.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 include/linux/mm.h |  5 +++
 mm/page_alloc.c    | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 96 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5d22e69..82361a6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1841,6 +1841,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 
+#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
+extern int report_unused_page_block(struct zone *zone, unsigned int order,
+				    unsigned int migratetype,
+				    struct page **page);
+#endif
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
  * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c25de4..0aefe02 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4615,6 +4615,97 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 	show_swap_cache_info();
 }
 
+#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
+
+/*
+ * Heuristically get a page block in the system that is unused.
+ * It is possible that pages from the page block are used immediately after
+ * report_unused_page_block() returns. It is the caller's responsibility
+ * to either detect or prevent the use of such pages.
+ *
+ * The free list to check: zone->free_area[order].free_list[migratetype].
+ *
+ * If the caller supplied page block (i.e. **page) is on the free list, offer
+ * the next page block on the list to the caller. Otherwise, offer the first
+ * page block on the list.
+ *
+ * Return 0 when a page block is found on the caller specified free list.
+ */
+int report_unused_page_block(struct zone *zone, unsigned int order,
+			     unsigned int migratetype, struct page **page)
+{
+	struct zone *this_zone;
+	struct list_head *this_list;
+	int ret = 0;
+	unsigned long flags;
+
+	/* Sanity check */
+	if (zone == NULL || page == NULL || order >= MAX_ORDER ||
+	    migratetype >= MIGRATE_TYPES)
+		return -EINVAL;
+
+	/* Zone validity check */
+	for_each_populated_zone(this_zone) {
+		if (zone == this_zone)
+			break;
+	}
+
+	/* Got a non-existent zone from the caller? */
+	if (zone != this_zone)
+		return -EINVAL;
+
+	spin_lock_irqsave(&this_zone->lock, flags);
+
+	this_list = &zone->free_area[order].free_list[migratetype];
+	if (list_empty(this_list)) {
+		*page = NULL;
+		ret = 1;
+		goto out;
+	}
+
+	/* The caller is asking for the first free page block on the list */
+	if ((*page) == NULL) {
+		*page = list_first_entry(this_list, struct page, lru);
+		ret = 0;
+		goto out;
+	}
+
+	/*
+	 * The page block passed from the caller is not on this free list
+	 * anymore (e.g. a 1MB free page block has been split). In this case,
+	 * offer the first page block on the free list that the caller is
+	 * asking for.
+	 */
+	if (PageBuddy(*page) && order != page_order(*page)) {
+		*page = list_first_entry(this_list, struct page, lru);
+		ret = 0;
+		goto out;
+	}
+
+	/*
+	 * The page block passed from the caller has been the last page block
+	 * on the list.
+	 */
+	if ((*page)->lru.next == this_list) {
+		*page = NULL;
+		ret = 1;
+		goto out;
+	}
+
+	/*
+	 * Finally, fall into the regular case: the page block passed from the
+	 * caller is still on the free list. Offer the next one.
+	 */
+	*page = list_next_entry((*page), lru);
+	ret = 0;
+out:
+	spin_unlock_irqrestore(&this_zone->lock, flags);
+	return ret;
+}
+EXPORT_SYMBOL(report_unused_page_block);
+
+#endif
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 175+ messages in thread

* [PATCH v11 5/6] mm: export symbol of next_zone and first_online_pgdat
  2017-06-09 10:41 ` Wei Wang
  (?)
@ 2017-06-09 10:41   ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

This patch enables for_each_zone()/for_each_populated_zone() to be
invoked by a kernel module.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 mm/mmzone.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/mmzone.c b/mm/mmzone.c
index a51c0a6..08a2a3a 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -13,6 +13,7 @@ struct pglist_data *first_online_pgdat(void)
 {
 	return NODE_DATA(first_online_node);
 }
+EXPORT_SYMBOL_GPL(first_online_pgdat);
 
 struct pglist_data *next_online_pgdat(struct pglist_data *pgdat)
 {
@@ -41,6 +42,7 @@ struct zone *next_zone(struct zone *zone)
 	}
 	return zone;
 }
+EXPORT_SYMBOL_GPL(next_zone);
 
 static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 175+ messages in thread

* [PATCH v11 5/6] mm: export symbol of next_zone and first_online_pgdat
@ 2017-06-09 10:41   ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

This patch enables for_each_zone()/for_each_populated_zone() to be
invoked by a kernel module.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 mm/mmzone.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/mmzone.c b/mm/mmzone.c
index a51c0a6..08a2a3a 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -13,6 +13,7 @@ struct pglist_data *first_online_pgdat(void)
 {
 	return NODE_DATA(first_online_node);
 }
+EXPORT_SYMBOL_GPL(first_online_pgdat);
 
 struct pglist_data *next_online_pgdat(struct pglist_data *pgdat)
 {
@@ -41,6 +42,7 @@ struct zone *next_zone(struct zone *zone)
 	}
 	return zone;
 }
+EXPORT_SYMBOL_GPL(next_zone);
 
 static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
 {
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 175+ messages in thread

* [Qemu-devel] [PATCH v11 5/6] mm: export symbol of next_zone and first_online_pgdat
@ 2017-06-09 10:41   ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

This patch enables for_each_zone()/for_each_populated_zone() to be
invoked by a kernel module.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 mm/mmzone.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/mmzone.c b/mm/mmzone.c
index a51c0a6..08a2a3a 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -13,6 +13,7 @@ struct pglist_data *first_online_pgdat(void)
 {
 	return NODE_DATA(first_online_node);
 }
+EXPORT_SYMBOL_GPL(first_online_pgdat);
 
 struct pglist_data *next_online_pgdat(struct pglist_data *pgdat)
 {
@@ -41,6 +42,7 @@ struct zone *next_zone(struct zone *zone)
 	}
 	return zone;
 }
+EXPORT_SYMBOL_GPL(next_zone);
 
 static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 175+ messages in thread

* [PATCH v11 5/6] mm: export symbol of next_zone and first_online_pgdat
  2017-06-09 10:41 ` Wei Wang
                   ` (10 preceding siblings ...)
  (?)
@ 2017-06-09 10:41 ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

This patch enables for_each_zone()/for_each_populated_zone() to be
invoked by a kernel module.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 mm/mmzone.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/mmzone.c b/mm/mmzone.c
index a51c0a6..08a2a3a 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -13,6 +13,7 @@ struct pglist_data *first_online_pgdat(void)
 {
 	return NODE_DATA(first_online_node);
 }
+EXPORT_SYMBOL_GPL(first_online_pgdat);
 
 struct pglist_data *next_online_pgdat(struct pglist_data *pgdat)
 {
@@ -41,6 +42,7 @@ struct zone *next_zone(struct zone *zone)
 	}
 	return zone;
 }
+EXPORT_SYMBOL_GPL(next_zone);
 
 static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 175+ messages in thread

* [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
  2017-06-09 10:41 ` Wei Wang
  (?)
@ 2017-06-09 10:41   ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Add a new vq, cmdq, to handle requests between the device and driver.

This patch implements two commands send from the device and handled in
the driver.
1) cmd VIRTIO_BALLOON_CMDQ_REPORT_STATS: this command is used to report
the guest memory statistics to the host. The stats_vq mechanism is not
used when the cmdq mechanism is enabled.
2) cmd VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES: this command is used to
report the guest unused pages to the host.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 drivers/virtio/virtio_balloon.c     | 363 ++++++++++++++++++++++++++++++++----
 include/uapi/linux/virtio_balloon.h |  13 ++
 2 files changed, 337 insertions(+), 39 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 0cf945c..4ac90a5 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -51,6 +51,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 static struct vfsmount *balloon_mnt;
 #endif
 
+/* Types of pages to chunk */
+#define PAGE_CHUNK_TYPE_BALLOON 0	/* Chunk of inflate/deflate pages */
+#define PAGE_CHNUK_UNUSED_PAGE  1	/* Chunk of unused pages */
+
 /* The size of one page_bmap used to record inflated/deflated pages. */
 #define VIRTIO_BALLOON_PAGE_BMAP_SIZE	(8 * PAGE_SIZE)
 /*
@@ -81,12 +85,25 @@ struct virtio_balloon_page_chunk {
 	unsigned long *page_bmap[VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM];
 };
 
+struct virtio_balloon_cmdq_unused_page {
+	struct virtio_balloon_cmdq_hdr hdr;
+	struct vring_desc *desc_table;
+	/* Number of added descriptors */
+	unsigned int num;
+};
+
+struct virtio_balloon_cmdq_stats {
+	struct virtio_balloon_cmdq_hdr hdr;
+	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
+};
+
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *cmd_vq;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
+	struct work_struct cmdq_handle_work;
 	struct work_struct update_balloon_size_work;
 
 	/* Prevent updating balloon when it is being canceled. */
@@ -115,8 +132,10 @@ struct virtio_balloon {
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
 
-	/* Memory statistics */
-	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
+	/* Cmdq msg buffer for memory statistics */
+	struct virtio_balloon_cmdq_stats cmdq_stats;
+	/* Cmdq msg buffer for reporting ununsed pages */
+	struct virtio_balloon_cmdq_unused_page cmdq_unused_page;
 
 	/* To register callback in oom notifier call chain */
 	struct notifier_block nb;
@@ -208,31 +227,77 @@ static void clear_page_bmap(struct virtio_balloon *vb,
 		       VIRTIO_BALLOON_PAGE_BMAP_SIZE);
 }
 
-static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq)
+static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
+			     int type, bool busy_wait)
 {
-	unsigned int len, num;
-	struct vring_desc *desc = vb->balloon_page_chunk.desc_table;
+	unsigned int len, *num, reset_num;
+	struct vring_desc *desc;
+
+	switch (type) {
+	case PAGE_CHUNK_TYPE_BALLOON:
+		desc = vb->balloon_page_chunk.desc_table;
+		num = &vb->balloon_page_chunk.chunk_num;
+		reset_num = 0;
+		break;
+	case PAGE_CHNUK_UNUSED_PAGE:
+		desc = vb->cmdq_unused_page.desc_table;
+		num = &vb->cmdq_unused_page.num;
+		/*
+		 * The first desc is used for the cmdq_hdr, so chunks will be
+		 * added from the second desc.
+		 */
+		reset_num = 1;
+		break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: unknown page chunk type %d\n",
+			 __func__, type);
+		return;
+	}
 
-	num = vb->balloon_page_chunk.chunk_num;
-	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
+	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
 		virtqueue_kick(vq);
-		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
-		vb->balloon_page_chunk.chunk_num = 0;
+		if (busy_wait)
+			while (!virtqueue_get_buf(vq, &len) &&
+			       !virtqueue_is_broken(vq))
+				cpu_relax();
+		else
+			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		/*
+		 * Now, the descriptor have been delivered to the host. Reset
+		 * the field in the structure that records the number of added
+		 * descriptors, so that new added descriptor can be re-counted.
+		 */
+		*num = reset_num;
 	}
 }
 
 /* Add a chunk to the buffer. */
 static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
-			  u64 base_addr, u32 size)
+			  int type, u64 base_addr, u32 size)
 {
-	unsigned int *num = &vb->balloon_page_chunk.chunk_num;
-	struct vring_desc *desc = &vb->balloon_page_chunk.desc_table[*num];
+	unsigned int *num;
+	struct vring_desc *desc;
+
+	switch (type) {
+	case PAGE_CHUNK_TYPE_BALLOON:
+		num = &vb->balloon_page_chunk.chunk_num;
+		desc = &vb->balloon_page_chunk.desc_table[*num];
+		break;
+	case PAGE_CHNUK_UNUSED_PAGE:
+		num = &vb->cmdq_unused_page.num;
+		desc = &vb->cmdq_unused_page.desc_table[*num];
+		break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
+			 __func__, type);
+		return;
+	}
 
 	desc->addr = cpu_to_virtio64(vb->vdev, base_addr);
 	desc->len = cpu_to_virtio32(vb->vdev, size);
 	*num += 1;
 	if (*num == VIRTIO_BALLOON_MAX_PAGE_CHUNKS)
-		send_page_chunks(vb, vq);
+		send_page_chunks(vb, vq, type, false);
 }
 
 static void convert_bmap_to_chunks(struct virtio_balloon *vb,
@@ -264,7 +329,8 @@ static void convert_bmap_to_chunks(struct virtio_balloon *vb,
 		chunk_base_addr = (pfn_start + next_one) <<
 				  VIRTIO_BALLOON_PFN_SHIFT;
 		if (chunk_size) {
-			add_one_chunk(vb, vq, chunk_base_addr, chunk_size);
+			add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
+				      chunk_base_addr, chunk_size);
 			pos += next_zero + 1;
 		}
 	}
@@ -311,7 +377,7 @@ static void tell_host_from_page_bmap(struct virtio_balloon *vb,
 				       pfn_num);
 	}
 	if (vb->balloon_page_chunk.chunk_num > 0)
-		send_page_chunks(vb, vq);
+		send_page_chunks(vb, vq, PAGE_CHUNK_TYPE_BALLOON, false);
 }
 
 static void set_page_pfns(struct virtio_balloon *vb,
@@ -516,8 +582,8 @@ static inline void update_stat(struct virtio_balloon *vb, int idx,
 			       u16 tag, u64 val)
 {
 	BUG_ON(idx >= VIRTIO_BALLOON_S_NR);
-	vb->stats[idx].tag = cpu_to_virtio16(vb->vdev, tag);
-	vb->stats[idx].val = cpu_to_virtio64(vb->vdev, val);
+	vb->cmdq_stats.stats[idx].tag = cpu_to_virtio16(vb->vdev, tag);
+	vb->cmdq_stats.stats[idx].val = cpu_to_virtio64(vb->vdev, val);
 }
 
 #define pages_to_bytes(x) ((u64)(x) << PAGE_SHIFT)
@@ -582,7 +648,8 @@ static void stats_handle_request(struct virtio_balloon *vb)
 	vq = vb->stats_vq;
 	if (!virtqueue_get_buf(vq, &len))
 		return;
-	sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
+	sg_init_one(&sg, vb->cmdq_stats.stats,
+		    sizeof(vb->cmdq_stats.stats[0]) * num_stats);
 	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
 	virtqueue_kick(vq);
 }
@@ -686,43 +753,216 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
+static void cmdq_handle_stats(struct virtio_balloon *vb)
+{
+	struct scatterlist sg;
+	unsigned int num_stats;
+
+	spin_lock(&vb->stop_update_lock);
+	if (!vb->stop_update) {
+		num_stats = update_balloon_stats(vb);
+		sg_init_one(&sg, &vb->cmdq_stats,
+			    sizeof(struct virtio_balloon_cmdq_hdr) +
+			    sizeof(struct virtio_balloon_stat) * num_stats);
+		virtqueue_add_outbuf(vb->cmd_vq, &sg, 1, vb, GFP_KERNEL);
+		virtqueue_kick(vb->cmd_vq);
+	}
+	spin_unlock(&vb->stop_update_lock);
+}
+
+/*
+ * The header part of the message buffer is given to the device to send a
+ * command to the driver.
+ */
+static void host_cmd_buf_add(struct virtio_balloon *vb,
+			   struct virtio_balloon_cmdq_hdr *hdr)
+{
+	struct scatterlist sg;
+
+	hdr->flags = 0;
+	sg_init_one(&sg, hdr, VIRTIO_BALLOON_CMDQ_HDR_SIZE);
+
+	if (virtqueue_add_inbuf(vb->cmd_vq, &sg, 1, hdr, GFP_KERNEL) < 0) {
+		__virtio_clear_bit(vb->vdev,
+				   VIRTIO_BALLOON_F_CMD_VQ);
+		dev_warn(&vb->vdev->dev, "%s: add miscq msg buf err\n",
+			 __func__);
+		return;
+	}
+
+	virtqueue_kick(vb->cmd_vq);
+}
+
+static void cmdq_handle_unused_pages(struct virtio_balloon *vb)
+{
+	struct virtqueue *vq = vb->cmd_vq;
+	struct vring_desc *hdr_desc = &vb->cmdq_unused_page.desc_table[0];
+	unsigned long hdr_pa;
+	unsigned int order = 0, migratetype = 0;
+	struct zone *zone = NULL;
+	struct page *page = NULL;
+	u64 pfn;
+	int ret = 0;
+
+	/* Put the hdr to the first desc */
+	hdr_pa = virt_to_phys((void *)&vb->cmdq_unused_page.hdr);
+	hdr_desc->addr = cpu_to_virtio64(vb->vdev, hdr_pa);
+	hdr_desc->len = cpu_to_virtio32(vb->vdev,
+				sizeof(struct virtio_balloon_cmdq_hdr));
+	vb->cmdq_unused_page.num = 1;
+
+	for_each_populated_zone(zone) {
+		for (order = MAX_ORDER - 1; order > 0; order--) {
+			for (migratetype = 0; migratetype < MIGRATE_TYPES;
+			     migratetype++) {
+				do {
+					ret = report_unused_page_block(zone,
+						order, migratetype, &page);
+					if (!ret) {
+						pfn = (u64)page_to_pfn(page);
+						add_one_chunk(vb, vq,
+						PAGE_CHNUK_UNUSED_PAGE,
+						pfn << VIRTIO_BALLOON_PFN_SHIFT,
+						(u64)(1 << order) *
+						VIRTIO_BALLOON_PAGES_PER_PAGE);
+					}
+				} while (!ret);
+			}
+		}
+	}
+
+	/* Set the cmd completion flag. */
+	vb->cmdq_unused_page.hdr.flags |=
+				cpu_to_le32(VIRTIO_BALLOON_CMDQ_F_COMPLETION);
+	send_page_chunks(vb, vq, PAGE_CHNUK_UNUSED_PAGE, true);
+}
+
+static void cmdq_handle(struct virtio_balloon *vb)
+{
+	struct virtqueue *vq;
+	struct virtio_balloon_cmdq_hdr *hdr;
+	unsigned int len;
+
+	vq = vb->cmd_vq;
+	while ((hdr = (struct virtio_balloon_cmdq_hdr *)
+			virtqueue_get_buf(vq, &len)) != NULL) {
+		switch (hdr->cmd) {
+		case VIRTIO_BALLOON_CMDQ_REPORT_STATS:
+			cmdq_handle_stats(vb);
+			break;
+		case VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES:
+			cmdq_handle_unused_pages(vb);
+			break;
+		default:
+			dev_warn(&vb->vdev->dev, "%s: wrong cmd\n", __func__);
+			return;
+		}
+		/*
+		 * Replenish all the command buffer to the device after a
+		 * command is handled. This is for the convenience of the
+		 * device to rewind the cmdq to get back all the command
+		 * buffer after live migration.
+		 */
+		host_cmd_buf_add(vb, &vb->cmdq_stats.hdr);
+		host_cmd_buf_add(vb, &vb->cmdq_unused_page.hdr);
+	}
+}
+
+static void cmdq_handle_work_func(struct work_struct *work)
+{
+	struct virtio_balloon *vb;
+
+	vb = container_of(work, struct virtio_balloon,
+			  cmdq_handle_work);
+	cmdq_handle(vb);
+}
+
+static void cmdq_callback(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+
+	queue_work(system_freezable_wq, &vb->cmdq_handle_work);
+}
+
 static int init_vqs(struct virtio_balloon *vb)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
-	int err, nvqs;
+	struct virtqueue **vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	int err = -ENOMEM;
+	int nvqs;
+
+	/* Inflateq and deflateq are used unconditionally */
+	nvqs = 2;
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CMD_VQ) ||
+	    virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs++;
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vq;
+	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
+	if (!callbacks)
+		goto err_callback;
+	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
+	if (!names)
+		goto err_names;
+
+	callbacks[0] = balloon_ack;
+	names[0] = "inflate";
+	callbacks[1] = balloon_ack;
+	names[1] = "deflate";
 
 	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
+	 * The stats_vq is used only when cmdq is not supported (or disabled)
+	 * by the device.
 	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
-			NULL);
-	if (err)
-		return err;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CMD_VQ)) {
+		callbacks[2] = cmdq_callback;
+		names[2] = "cmdq";
+	} else if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		callbacks[2] = stats_request;
+		names[2] = "stats";
+	}
 
+	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks,
+					 names, NULL);
+	if (err)
+		goto err_find;
 	vb->inflate_vq = vqs[0];
 	vb->deflate_vq = vqs[1];
-	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CMD_VQ)) {
+		vb->cmd_vq = vqs[2];
+		/* Prime the cmdq with the header buffer. */
+		host_cmd_buf_add(vb, &vb->cmdq_stats.hdr);
+		host_cmd_buf_add(vb, &vb->cmdq_unused_page.hdr);
+	} else if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
 		struct scatterlist sg;
-		unsigned int num_stats;
-		vb->stats_vq = vqs[2];
 
+		vb->stats_vq = vqs[2];
 		/*
 		 * Prime this virtqueue with one buffer so the hypervisor can
 		 * use it to signal us later (it can't be broken yet!).
 		 */
-		num_stats = update_balloon_stats(vb);
-
-		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
+		sg_init_one(&sg, vb->cmdq_stats.stats,
+			    sizeof(vb->cmdq_stats.stats));
 		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
 		    < 0)
 			BUG();
 		virtqueue_kick(vb->stats_vq);
 	}
-	return 0;
+
+err_find:
+	kfree(names);
+err_names:
+	kfree(callbacks);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return err;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -730,7 +970,8 @@ static int init_vqs(struct virtio_balloon *vb)
 static void tell_host_one_page(struct virtio_balloon *vb,
 			       struct virtqueue *vq, struct page *page)
 {
-	add_one_chunk(vb, vq, page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT,
+	add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
+		      page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT,
 		      VIRTIO_BALLOON_PAGES_PER_PAGE);
 }
 
@@ -865,6 +1106,40 @@ static int balloon_page_chunk_init(struct virtio_balloon *vb)
 	return -ENOMEM;
 }
 
+/*
+ * Each type of command is handled one in-flight each time. So, we allocate
+ * one message buffer for each type of command. The header part of the message
+ * buffer will be offered to the device, so that the device can send a command
+ * using the corresponding command buffer to the driver later.
+ */
+static int cmdq_init(struct virtio_balloon *vb)
+{
+	vb->cmdq_unused_page.desc_table = alloc_indirect(vb->vdev,
+						VIRTIO_BALLOON_MAX_PAGE_CHUNKS,
+						GFP_KERNEL);
+	if (!vb->cmdq_unused_page.desc_table) {
+		dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
+		__virtio_clear_bit(vb->vdev,
+				   VIRTIO_BALLOON_F_CMD_VQ);
+		return -ENOMEM;
+	}
+	vb->cmdq_unused_page.num = 0;
+
+	/*
+	 * The header is initialized to let the device know which type of
+	 * command buffer it receives. The device will later use a buffer
+	 * according to the type of command that it needs to send.
+	 */
+	vb->cmdq_stats.hdr.cmd = VIRTIO_BALLOON_CMDQ_REPORT_STATS;
+	vb->cmdq_stats.hdr.flags = 0;
+	vb->cmdq_unused_page.hdr.cmd = VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES;
+	vb->cmdq_unused_page.hdr.flags = 0;
+
+	INIT_WORK(&vb->cmdq_handle_work, cmdq_handle_work_func);
+
+	return 0;
+}
+
 static int virtballoon_validate(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb = NULL;
@@ -883,6 +1158,11 @@ static int virtballoon_validate(struct virtio_device *vdev)
 			goto err_page_chunk;
 	}
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_CMD_VQ)) {
+		err = cmdq_init(vb);
+		if (err < 0)
+			goto err_vb;
+	}
 	return 0;
 
 err_page_chunk:
@@ -902,7 +1182,10 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		return -EINVAL;
 	}
 
-	INIT_WORK(&vb->update_balloon_stats_work, update_balloon_stats_func);
+	if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_CMD_VQ) &&
+	    virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		INIT_WORK(&vb->update_balloon_stats_work,
+			  update_balloon_stats_func);
 	INIT_WORK(&vb->update_balloon_size_work, update_balloon_size_func);
 	spin_lock_init(&vb->stop_update_lock);
 	vb->stop_update = false;
@@ -980,6 +1263,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	spin_unlock_irq(&vb->stop_update_lock);
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
+	cancel_work_sync(&vb->cmdq_handle_work);
 
 	remove_common(vb);
 	free_page_bmap(vb);
@@ -1029,6 +1313,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_PAGE_CHUNKS,
+	VIRTIO_BALLOON_F_CMD_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 5ed3c7b..cb66c1a 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_PAGE_CHUNKS	3 /* Inflate/Deflate pages in chunks */
+#define VIRTIO_BALLOON_F_CMD_VQ		4 /* Command virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -83,4 +84,16 @@ struct virtio_balloon_stat {
 	__virtio64 val;
 } __attribute__((packed));
 
+/* Use the memory of a vring_desc to place the cmdq header */
+#define VIRTIO_BALLOON_CMDQ_HDR_SIZE sizeof(struct vring_desc)
+
+struct virtio_balloon_cmdq_hdr {
+#define VIRTIO_BALLOON_CMDQ_REPORT_STATS	0
+#define VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES	1
+	__le32 cmd;
+/* Flag to indicate the completion of handling a command */
+#define VIRTIO_BALLOON_CMDQ_F_COMPLETION	1
+	__le32 flags;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 175+ messages in thread

* [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
@ 2017-06-09 10:41   ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Add a new vq, cmdq, to handle requests between the device and driver.

This patch implements two commands send from the device and handled in
the driver.
1) cmd VIRTIO_BALLOON_CMDQ_REPORT_STATS: this command is used to report
the guest memory statistics to the host. The stats_vq mechanism is not
used when the cmdq mechanism is enabled.
2) cmd VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES: this command is used to
report the guest unused pages to the host.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 drivers/virtio/virtio_balloon.c     | 363 ++++++++++++++++++++++++++++++++----
 include/uapi/linux/virtio_balloon.h |  13 ++
 2 files changed, 337 insertions(+), 39 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 0cf945c..4ac90a5 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -51,6 +51,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 static struct vfsmount *balloon_mnt;
 #endif
 
+/* Types of pages to chunk */
+#define PAGE_CHUNK_TYPE_BALLOON 0	/* Chunk of inflate/deflate pages */
+#define PAGE_CHNUK_UNUSED_PAGE  1	/* Chunk of unused pages */
+
 /* The size of one page_bmap used to record inflated/deflated pages. */
 #define VIRTIO_BALLOON_PAGE_BMAP_SIZE	(8 * PAGE_SIZE)
 /*
@@ -81,12 +85,25 @@ struct virtio_balloon_page_chunk {
 	unsigned long *page_bmap[VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM];
 };
 
+struct virtio_balloon_cmdq_unused_page {
+	struct virtio_balloon_cmdq_hdr hdr;
+	struct vring_desc *desc_table;
+	/* Number of added descriptors */
+	unsigned int num;
+};
+
+struct virtio_balloon_cmdq_stats {
+	struct virtio_balloon_cmdq_hdr hdr;
+	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
+};
+
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *cmd_vq;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
+	struct work_struct cmdq_handle_work;
 	struct work_struct update_balloon_size_work;
 
 	/* Prevent updating balloon when it is being canceled. */
@@ -115,8 +132,10 @@ struct virtio_balloon {
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
 
-	/* Memory statistics */
-	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
+	/* Cmdq msg buffer for memory statistics */
+	struct virtio_balloon_cmdq_stats cmdq_stats;
+	/* Cmdq msg buffer for reporting ununsed pages */
+	struct virtio_balloon_cmdq_unused_page cmdq_unused_page;
 
 	/* To register callback in oom notifier call chain */
 	struct notifier_block nb;
@@ -208,31 +227,77 @@ static void clear_page_bmap(struct virtio_balloon *vb,
 		       VIRTIO_BALLOON_PAGE_BMAP_SIZE);
 }
 
-static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq)
+static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
+			     int type, bool busy_wait)
 {
-	unsigned int len, num;
-	struct vring_desc *desc = vb->balloon_page_chunk.desc_table;
+	unsigned int len, *num, reset_num;
+	struct vring_desc *desc;
+
+	switch (type) {
+	case PAGE_CHUNK_TYPE_BALLOON:
+		desc = vb->balloon_page_chunk.desc_table;
+		num = &vb->balloon_page_chunk.chunk_num;
+		reset_num = 0;
+		break;
+	case PAGE_CHNUK_UNUSED_PAGE:
+		desc = vb->cmdq_unused_page.desc_table;
+		num = &vb->cmdq_unused_page.num;
+		/*
+		 * The first desc is used for the cmdq_hdr, so chunks will be
+		 * added from the second desc.
+		 */
+		reset_num = 1;
+		break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: unknown page chunk type %d\n",
+			 __func__, type);
+		return;
+	}
 
-	num = vb->balloon_page_chunk.chunk_num;
-	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
+	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
 		virtqueue_kick(vq);
-		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
-		vb->balloon_page_chunk.chunk_num = 0;
+		if (busy_wait)
+			while (!virtqueue_get_buf(vq, &len) &&
+			       !virtqueue_is_broken(vq))
+				cpu_relax();
+		else
+			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		/*
+		 * Now, the descriptor have been delivered to the host. Reset
+		 * the field in the structure that records the number of added
+		 * descriptors, so that new added descriptor can be re-counted.
+		 */
+		*num = reset_num;
 	}
 }
 
 /* Add a chunk to the buffer. */
 static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
-			  u64 base_addr, u32 size)
+			  int type, u64 base_addr, u32 size)
 {
-	unsigned int *num = &vb->balloon_page_chunk.chunk_num;
-	struct vring_desc *desc = &vb->balloon_page_chunk.desc_table[*num];
+	unsigned int *num;
+	struct vring_desc *desc;
+
+	switch (type) {
+	case PAGE_CHUNK_TYPE_BALLOON:
+		num = &vb->balloon_page_chunk.chunk_num;
+		desc = &vb->balloon_page_chunk.desc_table[*num];
+		break;
+	case PAGE_CHNUK_UNUSED_PAGE:
+		num = &vb->cmdq_unused_page.num;
+		desc = &vb->cmdq_unused_page.desc_table[*num];
+		break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
+			 __func__, type);
+		return;
+	}
 
 	desc->addr = cpu_to_virtio64(vb->vdev, base_addr);
 	desc->len = cpu_to_virtio32(vb->vdev, size);
 	*num += 1;
 	if (*num == VIRTIO_BALLOON_MAX_PAGE_CHUNKS)
-		send_page_chunks(vb, vq);
+		send_page_chunks(vb, vq, type, false);
 }
 
 static void convert_bmap_to_chunks(struct virtio_balloon *vb,
@@ -264,7 +329,8 @@ static void convert_bmap_to_chunks(struct virtio_balloon *vb,
 		chunk_base_addr = (pfn_start + next_one) <<
 				  VIRTIO_BALLOON_PFN_SHIFT;
 		if (chunk_size) {
-			add_one_chunk(vb, vq, chunk_base_addr, chunk_size);
+			add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
+				      chunk_base_addr, chunk_size);
 			pos += next_zero + 1;
 		}
 	}
@@ -311,7 +377,7 @@ static void tell_host_from_page_bmap(struct virtio_balloon *vb,
 				       pfn_num);
 	}
 	if (vb->balloon_page_chunk.chunk_num > 0)
-		send_page_chunks(vb, vq);
+		send_page_chunks(vb, vq, PAGE_CHUNK_TYPE_BALLOON, false);
 }
 
 static void set_page_pfns(struct virtio_balloon *vb,
@@ -516,8 +582,8 @@ static inline void update_stat(struct virtio_balloon *vb, int idx,
 			       u16 tag, u64 val)
 {
 	BUG_ON(idx >= VIRTIO_BALLOON_S_NR);
-	vb->stats[idx].tag = cpu_to_virtio16(vb->vdev, tag);
-	vb->stats[idx].val = cpu_to_virtio64(vb->vdev, val);
+	vb->cmdq_stats.stats[idx].tag = cpu_to_virtio16(vb->vdev, tag);
+	vb->cmdq_stats.stats[idx].val = cpu_to_virtio64(vb->vdev, val);
 }
 
 #define pages_to_bytes(x) ((u64)(x) << PAGE_SHIFT)
@@ -582,7 +648,8 @@ static void stats_handle_request(struct virtio_balloon *vb)
 	vq = vb->stats_vq;
 	if (!virtqueue_get_buf(vq, &len))
 		return;
-	sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
+	sg_init_one(&sg, vb->cmdq_stats.stats,
+		    sizeof(vb->cmdq_stats.stats[0]) * num_stats);
 	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
 	virtqueue_kick(vq);
 }
@@ -686,43 +753,216 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
+static void cmdq_handle_stats(struct virtio_balloon *vb)
+{
+	struct scatterlist sg;
+	unsigned int num_stats;
+
+	spin_lock(&vb->stop_update_lock);
+	if (!vb->stop_update) {
+		num_stats = update_balloon_stats(vb);
+		sg_init_one(&sg, &vb->cmdq_stats,
+			    sizeof(struct virtio_balloon_cmdq_hdr) +
+			    sizeof(struct virtio_balloon_stat) * num_stats);
+		virtqueue_add_outbuf(vb->cmd_vq, &sg, 1, vb, GFP_KERNEL);
+		virtqueue_kick(vb->cmd_vq);
+	}
+	spin_unlock(&vb->stop_update_lock);
+}
+
+/*
+ * The header part of the message buffer is given to the device to send a
+ * command to the driver.
+ */
+static void host_cmd_buf_add(struct virtio_balloon *vb,
+			   struct virtio_balloon_cmdq_hdr *hdr)
+{
+	struct scatterlist sg;
+
+	hdr->flags = 0;
+	sg_init_one(&sg, hdr, VIRTIO_BALLOON_CMDQ_HDR_SIZE);
+
+	if (virtqueue_add_inbuf(vb->cmd_vq, &sg, 1, hdr, GFP_KERNEL) < 0) {
+		__virtio_clear_bit(vb->vdev,
+				   VIRTIO_BALLOON_F_CMD_VQ);
+		dev_warn(&vb->vdev->dev, "%s: add miscq msg buf err\n",
+			 __func__);
+		return;
+	}
+
+	virtqueue_kick(vb->cmd_vq);
+}
+
+static void cmdq_handle_unused_pages(struct virtio_balloon *vb)
+{
+	struct virtqueue *vq = vb->cmd_vq;
+	struct vring_desc *hdr_desc = &vb->cmdq_unused_page.desc_table[0];
+	unsigned long hdr_pa;
+	unsigned int order = 0, migratetype = 0;
+	struct zone *zone = NULL;
+	struct page *page = NULL;
+	u64 pfn;
+	int ret = 0;
+
+	/* Put the hdr to the first desc */
+	hdr_pa = virt_to_phys((void *)&vb->cmdq_unused_page.hdr);
+	hdr_desc->addr = cpu_to_virtio64(vb->vdev, hdr_pa);
+	hdr_desc->len = cpu_to_virtio32(vb->vdev,
+				sizeof(struct virtio_balloon_cmdq_hdr));
+	vb->cmdq_unused_page.num = 1;
+
+	for_each_populated_zone(zone) {
+		for (order = MAX_ORDER - 1; order > 0; order--) {
+			for (migratetype = 0; migratetype < MIGRATE_TYPES;
+			     migratetype++) {
+				do {
+					ret = report_unused_page_block(zone,
+						order, migratetype, &page);
+					if (!ret) {
+						pfn = (u64)page_to_pfn(page);
+						add_one_chunk(vb, vq,
+						PAGE_CHNUK_UNUSED_PAGE,
+						pfn << VIRTIO_BALLOON_PFN_SHIFT,
+						(u64)(1 << order) *
+						VIRTIO_BALLOON_PAGES_PER_PAGE);
+					}
+				} while (!ret);
+			}
+		}
+	}
+
+	/* Set the cmd completion flag. */
+	vb->cmdq_unused_page.hdr.flags |=
+				cpu_to_le32(VIRTIO_BALLOON_CMDQ_F_COMPLETION);
+	send_page_chunks(vb, vq, PAGE_CHNUK_UNUSED_PAGE, true);
+}
+
+static void cmdq_handle(struct virtio_balloon *vb)
+{
+	struct virtqueue *vq;
+	struct virtio_balloon_cmdq_hdr *hdr;
+	unsigned int len;
+
+	vq = vb->cmd_vq;
+	while ((hdr = (struct virtio_balloon_cmdq_hdr *)
+			virtqueue_get_buf(vq, &len)) != NULL) {
+		switch (hdr->cmd) {
+		case VIRTIO_BALLOON_CMDQ_REPORT_STATS:
+			cmdq_handle_stats(vb);
+			break;
+		case VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES:
+			cmdq_handle_unused_pages(vb);
+			break;
+		default:
+			dev_warn(&vb->vdev->dev, "%s: wrong cmd\n", __func__);
+			return;
+		}
+		/*
+		 * Replenish all the command buffer to the device after a
+		 * command is handled. This is for the convenience of the
+		 * device to rewind the cmdq to get back all the command
+		 * buffer after live migration.
+		 */
+		host_cmd_buf_add(vb, &vb->cmdq_stats.hdr);
+		host_cmd_buf_add(vb, &vb->cmdq_unused_page.hdr);
+	}
+}
+
+static void cmdq_handle_work_func(struct work_struct *work)
+{
+	struct virtio_balloon *vb;
+
+	vb = container_of(work, struct virtio_balloon,
+			  cmdq_handle_work);
+	cmdq_handle(vb);
+}
+
+static void cmdq_callback(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+
+	queue_work(system_freezable_wq, &vb->cmdq_handle_work);
+}
+
 static int init_vqs(struct virtio_balloon *vb)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
-	int err, nvqs;
+	struct virtqueue **vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	int err = -ENOMEM;
+	int nvqs;
+
+	/* Inflateq and deflateq are used unconditionally */
+	nvqs = 2;
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CMD_VQ) ||
+	    virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs++;
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vq;
+	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
+	if (!callbacks)
+		goto err_callback;
+	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
+	if (!names)
+		goto err_names;
+
+	callbacks[0] = balloon_ack;
+	names[0] = "inflate";
+	callbacks[1] = balloon_ack;
+	names[1] = "deflate";
 
 	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
+	 * The stats_vq is used only when cmdq is not supported (or disabled)
+	 * by the device.
 	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
-			NULL);
-	if (err)
-		return err;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CMD_VQ)) {
+		callbacks[2] = cmdq_callback;
+		names[2] = "cmdq";
+	} else if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		callbacks[2] = stats_request;
+		names[2] = "stats";
+	}
 
+	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks,
+					 names, NULL);
+	if (err)
+		goto err_find;
 	vb->inflate_vq = vqs[0];
 	vb->deflate_vq = vqs[1];
-	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CMD_VQ)) {
+		vb->cmd_vq = vqs[2];
+		/* Prime the cmdq with the header buffer. */
+		host_cmd_buf_add(vb, &vb->cmdq_stats.hdr);
+		host_cmd_buf_add(vb, &vb->cmdq_unused_page.hdr);
+	} else if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
 		struct scatterlist sg;
-		unsigned int num_stats;
-		vb->stats_vq = vqs[2];
 
+		vb->stats_vq = vqs[2];
 		/*
 		 * Prime this virtqueue with one buffer so the hypervisor can
 		 * use it to signal us later (it can't be broken yet!).
 		 */
-		num_stats = update_balloon_stats(vb);
-
-		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
+		sg_init_one(&sg, vb->cmdq_stats.stats,
+			    sizeof(vb->cmdq_stats.stats));
 		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
 		    < 0)
 			BUG();
 		virtqueue_kick(vb->stats_vq);
 	}
-	return 0;
+
+err_find:
+	kfree(names);
+err_names:
+	kfree(callbacks);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return err;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -730,7 +970,8 @@ static int init_vqs(struct virtio_balloon *vb)
 static void tell_host_one_page(struct virtio_balloon *vb,
 			       struct virtqueue *vq, struct page *page)
 {
-	add_one_chunk(vb, vq, page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT,
+	add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
+		      page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT,
 		      VIRTIO_BALLOON_PAGES_PER_PAGE);
 }
 
@@ -865,6 +1106,40 @@ static int balloon_page_chunk_init(struct virtio_balloon *vb)
 	return -ENOMEM;
 }
 
+/*
+ * Each type of command is handled one in-flight each time. So, we allocate
+ * one message buffer for each type of command. The header part of the message
+ * buffer will be offered to the device, so that the device can send a command
+ * using the corresponding command buffer to the driver later.
+ */
+static int cmdq_init(struct virtio_balloon *vb)
+{
+	vb->cmdq_unused_page.desc_table = alloc_indirect(vb->vdev,
+						VIRTIO_BALLOON_MAX_PAGE_CHUNKS,
+						GFP_KERNEL);
+	if (!vb->cmdq_unused_page.desc_table) {
+		dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
+		__virtio_clear_bit(vb->vdev,
+				   VIRTIO_BALLOON_F_CMD_VQ);
+		return -ENOMEM;
+	}
+	vb->cmdq_unused_page.num = 0;
+
+	/*
+	 * The header is initialized to let the device know which type of
+	 * command buffer it receives. The device will later use a buffer
+	 * according to the type of command that it needs to send.
+	 */
+	vb->cmdq_stats.hdr.cmd = VIRTIO_BALLOON_CMDQ_REPORT_STATS;
+	vb->cmdq_stats.hdr.flags = 0;
+	vb->cmdq_unused_page.hdr.cmd = VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES;
+	vb->cmdq_unused_page.hdr.flags = 0;
+
+	INIT_WORK(&vb->cmdq_handle_work, cmdq_handle_work_func);
+
+	return 0;
+}
+
 static int virtballoon_validate(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb = NULL;
@@ -883,6 +1158,11 @@ static int virtballoon_validate(struct virtio_device *vdev)
 			goto err_page_chunk;
 	}
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_CMD_VQ)) {
+		err = cmdq_init(vb);
+		if (err < 0)
+			goto err_vb;
+	}
 	return 0;
 
 err_page_chunk:
@@ -902,7 +1182,10 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		return -EINVAL;
 	}
 
-	INIT_WORK(&vb->update_balloon_stats_work, update_balloon_stats_func);
+	if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_CMD_VQ) &&
+	    virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		INIT_WORK(&vb->update_balloon_stats_work,
+			  update_balloon_stats_func);
 	INIT_WORK(&vb->update_balloon_size_work, update_balloon_size_func);
 	spin_lock_init(&vb->stop_update_lock);
 	vb->stop_update = false;
@@ -980,6 +1263,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	spin_unlock_irq(&vb->stop_update_lock);
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
+	cancel_work_sync(&vb->cmdq_handle_work);
 
 	remove_common(vb);
 	free_page_bmap(vb);
@@ -1029,6 +1313,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_PAGE_CHUNKS,
+	VIRTIO_BALLOON_F_CMD_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 5ed3c7b..cb66c1a 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_PAGE_CHUNKS	3 /* Inflate/Deflate pages in chunks */
+#define VIRTIO_BALLOON_F_CMD_VQ		4 /* Command virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -83,4 +84,16 @@ struct virtio_balloon_stat {
 	__virtio64 val;
 } __attribute__((packed));
 
+/* Use the memory of a vring_desc to place the cmdq header */
+#define VIRTIO_BALLOON_CMDQ_HDR_SIZE sizeof(struct vring_desc)
+
+struct virtio_balloon_cmdq_hdr {
+#define VIRTIO_BALLOON_CMDQ_REPORT_STATS	0
+#define VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES	1
+	__le32 cmd;
+/* Flag to indicate the completion of handling a command */
+#define VIRTIO_BALLOON_CMDQ_F_COMPLETION	1
+	__le32 flags;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 175+ messages in thread

* [Qemu-devel] [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
@ 2017-06-09 10:41   ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Add a new vq, cmdq, to handle requests between the device and driver.

This patch implements two commands send from the device and handled in
the driver.
1) cmd VIRTIO_BALLOON_CMDQ_REPORT_STATS: this command is used to report
the guest memory statistics to the host. The stats_vq mechanism is not
used when the cmdq mechanism is enabled.
2) cmd VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES: this command is used to
report the guest unused pages to the host.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 drivers/virtio/virtio_balloon.c     | 363 ++++++++++++++++++++++++++++++++----
 include/uapi/linux/virtio_balloon.h |  13 ++
 2 files changed, 337 insertions(+), 39 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 0cf945c..4ac90a5 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -51,6 +51,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 static struct vfsmount *balloon_mnt;
 #endif
 
+/* Types of pages to chunk */
+#define PAGE_CHUNK_TYPE_BALLOON 0	/* Chunk of inflate/deflate pages */
+#define PAGE_CHNUK_UNUSED_PAGE  1	/* Chunk of unused pages */
+
 /* The size of one page_bmap used to record inflated/deflated pages. */
 #define VIRTIO_BALLOON_PAGE_BMAP_SIZE	(8 * PAGE_SIZE)
 /*
@@ -81,12 +85,25 @@ struct virtio_balloon_page_chunk {
 	unsigned long *page_bmap[VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM];
 };
 
+struct virtio_balloon_cmdq_unused_page {
+	struct virtio_balloon_cmdq_hdr hdr;
+	struct vring_desc *desc_table;
+	/* Number of added descriptors */
+	unsigned int num;
+};
+
+struct virtio_balloon_cmdq_stats {
+	struct virtio_balloon_cmdq_hdr hdr;
+	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
+};
+
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *cmd_vq;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
+	struct work_struct cmdq_handle_work;
 	struct work_struct update_balloon_size_work;
 
 	/* Prevent updating balloon when it is being canceled. */
@@ -115,8 +132,10 @@ struct virtio_balloon {
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
 
-	/* Memory statistics */
-	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
+	/* Cmdq msg buffer for memory statistics */
+	struct virtio_balloon_cmdq_stats cmdq_stats;
+	/* Cmdq msg buffer for reporting ununsed pages */
+	struct virtio_balloon_cmdq_unused_page cmdq_unused_page;
 
 	/* To register callback in oom notifier call chain */
 	struct notifier_block nb;
@@ -208,31 +227,77 @@ static void clear_page_bmap(struct virtio_balloon *vb,
 		       VIRTIO_BALLOON_PAGE_BMAP_SIZE);
 }
 
-static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq)
+static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
+			     int type, bool busy_wait)
 {
-	unsigned int len, num;
-	struct vring_desc *desc = vb->balloon_page_chunk.desc_table;
+	unsigned int len, *num, reset_num;
+	struct vring_desc *desc;
+
+	switch (type) {
+	case PAGE_CHUNK_TYPE_BALLOON:
+		desc = vb->balloon_page_chunk.desc_table;
+		num = &vb->balloon_page_chunk.chunk_num;
+		reset_num = 0;
+		break;
+	case PAGE_CHNUK_UNUSED_PAGE:
+		desc = vb->cmdq_unused_page.desc_table;
+		num = &vb->cmdq_unused_page.num;
+		/*
+		 * The first desc is used for the cmdq_hdr, so chunks will be
+		 * added from the second desc.
+		 */
+		reset_num = 1;
+		break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: unknown page chunk type %d\n",
+			 __func__, type);
+		return;
+	}
 
-	num = vb->balloon_page_chunk.chunk_num;
-	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
+	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
 		virtqueue_kick(vq);
-		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
-		vb->balloon_page_chunk.chunk_num = 0;
+		if (busy_wait)
+			while (!virtqueue_get_buf(vq, &len) &&
+			       !virtqueue_is_broken(vq))
+				cpu_relax();
+		else
+			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		/*
+		 * Now, the descriptor have been delivered to the host. Reset
+		 * the field in the structure that records the number of added
+		 * descriptors, so that new added descriptor can be re-counted.
+		 */
+		*num = reset_num;
 	}
 }
 
 /* Add a chunk to the buffer. */
 static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
-			  u64 base_addr, u32 size)
+			  int type, u64 base_addr, u32 size)
 {
-	unsigned int *num = &vb->balloon_page_chunk.chunk_num;
-	struct vring_desc *desc = &vb->balloon_page_chunk.desc_table[*num];
+	unsigned int *num;
+	struct vring_desc *desc;
+
+	switch (type) {
+	case PAGE_CHUNK_TYPE_BALLOON:
+		num = &vb->balloon_page_chunk.chunk_num;
+		desc = &vb->balloon_page_chunk.desc_table[*num];
+		break;
+	case PAGE_CHNUK_UNUSED_PAGE:
+		num = &vb->cmdq_unused_page.num;
+		desc = &vb->cmdq_unused_page.desc_table[*num];
+		break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
+			 __func__, type);
+		return;
+	}
 
 	desc->addr = cpu_to_virtio64(vb->vdev, base_addr);
 	desc->len = cpu_to_virtio32(vb->vdev, size);
 	*num += 1;
 	if (*num == VIRTIO_BALLOON_MAX_PAGE_CHUNKS)
-		send_page_chunks(vb, vq);
+		send_page_chunks(vb, vq, type, false);
 }
 
 static void convert_bmap_to_chunks(struct virtio_balloon *vb,
@@ -264,7 +329,8 @@ static void convert_bmap_to_chunks(struct virtio_balloon *vb,
 		chunk_base_addr = (pfn_start + next_one) <<
 				  VIRTIO_BALLOON_PFN_SHIFT;
 		if (chunk_size) {
-			add_one_chunk(vb, vq, chunk_base_addr, chunk_size);
+			add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
+				      chunk_base_addr, chunk_size);
 			pos += next_zero + 1;
 		}
 	}
@@ -311,7 +377,7 @@ static void tell_host_from_page_bmap(struct virtio_balloon *vb,
 				       pfn_num);
 	}
 	if (vb->balloon_page_chunk.chunk_num > 0)
-		send_page_chunks(vb, vq);
+		send_page_chunks(vb, vq, PAGE_CHUNK_TYPE_BALLOON, false);
 }
 
 static void set_page_pfns(struct virtio_balloon *vb,
@@ -516,8 +582,8 @@ static inline void update_stat(struct virtio_balloon *vb, int idx,
 			       u16 tag, u64 val)
 {
 	BUG_ON(idx >= VIRTIO_BALLOON_S_NR);
-	vb->stats[idx].tag = cpu_to_virtio16(vb->vdev, tag);
-	vb->stats[idx].val = cpu_to_virtio64(vb->vdev, val);
+	vb->cmdq_stats.stats[idx].tag = cpu_to_virtio16(vb->vdev, tag);
+	vb->cmdq_stats.stats[idx].val = cpu_to_virtio64(vb->vdev, val);
 }
 
 #define pages_to_bytes(x) ((u64)(x) << PAGE_SHIFT)
@@ -582,7 +648,8 @@ static void stats_handle_request(struct virtio_balloon *vb)
 	vq = vb->stats_vq;
 	if (!virtqueue_get_buf(vq, &len))
 		return;
-	sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
+	sg_init_one(&sg, vb->cmdq_stats.stats,
+		    sizeof(vb->cmdq_stats.stats[0]) * num_stats);
 	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
 	virtqueue_kick(vq);
 }
@@ -686,43 +753,216 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
+static void cmdq_handle_stats(struct virtio_balloon *vb)
+{
+	struct scatterlist sg;
+	unsigned int num_stats;
+
+	spin_lock(&vb->stop_update_lock);
+	if (!vb->stop_update) {
+		num_stats = update_balloon_stats(vb);
+		sg_init_one(&sg, &vb->cmdq_stats,
+			    sizeof(struct virtio_balloon_cmdq_hdr) +
+			    sizeof(struct virtio_balloon_stat) * num_stats);
+		virtqueue_add_outbuf(vb->cmd_vq, &sg, 1, vb, GFP_KERNEL);
+		virtqueue_kick(vb->cmd_vq);
+	}
+	spin_unlock(&vb->stop_update_lock);
+}
+
+/*
+ * The header part of the message buffer is given to the device to send a
+ * command to the driver.
+ */
+static void host_cmd_buf_add(struct virtio_balloon *vb,
+			   struct virtio_balloon_cmdq_hdr *hdr)
+{
+	struct scatterlist sg;
+
+	hdr->flags = 0;
+	sg_init_one(&sg, hdr, VIRTIO_BALLOON_CMDQ_HDR_SIZE);
+
+	if (virtqueue_add_inbuf(vb->cmd_vq, &sg, 1, hdr, GFP_KERNEL) < 0) {
+		__virtio_clear_bit(vb->vdev,
+				   VIRTIO_BALLOON_F_CMD_VQ);
+		dev_warn(&vb->vdev->dev, "%s: add miscq msg buf err\n",
+			 __func__);
+		return;
+	}
+
+	virtqueue_kick(vb->cmd_vq);
+}
+
+static void cmdq_handle_unused_pages(struct virtio_balloon *vb)
+{
+	struct virtqueue *vq = vb->cmd_vq;
+	struct vring_desc *hdr_desc = &vb->cmdq_unused_page.desc_table[0];
+	unsigned long hdr_pa;
+	unsigned int order = 0, migratetype = 0;
+	struct zone *zone = NULL;
+	struct page *page = NULL;
+	u64 pfn;
+	int ret = 0;
+
+	/* Put the hdr to the first desc */
+	hdr_pa = virt_to_phys((void *)&vb->cmdq_unused_page.hdr);
+	hdr_desc->addr = cpu_to_virtio64(vb->vdev, hdr_pa);
+	hdr_desc->len = cpu_to_virtio32(vb->vdev,
+				sizeof(struct virtio_balloon_cmdq_hdr));
+	vb->cmdq_unused_page.num = 1;
+
+	for_each_populated_zone(zone) {
+		for (order = MAX_ORDER - 1; order > 0; order--) {
+			for (migratetype = 0; migratetype < MIGRATE_TYPES;
+			     migratetype++) {
+				do {
+					ret = report_unused_page_block(zone,
+						order, migratetype, &page);
+					if (!ret) {
+						pfn = (u64)page_to_pfn(page);
+						add_one_chunk(vb, vq,
+						PAGE_CHNUK_UNUSED_PAGE,
+						pfn << VIRTIO_BALLOON_PFN_SHIFT,
+						(u64)(1 << order) *
+						VIRTIO_BALLOON_PAGES_PER_PAGE);
+					}
+				} while (!ret);
+			}
+		}
+	}
+
+	/* Set the cmd completion flag. */
+	vb->cmdq_unused_page.hdr.flags |=
+				cpu_to_le32(VIRTIO_BALLOON_CMDQ_F_COMPLETION);
+	send_page_chunks(vb, vq, PAGE_CHNUK_UNUSED_PAGE, true);
+}
+
+static void cmdq_handle(struct virtio_balloon *vb)
+{
+	struct virtqueue *vq;
+	struct virtio_balloon_cmdq_hdr *hdr;
+	unsigned int len;
+
+	vq = vb->cmd_vq;
+	while ((hdr = (struct virtio_balloon_cmdq_hdr *)
+			virtqueue_get_buf(vq, &len)) != NULL) {
+		switch (hdr->cmd) {
+		case VIRTIO_BALLOON_CMDQ_REPORT_STATS:
+			cmdq_handle_stats(vb);
+			break;
+		case VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES:
+			cmdq_handle_unused_pages(vb);
+			break;
+		default:
+			dev_warn(&vb->vdev->dev, "%s: wrong cmd\n", __func__);
+			return;
+		}
+		/*
+		 * Replenish all the command buffer to the device after a
+		 * command is handled. This is for the convenience of the
+		 * device to rewind the cmdq to get back all the command
+		 * buffer after live migration.
+		 */
+		host_cmd_buf_add(vb, &vb->cmdq_stats.hdr);
+		host_cmd_buf_add(vb, &vb->cmdq_unused_page.hdr);
+	}
+}
+
+static void cmdq_handle_work_func(struct work_struct *work)
+{
+	struct virtio_balloon *vb;
+
+	vb = container_of(work, struct virtio_balloon,
+			  cmdq_handle_work);
+	cmdq_handle(vb);
+}
+
+static void cmdq_callback(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+
+	queue_work(system_freezable_wq, &vb->cmdq_handle_work);
+}
+
 static int init_vqs(struct virtio_balloon *vb)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
-	int err, nvqs;
+	struct virtqueue **vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	int err = -ENOMEM;
+	int nvqs;
+
+	/* Inflateq and deflateq are used unconditionally */
+	nvqs = 2;
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CMD_VQ) ||
+	    virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs++;
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vq;
+	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
+	if (!callbacks)
+		goto err_callback;
+	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
+	if (!names)
+		goto err_names;
+
+	callbacks[0] = balloon_ack;
+	names[0] = "inflate";
+	callbacks[1] = balloon_ack;
+	names[1] = "deflate";
 
 	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
+	 * The stats_vq is used only when cmdq is not supported (or disabled)
+	 * by the device.
 	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
-			NULL);
-	if (err)
-		return err;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CMD_VQ)) {
+		callbacks[2] = cmdq_callback;
+		names[2] = "cmdq";
+	} else if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		callbacks[2] = stats_request;
+		names[2] = "stats";
+	}
 
+	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks,
+					 names, NULL);
+	if (err)
+		goto err_find;
 	vb->inflate_vq = vqs[0];
 	vb->deflate_vq = vqs[1];
-	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CMD_VQ)) {
+		vb->cmd_vq = vqs[2];
+		/* Prime the cmdq with the header buffer. */
+		host_cmd_buf_add(vb, &vb->cmdq_stats.hdr);
+		host_cmd_buf_add(vb, &vb->cmdq_unused_page.hdr);
+	} else if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
 		struct scatterlist sg;
-		unsigned int num_stats;
-		vb->stats_vq = vqs[2];
 
+		vb->stats_vq = vqs[2];
 		/*
 		 * Prime this virtqueue with one buffer so the hypervisor can
 		 * use it to signal us later (it can't be broken yet!).
 		 */
-		num_stats = update_balloon_stats(vb);
-
-		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
+		sg_init_one(&sg, vb->cmdq_stats.stats,
+			    sizeof(vb->cmdq_stats.stats));
 		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
 		    < 0)
 			BUG();
 		virtqueue_kick(vb->stats_vq);
 	}
-	return 0;
+
+err_find:
+	kfree(names);
+err_names:
+	kfree(callbacks);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return err;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -730,7 +970,8 @@ static int init_vqs(struct virtio_balloon *vb)
 static void tell_host_one_page(struct virtio_balloon *vb,
 			       struct virtqueue *vq, struct page *page)
 {
-	add_one_chunk(vb, vq, page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT,
+	add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
+		      page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT,
 		      VIRTIO_BALLOON_PAGES_PER_PAGE);
 }
 
@@ -865,6 +1106,40 @@ static int balloon_page_chunk_init(struct virtio_balloon *vb)
 	return -ENOMEM;
 }
 
+/*
+ * Each type of command is handled one in-flight each time. So, we allocate
+ * one message buffer for each type of command. The header part of the message
+ * buffer will be offered to the device, so that the device can send a command
+ * using the corresponding command buffer to the driver later.
+ */
+static int cmdq_init(struct virtio_balloon *vb)
+{
+	vb->cmdq_unused_page.desc_table = alloc_indirect(vb->vdev,
+						VIRTIO_BALLOON_MAX_PAGE_CHUNKS,
+						GFP_KERNEL);
+	if (!vb->cmdq_unused_page.desc_table) {
+		dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
+		__virtio_clear_bit(vb->vdev,
+				   VIRTIO_BALLOON_F_CMD_VQ);
+		return -ENOMEM;
+	}
+	vb->cmdq_unused_page.num = 0;
+
+	/*
+	 * The header is initialized to let the device know which type of
+	 * command buffer it receives. The device will later use a buffer
+	 * according to the type of command that it needs to send.
+	 */
+	vb->cmdq_stats.hdr.cmd = VIRTIO_BALLOON_CMDQ_REPORT_STATS;
+	vb->cmdq_stats.hdr.flags = 0;
+	vb->cmdq_unused_page.hdr.cmd = VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES;
+	vb->cmdq_unused_page.hdr.flags = 0;
+
+	INIT_WORK(&vb->cmdq_handle_work, cmdq_handle_work_func);
+
+	return 0;
+}
+
 static int virtballoon_validate(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb = NULL;
@@ -883,6 +1158,11 @@ static int virtballoon_validate(struct virtio_device *vdev)
 			goto err_page_chunk;
 	}
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_CMD_VQ)) {
+		err = cmdq_init(vb);
+		if (err < 0)
+			goto err_vb;
+	}
 	return 0;
 
 err_page_chunk:
@@ -902,7 +1182,10 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		return -EINVAL;
 	}
 
-	INIT_WORK(&vb->update_balloon_stats_work, update_balloon_stats_func);
+	if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_CMD_VQ) &&
+	    virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		INIT_WORK(&vb->update_balloon_stats_work,
+			  update_balloon_stats_func);
 	INIT_WORK(&vb->update_balloon_size_work, update_balloon_size_func);
 	spin_lock_init(&vb->stop_update_lock);
 	vb->stop_update = false;
@@ -980,6 +1263,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	spin_unlock_irq(&vb->stop_update_lock);
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
+	cancel_work_sync(&vb->cmdq_handle_work);
 
 	remove_common(vb);
 	free_page_bmap(vb);
@@ -1029,6 +1313,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_PAGE_CHUNKS,
+	VIRTIO_BALLOON_F_CMD_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 5ed3c7b..cb66c1a 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_PAGE_CHUNKS	3 /* Inflate/Deflate pages in chunks */
+#define VIRTIO_BALLOON_F_CMD_VQ		4 /* Command virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -83,4 +84,16 @@ struct virtio_balloon_stat {
 	__virtio64 val;
 } __attribute__((packed));
 
+/* Use the memory of a vring_desc to place the cmdq header */
+#define VIRTIO_BALLOON_CMDQ_HDR_SIZE sizeof(struct vring_desc)
+
+struct virtio_balloon_cmdq_hdr {
+#define VIRTIO_BALLOON_CMDQ_REPORT_STATS	0
+#define VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES	1
+	__le32 cmd;
+/* Flag to indicate the completion of handling a command */
+#define VIRTIO_BALLOON_CMDQ_F_COMPLETION	1
+	__le32 flags;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 175+ messages in thread

* [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
  2017-06-09 10:41 ` Wei Wang
                   ` (11 preceding siblings ...)
  (?)
@ 2017-06-09 10:41 ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-09 10:41 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Add a new vq, cmdq, to handle requests between the device and driver.

This patch implements two commands send from the device and handled in
the driver.
1) cmd VIRTIO_BALLOON_CMDQ_REPORT_STATS: this command is used to report
the guest memory statistics to the host. The stats_vq mechanism is not
used when the cmdq mechanism is enabled.
2) cmd VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES: this command is used to
report the guest unused pages to the host.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 drivers/virtio/virtio_balloon.c     | 363 ++++++++++++++++++++++++++++++++----
 include/uapi/linux/virtio_balloon.h |  13 ++
 2 files changed, 337 insertions(+), 39 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 0cf945c..4ac90a5 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -51,6 +51,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 static struct vfsmount *balloon_mnt;
 #endif
 
+/* Types of pages to chunk */
+#define PAGE_CHUNK_TYPE_BALLOON 0	/* Chunk of inflate/deflate pages */
+#define PAGE_CHNUK_UNUSED_PAGE  1	/* Chunk of unused pages */
+
 /* The size of one page_bmap used to record inflated/deflated pages. */
 #define VIRTIO_BALLOON_PAGE_BMAP_SIZE	(8 * PAGE_SIZE)
 /*
@@ -81,12 +85,25 @@ struct virtio_balloon_page_chunk {
 	unsigned long *page_bmap[VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM];
 };
 
+struct virtio_balloon_cmdq_unused_page {
+	struct virtio_balloon_cmdq_hdr hdr;
+	struct vring_desc *desc_table;
+	/* Number of added descriptors */
+	unsigned int num;
+};
+
+struct virtio_balloon_cmdq_stats {
+	struct virtio_balloon_cmdq_hdr hdr;
+	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
+};
+
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *cmd_vq;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
+	struct work_struct cmdq_handle_work;
 	struct work_struct update_balloon_size_work;
 
 	/* Prevent updating balloon when it is being canceled. */
@@ -115,8 +132,10 @@ struct virtio_balloon {
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
 
-	/* Memory statistics */
-	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
+	/* Cmdq msg buffer for memory statistics */
+	struct virtio_balloon_cmdq_stats cmdq_stats;
+	/* Cmdq msg buffer for reporting ununsed pages */
+	struct virtio_balloon_cmdq_unused_page cmdq_unused_page;
 
 	/* To register callback in oom notifier call chain */
 	struct notifier_block nb;
@@ -208,31 +227,77 @@ static void clear_page_bmap(struct virtio_balloon *vb,
 		       VIRTIO_BALLOON_PAGE_BMAP_SIZE);
 }
 
-static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq)
+static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
+			     int type, bool busy_wait)
 {
-	unsigned int len, num;
-	struct vring_desc *desc = vb->balloon_page_chunk.desc_table;
+	unsigned int len, *num, reset_num;
+	struct vring_desc *desc;
+
+	switch (type) {
+	case PAGE_CHUNK_TYPE_BALLOON:
+		desc = vb->balloon_page_chunk.desc_table;
+		num = &vb->balloon_page_chunk.chunk_num;
+		reset_num = 0;
+		break;
+	case PAGE_CHNUK_UNUSED_PAGE:
+		desc = vb->cmdq_unused_page.desc_table;
+		num = &vb->cmdq_unused_page.num;
+		/*
+		 * The first desc is used for the cmdq_hdr, so chunks will be
+		 * added from the second desc.
+		 */
+		reset_num = 1;
+		break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: unknown page chunk type %d\n",
+			 __func__, type);
+		return;
+	}
 
-	num = vb->balloon_page_chunk.chunk_num;
-	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
+	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
 		virtqueue_kick(vq);
-		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
-		vb->balloon_page_chunk.chunk_num = 0;
+		if (busy_wait)
+			while (!virtqueue_get_buf(vq, &len) &&
+			       !virtqueue_is_broken(vq))
+				cpu_relax();
+		else
+			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		/*
+		 * Now, the descriptor have been delivered to the host. Reset
+		 * the field in the structure that records the number of added
+		 * descriptors, so that new added descriptor can be re-counted.
+		 */
+		*num = reset_num;
 	}
 }
 
 /* Add a chunk to the buffer. */
 static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
-			  u64 base_addr, u32 size)
+			  int type, u64 base_addr, u32 size)
 {
-	unsigned int *num = &vb->balloon_page_chunk.chunk_num;
-	struct vring_desc *desc = &vb->balloon_page_chunk.desc_table[*num];
+	unsigned int *num;
+	struct vring_desc *desc;
+
+	switch (type) {
+	case PAGE_CHUNK_TYPE_BALLOON:
+		num = &vb->balloon_page_chunk.chunk_num;
+		desc = &vb->balloon_page_chunk.desc_table[*num];
+		break;
+	case PAGE_CHNUK_UNUSED_PAGE:
+		num = &vb->cmdq_unused_page.num;
+		desc = &vb->cmdq_unused_page.desc_table[*num];
+		break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
+			 __func__, type);
+		return;
+	}
 
 	desc->addr = cpu_to_virtio64(vb->vdev, base_addr);
 	desc->len = cpu_to_virtio32(vb->vdev, size);
 	*num += 1;
 	if (*num == VIRTIO_BALLOON_MAX_PAGE_CHUNKS)
-		send_page_chunks(vb, vq);
+		send_page_chunks(vb, vq, type, false);
 }
 
 static void convert_bmap_to_chunks(struct virtio_balloon *vb,
@@ -264,7 +329,8 @@ static void convert_bmap_to_chunks(struct virtio_balloon *vb,
 		chunk_base_addr = (pfn_start + next_one) <<
 				  VIRTIO_BALLOON_PFN_SHIFT;
 		if (chunk_size) {
-			add_one_chunk(vb, vq, chunk_base_addr, chunk_size);
+			add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
+				      chunk_base_addr, chunk_size);
 			pos += next_zero + 1;
 		}
 	}
@@ -311,7 +377,7 @@ static void tell_host_from_page_bmap(struct virtio_balloon *vb,
 				       pfn_num);
 	}
 	if (vb->balloon_page_chunk.chunk_num > 0)
-		send_page_chunks(vb, vq);
+		send_page_chunks(vb, vq, PAGE_CHUNK_TYPE_BALLOON, false);
 }
 
 static void set_page_pfns(struct virtio_balloon *vb,
@@ -516,8 +582,8 @@ static inline void update_stat(struct virtio_balloon *vb, int idx,
 			       u16 tag, u64 val)
 {
 	BUG_ON(idx >= VIRTIO_BALLOON_S_NR);
-	vb->stats[idx].tag = cpu_to_virtio16(vb->vdev, tag);
-	vb->stats[idx].val = cpu_to_virtio64(vb->vdev, val);
+	vb->cmdq_stats.stats[idx].tag = cpu_to_virtio16(vb->vdev, tag);
+	vb->cmdq_stats.stats[idx].val = cpu_to_virtio64(vb->vdev, val);
 }
 
 #define pages_to_bytes(x) ((u64)(x) << PAGE_SHIFT)
@@ -582,7 +648,8 @@ static void stats_handle_request(struct virtio_balloon *vb)
 	vq = vb->stats_vq;
 	if (!virtqueue_get_buf(vq, &len))
 		return;
-	sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
+	sg_init_one(&sg, vb->cmdq_stats.stats,
+		    sizeof(vb->cmdq_stats.stats[0]) * num_stats);
 	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
 	virtqueue_kick(vq);
 }
@@ -686,43 +753,216 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
+static void cmdq_handle_stats(struct virtio_balloon *vb)
+{
+	struct scatterlist sg;
+	unsigned int num_stats;
+
+	spin_lock(&vb->stop_update_lock);
+	if (!vb->stop_update) {
+		num_stats = update_balloon_stats(vb);
+		sg_init_one(&sg, &vb->cmdq_stats,
+			    sizeof(struct virtio_balloon_cmdq_hdr) +
+			    sizeof(struct virtio_balloon_stat) * num_stats);
+		virtqueue_add_outbuf(vb->cmd_vq, &sg, 1, vb, GFP_KERNEL);
+		virtqueue_kick(vb->cmd_vq);
+	}
+	spin_unlock(&vb->stop_update_lock);
+}
+
+/*
+ * The header part of the message buffer is given to the device to send a
+ * command to the driver.
+ */
+static void host_cmd_buf_add(struct virtio_balloon *vb,
+			   struct virtio_balloon_cmdq_hdr *hdr)
+{
+	struct scatterlist sg;
+
+	hdr->flags = 0;
+	sg_init_one(&sg, hdr, VIRTIO_BALLOON_CMDQ_HDR_SIZE);
+
+	if (virtqueue_add_inbuf(vb->cmd_vq, &sg, 1, hdr, GFP_KERNEL) < 0) {
+		__virtio_clear_bit(vb->vdev,
+				   VIRTIO_BALLOON_F_CMD_VQ);
+		dev_warn(&vb->vdev->dev, "%s: add miscq msg buf err\n",
+			 __func__);
+		return;
+	}
+
+	virtqueue_kick(vb->cmd_vq);
+}
+
+static void cmdq_handle_unused_pages(struct virtio_balloon *vb)
+{
+	struct virtqueue *vq = vb->cmd_vq;
+	struct vring_desc *hdr_desc = &vb->cmdq_unused_page.desc_table[0];
+	unsigned long hdr_pa;
+	unsigned int order = 0, migratetype = 0;
+	struct zone *zone = NULL;
+	struct page *page = NULL;
+	u64 pfn;
+	int ret = 0;
+
+	/* Put the hdr to the first desc */
+	hdr_pa = virt_to_phys((void *)&vb->cmdq_unused_page.hdr);
+	hdr_desc->addr = cpu_to_virtio64(vb->vdev, hdr_pa);
+	hdr_desc->len = cpu_to_virtio32(vb->vdev,
+				sizeof(struct virtio_balloon_cmdq_hdr));
+	vb->cmdq_unused_page.num = 1;
+
+	for_each_populated_zone(zone) {
+		for (order = MAX_ORDER - 1; order > 0; order--) {
+			for (migratetype = 0; migratetype < MIGRATE_TYPES;
+			     migratetype++) {
+				do {
+					ret = report_unused_page_block(zone,
+						order, migratetype, &page);
+					if (!ret) {
+						pfn = (u64)page_to_pfn(page);
+						add_one_chunk(vb, vq,
+						PAGE_CHNUK_UNUSED_PAGE,
+						pfn << VIRTIO_BALLOON_PFN_SHIFT,
+						(u64)(1 << order) *
+						VIRTIO_BALLOON_PAGES_PER_PAGE);
+					}
+				} while (!ret);
+			}
+		}
+	}
+
+	/* Set the cmd completion flag. */
+	vb->cmdq_unused_page.hdr.flags |=
+				cpu_to_le32(VIRTIO_BALLOON_CMDQ_F_COMPLETION);
+	send_page_chunks(vb, vq, PAGE_CHNUK_UNUSED_PAGE, true);
+}
+
+static void cmdq_handle(struct virtio_balloon *vb)
+{
+	struct virtqueue *vq;
+	struct virtio_balloon_cmdq_hdr *hdr;
+	unsigned int len;
+
+	vq = vb->cmd_vq;
+	while ((hdr = (struct virtio_balloon_cmdq_hdr *)
+			virtqueue_get_buf(vq, &len)) != NULL) {
+		switch (hdr->cmd) {
+		case VIRTIO_BALLOON_CMDQ_REPORT_STATS:
+			cmdq_handle_stats(vb);
+			break;
+		case VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES:
+			cmdq_handle_unused_pages(vb);
+			break;
+		default:
+			dev_warn(&vb->vdev->dev, "%s: wrong cmd\n", __func__);
+			return;
+		}
+		/*
+		 * Replenish all the command buffer to the device after a
+		 * command is handled. This is for the convenience of the
+		 * device to rewind the cmdq to get back all the command
+		 * buffer after live migration.
+		 */
+		host_cmd_buf_add(vb, &vb->cmdq_stats.hdr);
+		host_cmd_buf_add(vb, &vb->cmdq_unused_page.hdr);
+	}
+}
+
+static void cmdq_handle_work_func(struct work_struct *work)
+{
+	struct virtio_balloon *vb;
+
+	vb = container_of(work, struct virtio_balloon,
+			  cmdq_handle_work);
+	cmdq_handle(vb);
+}
+
+static void cmdq_callback(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+
+	queue_work(system_freezable_wq, &vb->cmdq_handle_work);
+}
+
 static int init_vqs(struct virtio_balloon *vb)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
-	int err, nvqs;
+	struct virtqueue **vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	int err = -ENOMEM;
+	int nvqs;
+
+	/* Inflateq and deflateq are used unconditionally */
+	nvqs = 2;
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CMD_VQ) ||
+	    virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs++;
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vq;
+	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
+	if (!callbacks)
+		goto err_callback;
+	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
+	if (!names)
+		goto err_names;
+
+	callbacks[0] = balloon_ack;
+	names[0] = "inflate";
+	callbacks[1] = balloon_ack;
+	names[1] = "deflate";
 
 	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
+	 * The stats_vq is used only when cmdq is not supported (or disabled)
+	 * by the device.
 	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
-			NULL);
-	if (err)
-		return err;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CMD_VQ)) {
+		callbacks[2] = cmdq_callback;
+		names[2] = "cmdq";
+	} else if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		callbacks[2] = stats_request;
+		names[2] = "stats";
+	}
 
+	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks,
+					 names, NULL);
+	if (err)
+		goto err_find;
 	vb->inflate_vq = vqs[0];
 	vb->deflate_vq = vqs[1];
-	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CMD_VQ)) {
+		vb->cmd_vq = vqs[2];
+		/* Prime the cmdq with the header buffer. */
+		host_cmd_buf_add(vb, &vb->cmdq_stats.hdr);
+		host_cmd_buf_add(vb, &vb->cmdq_unused_page.hdr);
+	} else if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
 		struct scatterlist sg;
-		unsigned int num_stats;
-		vb->stats_vq = vqs[2];
 
+		vb->stats_vq = vqs[2];
 		/*
 		 * Prime this virtqueue with one buffer so the hypervisor can
 		 * use it to signal us later (it can't be broken yet!).
 		 */
-		num_stats = update_balloon_stats(vb);
-
-		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
+		sg_init_one(&sg, vb->cmdq_stats.stats,
+			    sizeof(vb->cmdq_stats.stats));
 		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
 		    < 0)
 			BUG();
 		virtqueue_kick(vb->stats_vq);
 	}
-	return 0;
+
+err_find:
+	kfree(names);
+err_names:
+	kfree(callbacks);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return err;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -730,7 +970,8 @@ static int init_vqs(struct virtio_balloon *vb)
 static void tell_host_one_page(struct virtio_balloon *vb,
 			       struct virtqueue *vq, struct page *page)
 {
-	add_one_chunk(vb, vq, page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT,
+	add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
+		      page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT,
 		      VIRTIO_BALLOON_PAGES_PER_PAGE);
 }
 
@@ -865,6 +1106,40 @@ static int balloon_page_chunk_init(struct virtio_balloon *vb)
 	return -ENOMEM;
 }
 
+/*
+ * Each type of command is handled one in-flight each time. So, we allocate
+ * one message buffer for each type of command. The header part of the message
+ * buffer will be offered to the device, so that the device can send a command
+ * using the corresponding command buffer to the driver later.
+ */
+static int cmdq_init(struct virtio_balloon *vb)
+{
+	vb->cmdq_unused_page.desc_table = alloc_indirect(vb->vdev,
+						VIRTIO_BALLOON_MAX_PAGE_CHUNKS,
+						GFP_KERNEL);
+	if (!vb->cmdq_unused_page.desc_table) {
+		dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
+		__virtio_clear_bit(vb->vdev,
+				   VIRTIO_BALLOON_F_CMD_VQ);
+		return -ENOMEM;
+	}
+	vb->cmdq_unused_page.num = 0;
+
+	/*
+	 * The header is initialized to let the device know which type of
+	 * command buffer it receives. The device will later use a buffer
+	 * according to the type of command that it needs to send.
+	 */
+	vb->cmdq_stats.hdr.cmd = VIRTIO_BALLOON_CMDQ_REPORT_STATS;
+	vb->cmdq_stats.hdr.flags = 0;
+	vb->cmdq_unused_page.hdr.cmd = VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES;
+	vb->cmdq_unused_page.hdr.flags = 0;
+
+	INIT_WORK(&vb->cmdq_handle_work, cmdq_handle_work_func);
+
+	return 0;
+}
+
 static int virtballoon_validate(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb = NULL;
@@ -883,6 +1158,11 @@ static int virtballoon_validate(struct virtio_device *vdev)
 			goto err_page_chunk;
 	}
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_CMD_VQ)) {
+		err = cmdq_init(vb);
+		if (err < 0)
+			goto err_vb;
+	}
 	return 0;
 
 err_page_chunk:
@@ -902,7 +1182,10 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		return -EINVAL;
 	}
 
-	INIT_WORK(&vb->update_balloon_stats_work, update_balloon_stats_func);
+	if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_CMD_VQ) &&
+	    virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		INIT_WORK(&vb->update_balloon_stats_work,
+			  update_balloon_stats_func);
 	INIT_WORK(&vb->update_balloon_size_work, update_balloon_size_func);
 	spin_lock_init(&vb->stop_update_lock);
 	vb->stop_update = false;
@@ -980,6 +1263,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	spin_unlock_irq(&vb->stop_update_lock);
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
+	cancel_work_sync(&vb->cmdq_handle_work);
 
 	remove_common(vb);
 	free_page_bmap(vb);
@@ -1029,6 +1313,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_PAGE_CHUNKS,
+	VIRTIO_BALLOON_F_CMD_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 5ed3c7b..cb66c1a 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_PAGE_CHUNKS	3 /* Inflate/Deflate pages in chunks */
+#define VIRTIO_BALLOON_F_CMD_VQ		4 /* Command virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -83,4 +84,16 @@ struct virtio_balloon_stat {
 	__virtio64 val;
 } __attribute__((packed));
 
+/* Use the memory of a vring_desc to place the cmdq header */
+#define VIRTIO_BALLOON_CMDQ_HDR_SIZE sizeof(struct vring_desc)
+
+struct virtio_balloon_cmdq_hdr {
+#define VIRTIO_BALLOON_CMDQ_REPORT_STATS	0
+#define VIRTIO_BALLOON_CMDQ_REPORT_UNUSED_PAGES	1
+	__le32 cmd;
+/* Flag to indicate the completion of handling a command */
+#define VIRTIO_BALLOON_CMDQ_F_COMPLETION	1
+	__le32 flags;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 175+ messages in thread

* RE: [PATCH v11 0/6] Virtio-balloon Enhancement
  2017-06-09 10:41 ` Wei Wang
  (?)
  (?)
@ 2017-06-09 11:18   ` Wang, Wei W
  -1 siblings, 0 replies; 175+ messages in thread
From: Wang, Wei W @ 2017-06-09 11:18 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Friday, June 9, 2017 6:42 PM, Wang, Wei W wrote:
> To: virtio-dev@lists.oasis-open.org; linux-kernel@vger.kernel.org; qemu-
> devel@nongnu.org; virtualization@lists.linux-foundation.org;
> kvm@vger.kernel.org; linux-mm@kvack.org; mst@redhat.com;
> david@redhat.com; Hansen, Dave <dave.hansen@intel.com>;
> cornelia.huck@de.ibm.com; akpm@linux-foundation.org;
> mgorman@techsingularity.net; aarcange@redhat.com; amit.shah@redhat.com;
> pbonzini@redhat.com; Wang, Wei W <wei.w.wang@intel.com>;
> liliang.opensource@gmail.com
> Subject: [PATCH v11 0/6] Virtio-balloon Enhancement
> 
> This patch series enhances the existing virtio-balloon with the following new
> features:
> 1) fast ballooning: transfer ballooned pages between the guest and host in
> chunks, instead of one by one; and
> 2) cmdq: a new virtqueue to send commands between the device and driver.
> Currently, it supports commands to report memory stats (replace the old statq
> mechanism) and report guest unused pages.

v10->v11 changes:
1) virtio_balloon: use vring_desc to describe a chunk;
2) virtio_ring: support to add an indirect desc table to virtqueue;
3)  virtio_balloon: use cmdq to report guest memory statistics.

> 
> Liang Li (1):
>   virtio-balloon: deflate via a page list
> 
> Wei Wang (5):
>   virtio-balloon: coding format cleanup
>   virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
>   mm: function to offer a page block on the free list
>   mm: export symbol of next_zone and first_online_pgdat
>   virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
> 
>  drivers/virtio/virtio_balloon.c     | 781 ++++++++++++++++++++++++++++++++--
> --
>  drivers/virtio/virtio_ring.c        | 120 +++++-
>  include/linux/mm.h                  |   5 +
>  include/linux/virtio.h              |   7 +
>  include/uapi/linux/virtio_balloon.h |  14 +
>  include/uapi/linux/virtio_ring.h    |   3 +
>  mm/mmzone.c                         |   2 +
>  mm/page_alloc.c                     |  91 +++++
>  8 files changed, 950 insertions(+), 73 deletions(-)
> 
> --
> 2.7.4

^ permalink raw reply	[flat|nested] 175+ messages in thread

* RE: [PATCH v11 0/6] Virtio-balloon Enhancement
@ 2017-06-09 11:18   ` Wang, Wei W
  0 siblings, 0 replies; 175+ messages in thread
From: Wang, Wei W @ 2017-06-09 11:18 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Friday, June 9, 2017 6:42 PM, Wang, Wei W wrote:
> To: virtio-dev@lists.oasis-open.org; linux-kernel@vger.kernel.org; qemu-
> devel@nongnu.org; virtualization@lists.linux-foundation.org;
> kvm@vger.kernel.org; linux-mm@kvack.org; mst@redhat.com;
> david@redhat.com; Hansen, Dave <dave.hansen@intel.com>;
> cornelia.huck@de.ibm.com; akpm@linux-foundation.org;
> mgorman@techsingularity.net; aarcange@redhat.com; amit.shah@redhat.com;
> pbonzini@redhat.com; Wang, Wei W <wei.w.wang@intel.com>;
> liliang.opensource@gmail.com
> Subject: [PATCH v11 0/6] Virtio-balloon Enhancement
> 
> This patch series enhances the existing virtio-balloon with the following new
> features:
> 1) fast ballooning: transfer ballooned pages between the guest and host in
> chunks, instead of one by one; and
> 2) cmdq: a new virtqueue to send commands between the device and driver.
> Currently, it supports commands to report memory stats (replace the old statq
> mechanism) and report guest unused pages.

v10->v11 changes:
1) virtio_balloon: use vring_desc to describe a chunk;
2) virtio_ring: support to add an indirect desc table to virtqueue;
3)  virtio_balloon: use cmdq to report guest memory statistics.

> 
> Liang Li (1):
>   virtio-balloon: deflate via a page list
> 
> Wei Wang (5):
>   virtio-balloon: coding format cleanup
>   virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
>   mm: function to offer a page block on the free list
>   mm: export symbol of next_zone and first_online_pgdat
>   virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
> 
>  drivers/virtio/virtio_balloon.c     | 781 ++++++++++++++++++++++++++++++++--
> --
>  drivers/virtio/virtio_ring.c        | 120 +++++-
>  include/linux/mm.h                  |   5 +
>  include/linux/virtio.h              |   7 +
>  include/uapi/linux/virtio_balloon.h |  14 +
>  include/uapi/linux/virtio_ring.h    |   3 +
>  mm/mmzone.c                         |   2 +
>  mm/page_alloc.c                     |  91 +++++
>  8 files changed, 950 insertions(+), 73 deletions(-)
> 
> --
> 2.7.4

^ permalink raw reply	[flat|nested] 175+ messages in thread

* RE: [PATCH v11 0/6] Virtio-balloon Enhancement
@ 2017-06-09 11:18   ` Wang, Wei W
  0 siblings, 0 replies; 175+ messages in thread
From: Wang, Wei W @ 2017-06-09 11:18 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Friday, June 9, 2017 6:42 PM, Wang, Wei W wrote:
> To: virtio-dev@lists.oasis-open.org; linux-kernel@vger.kernel.org; qemu-
> devel@nongnu.org; virtualization@lists.linux-foundation.org;
> kvm@vger.kernel.org; linux-mm@kvack.org; mst@redhat.com;
> david@redhat.com; Hansen, Dave <dave.hansen@intel.com>;
> cornelia.huck@de.ibm.com; akpm@linux-foundation.org;
> mgorman@techsingularity.net; aarcange@redhat.com; amit.shah@redhat.com;
> pbonzini@redhat.com; Wang, Wei W <wei.w.wang@intel.com>;
> liliang.opensource@gmail.com
> Subject: [PATCH v11 0/6] Virtio-balloon Enhancement
> 
> This patch series enhances the existing virtio-balloon with the following new
> features:
> 1) fast ballooning: transfer ballooned pages between the guest and host in
> chunks, instead of one by one; and
> 2) cmdq: a new virtqueue to send commands between the device and driver.
> Currently, it supports commands to report memory stats (replace the old statq
> mechanism) and report guest unused pages.

v10->v11 changes:
1) virtio_balloon: use vring_desc to describe a chunk;
2) virtio_ring: support to add an indirect desc table to virtqueue;
3)  virtio_balloon: use cmdq to report guest memory statistics.

> 
> Liang Li (1):
>   virtio-balloon: deflate via a page list
> 
> Wei Wang (5):
>   virtio-balloon: coding format cleanup
>   virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
>   mm: function to offer a page block on the free list
>   mm: export symbol of next_zone and first_online_pgdat
>   virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
> 
>  drivers/virtio/virtio_balloon.c     | 781 ++++++++++++++++++++++++++++++++--
> --
>  drivers/virtio/virtio_ring.c        | 120 +++++-
>  include/linux/mm.h                  |   5 +
>  include/linux/virtio.h              |   7 +
>  include/uapi/linux/virtio_balloon.h |  14 +
>  include/uapi/linux/virtio_ring.h    |   3 +
>  mm/mmzone.c                         |   2 +
>  mm/page_alloc.c                     |  91 +++++
>  8 files changed, 950 insertions(+), 73 deletions(-)
> 
> --
> 2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 0/6] Virtio-balloon Enhancement
@ 2017-06-09 11:18   ` Wang, Wei W
  0 siblings, 0 replies; 175+ messages in thread
From: Wang, Wei W @ 2017-06-09 11:18 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Friday, June 9, 2017 6:42 PM, Wang, Wei W wrote:
> To: virtio-dev@lists.oasis-open.org; linux-kernel@vger.kernel.org; qemu-
> devel@nongnu.org; virtualization@lists.linux-foundation.org;
> kvm@vger.kernel.org; linux-mm@kvack.org; mst@redhat.com;
> david@redhat.com; Hansen, Dave <dave.hansen@intel.com>;
> cornelia.huck@de.ibm.com; akpm@linux-foundation.org;
> mgorman@techsingularity.net; aarcange@redhat.com; amit.shah@redhat.com;
> pbonzini@redhat.com; Wang, Wei W <wei.w.wang@intel.com>;
> liliang.opensource@gmail.com
> Subject: [PATCH v11 0/6] Virtio-balloon Enhancement
> 
> This patch series enhances the existing virtio-balloon with the following new
> features:
> 1) fast ballooning: transfer ballooned pages between the guest and host in
> chunks, instead of one by one; and
> 2) cmdq: a new virtqueue to send commands between the device and driver.
> Currently, it supports commands to report memory stats (replace the old statq
> mechanism) and report guest unused pages.

v10->v11 changes:
1) virtio_balloon: use vring_desc to describe a chunk;
2) virtio_ring: support to add an indirect desc table to virtqueue;
3)  virtio_balloon: use cmdq to report guest memory statistics.

> 
> Liang Li (1):
>   virtio-balloon: deflate via a page list
> 
> Wei Wang (5):
>   virtio-balloon: coding format cleanup
>   virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
>   mm: function to offer a page block on the free list
>   mm: export symbol of next_zone and first_online_pgdat
>   virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
> 
>  drivers/virtio/virtio_balloon.c     | 781 ++++++++++++++++++++++++++++++++--
> --
>  drivers/virtio/virtio_ring.c        | 120 +++++-
>  include/linux/mm.h                  |   5 +
>  include/linux/virtio.h              |   7 +
>  include/uapi/linux/virtio_balloon.h |  14 +
>  include/uapi/linux/virtio_ring.h    |   3 +
>  mm/mmzone.c                         |   2 +
>  mm/page_alloc.c                     |  91 +++++
>  8 files changed, 950 insertions(+), 73 deletions(-)
> 
> --
> 2.7.4

^ permalink raw reply	[flat|nested] 175+ messages in thread

* RE: [PATCH v11 0/6] Virtio-balloon Enhancement
  2017-06-09 10:41 ` Wei Wang
                   ` (13 preceding siblings ...)
  (?)
@ 2017-06-09 11:18 ` Wang, Wei W
  -1 siblings, 0 replies; 175+ messages in thread
From: Wang, Wei W @ 2017-06-09 11:18 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Friday, June 9, 2017 6:42 PM, Wang, Wei W wrote:
> To: virtio-dev@lists.oasis-open.org; linux-kernel@vger.kernel.org; qemu-
> devel@nongnu.org; virtualization@lists.linux-foundation.org;
> kvm@vger.kernel.org; linux-mm@kvack.org; mst@redhat.com;
> david@redhat.com; Hansen, Dave <dave.hansen@intel.com>;
> cornelia.huck@de.ibm.com; akpm@linux-foundation.org;
> mgorman@techsingularity.net; aarcange@redhat.com; amit.shah@redhat.com;
> pbonzini@redhat.com; Wang, Wei W <wei.w.wang@intel.com>;
> liliang.opensource@gmail.com
> Subject: [PATCH v11 0/6] Virtio-balloon Enhancement
> 
> This patch series enhances the existing virtio-balloon with the following new
> features:
> 1) fast ballooning: transfer ballooned pages between the guest and host in
> chunks, instead of one by one; and
> 2) cmdq: a new virtqueue to send commands between the device and driver.
> Currently, it supports commands to report memory stats (replace the old statq
> mechanism) and report guest unused pages.

v10->v11 changes:
1) virtio_balloon: use vring_desc to describe a chunk;
2) virtio_ring: support to add an indirect desc table to virtqueue;
3)  virtio_balloon: use cmdq to report guest memory statistics.

> 
> Liang Li (1):
>   virtio-balloon: deflate via a page list
> 
> Wei Wang (5):
>   virtio-balloon: coding format cleanup
>   virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
>   mm: function to offer a page block on the free list
>   mm: export symbol of next_zone and first_online_pgdat
>   virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
> 
>  drivers/virtio/virtio_balloon.c     | 781 ++++++++++++++++++++++++++++++++--
> --
>  drivers/virtio/virtio_ring.c        | 120 +++++-
>  include/linux/mm.h                  |   5 +
>  include/linux/virtio.h              |   7 +
>  include/uapi/linux/virtio_balloon.h |  14 +
>  include/uapi/linux/virtio_ring.h    |   3 +
>  mm/mmzone.c                         |   2 +
>  mm/page_alloc.c                     |  91 +++++
>  8 files changed, 950 insertions(+), 73 deletions(-)
> 
> --
> 2.7.4

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
  2017-06-09 10:41   ` Wei Wang
  (?)
@ 2017-06-12 14:07     ` Dave Hansen
  -1 siblings, 0 replies; 175+ messages in thread
From: Dave Hansen @ 2017-06-12 14:07 UTC (permalink / raw)
  To: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, david, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 06/09/2017 03:41 AM, Wei Wang wrote:
> +	for_each_populated_zone(zone) {
> +		for (order = MAX_ORDER - 1; order > 0; order--) {
> +			for (migratetype = 0; migratetype < MIGRATE_TYPES;
> +			     migratetype++) {
> +				do {
> +					ret = report_unused_page_block(zone,
> +						order, migratetype, &page);
> +					if (!ret) {
> +						pfn = (u64)page_to_pfn(page);
> +						add_one_chunk(vb, vq,
> +						PAGE_CHNUK_UNUSED_PAGE,
> +						pfn << VIRTIO_BALLOON_PFN_SHIFT,
> +						(u64)(1 << order) *
> +						VIRTIO_BALLOON_PAGES_PER_PAGE);
> +					}
> +				} while (!ret);
> +			}
> +		}
> +	}

This is pretty unreadable.    Please add some indentation.  If you go
over 80 cols, then you might need to break this up into a separate
function.  But, either way, it can't be left like this.

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
@ 2017-06-12 14:07     ` Dave Hansen
  0 siblings, 0 replies; 175+ messages in thread
From: Dave Hansen @ 2017-06-12 14:07 UTC (permalink / raw)
  To: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, david, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 06/09/2017 03:41 AM, Wei Wang wrote:
> +	for_each_populated_zone(zone) {
> +		for (order = MAX_ORDER - 1; order > 0; order--) {
> +			for (migratetype = 0; migratetype < MIGRATE_TYPES;
> +			     migratetype++) {
> +				do {
> +					ret = report_unused_page_block(zone,
> +						order, migratetype, &page);
> +					if (!ret) {
> +						pfn = (u64)page_to_pfn(page);
> +						add_one_chunk(vb, vq,
> +						PAGE_CHNUK_UNUSED_PAGE,
> +						pfn << VIRTIO_BALLOON_PFN_SHIFT,
> +						(u64)(1 << order) *
> +						VIRTIO_BALLOON_PAGES_PER_PAGE);
> +					}
> +				} while (!ret);
> +			}
> +		}
> +	}

This is pretty unreadable.    Please add some indentation.  If you go
over 80 cols, then you might need to break this up into a separate
function.  But, either way, it can't be left like this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
@ 2017-06-12 14:07     ` Dave Hansen
  0 siblings, 0 replies; 175+ messages in thread
From: Dave Hansen @ 2017-06-12 14:07 UTC (permalink / raw)
  To: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, david, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 06/09/2017 03:41 AM, Wei Wang wrote:
> +	for_each_populated_zone(zone) {
> +		for (order = MAX_ORDER - 1; order > 0; order--) {
> +			for (migratetype = 0; migratetype < MIGRATE_TYPES;
> +			     migratetype++) {
> +				do {
> +					ret = report_unused_page_block(zone,
> +						order, migratetype, &page);
> +					if (!ret) {
> +						pfn = (u64)page_to_pfn(page);
> +						add_one_chunk(vb, vq,
> +						PAGE_CHNUK_UNUSED_PAGE,
> +						pfn << VIRTIO_BALLOON_PFN_SHIFT,
> +						(u64)(1 << order) *
> +						VIRTIO_BALLOON_PAGES_PER_PAGE);
> +					}
> +				} while (!ret);
> +			}
> +		}
> +	}

This is pretty unreadable.    Please add some indentation.  If you go
over 80 cols, then you might need to break this up into a separate
function.  But, either way, it can't be left like this.

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
  2017-06-09 10:41   ` Wei Wang
                     ` (2 preceding siblings ...)
  (?)
@ 2017-06-12 14:07   ` Dave Hansen
  -1 siblings, 0 replies; 175+ messages in thread
From: Dave Hansen @ 2017-06-12 14:07 UTC (permalink / raw)
  To: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, david, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 06/09/2017 03:41 AM, Wei Wang wrote:
> +	for_each_populated_zone(zone) {
> +		for (order = MAX_ORDER - 1; order > 0; order--) {
> +			for (migratetype = 0; migratetype < MIGRATE_TYPES;
> +			     migratetype++) {
> +				do {
> +					ret = report_unused_page_block(zone,
> +						order, migratetype, &page);
> +					if (!ret) {
> +						pfn = (u64)page_to_pfn(page);
> +						add_one_chunk(vb, vq,
> +						PAGE_CHNUK_UNUSED_PAGE,
> +						pfn << VIRTIO_BALLOON_PFN_SHIFT,
> +						(u64)(1 << order) *
> +						VIRTIO_BALLOON_PAGES_PER_PAGE);
> +					}
> +				} while (!ret);
> +			}
> +		}
> +	}

This is pretty unreadable.    Please add some indentation.  If you go
over 80 cols, then you might need to break this up into a separate
function.  But, either way, it can't be left like this.

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-09 10:41   ` Wei Wang
  (?)
@ 2017-06-12 14:10     ` Dave Hansen
  -1 siblings, 0 replies; 175+ messages in thread
From: Dave Hansen @ 2017-06-12 14:10 UTC (permalink / raw)
  To: Wei Wang, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource

Please stop cc'ing me on things also sent to closed mailing lists
(virtio-dev@lists.oasis-open.org).  I'm happy to review things on open
lists, but I'm not fond of the closed lists bouncing things at me.

On 06/09/2017 03:41 AM, Wei Wang wrote:
> Add a function to find a page block on the free list specified by the
> caller. Pages from the page block may be used immediately after the
> function returns. The caller is responsible for detecting or preventing
> the use of such pages.

This description doesn't tell me very much about what's going on here.
Neither does the comment.

"Pages from the page block may be used immediately after the
 function returns".

Used by who?  Does the "may" here mean that it is OK, or is it a warning
that the contents will be thrown away immediately?

The hypervisor is going to throw away the contents of these pages,
right?  As soon as the spinlock is released, someone can allocate a
page, and put good data in it.  What keeps the hypervisor from throwing
away good data?

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-12 14:10     ` Dave Hansen
  0 siblings, 0 replies; 175+ messages in thread
From: Dave Hansen @ 2017-06-12 14:10 UTC (permalink / raw)
  To: Wei Wang, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource

Please stop cc'ing me on things also sent to closed mailing lists
(virtio-dev@lists.oasis-open.org).  I'm happy to review things on open
lists, but I'm not fond of the closed lists bouncing things at me.

On 06/09/2017 03:41 AM, Wei Wang wrote:
> Add a function to find a page block on the free list specified by the
> caller. Pages from the page block may be used immediately after the
> function returns. The caller is responsible for detecting or preventing
> the use of such pages.

This description doesn't tell me very much about what's going on here.
Neither does the comment.

"Pages from the page block may be used immediately after the
 function returns".

Used by who?  Does the "may" here mean that it is OK, or is it a warning
that the contents will be thrown away immediately?

The hypervisor is going to throw away the contents of these pages,
right?  As soon as the spinlock is released, someone can allocate a
page, and put good data in it.  What keeps the hypervisor from throwing
away good data?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-12 14:10     ` Dave Hansen
  0 siblings, 0 replies; 175+ messages in thread
From: Dave Hansen @ 2017-06-12 14:10 UTC (permalink / raw)
  To: Wei Wang, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource

Please stop cc'ing me on things also sent to closed mailing lists
(virtio-dev@lists.oasis-open.org).  I'm happy to review things on open
lists, but I'm not fond of the closed lists bouncing things at me.

On 06/09/2017 03:41 AM, Wei Wang wrote:
> Add a function to find a page block on the free list specified by the
> caller. Pages from the page block may be used immediately after the
> function returns. The caller is responsible for detecting or preventing
> the use of such pages.

This description doesn't tell me very much about what's going on here.
Neither does the comment.

"Pages from the page block may be used immediately after the
 function returns".

Used by who?  Does the "may" here mean that it is OK, or is it a warning
that the contents will be thrown away immediately?

The hypervisor is going to throw away the contents of these pages,
right?  As soon as the spinlock is released, someone can allocate a
page, and put good data in it.  What keeps the hypervisor from throwing
away good data?

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-09 10:41   ` Wei Wang
                     ` (2 preceding siblings ...)
  (?)
@ 2017-06-12 14:10   ` Dave Hansen
  -1 siblings, 0 replies; 175+ messages in thread
From: Dave Hansen @ 2017-06-12 14:10 UTC (permalink / raw)
  To: Wei Wang, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource

Please stop cc'ing me on things also sent to closed mailing lists
(virtio-dev@lists.oasis-open.org).  I'm happy to review things on open
lists, but I'm not fond of the closed lists bouncing things at me.

On 06/09/2017 03:41 AM, Wei Wang wrote:
> Add a function to find a page block on the free list specified by the
> caller. Pages from the page block may be used immediately after the
> function returns. The caller is responsible for detecting or preventing
> the use of such pages.

This description doesn't tell me very much about what's going on here.
Neither does the comment.

"Pages from the page block may be used immediately after the
 function returns".

Used by who?  Does the "may" here mean that it is OK, or is it a warning
that the contents will be thrown away immediately?

The hypervisor is going to throw away the contents of these pages,
right?  As soon as the spinlock is released, someone can allocate a
page, and put good data in it.  What keeps the hypervisor from throwing
away good data?

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-12 14:10     ` Dave Hansen
  (?)
@ 2017-06-12 16:28       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-12 16:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Wei Wang, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource

On Mon, Jun 12, 2017 at 07:10:12AM -0700, Dave Hansen wrote:
> Please stop cc'ing me on things also sent to closed mailing lists
> (virtio-dev@lists.oasis-open.org).  I'm happy to review things on open
> lists, but I'm not fond of the closed lists bouncing things at me.
> 
> On 06/09/2017 03:41 AM, Wei Wang wrote:
> > Add a function to find a page block on the free list specified by the
> > caller. Pages from the page block may be used immediately after the
> > function returns. The caller is responsible for detecting or preventing
> > the use of such pages.
> 
> This description doesn't tell me very much about what's going on here.
> Neither does the comment.
> 
> "Pages from the page block may be used immediately after the
>  function returns".
> 
> Used by who?  Does the "may" here mean that it is OK, or is it a warning
> that the contents will be thrown away immediately?

I agree here. Don't tell callers what they should do, say what does the
function does. "offer" also confuses. Here's a better comment

--->
mm: support reporting free page blocks

This adds support for reporting blocks of pages on the free list
specified by the caller.

As pages can leave the free list during this call or immediately
afterwards, they are not guaranteed to be free after the function
returns. The only guarantee this makes is that the page was on the free
list at some point in time after the function has been invoked.

Therefore, it is not safe for caller to use any pages on the returned
block or to discard data that is put there after the function returns.
However, it is safe for caller to discard data that was in one of these
pages before the function was invoked.

---

And repeat the last part in a code comment:

 * Note: it is not safe for caller to use any pages on the returned
 * block or to discard data that is put there after the function returns.
 * However, it is safe for caller to discard data that was in one of these
 * pages before the function was invoked.


> The hypervisor is going to throw away the contents of these pages,
> right?

It should be careful and only throw away contents that was there before
report_unused_page_block was invoked.  Hypervisor is responsible for not
corrupting guest memory.  But that's not something an mm patch should
worry about.

>  As soon as the spinlock is released, someone can allocate a
> page, and put good data in it.  What keeps the hypervisor from throwing
> away good data?

API should require this explicitly. Hopefully above answers this question.

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-12 16:28       ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-12 16:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Wei Wang, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource

On Mon, Jun 12, 2017 at 07:10:12AM -0700, Dave Hansen wrote:
> Please stop cc'ing me on things also sent to closed mailing lists
> (virtio-dev@lists.oasis-open.org).  I'm happy to review things on open
> lists, but I'm not fond of the closed lists bouncing things at me.
> 
> On 06/09/2017 03:41 AM, Wei Wang wrote:
> > Add a function to find a page block on the free list specified by the
> > caller. Pages from the page block may be used immediately after the
> > function returns. The caller is responsible for detecting or preventing
> > the use of such pages.
> 
> This description doesn't tell me very much about what's going on here.
> Neither does the comment.
> 
> "Pages from the page block may be used immediately after the
>  function returns".
> 
> Used by who?  Does the "may" here mean that it is OK, or is it a warning
> that the contents will be thrown away immediately?

I agree here. Don't tell callers what they should do, say what does the
function does. "offer" also confuses. Here's a better comment

--->
mm: support reporting free page blocks

This adds support for reporting blocks of pages on the free list
specified by the caller.

As pages can leave the free list during this call or immediately
afterwards, they are not guaranteed to be free after the function
returns. The only guarantee this makes is that the page was on the free
list at some point in time after the function has been invoked.

Therefore, it is not safe for caller to use any pages on the returned
block or to discard data that is put there after the function returns.
However, it is safe for caller to discard data that was in one of these
pages before the function was invoked.

---

And repeat the last part in a code comment:

 * Note: it is not safe for caller to use any pages on the returned
 * block or to discard data that is put there after the function returns.
 * However, it is safe for caller to discard data that was in one of these
 * pages before the function was invoked.


> The hypervisor is going to throw away the contents of these pages,
> right?

It should be careful and only throw away contents that was there before
report_unused_page_block was invoked.  Hypervisor is responsible for not
corrupting guest memory.  But that's not something an mm patch should
worry about.

>  As soon as the spinlock is released, someone can allocate a
> page, and put good data in it.  What keeps the hypervisor from throwing
> away good data?

API should require this explicitly. Hopefully above answers this question.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-12 16:28       ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-12 16:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Wei Wang, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource

On Mon, Jun 12, 2017 at 07:10:12AM -0700, Dave Hansen wrote:
> Please stop cc'ing me on things also sent to closed mailing lists
> (virtio-dev@lists.oasis-open.org).  I'm happy to review things on open
> lists, but I'm not fond of the closed lists bouncing things at me.
> 
> On 06/09/2017 03:41 AM, Wei Wang wrote:
> > Add a function to find a page block on the free list specified by the
> > caller. Pages from the page block may be used immediately after the
> > function returns. The caller is responsible for detecting or preventing
> > the use of such pages.
> 
> This description doesn't tell me very much about what's going on here.
> Neither does the comment.
> 
> "Pages from the page block may be used immediately after the
>  function returns".
> 
> Used by who?  Does the "may" here mean that it is OK, or is it a warning
> that the contents will be thrown away immediately?

I agree here. Don't tell callers what they should do, say what does the
function does. "offer" also confuses. Here's a better comment

--->
mm: support reporting free page blocks

This adds support for reporting blocks of pages on the free list
specified by the caller.

As pages can leave the free list during this call or immediately
afterwards, they are not guaranteed to be free after the function
returns. The only guarantee this makes is that the page was on the free
list at some point in time after the function has been invoked.

Therefore, it is not safe for caller to use any pages on the returned
block or to discard data that is put there after the function returns.
However, it is safe for caller to discard data that was in one of these
pages before the function was invoked.

---

And repeat the last part in a code comment:

 * Note: it is not safe for caller to use any pages on the returned
 * block or to discard data that is put there after the function returns.
 * However, it is safe for caller to discard data that was in one of these
 * pages before the function was invoked.


> The hypervisor is going to throw away the contents of these pages,
> right?

It should be careful and only throw away contents that was there before
report_unused_page_block was invoked.  Hypervisor is responsible for not
corrupting guest memory.  But that's not something an mm patch should
worry about.

>  As soon as the spinlock is released, someone can allocate a
> page, and put good data in it.  What keeps the hypervisor from throwing
> away good data?

API should require this explicitly. Hopefully above answers this question.

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-12 14:10     ` Dave Hansen
                       ` (2 preceding siblings ...)
  (?)
@ 2017-06-12 16:28     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-12 16:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: aarcange, kvm, qemu-devel, amit.shah, liliang.opensource,
	linux-kernel, virtualization, linux-mm, cornelia.huck, pbonzini,
	akpm, mgorman

On Mon, Jun 12, 2017 at 07:10:12AM -0700, Dave Hansen wrote:
> Please stop cc'ing me on things also sent to closed mailing lists
> (virtio-dev@lists.oasis-open.org).  I'm happy to review things on open
> lists, but I'm not fond of the closed lists bouncing things at me.
> 
> On 06/09/2017 03:41 AM, Wei Wang wrote:
> > Add a function to find a page block on the free list specified by the
> > caller. Pages from the page block may be used immediately after the
> > function returns. The caller is responsible for detecting or preventing
> > the use of such pages.
> 
> This description doesn't tell me very much about what's going on here.
> Neither does the comment.
> 
> "Pages from the page block may be used immediately after the
>  function returns".
> 
> Used by who?  Does the "may" here mean that it is OK, or is it a warning
> that the contents will be thrown away immediately?

I agree here. Don't tell callers what they should do, say what does the
function does. "offer" also confuses. Here's a better comment

--->
mm: support reporting free page blocks

This adds support for reporting blocks of pages on the free list
specified by the caller.

As pages can leave the free list during this call or immediately
afterwards, they are not guaranteed to be free after the function
returns. The only guarantee this makes is that the page was on the free
list at some point in time after the function has been invoked.

Therefore, it is not safe for caller to use any pages on the returned
block or to discard data that is put there after the function returns.
However, it is safe for caller to discard data that was in one of these
pages before the function was invoked.

---

And repeat the last part in a code comment:

 * Note: it is not safe for caller to use any pages on the returned
 * block or to discard data that is put there after the function returns.
 * However, it is safe for caller to discard data that was in one of these
 * pages before the function was invoked.


> The hypervisor is going to throw away the contents of these pages,
> right?

It should be careful and only throw away contents that was there before
report_unused_page_block was invoked.  Hypervisor is responsible for not
corrupting guest memory.  But that's not something an mm patch should
worry about.

>  As soon as the spinlock is released, someone can allocate a
> page, and put good data in it.  What keeps the hypervisor from throwing
> away good data?

API should require this explicitly. Hopefully above answers this question.

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-12 16:28       ` Michael S. Tsirkin
  (?)
@ 2017-06-12 16:42         ` Dave Hansen
  -1 siblings, 0 replies; 175+ messages in thread
From: Dave Hansen @ 2017-06-12 16:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Wang, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource

On 06/12/2017 09:28 AM, Michael S. Tsirkin wrote:
> 
>> The hypervisor is going to throw away the contents of these pages,
>> right?
> It should be careful and only throw away contents that was there before
> report_unused_page_block was invoked.  Hypervisor is responsible for not
> corrupting guest memory.  But that's not something an mm patch should
> worry about.

That makes sense.  I'm struggling to imagine how the hypervisor makes
use of this information, though.  Does it make the pages read-only
before this, and then it knows if there has not been a write *and* it
gets notified via this new mechanism that it can throw the page away?

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-12 16:42         ` Dave Hansen
  0 siblings, 0 replies; 175+ messages in thread
From: Dave Hansen @ 2017-06-12 16:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Wang, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource

On 06/12/2017 09:28 AM, Michael S. Tsirkin wrote:
> 
>> The hypervisor is going to throw away the contents of these pages,
>> right?
> It should be careful and only throw away contents that was there before
> report_unused_page_block was invoked.  Hypervisor is responsible for not
> corrupting guest memory.  But that's not something an mm patch should
> worry about.

That makes sense.  I'm struggling to imagine how the hypervisor makes
use of this information, though.  Does it make the pages read-only
before this, and then it knows if there has not been a write *and* it
gets notified via this new mechanism that it can throw the page away?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-12 16:42         ` Dave Hansen
  0 siblings, 0 replies; 175+ messages in thread
From: Dave Hansen @ 2017-06-12 16:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Wang, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource

On 06/12/2017 09:28 AM, Michael S. Tsirkin wrote:
> 
>> The hypervisor is going to throw away the contents of these pages,
>> right?
> It should be careful and only throw away contents that was there before
> report_unused_page_block was invoked.  Hypervisor is responsible for not
> corrupting guest memory.  But that's not something an mm patch should
> worry about.

That makes sense.  I'm struggling to imagine how the hypervisor makes
use of this information, though.  Does it make the pages read-only
before this, and then it knows if there has not been a write *and* it
gets notified via this new mechanism that it can throw the page away?

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-12 16:28       ` Michael S. Tsirkin
  (?)
  (?)
@ 2017-06-12 16:42       ` Dave Hansen
  -1 siblings, 0 replies; 175+ messages in thread
From: Dave Hansen @ 2017-06-12 16:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, kvm, qemu-devel, amit.shah, liliang.opensource,
	linux-kernel, virtualization, linux-mm, cornelia.huck, pbonzini,
	akpm, mgorman

On 06/12/2017 09:28 AM, Michael S. Tsirkin wrote:
> 
>> The hypervisor is going to throw away the contents of these pages,
>> right?
> It should be careful and only throw away contents that was there before
> report_unused_page_block was invoked.  Hypervisor is responsible for not
> corrupting guest memory.  But that's not something an mm patch should
> worry about.

That makes sense.  I'm struggling to imagine how the hypervisor makes
use of this information, though.  Does it make the pages read-only
before this, and then it knows if there has not been a write *and* it
gets notified via this new mechanism that it can throw the page away?

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-12 16:42         ` Dave Hansen
  (?)
  (?)
@ 2017-06-12 20:34           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-12 20:34 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Wei Wang, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource

On Mon, Jun 12, 2017 at 09:42:36AM -0700, Dave Hansen wrote:
> On 06/12/2017 09:28 AM, Michael S. Tsirkin wrote:
> > 
> >> The hypervisor is going to throw away the contents of these pages,
> >> right?
> > It should be careful and only throw away contents that was there before
> > report_unused_page_block was invoked.  Hypervisor is responsible for not
> > corrupting guest memory.  But that's not something an mm patch should
> > worry about.
> 
> That makes sense.  I'm struggling to imagine how the hypervisor makes
> use of this information, though.  Does it make the pages read-only
> before this, and then it knows if there has not been a write *and* it
> gets notified via this new mechanism that it can throw the page away?

Yes, and specifically, this is how it works for migration.  Normally you
start by migrating all of memory, then send updates incrementally if
pages have been modified.  This mechanism allows skipping some pages in
the 1st stage, if they get changed they will be migrated in the 2nd
stage.

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-12 20:34           ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-12 20:34 UTC (permalink / raw)
  To: Dave Hansen
  Cc: aarcange, kvm, qemu-devel, amit.shah, liliang.opensource,
	linux-kernel, virtualization, linux-mm, cornelia.huck, pbonzini,
	akpm, mgorman

On Mon, Jun 12, 2017 at 09:42:36AM -0700, Dave Hansen wrote:
> On 06/12/2017 09:28 AM, Michael S. Tsirkin wrote:
> > 
> >> The hypervisor is going to throw away the contents of these pages,
> >> right?
> > It should be careful and only throw away contents that was there before
> > report_unused_page_block was invoked.  Hypervisor is responsible for not
> > corrupting guest memory.  But that's not something an mm patch should
> > worry about.
> 
> That makes sense.  I'm struggling to imagine how the hypervisor makes
> use of this information, though.  Does it make the pages read-only
> before this, and then it knows if there has not been a write *and* it
> gets notified via this new mechanism that it can throw the page away?

Yes, and specifically, this is how it works for migration.  Normally you
start by migrating all of memory, then send updates incrementally if
pages have been modified.  This mechanism allows skipping some pages in
the 1st stage, if they get changed they will be migrated in the 2nd
stage.

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-12 20:34           ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-12 20:34 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Wei Wang, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource

On Mon, Jun 12, 2017 at 09:42:36AM -0700, Dave Hansen wrote:
> On 06/12/2017 09:28 AM, Michael S. Tsirkin wrote:
> > 
> >> The hypervisor is going to throw away the contents of these pages,
> >> right?
> > It should be careful and only throw away contents that was there before
> > report_unused_page_block was invoked.  Hypervisor is responsible for not
> > corrupting guest memory.  But that's not something an mm patch should
> > worry about.
> 
> That makes sense.  I'm struggling to imagine how the hypervisor makes
> use of this information, though.  Does it make the pages read-only
> before this, and then it knows if there has not been a write *and* it
> gets notified via this new mechanism that it can throw the page away?

Yes, and specifically, this is how it works for migration.  Normally you
start by migrating all of memory, then send updates incrementally if
pages have been modified.  This mechanism allows skipping some pages in
the 1st stage, if they get changed they will be migrated in the 2nd
stage.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-12 20:34           ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-12 20:34 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Wei Wang, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource

On Mon, Jun 12, 2017 at 09:42:36AM -0700, Dave Hansen wrote:
> On 06/12/2017 09:28 AM, Michael S. Tsirkin wrote:
> > 
> >> The hypervisor is going to throw away the contents of these pages,
> >> right?
> > It should be careful and only throw away contents that was there before
> > report_unused_page_block was invoked.  Hypervisor is responsible for not
> > corrupting guest memory.  But that's not something an mm patch should
> > worry about.
> 
> That makes sense.  I'm struggling to imagine how the hypervisor makes
> use of this information, though.  Does it make the pages read-only
> before this, and then it knows if there has not been a write *and* it
> gets notified via this new mechanism that it can throw the page away?

Yes, and specifically, this is how it works for migration.  Normally you
start by migrating all of memory, then send updates incrementally if
pages have been modified.  This mechanism allows skipping some pages in
the 1st stage, if they get changed they will be migrated in the 2nd
stage.

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-12 20:34           ` Michael S. Tsirkin
  (?)
@ 2017-06-12 20:54             ` Dave Hansen
  -1 siblings, 0 replies; 175+ messages in thread
From: Dave Hansen @ 2017-06-12 20:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Wang, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource

On 06/12/2017 01:34 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 12, 2017 at 09:42:36AM -0700, Dave Hansen wrote:
>> On 06/12/2017 09:28 AM, Michael S. Tsirkin wrote:
>>>
>>>> The hypervisor is going to throw away the contents of these pages,
>>>> right?
>>> It should be careful and only throw away contents that was there before
>>> report_unused_page_block was invoked.  Hypervisor is responsible for not
>>> corrupting guest memory.  But that's not something an mm patch should
>>> worry about.
>>
>> That makes sense.  I'm struggling to imagine how the hypervisor makes
>> use of this information, though.  Does it make the pages read-only
>> before this, and then it knows if there has not been a write *and* it
>> gets notified via this new mechanism that it can throw the page away?
> 
> Yes, and specifically, this is how it works for migration.  Normally you
> start by migrating all of memory, then send updates incrementally if
> pages have been modified.  This mechanism allows skipping some pages in
> the 1st stage, if they get changed they will be migrated in the 2nd
> stage.

OK, so the migration starts and marks everything read-only.  All the
pages now have read-only valuable data, or read-only worthless data in
the case that the page is in the free lists.  In order for a page to
become non-worthless, it has to have a write done to it, which the
hypervisor obviously knows about.

With this mechanism, the hypervisor knows it can discard pages which
have not had a write since they were known to have worthless contents.

Correct?

That also seems like pretty good information to include in the
changelog.  Otherwise, folks are going to be left wondering what good
the mechanism is.  It's pretty non-trivial to figure out. :)

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-12 20:54             ` Dave Hansen
  0 siblings, 0 replies; 175+ messages in thread
From: Dave Hansen @ 2017-06-12 20:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Wang, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource

On 06/12/2017 01:34 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 12, 2017 at 09:42:36AM -0700, Dave Hansen wrote:
>> On 06/12/2017 09:28 AM, Michael S. Tsirkin wrote:
>>>
>>>> The hypervisor is going to throw away the contents of these pages,
>>>> right?
>>> It should be careful and only throw away contents that was there before
>>> report_unused_page_block was invoked.  Hypervisor is responsible for not
>>> corrupting guest memory.  But that's not something an mm patch should
>>> worry about.
>>
>> That makes sense.  I'm struggling to imagine how the hypervisor makes
>> use of this information, though.  Does it make the pages read-only
>> before this, and then it knows if there has not been a write *and* it
>> gets notified via this new mechanism that it can throw the page away?
> 
> Yes, and specifically, this is how it works for migration.  Normally you
> start by migrating all of memory, then send updates incrementally if
> pages have been modified.  This mechanism allows skipping some pages in
> the 1st stage, if they get changed they will be migrated in the 2nd
> stage.

OK, so the migration starts and marks everything read-only.  All the
pages now have read-only valuable data, or read-only worthless data in
the case that the page is in the free lists.  In order for a page to
become non-worthless, it has to have a write done to it, which the
hypervisor obviously knows about.

With this mechanism, the hypervisor knows it can discard pages which
have not had a write since they were known to have worthless contents.

Correct?

That also seems like pretty good information to include in the
changelog.  Otherwise, folks are going to be left wondering what good
the mechanism is.  It's pretty non-trivial to figure out. :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-12 20:54             ` Dave Hansen
  0 siblings, 0 replies; 175+ messages in thread
From: Dave Hansen @ 2017-06-12 20:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Wang, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource

On 06/12/2017 01:34 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 12, 2017 at 09:42:36AM -0700, Dave Hansen wrote:
>> On 06/12/2017 09:28 AM, Michael S. Tsirkin wrote:
>>>
>>>> The hypervisor is going to throw away the contents of these pages,
>>>> right?
>>> It should be careful and only throw away contents that was there before
>>> report_unused_page_block was invoked.  Hypervisor is responsible for not
>>> corrupting guest memory.  But that's not something an mm patch should
>>> worry about.
>>
>> That makes sense.  I'm struggling to imagine how the hypervisor makes
>> use of this information, though.  Does it make the pages read-only
>> before this, and then it knows if there has not been a write *and* it
>> gets notified via this new mechanism that it can throw the page away?
> 
> Yes, and specifically, this is how it works for migration.  Normally you
> start by migrating all of memory, then send updates incrementally if
> pages have been modified.  This mechanism allows skipping some pages in
> the 1st stage, if they get changed they will be migrated in the 2nd
> stage.

OK, so the migration starts and marks everything read-only.  All the
pages now have read-only valuable data, or read-only worthless data in
the case that the page is in the free lists.  In order for a page to
become non-worthless, it has to have a write done to it, which the
hypervisor obviously knows about.

With this mechanism, the hypervisor knows it can discard pages which
have not had a write since they were known to have worthless contents.

Correct?

That also seems like pretty good information to include in the
changelog.  Otherwise, folks are going to be left wondering what good
the mechanism is.  It's pretty non-trivial to figure out. :)

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-12 20:34           ` Michael S. Tsirkin
                             ` (3 preceding siblings ...)
  (?)
@ 2017-06-12 20:54           ` Dave Hansen
  -1 siblings, 0 replies; 175+ messages in thread
From: Dave Hansen @ 2017-06-12 20:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, kvm, qemu-devel, amit.shah, liliang.opensource,
	linux-kernel, virtualization, linux-mm, cornelia.huck, pbonzini,
	akpm, mgorman

On 06/12/2017 01:34 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 12, 2017 at 09:42:36AM -0700, Dave Hansen wrote:
>> On 06/12/2017 09:28 AM, Michael S. Tsirkin wrote:
>>>
>>>> The hypervisor is going to throw away the contents of these pages,
>>>> right?
>>> It should be careful and only throw away contents that was there before
>>> report_unused_page_block was invoked.  Hypervisor is responsible for not
>>> corrupting guest memory.  But that's not something an mm patch should
>>> worry about.
>>
>> That makes sense.  I'm struggling to imagine how the hypervisor makes
>> use of this information, though.  Does it make the pages read-only
>> before this, and then it knows if there has not been a write *and* it
>> gets notified via this new mechanism that it can throw the page away?
> 
> Yes, and specifically, this is how it works for migration.  Normally you
> start by migrating all of memory, then send updates incrementally if
> pages have been modified.  This mechanism allows skipping some pages in
> the 1st stage, if they get changed they will be migrated in the 2nd
> stage.

OK, so the migration starts and marks everything read-only.  All the
pages now have read-only valuable data, or read-only worthless data in
the case that the page is in the free lists.  In order for a page to
become non-worthless, it has to have a write done to it, which the
hypervisor obviously knows about.

With this mechanism, the hypervisor knows it can discard pages which
have not had a write since they were known to have worthless contents.

Correct?

That also seems like pretty good information to include in the
changelog.  Otherwise, folks are going to be left wondering what good
the mechanism is.  It's pretty non-trivial to figure out. :)

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-12 20:54             ` Dave Hansen
  (?)
  (?)
@ 2017-06-13  2:56               ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-13  2:56 UTC (permalink / raw)
  To: Dave Hansen, Michael S. Tsirkin
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource

On 06/13/2017 04:54 AM, Dave Hansen wrote:
> On 06/12/2017 01:34 PM, Michael S. Tsirkin wrote:
>> On Mon, Jun 12, 2017 at 09:42:36AM -0700, Dave Hansen wrote:
>>> On 06/12/2017 09:28 AM, Michael S. Tsirkin wrote:
>>>>> The hypervisor is going to throw away the contents of these pages,
>>>>> right?
>>>> It should be careful and only throw away contents that was there before
>>>> report_unused_page_block was invoked.  Hypervisor is responsible for not
>>>> corrupting guest memory.  But that's not something an mm patch should
>>>> worry about.
>>> That makes sense.  I'm struggling to imagine how the hypervisor makes
>>> use of this information, though.  Does it make the pages read-only
>>> before this, and then it knows if there has not been a write *and* it
>>> gets notified via this new mechanism that it can throw the page away?
>> Yes, and specifically, this is how it works for migration.  Normally you
>> start by migrating all of memory, then send updates incrementally if
>> pages have been modified.  This mechanism allows skipping some pages in
>> the 1st stage, if they get changed they will be migrated in the 2nd
>> stage.
> OK, so the migration starts and marks everything read-only.  All the
> pages now have read-only valuable data, or read-only worthless data in
> the case that the page is in the free lists.  In order for a page to
> become non-worthless, it has to have a write done to it, which the
> hypervisor obviously knows about.
>
> With this mechanism, the hypervisor knows it can discard pages which
> have not had a write since they were known to have worthless contents.
>
> Correct?
Right. By the way, ready-only is one of the dirty page logging
methods that a hypervisor uses to capture the pages that are
written by the VM.

>
> That also seems like pretty good information to include in the
> changelog.  Otherwise, folks are going to be left wondering what good
> the mechanism is.  It's pretty non-trivial to figure out. :)
If necessary, I think it's better to keep the introduction at high-level:

Examples of using this API by a hypervisor:
To live migrate a VM from one physical machine to another,
the hypervisor usually transfers all the VM's memory content.
An optimization here is to skip the transfer of memory that are not
in use by the VM, because the content of the unused memory is
worthless.
This API is the used to report the unused pages to the hypervisor.
The pages that have been reported to the hypervisor as unused
pages may be used by the VM after the report. The hypervisor
has a good mechanism (i.e. dirty page logging) to capture
the change. Therefore, if the new used pages are written into some
data, the hypervisor will still transfer them to the destination machine.

What do you guys think?

Best,
Wei

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-13  2:56               ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-13  2:56 UTC (permalink / raw)
  To: Dave Hansen, Michael S. Tsirkin
  Cc: aarcange, kvm, qemu-devel, amit.shah, liliang.opensource,
	linux-kernel, virtualization, linux-mm, cornelia.huck, pbonzini,
	akpm, mgorman

On 06/13/2017 04:54 AM, Dave Hansen wrote:
> On 06/12/2017 01:34 PM, Michael S. Tsirkin wrote:
>> On Mon, Jun 12, 2017 at 09:42:36AM -0700, Dave Hansen wrote:
>>> On 06/12/2017 09:28 AM, Michael S. Tsirkin wrote:
>>>>> The hypervisor is going to throw away the contents of these pages,
>>>>> right?
>>>> It should be careful and only throw away contents that was there before
>>>> report_unused_page_block was invoked.  Hypervisor is responsible for not
>>>> corrupting guest memory.  But that's not something an mm patch should
>>>> worry about.
>>> That makes sense.  I'm struggling to imagine how the hypervisor makes
>>> use of this information, though.  Does it make the pages read-only
>>> before this, and then it knows if there has not been a write *and* it
>>> gets notified via this new mechanism that it can throw the page away?
>> Yes, and specifically, this is how it works for migration.  Normally you
>> start by migrating all of memory, then send updates incrementally if
>> pages have been modified.  This mechanism allows skipping some pages in
>> the 1st stage, if they get changed they will be migrated in the 2nd
>> stage.
> OK, so the migration starts and marks everything read-only.  All the
> pages now have read-only valuable data, or read-only worthless data in
> the case that the page is in the free lists.  In order for a page to
> become non-worthless, it has to have a write done to it, which the
> hypervisor obviously knows about.
>
> With this mechanism, the hypervisor knows it can discard pages which
> have not had a write since they were known to have worthless contents.
>
> Correct?
Right. By the way, ready-only is one of the dirty page logging
methods that a hypervisor uses to capture the pages that are
written by the VM.

>
> That also seems like pretty good information to include in the
> changelog.  Otherwise, folks are going to be left wondering what good
> the mechanism is.  It's pretty non-trivial to figure out. :)
If necessary, I think it's better to keep the introduction at high-level:

Examples of using this API by a hypervisor:
To live migrate a VM from one physical machine to another,
the hypervisor usually transfers all the VM's memory content.
An optimization here is to skip the transfer of memory that are not
in use by the VM, because the content of the unused memory is
worthless.
This API is the used to report the unused pages to the hypervisor.
The pages that have been reported to the hypervisor as unused
pages may be used by the VM after the report. The hypervisor
has a good mechanism (i.e. dirty page logging) to capture
the change. Therefore, if the new used pages are written into some
data, the hypervisor will still transfer them to the destination machine.

What do you guys think?

Best,
Wei

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-13  2:56               ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-13  2:56 UTC (permalink / raw)
  To: Dave Hansen, Michael S. Tsirkin
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource

On 06/13/2017 04:54 AM, Dave Hansen wrote:
> On 06/12/2017 01:34 PM, Michael S. Tsirkin wrote:
>> On Mon, Jun 12, 2017 at 09:42:36AM -0700, Dave Hansen wrote:
>>> On 06/12/2017 09:28 AM, Michael S. Tsirkin wrote:
>>>>> The hypervisor is going to throw away the contents of these pages,
>>>>> right?
>>>> It should be careful and only throw away contents that was there before
>>>> report_unused_page_block was invoked.  Hypervisor is responsible for not
>>>> corrupting guest memory.  But that's not something an mm patch should
>>>> worry about.
>>> That makes sense.  I'm struggling to imagine how the hypervisor makes
>>> use of this information, though.  Does it make the pages read-only
>>> before this, and then it knows if there has not been a write *and* it
>>> gets notified via this new mechanism that it can throw the page away?
>> Yes, and specifically, this is how it works for migration.  Normally you
>> start by migrating all of memory, then send updates incrementally if
>> pages have been modified.  This mechanism allows skipping some pages in
>> the 1st stage, if they get changed they will be migrated in the 2nd
>> stage.
> OK, so the migration starts and marks everything read-only.  All the
> pages now have read-only valuable data, or read-only worthless data in
> the case that the page is in the free lists.  In order for a page to
> become non-worthless, it has to have a write done to it, which the
> hypervisor obviously knows about.
>
> With this mechanism, the hypervisor knows it can discard pages which
> have not had a write since they were known to have worthless contents.
>
> Correct?
Right. By the way, ready-only is one of the dirty page logging
methods that a hypervisor uses to capture the pages that are
written by the VM.

>
> That also seems like pretty good information to include in the
> changelog.  Otherwise, folks are going to be left wondering what good
> the mechanism is.  It's pretty non-trivial to figure out. :)
If necessary, I think it's better to keep the introduction at high-level:

Examples of using this API by a hypervisor:
To live migrate a VM from one physical machine to another,
the hypervisor usually transfers all the VM's memory content.
An optimization here is to skip the transfer of memory that are not
in use by the VM, because the content of the unused memory is
worthless.
This API is the used to report the unused pages to the hypervisor.
The pages that have been reported to the hypervisor as unused
pages may be used by the VM after the report. The hypervisor
has a good mechanism (i.e. dirty page logging) to capture
the change. Therefore, if the new used pages are written into some
data, the hypervisor will still transfer them to the destination machine.

What do you guys think?

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-13  2:56               ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-13  2:56 UTC (permalink / raw)
  To: Dave Hansen, Michael S. Tsirkin
  Cc: linux-kernel, qemu-devel, virtualization, kvm, linux-mm, david,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource

On 06/13/2017 04:54 AM, Dave Hansen wrote:
> On 06/12/2017 01:34 PM, Michael S. Tsirkin wrote:
>> On Mon, Jun 12, 2017 at 09:42:36AM -0700, Dave Hansen wrote:
>>> On 06/12/2017 09:28 AM, Michael S. Tsirkin wrote:
>>>>> The hypervisor is going to throw away the contents of these pages,
>>>>> right?
>>>> It should be careful and only throw away contents that was there before
>>>> report_unused_page_block was invoked.  Hypervisor is responsible for not
>>>> corrupting guest memory.  But that's not something an mm patch should
>>>> worry about.
>>> That makes sense.  I'm struggling to imagine how the hypervisor makes
>>> use of this information, though.  Does it make the pages read-only
>>> before this, and then it knows if there has not been a write *and* it
>>> gets notified via this new mechanism that it can throw the page away?
>> Yes, and specifically, this is how it works for migration.  Normally you
>> start by migrating all of memory, then send updates incrementally if
>> pages have been modified.  This mechanism allows skipping some pages in
>> the 1st stage, if they get changed they will be migrated in the 2nd
>> stage.
> OK, so the migration starts and marks everything read-only.  All the
> pages now have read-only valuable data, or read-only worthless data in
> the case that the page is in the free lists.  In order for a page to
> become non-worthless, it has to have a write done to it, which the
> hypervisor obviously knows about.
>
> With this mechanism, the hypervisor knows it can discard pages which
> have not had a write since they were known to have worthless contents.
>
> Correct?
Right. By the way, ready-only is one of the dirty page logging
methods that a hypervisor uses to capture the pages that are
written by the VM.

>
> That also seems like pretty good information to include in the
> changelog.  Otherwise, folks are going to be left wondering what good
> the mechanism is.  It's pretty non-trivial to figure out. :)
If necessary, I think it's better to keep the introduction at high-level:

Examples of using this API by a hypervisor:
To live migrate a VM from one physical machine to another,
the hypervisor usually transfers all the VM's memory content.
An optimization here is to skip the transfer of memory that are not
in use by the VM, because the content of the unused memory is
worthless.
This API is the used to report the unused pages to the hypervisor.
The pages that have been reported to the hypervisor as unused
pages may be used by the VM after the report. The hypervisor
has a good mechanism (i.e. dirty page logging) to capture
the change. Therefore, if the new used pages are written into some
data, the hypervisor will still transfer them to the destination machine.

What do you guys think?

Best,
Wei

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
  2017-06-12 14:07     ` Dave Hansen
  (?)
@ 2017-06-13 10:17       ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-13 10:17 UTC (permalink / raw)
  To: Dave Hansen, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, mst, david, cornelia.huck, akpm,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource

On 06/12/2017 10:07 PM, Dave Hansen wrote:
> On 06/09/2017 03:41 AM, Wei Wang wrote:
>> +	for_each_populated_zone(zone) {
>> +		for (order = MAX_ORDER - 1; order > 0; order--) {
>> +			for (migratetype = 0; migratetype < MIGRATE_TYPES;
>> +			     migratetype++) {
>> +				do {
>> +					ret = report_unused_page_block(zone,
>> +						order, migratetype, &page);
>> +					if (!ret) {
>> +						pfn = (u64)page_to_pfn(page);
>> +						add_one_chunk(vb, vq,
>> +						PAGE_CHNUK_UNUSED_PAGE,
>> +						pfn << VIRTIO_BALLOON_PFN_SHIFT,
>> +						(u64)(1 << order) *
>> +						VIRTIO_BALLOON_PAGES_PER_PAGE);
>> +					}
>> +				} while (!ret);
>> +			}
>> +		}
>> +	}
> This is pretty unreadable.    Please add some indentation.  If you go
> over 80 cols, then you might need to break this up into a separate
> function.  But, either way, it can't be left like this.

OK, I'll re-arrange it.

Best,
Wei

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
@ 2017-06-13 10:17       ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-13 10:17 UTC (permalink / raw)
  To: Dave Hansen, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, mst, david, cornelia.huck, akpm,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource

On 06/12/2017 10:07 PM, Dave Hansen wrote:
> On 06/09/2017 03:41 AM, Wei Wang wrote:
>> +	for_each_populated_zone(zone) {
>> +		for (order = MAX_ORDER - 1; order > 0; order--) {
>> +			for (migratetype = 0; migratetype < MIGRATE_TYPES;
>> +			     migratetype++) {
>> +				do {
>> +					ret = report_unused_page_block(zone,
>> +						order, migratetype, &page);
>> +					if (!ret) {
>> +						pfn = (u64)page_to_pfn(page);
>> +						add_one_chunk(vb, vq,
>> +						PAGE_CHNUK_UNUSED_PAGE,
>> +						pfn << VIRTIO_BALLOON_PFN_SHIFT,
>> +						(u64)(1 << order) *
>> +						VIRTIO_BALLOON_PAGES_PER_PAGE);
>> +					}
>> +				} while (!ret);
>> +			}
>> +		}
>> +	}
> This is pretty unreadable.    Please add some indentation.  If you go
> over 80 cols, then you might need to break this up into a separate
> function.  But, either way, it can't be left like this.

OK, I'll re-arrange it.

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
@ 2017-06-13 10:17       ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-13 10:17 UTC (permalink / raw)
  To: Dave Hansen, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, mst, david, cornelia.huck, akpm,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource

On 06/12/2017 10:07 PM, Dave Hansen wrote:
> On 06/09/2017 03:41 AM, Wei Wang wrote:
>> +	for_each_populated_zone(zone) {
>> +		for (order = MAX_ORDER - 1; order > 0; order--) {
>> +			for (migratetype = 0; migratetype < MIGRATE_TYPES;
>> +			     migratetype++) {
>> +				do {
>> +					ret = report_unused_page_block(zone,
>> +						order, migratetype, &page);
>> +					if (!ret) {
>> +						pfn = (u64)page_to_pfn(page);
>> +						add_one_chunk(vb, vq,
>> +						PAGE_CHNUK_UNUSED_PAGE,
>> +						pfn << VIRTIO_BALLOON_PFN_SHIFT,
>> +						(u64)(1 << order) *
>> +						VIRTIO_BALLOON_PAGES_PER_PAGE);
>> +					}
>> +				} while (!ret);
>> +			}
>> +		}
>> +	}
> This is pretty unreadable.    Please add some indentation.  If you go
> over 80 cols, then you might need to break this up into a separate
> function.  But, either way, it can't be left like this.

OK, I'll re-arrange it.

Best,
Wei

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
  2017-06-12 14:07     ` Dave Hansen
  (?)
  (?)
@ 2017-06-13 10:17     ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-13 10:17 UTC (permalink / raw)
  To: Dave Hansen, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, mst, david, cornelia.huck, akpm,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource

On 06/12/2017 10:07 PM, Dave Hansen wrote:
> On 06/09/2017 03:41 AM, Wei Wang wrote:
>> +	for_each_populated_zone(zone) {
>> +		for (order = MAX_ORDER - 1; order > 0; order--) {
>> +			for (migratetype = 0; migratetype < MIGRATE_TYPES;
>> +			     migratetype++) {
>> +				do {
>> +					ret = report_unused_page_block(zone,
>> +						order, migratetype, &page);
>> +					if (!ret) {
>> +						pfn = (u64)page_to_pfn(page);
>> +						add_one_chunk(vb, vq,
>> +						PAGE_CHNUK_UNUSED_PAGE,
>> +						pfn << VIRTIO_BALLOON_PFN_SHIFT,
>> +						(u64)(1 << order) *
>> +						VIRTIO_BALLOON_PAGES_PER_PAGE);
>> +					}
>> +				} while (!ret);
>> +			}
>> +		}
>> +	}
> This is pretty unreadable.    Please add some indentation.  If you go
> over 80 cols, then you might need to break this up into a separate
> function.  But, either way, it can't be left like this.

OK, I'll re-arrange it.

Best,
Wei

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
  2017-06-09 10:41   ` Wei Wang
  (?)
@ 2017-06-13 17:56     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-13 17:56 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource,
	Matthew Wilcox

On Fri, Jun 09, 2017 at 06:41:38PM +0800, Wei Wang wrote:
> Add a new feature, VIRTIO_BALLOON_F_PAGE_CHUNKS, which enables
> the transfer of the ballooned (i.e. inflated/deflated) pages in
> chunks to the host.

so now these chunks are just s/g list entry.
So let's rename this VIRTIO_BALLOON_F_SG with a comment:
* Use standard virtio s/g instead of PFN lists *

> The implementation of the previous virtio-balloon is not very
> efficient, because the ballooned pages are transferred to the
> host one by one. Here is the breakdown of the time in percentage
> spent on each step of the balloon inflating process (inflating
> 7GB of an 8GB idle guest).
> 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete.
> The above profiling shows that the bottlenecks are stage 2)
> and stage 4).
> 
> This patch optimizes step 2) by transferring pages to the host in
> chunks. A chunk consists of guest physically continuous pages.
> When the pages are packed into a chunk, they are converted into
> balloon page size (4KB) pages. A chunk is offered to the host
> via a base address (i.e. the start guest physical address of those
> physically continuous pages) and the size (i.e. the total number
> of the 4KB balloon size pages). A chunk is described via a
> vring_desc struct in the implementation.
> 
> By doing so, step 4) can also be optimized by doing address
> translation and madvise() in chunks rather than page by page.
> 
> With this new feature, the above ballooning process takes ~590ms
> resulting in an improvement of ~85%.
> 
> TODO: optimize stage 1) by allocating/freeing a chunk of pages
> instead of a single page each time.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 418 +++++++++++++++++++++++++++++++++---
>  drivers/virtio/virtio_ring.c        | 120 ++++++++++-
>  include/linux/virtio.h              |   7 +
>  include/uapi/linux/virtio_balloon.h |   1 +
>  include/uapi/linux/virtio_ring.h    |   3 +
>  5 files changed, 517 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index ecb64e9..0cf945c 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -51,6 +51,36 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>  static struct vfsmount *balloon_mnt;
>  #endif
>  
> +/* The size of one page_bmap used to record inflated/deflated pages. */
> +#define VIRTIO_BALLOON_PAGE_BMAP_SIZE	(8 * PAGE_SIZE)

At this size, you probably want alloc_pages to avoid kmalloc
overhead.

> +/*
> + * Callulates how many pfns can a page_bmap record. A bit corresponds to a
> + * page of PAGE_SIZE.
> + */
> +#define VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP \
> +	(VIRTIO_BALLOON_PAGE_BMAP_SIZE * BITS_PER_BYTE)
> +
> +/* The number of page_bmap to allocate by default. */
> +#define VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM	1

It's not by default, it's at probe time, right?

> +/* The maximum number of page_bmap that can be allocated. */

Not really, this is the size of the array we use to keep them.

> +#define VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM	32
> +

So you still have a home-grown bitmap. I'd like to know why
isn't xbitmap suggested for this purpose by Matthew Wilcox
appropriate. Please add a comment explaining the requirements
from the data structure.

> +/*
> + * QEMU virtio implementation requires the desc table size less than
> + * VIRTQUEUE_MAX_SIZE, so minus 1 here.

I think it doesn't, the issue is probably that you add a header
as a separate s/g. In any case see below.

> + */
> +#define VIRTIO_BALLOON_MAX_PAGE_CHUNKS (VIRTQUEUE_MAX_SIZE - 1)

This is wrong, virtio spec says s/g size should not exceed VQ size.
If you want to support huge VQ sizes, you can add a fallback to
smaller sizes until it fits in 1 page.

> +
> +/* The struct to manage ballooned pages in chunks */
> +struct virtio_balloon_page_chunk {
> +	/* Indirect desc table to hold chunks of balloon pages */
> +	struct vring_desc *desc_table;
> +	/* Number of added chunks of balloon pages */
> +	unsigned int chunk_num;
> +	/* Bitmap used to record ballooned pages. */
> +	unsigned long *page_bmap[VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM];
> +};
> +
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
>  	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> @@ -79,6 +109,8 @@ struct virtio_balloon {
>  	/* Synchronize access/update to this struct virtio_balloon elements */
>  	struct mutex balloon_lock;
>  
> +	struct virtio_balloon_page_chunk balloon_page_chunk;
> +
>  	/* The array of pfns we tell the Host about. */
>  	unsigned int num_pfns;
>  	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> @@ -111,6 +143,133 @@ static void balloon_ack(struct virtqueue *vq)
>  	wake_up(&vb->acked);
>  }
>  
> +/* Update pfn_max and pfn_min according to the pfn of page */
> +static inline void update_pfn_range(struct virtio_balloon *vb,
> +				    struct page *page,
> +				    unsigned long *pfn_min,
> +				    unsigned long *pfn_max)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +}
> +
> +static unsigned int extend_page_bmap_size(struct virtio_balloon *vb,
> +					  unsigned long pfn_num)

what's this API doing?  Pls add comments. this seems to assume
it will only be called once. it would be better to avoid making
this assumption, just look at what has been allocated
and extend it.

> +{
> +	unsigned int i, bmap_num, allocated_bmap_num;
> +	unsigned long bmap_len;
> +
> +	allocated_bmap_num = VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM;

how come? Pls init vars where they are declared.

> +	bmap_len = ALIGN(pfn_num, BITS_PER_LONG) / BITS_PER_BYTE;
> +	bmap_len = roundup(bmap_len, VIRTIO_BALLOON_PAGE_BMAP_SIZE);
> +	/*
> +	 * VIRTIO_BALLOON_PAGE_BMAP_SIZE is the size of one page_bmap, so
> +	 * divide it to calculate how many page_bmap that we need.
> +	 */
> +	bmap_num = (unsigned int)(bmap_len / VIRTIO_BALLOON_PAGE_BMAP_SIZE);
> +	/* The number of page_bmap to allocate should not exceed the max */
> +	bmap_num = min_t(unsigned int, VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM,
> +			 bmap_num);

two comments above don't really help just drop them.

> +
> +	for (i = VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i < bmap_num; i++) {
> +		vb->balloon_page_chunk.page_bmap[i] =
> +			    kmalloc(VIRTIO_BALLOON_PAGE_BMAP_SIZE, GFP_KERNEL);
> +		if (vb->balloon_page_chunk.page_bmap[i])
> +			allocated_bmap_num++;
> +		else
> +			break;
> +	}
> +
> +	return allocated_bmap_num;
> +}
> +
> +static void free_extended_page_bmap(struct virtio_balloon *vb,
> +				    unsigned int page_bmap_num)
> +{
> +	unsigned int i;
> +
> +	for (i = VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i < page_bmap_num;
> +	     i++) {
> +		kfree(vb->balloon_page_chunk.page_bmap[i]);
> +		vb->balloon_page_chunk.page_bmap[i] = NULL;
> +		page_bmap_num--;
> +	}
> +}
> +
> +static void clear_page_bmap(struct virtio_balloon *vb,
> +			    unsigned int page_bmap_num)
> +{
> +	int i;
> +
> +	for (i = 0; i < page_bmap_num; i++)
> +		memset(vb->balloon_page_chunk.page_bmap[i], 0,
> +		       VIRTIO_BALLOON_PAGE_BMAP_SIZE);
> +}
> +
> +static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq)
> +{
> +	unsigned int len, num;
> +	struct vring_desc *desc = vb->balloon_page_chunk.desc_table;
> +
> +	num = vb->balloon_page_chunk.chunk_num;
> +	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
> +		virtqueue_kick(vq);
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		vb->balloon_page_chunk.chunk_num = 0;
> +	}
> +}
> +
> +/* Add a chunk to the buffer. */
> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> +			  u64 base_addr, u32 size)
> +{
> +	unsigned int *num = &vb->balloon_page_chunk.chunk_num;
> +	struct vring_desc *desc = &vb->balloon_page_chunk.desc_table[*num];
> +
> +	desc->addr = cpu_to_virtio64(vb->vdev, base_addr);
> +	desc->len = cpu_to_virtio32(vb->vdev, size);
> +	*num += 1;
> +	if (*num == VIRTIO_BALLOON_MAX_PAGE_CHUNKS)
> +		send_page_chunks(vb, vq);
> +}
> +

Poking at virtio internals like this is not nice. Pls move to virtio
code.  Also, pages must be read descriptors as host might modify them.

This also lacks viommu support but this is not mandatory as
that is borken atm anyway. I'll send a patch to at least fail cleanly.

> +static void convert_bmap_to_chunks(struct virtio_balloon *vb,
> +				   struct virtqueue *vq,
> +				   unsigned long *bmap,
> +				   unsigned long pfn_start,
> +				   unsigned long size)
> +{
> +	unsigned long next_one, next_zero, pos = 0;
> +	u64 chunk_base_addr;
> +	u32 chunk_size;
> +
> +	while (pos < size) {
> +		next_one = find_next_bit(bmap, size, pos);
> +		/*
> +		 * No "1" bit found, which means that there is no pfn
> +		 * recorded in the rest of this bmap.
> +		 */
> +		if (next_one == size)
> +			break;
> +		next_zero = find_next_zero_bit(bmap, size, next_one + 1);
> +		/*
> +		 * A bit in page_bmap corresponds to a page of PAGE_SIZE.
> +		 * Convert it to be pages of 4KB balloon page size when
> +		 * adding it to a chunk.

This looks wrong. add_one_chunk assumes size in bytes. So should be just
PAGE_SIZE.

> +		 */
> +		chunk_size = (next_zero - next_one) *
> +			     VIRTIO_BALLOON_PAGES_PER_PAGE;

How do you know this won't overflow a 32 bit integer? Needs a comment.

> +		chunk_base_addr = (pfn_start + next_one) <<
> +				  VIRTIO_BALLOON_PFN_SHIFT;

Same here I think we've left pfns behind, we are using standard s/g now.

> +		if (chunk_size) {
> +			add_one_chunk(vb, vq, chunk_base_addr, chunk_size);
> +			pos += next_zero + 1;
> +		}
> +	}
> +}
> +
>  static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>  {
>  	struct scatterlist sg;
> @@ -124,7 +283,35 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>  
>  	/* When host has read buffer, this completes via balloon_ack */
>  	wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +}
> +
> +static void tell_host_from_page_bmap(struct virtio_balloon *vb,
> +				     struct virtqueue *vq,
> +				     unsigned long pfn_start,
> +				     unsigned long pfn_end,
> +				     unsigned int page_bmap_num)
> +{
> +	unsigned long i, pfn_num;
>  
> +	for (i = 0; i < page_bmap_num; i++) {
> +		/*
> +		 * For the last page_bmap, only the remaining number of pfns
> +		 * need to be searched rather than the entire page_bmap.
> +		 */
> +		if (i + 1 == page_bmap_num)
> +			pfn_num = (pfn_end - pfn_start) %
> +				  VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
> +		else
> +			pfn_num = VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
> +
> +		convert_bmap_to_chunks(vb, vq,
> +				       vb->balloon_page_chunk.page_bmap[i],
> +				       pfn_start +
> +				       i * VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP,
> +				       pfn_num);
> +	}
> +	if (vb->balloon_page_chunk.chunk_num > 0)
> +		send_page_chunks(vb, vq);
>  }
>  
>  static void set_page_pfns(struct virtio_balloon *vb,
> @@ -141,13 +328,89 @@ static void set_page_pfns(struct virtio_balloon *vb,
>  					  page_to_balloon_pfn(page) + i);
>  }
>  
> +/*
> + * Send ballooned pages in chunks to host.
> + * The ballooned pages are recorded in page bitmaps. Each bit in a bitmap
> + * corresponds to a page of PAGE_SIZE. The page bitmaps are searched for
> + * continuous "1" bits, which correspond to continuous pages, to chunk.
> + * When packing those continuous pages into chunks, pages are converted into
> + * 4KB balloon pages.
> + *
> + * pfn_max and pfn_min form the range of pfns that need to use page bitmaps to
> + * record. If the range is too large to be recorded into the allocated page
> + * bitmaps, the page bitmaps are used multiple times to record the entire
> + * range of pfns.
> + */
> +static void tell_host_page_chunks(struct virtio_balloon *vb,
> +				  struct list_head *pages,
> +				  struct virtqueue *vq,
> +				  unsigned long pfn_max,
> +				  unsigned long pfn_min)
> +{
> +	/*
> +	 * The pfn_start and pfn_end form the range of pfns that the allocated
> +	 * page_bmap can record in each round.
> +	 */
> +	unsigned long pfn_start, pfn_end;
> +	/* Total number of allocated page_bmap */
> +	unsigned int page_bmap_num;
> +	struct page *page;
> +	bool found;
> +
> +	/*
> +	 * In the case that one page_bmap is not sufficient to record the pfn
> +	 * range, page_bmap will be extended by allocating more numbers of
> +	 * page_bmap.
> +	 */
> +	page_bmap_num = extend_page_bmap_size(vb, pfn_max - pfn_min + 1);
> +
> +	/* Start from the beginning of the whole pfn range */
> +	pfn_start = pfn_min;
> +	while (pfn_start < pfn_max) {
> +		pfn_end = pfn_start +
> +			  VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP * page_bmap_num;
> +		pfn_end = pfn_end < pfn_max ? pfn_end : pfn_max;
> +		clear_page_bmap(vb, page_bmap_num);
> +		found = false;
> +
> +		list_for_each_entry(page, pages, lru) {
> +			unsigned long bmap_idx, bmap_pos, this_pfn;
> +
> +			this_pfn = page_to_pfn(page);
> +			if (this_pfn < pfn_start || this_pfn > pfn_end)
> +				continue;
> +			bmap_idx = (this_pfn - pfn_start) /
> +				   VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
> +			bmap_pos = (this_pfn - pfn_start) %
> +				   VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
> +			set_bit(bmap_pos,
> +				vb->balloon_page_chunk.page_bmap[bmap_idx]);
> +
> +			found = true;
> +		}
> +		if (found)
> +			tell_host_from_page_bmap(vb, vq, pfn_start, pfn_end,
> +						 page_bmap_num);
> +		/*
> +		 * Start the next round when pfn_start and pfn_end couldn't
> +		 * cover the whole pfn range given by pfn_max and pfn_min.
> +		 */
> +		pfn_start = pfn_end;
> +	}
> +	free_extended_page_bmap(vb, page_bmap_num);
> +}
> +
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	unsigned num_allocated_pages;
> +	bool chunking = virtio_has_feature(vb->vdev,
> +					   VIRTIO_BALLOON_F_PAGE_CHUNKS);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
>  	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	if (!chunking)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
> @@ -162,7 +425,10 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  			msleep(200);
>  			break;
>  		}
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (chunking)
> +			update_pfn_range(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		if (!virtio_has_feature(vb->vdev,
>  					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> @@ -171,8 +437,14 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  
>  	num_allocated_pages = vb->num_pfns;
>  	/* Did we get any? */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->inflate_vq);
> +	if (vb->num_pfns != 0) {
> +		if (chunking)
> +			tell_host_page_chunks(vb, &vb_dev_info->pages,
> +					      vb->inflate_vq,
> +					      pfn_max, pfn_min);
> +		else
> +			tell_host(vb, vb->inflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	return num_allocated_pages;
> @@ -198,9 +470,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool chunking = virtio_has_feature(vb->vdev,
> +					   VIRTIO_BALLOON_F_PAGE_CHUNKS);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> -	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	/* Traditionally, we can only do one array worth at a time. */
> +	if (!chunking)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	/* We can't release more pages than taken */
> @@ -210,7 +486,10 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  		page = balloon_page_dequeue(vb_dev_info);
>  		if (!page)
>  			break;
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (chunking)
> +			update_pfn_range(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>  		list_add(&page->lru, &pages);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}
> @@ -221,8 +500,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>  	 * is true, we *have* to do it in this order
>  	 */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->deflate_vq);
> +	if (vb->num_pfns != 0) {
> +		if (chunking)
> +			tell_host_page_chunks(vb, &pages, vb->deflate_vq,
> +					      pfn_max, pfn_min);
> +		else
> +			tell_host(vb, vb->deflate_vq);
> +	}
>  	release_pages_balloon(vb, &pages);
>  	mutex_unlock(&vb->balloon_lock);
>  	return num_freed_pages;
> @@ -442,6 +726,14 @@ static int init_vqs(struct virtio_balloon *vb)
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> +
> +static void tell_host_one_page(struct virtio_balloon *vb,
> +			       struct virtqueue *vq, struct page *page)
> +{
> +	add_one_chunk(vb, vq, page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT,
> +		      VIRTIO_BALLOON_PAGES_PER_PAGE);
> +}
> +
>  /*
>   * virtballoon_migratepage - perform the balloon page migration on behalf of
>   *			     a compation thread.     (called under page lock)
> @@ -465,6 +757,8 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  {
>  	struct virtio_balloon *vb = container_of(vb_dev_info,
>  			struct virtio_balloon, vb_dev_info);
> +	bool chunking = virtio_has_feature(vb->vdev,
> +					   VIRTIO_BALLOON_F_PAGE_CHUNKS);
>  	unsigned long flags;
>  
>  	/*
> @@ -486,16 +780,22 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  	vb_dev_info->isolated_pages--;
>  	__count_vm_event(BALLOON_MIGRATE);
>  	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, newpage);
> -	tell_host(vb, vb->inflate_vq);
> -
> +	if (chunking) {
> +		tell_host_one_page(vb, vb->inflate_vq, newpage);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, newpage);
> +		tell_host(vb, vb->inflate_vq);
> +	}
>  	/* balloon's page migration 2nd step -- deflate "page" */
>  	balloon_page_delete(page);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, page);
> -	tell_host(vb, vb->deflate_vq);
> -
> +	if (chunking) {
> +		tell_host_one_page(vb, vb->deflate_vq, page);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, page);
> +		tell_host(vb, vb->deflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	put_page(page); /* balloon reference */
> @@ -522,9 +822,78 @@ static struct file_system_type balloon_fs = {
>  
>  #endif /* CONFIG_BALLOON_COMPACTION */
>  
> +static void free_page_bmap(struct virtio_balloon *vb)
> +{
> +	int i;
> +
> +	for (i = 0; i < VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i++) {
> +		kfree(vb->balloon_page_chunk.page_bmap[i]);
> +		vb->balloon_page_chunk.page_bmap[i] = NULL;
> +	}
> +}
> +
> +static int balloon_page_chunk_init(struct virtio_balloon *vb)
> +{
> +	int i;
> +
> +	vb->balloon_page_chunk.desc_table = alloc_indirect(vb->vdev,
> +						VIRTIO_BALLOON_MAX_PAGE_CHUNKS,
> +						GFP_KERNEL);

This one's problematic, you aren't supposed to use APIs when device
is not inited yet. Seems to work by luck here. I suggest moving
this to probe, that's where we do a bunch of inits.
And then you can move private init back to allocate too.

> +	if (!vb->balloon_page_chunk.desc_table)
> +		goto err_page_chunk;
> +	vb->balloon_page_chunk.chunk_num = 0;
> +
> +	/*
> +	 * The default number of page_bmaps are allocated. More may be
> +	 * allocated on demand.
> +	 */
> +	for (i = 0; i < VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i++) {
> +		vb->balloon_page_chunk.page_bmap[i] =
> +			    kmalloc(VIRTIO_BALLOON_PAGE_BMAP_SIZE, GFP_KERNEL);
> +		if (!vb->balloon_page_chunk.page_bmap[i])
> +			goto err_page_bmap;
> +	}
> +
> +	return 0;
> +err_page_bmap:
> +	free_page_bmap(vb);
> +	kfree(vb->balloon_page_chunk.desc_table);
> +	vb->balloon_page_chunk.desc_table = NULL;
> +err_page_chunk:
> +	__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_PAGE_CHUNKS);
> +	dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
> +	return -ENOMEM;
> +}
> +
> +static int virtballoon_validate(struct virtio_device *vdev)
> +{
> +	struct virtio_balloon *vb = NULL;
> +	int err;
> +
> +	vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
> +	if (!vb) {
> +		err = -ENOMEM;
> +		goto err_vb;
> +	}
> +	vb->vdev = vdev;
> +
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_CHUNKS)) {
> +		err = balloon_page_chunk_init(vb);
> +		if (err < 0)
> +			goto err_page_chunk;
> +	}
> +
> +	return 0;
> +
> +err_page_chunk:
> +	kfree(vb);
> +err_vb:
> +	return err;
> +}
> +

So here you are supposed to validate features, not handle OOM
conditions.  BTW we need a fix for vIOMMU - I noticed balloon does not
support that yes.

>  static int virtballoon_probe(struct virtio_device *vdev)
>  {
> -	struct virtio_balloon *vb;
> +	struct virtio_balloon *vb = vdev->priv;
>  	int err;
>  
>  	if (!vdev->config->get) {
> @@ -533,20 +902,14 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  		return -EINVAL;
>  	}
>  
> -	vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
> -	if (!vb) {
> -		err = -ENOMEM;
> -		goto out;
> -	}
> -
>  	INIT_WORK(&vb->update_balloon_stats_work, update_balloon_stats_func);
>  	INIT_WORK(&vb->update_balloon_size_work, update_balloon_size_func);
>  	spin_lock_init(&vb->stop_update_lock);
>  	vb->stop_update = false;
>  	vb->num_pages = 0;
> +
>  	mutex_init(&vb->balloon_lock);
>  	init_waitqueue_head(&vb->acked);
> -	vb->vdev = vdev;
>  
>  	balloon_devinfo_init(&vb->vb_dev_info);
>  
> @@ -590,7 +953,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	vdev->config->del_vqs(vdev);
>  out_free_vb:
>  	kfree(vb);
> -out:
>  	return err;
>  }
>  
> @@ -620,6 +982,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	cancel_work_sync(&vb->update_balloon_stats_work);
>  
>  	remove_common(vb);
> +	free_page_bmap(vb);
> +	kfree(vb->balloon_page_chunk.desc_table);
>  #ifdef CONFIG_BALLOON_COMPACTION
>  	if (vb->vb_dev_info.inode)
>  		iput(vb->vb_dev_info.inode);
> @@ -664,6 +1028,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_PAGE_CHUNKS,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> @@ -674,6 +1039,7 @@ static struct virtio_driver virtio_balloon_driver = {
>  	.id_table =	id_table,
>  	.probe =	virtballoon_probe,
>  	.remove =	virtballoon_remove,
> +	.validate =	virtballoon_validate,
>  	.config_changed = virtballoon_changed,
>  #ifdef CONFIG_PM_SLEEP
>  	.freeze	=	virtballoon_freeze,
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 409aeaa..0ea2512 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -235,8 +235,17 @@ static int vring_mapping_error(const struct vring_virtqueue *vq,
>  	return dma_mapping_error(vring_dma_dev(vq), addr);
>  }
>  
> -static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
> -					 unsigned int total_sg, gfp_t gfp)
> +/**
> + * alloc_indirect - allocate an indirect desc table
> + * @vdev: the virtio_device that owns the indirect desc table.
> + * @num: the number of entries that the table will have.
> + * @gfp: how to do memory allocations (if necessary).
> + *
> + * Return NULL if the table allocation failed. Otherwise, return the address
> + * of the table.
> + */
> +struct vring_desc *alloc_indirect(struct virtio_device *vdev, unsigned int num,
> +				  gfp_t gfp)
>  {
>  	struct vring_desc *desc;
>  	unsigned int i;
> @@ -248,14 +257,15 @@ static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
>  	 */
>  	gfp &= ~__GFP_HIGHMEM;
>  
> -	desc = kmalloc(total_sg * sizeof(struct vring_desc), gfp);
> +	desc = kmalloc_array(num, sizeof(struct vring_desc), gfp);
>  	if (!desc)
>  		return NULL;
>  
> -	for (i = 0; i < total_sg; i++)
> -		desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);
> +	for (i = 0; i < num; i++)
> +		desc[i].next = cpu_to_virtio16(vdev, i + 1);
>  	return desc;
>  }
> +EXPORT_SYMBOL_GPL(alloc_indirect);
>  
>  static inline int virtqueue_add(struct virtqueue *_vq,
>  				struct scatterlist *sgs[],
> @@ -302,7 +312,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  	/* If the host supports indirect descriptor tables, and we have multiple
>  	 * buffers, then go indirect. FIXME: tune this threshold */
>  	if (vq->indirect && total_sg > 1 && vq->vq.num_free)
> -		desc = alloc_indirect(_vq, total_sg, gfp);
> +		desc = alloc_indirect(_vq->vdev, total_sg, gfp);
>  	else
>  		desc = NULL;
>  
> @@ -433,6 +443,104 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  }
>  
>  /**
> + * virtqueue_indirect_desc_table_add - add an indirect desc table to the vq
> + * @_vq: the struct virtqueue we're talking about.
> + * @desc: the desc table we're talking about.
> + * @num: the number of entries that the desc table has.
> + *
> + * Returns zero or a negative error (ie. ENOSPC, EIO).
> + */
> +int virtqueue_indirect_desc_table_add(struct virtqueue *_vq,
> +				      struct vring_desc *desc,
> +				      unsigned int num)
> +{
> +	struct vring_virtqueue *vq = to_vvq(_vq);
> +	dma_addr_t desc_addr;
> +	unsigned int i, avail;
> +	int head;
> +
> +	/* Sanity check */
> +	if (!desc) {
> +		pr_debug("%s: empty desc table\n", __func__);
> +		return -EINVAL;
> +	}
> +
> +	START_USE(vq);
> +
> +	if (unlikely(vq->broken)) {
> +		END_USE(vq);
> +		return -EIO;
> +	}
> +
> +	if (!vq->vq.num_free) {
> +		pr_debug("%s: the virtioqueue is full\n", __func__);
> +		END_USE(vq);
> +		return -ENOSPC;
> +	}
> +
> +	/* Map and fill in the indirect table */
> +	desc_addr = vring_map_single(vq, desc, num * sizeof(struct vring_desc),
> +				     DMA_TO_DEVICE);
> +	if (vring_mapping_error(vq, desc_addr)) {
> +		pr_debug("%s: map desc failed\n", __func__);
> +		END_USE(vq);
> +		return -EIO;
> +	}
> +
> +	/* Mark the flag of the table entries */
> +	for (i = 0; i < num; i++)
> +		desc[i].flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_NEXT);
> +	/* The last one doesn't continue. */
> +	desc[num - 1].flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
> +
> +	/* Get a ring entry to point to the indirect table */
> +	head = vq->free_head;
> +	vq->vring.desc[head].flags = cpu_to_virtio16(_vq->vdev,
> +						     VRING_DESC_F_INDIRECT);
> +	vq->vring.desc[head].addr = cpu_to_virtio64(_vq->vdev, desc_addr);
> +	vq->vring.desc[head].len = cpu_to_virtio32(_vq->vdev, num *
> +						   sizeof(struct vring_desc));
> +	/* We're using 1 buffers from the free list. */
> +	vq->vq.num_free--;
> +	/* Update free pointer */
> +	vq->free_head = virtio16_to_cpu(_vq->vdev, vq->vring.desc[head].next);
> +
> +	/* Store token and indirect buffer state. */
> +	vq->desc_state[head].data = desc;
> +	/* Don't free the caller allocated indirect table when detach_buf. */
> +	vq->desc_state[head].indir_desc = NULL;
> +
> +	/*
> +	 * Put entry in available array (but don't update avail->idx until they
> +	 * do sync).
> +	 */
> +	avail = vq->avail_idx_shadow & (vq->vring.num - 1);
> +	vq->vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> +
> +	/*
> +	 * Descriptors and available array need to be set before we expose the
> +	 * new available array entries.
> +	 */
> +	virtio_wmb(vq->weak_barriers);
> +	vq->avail_idx_shadow++;
> +	vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
> +	vq->num_added++;
> +
> +	pr_debug("%s: added buffer head %i to %p\n", __func__, head, vq);
> +	END_USE(vq);
> +
> +	/*
> +	 * This is very unlikely, but theoretically possible.  Kick
> +	 * just in case.
> +	 */
> +	if (unlikely(vq->num_added == (1 << 16) - 1))
> +		virtqueue_kick(_vq);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_indirect_desc_table_add);
> +

I'm not really happy with the fact we are duplicating so much code. Most
of this is duplicated from virtqueue_add, isn't it? I imagine you just
need to factor out the code from the following place down:

        /* If the host supports indirect descriptor tables, and we have multiple
         * buffers, then go indirect. FIXME: tune this threshold */
        if (vq->indirect && total_sg > 1 && vq->vq.num_free)
                desc = alloc_indirect(_vq, total_sg, gfp);
        else
                desc = NULL;

then reuse that.

> +/**
>   * virtqueue_add_sgs - expose buffers to other end
>   * @vq: the struct virtqueue we're talking about.
>   * @sgs: array of terminated scatterlists.
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index 7edfbdb..01dad22 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -34,6 +34,13 @@ struct virtqueue {
>  	void *priv;
>  };
>  
> +struct vring_desc *alloc_indirect(struct virtio_device *vdev,
> +				  unsigned int num, gfp_t gfp);
> +

Please prefix with virtqueue or virtio (depending on 1st parameter).
You also want a free API to pair with this (even though it's just kfree
right now).

> +int virtqueue_indirect_desc_table_add(struct virtqueue *_vq,
> +				      struct vring_desc *desc,
> +				      unsigned int num);
> +
>  int virtqueue_add_outbuf(struct virtqueue *vq,
>  			 struct scatterlist sg[], unsigned int num,
>  			 void *data,
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 343d7dd..5ed3c7b 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_PAGE_CHUNKS	3 /* Inflate/Deflate pages in chunks */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
> index c072959..0499fb8 100644
> --- a/include/uapi/linux/virtio_ring.h
> +++ b/include/uapi/linux/virtio_ring.h
> @@ -111,6 +111,9 @@ struct vring {
>  #define VRING_USED_ALIGN_SIZE 4
>  #define VRING_DESC_ALIGN_SIZE 16
>  
> +/* The supported max queue size */
> +#define VIRTQUEUE_MAX_SIZE 1024
> +
>  /* The standard layout for the ring is a continuous chunk of memory which looks
>   * like this.  We assume num is a power of 2.
>   *

Please do not add this to UAPI.

> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
@ 2017-06-13 17:56     ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-13 17:56 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource,
	Matthew Wilcox

On Fri, Jun 09, 2017 at 06:41:38PM +0800, Wei Wang wrote:
> Add a new feature, VIRTIO_BALLOON_F_PAGE_CHUNKS, which enables
> the transfer of the ballooned (i.e. inflated/deflated) pages in
> chunks to the host.

so now these chunks are just s/g list entry.
So let's rename this VIRTIO_BALLOON_F_SG with a comment:
* Use standard virtio s/g instead of PFN lists *

> The implementation of the previous virtio-balloon is not very
> efficient, because the ballooned pages are transferred to the
> host one by one. Here is the breakdown of the time in percentage
> spent on each step of the balloon inflating process (inflating
> 7GB of an 8GB idle guest).
> 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete.
> The above profiling shows that the bottlenecks are stage 2)
> and stage 4).
> 
> This patch optimizes step 2) by transferring pages to the host in
> chunks. A chunk consists of guest physically continuous pages.
> When the pages are packed into a chunk, they are converted into
> balloon page size (4KB) pages. A chunk is offered to the host
> via a base address (i.e. the start guest physical address of those
> physically continuous pages) and the size (i.e. the total number
> of the 4KB balloon size pages). A chunk is described via a
> vring_desc struct in the implementation.
> 
> By doing so, step 4) can also be optimized by doing address
> translation and madvise() in chunks rather than page by page.
> 
> With this new feature, the above ballooning process takes ~590ms
> resulting in an improvement of ~85%.
> 
> TODO: optimize stage 1) by allocating/freeing a chunk of pages
> instead of a single page each time.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 418 +++++++++++++++++++++++++++++++++---
>  drivers/virtio/virtio_ring.c        | 120 ++++++++++-
>  include/linux/virtio.h              |   7 +
>  include/uapi/linux/virtio_balloon.h |   1 +
>  include/uapi/linux/virtio_ring.h    |   3 +
>  5 files changed, 517 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index ecb64e9..0cf945c 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -51,6 +51,36 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>  static struct vfsmount *balloon_mnt;
>  #endif
>  
> +/* The size of one page_bmap used to record inflated/deflated pages. */
> +#define VIRTIO_BALLOON_PAGE_BMAP_SIZE	(8 * PAGE_SIZE)

At this size, you probably want alloc_pages to avoid kmalloc
overhead.

> +/*
> + * Callulates how many pfns can a page_bmap record. A bit corresponds to a
> + * page of PAGE_SIZE.
> + */
> +#define VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP \
> +	(VIRTIO_BALLOON_PAGE_BMAP_SIZE * BITS_PER_BYTE)
> +
> +/* The number of page_bmap to allocate by default. */
> +#define VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM	1

It's not by default, it's at probe time, right?

> +/* The maximum number of page_bmap that can be allocated. */

Not really, this is the size of the array we use to keep them.

> +#define VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM	32
> +

So you still have a home-grown bitmap. I'd like to know why
isn't xbitmap suggested for this purpose by Matthew Wilcox
appropriate. Please add a comment explaining the requirements
from the data structure.

> +/*
> + * QEMU virtio implementation requires the desc table size less than
> + * VIRTQUEUE_MAX_SIZE, so minus 1 here.

I think it doesn't, the issue is probably that you add a header
as a separate s/g. In any case see below.

> + */
> +#define VIRTIO_BALLOON_MAX_PAGE_CHUNKS (VIRTQUEUE_MAX_SIZE - 1)

This is wrong, virtio spec says s/g size should not exceed VQ size.
If you want to support huge VQ sizes, you can add a fallback to
smaller sizes until it fits in 1 page.

> +
> +/* The struct to manage ballooned pages in chunks */
> +struct virtio_balloon_page_chunk {
> +	/* Indirect desc table to hold chunks of balloon pages */
> +	struct vring_desc *desc_table;
> +	/* Number of added chunks of balloon pages */
> +	unsigned int chunk_num;
> +	/* Bitmap used to record ballooned pages. */
> +	unsigned long *page_bmap[VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM];
> +};
> +
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
>  	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> @@ -79,6 +109,8 @@ struct virtio_balloon {
>  	/* Synchronize access/update to this struct virtio_balloon elements */
>  	struct mutex balloon_lock;
>  
> +	struct virtio_balloon_page_chunk balloon_page_chunk;
> +
>  	/* The array of pfns we tell the Host about. */
>  	unsigned int num_pfns;
>  	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> @@ -111,6 +143,133 @@ static void balloon_ack(struct virtqueue *vq)
>  	wake_up(&vb->acked);
>  }
>  
> +/* Update pfn_max and pfn_min according to the pfn of page */
> +static inline void update_pfn_range(struct virtio_balloon *vb,
> +				    struct page *page,
> +				    unsigned long *pfn_min,
> +				    unsigned long *pfn_max)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +}
> +
> +static unsigned int extend_page_bmap_size(struct virtio_balloon *vb,
> +					  unsigned long pfn_num)

what's this API doing?  Pls add comments. this seems to assume
it will only be called once. it would be better to avoid making
this assumption, just look at what has been allocated
and extend it.

> +{
> +	unsigned int i, bmap_num, allocated_bmap_num;
> +	unsigned long bmap_len;
> +
> +	allocated_bmap_num = VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM;

how come? Pls init vars where they are declared.

> +	bmap_len = ALIGN(pfn_num, BITS_PER_LONG) / BITS_PER_BYTE;
> +	bmap_len = roundup(bmap_len, VIRTIO_BALLOON_PAGE_BMAP_SIZE);
> +	/*
> +	 * VIRTIO_BALLOON_PAGE_BMAP_SIZE is the size of one page_bmap, so
> +	 * divide it to calculate how many page_bmap that we need.
> +	 */
> +	bmap_num = (unsigned int)(bmap_len / VIRTIO_BALLOON_PAGE_BMAP_SIZE);
> +	/* The number of page_bmap to allocate should not exceed the max */
> +	bmap_num = min_t(unsigned int, VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM,
> +			 bmap_num);

two comments above don't really help just drop them.

> +
> +	for (i = VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i < bmap_num; i++) {
> +		vb->balloon_page_chunk.page_bmap[i] =
> +			    kmalloc(VIRTIO_BALLOON_PAGE_BMAP_SIZE, GFP_KERNEL);
> +		if (vb->balloon_page_chunk.page_bmap[i])
> +			allocated_bmap_num++;
> +		else
> +			break;
> +	}
> +
> +	return allocated_bmap_num;
> +}
> +
> +static void free_extended_page_bmap(struct virtio_balloon *vb,
> +				    unsigned int page_bmap_num)
> +{
> +	unsigned int i;
> +
> +	for (i = VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i < page_bmap_num;
> +	     i++) {
> +		kfree(vb->balloon_page_chunk.page_bmap[i]);
> +		vb->balloon_page_chunk.page_bmap[i] = NULL;
> +		page_bmap_num--;
> +	}
> +}
> +
> +static void clear_page_bmap(struct virtio_balloon *vb,
> +			    unsigned int page_bmap_num)
> +{
> +	int i;
> +
> +	for (i = 0; i < page_bmap_num; i++)
> +		memset(vb->balloon_page_chunk.page_bmap[i], 0,
> +		       VIRTIO_BALLOON_PAGE_BMAP_SIZE);
> +}
> +
> +static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq)
> +{
> +	unsigned int len, num;
> +	struct vring_desc *desc = vb->balloon_page_chunk.desc_table;
> +
> +	num = vb->balloon_page_chunk.chunk_num;
> +	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
> +		virtqueue_kick(vq);
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		vb->balloon_page_chunk.chunk_num = 0;
> +	}
> +}
> +
> +/* Add a chunk to the buffer. */
> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> +			  u64 base_addr, u32 size)
> +{
> +	unsigned int *num = &vb->balloon_page_chunk.chunk_num;
> +	struct vring_desc *desc = &vb->balloon_page_chunk.desc_table[*num];
> +
> +	desc->addr = cpu_to_virtio64(vb->vdev, base_addr);
> +	desc->len = cpu_to_virtio32(vb->vdev, size);
> +	*num += 1;
> +	if (*num == VIRTIO_BALLOON_MAX_PAGE_CHUNKS)
> +		send_page_chunks(vb, vq);
> +}
> +

Poking at virtio internals like this is not nice. Pls move to virtio
code.  Also, pages must be read descriptors as host might modify them.

This also lacks viommu support but this is not mandatory as
that is borken atm anyway. I'll send a patch to at least fail cleanly.

> +static void convert_bmap_to_chunks(struct virtio_balloon *vb,
> +				   struct virtqueue *vq,
> +				   unsigned long *bmap,
> +				   unsigned long pfn_start,
> +				   unsigned long size)
> +{
> +	unsigned long next_one, next_zero, pos = 0;
> +	u64 chunk_base_addr;
> +	u32 chunk_size;
> +
> +	while (pos < size) {
> +		next_one = find_next_bit(bmap, size, pos);
> +		/*
> +		 * No "1" bit found, which means that there is no pfn
> +		 * recorded in the rest of this bmap.
> +		 */
> +		if (next_one == size)
> +			break;
> +		next_zero = find_next_zero_bit(bmap, size, next_one + 1);
> +		/*
> +		 * A bit in page_bmap corresponds to a page of PAGE_SIZE.
> +		 * Convert it to be pages of 4KB balloon page size when
> +		 * adding it to a chunk.

This looks wrong. add_one_chunk assumes size in bytes. So should be just
PAGE_SIZE.

> +		 */
> +		chunk_size = (next_zero - next_one) *
> +			     VIRTIO_BALLOON_PAGES_PER_PAGE;

How do you know this won't overflow a 32 bit integer? Needs a comment.

> +		chunk_base_addr = (pfn_start + next_one) <<
> +				  VIRTIO_BALLOON_PFN_SHIFT;

Same here I think we've left pfns behind, we are using standard s/g now.

> +		if (chunk_size) {
> +			add_one_chunk(vb, vq, chunk_base_addr, chunk_size);
> +			pos += next_zero + 1;
> +		}
> +	}
> +}
> +
>  static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>  {
>  	struct scatterlist sg;
> @@ -124,7 +283,35 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>  
>  	/* When host has read buffer, this completes via balloon_ack */
>  	wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +}
> +
> +static void tell_host_from_page_bmap(struct virtio_balloon *vb,
> +				     struct virtqueue *vq,
> +				     unsigned long pfn_start,
> +				     unsigned long pfn_end,
> +				     unsigned int page_bmap_num)
> +{
> +	unsigned long i, pfn_num;
>  
> +	for (i = 0; i < page_bmap_num; i++) {
> +		/*
> +		 * For the last page_bmap, only the remaining number of pfns
> +		 * need to be searched rather than the entire page_bmap.
> +		 */
> +		if (i + 1 == page_bmap_num)
> +			pfn_num = (pfn_end - pfn_start) %
> +				  VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
> +		else
> +			pfn_num = VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
> +
> +		convert_bmap_to_chunks(vb, vq,
> +				       vb->balloon_page_chunk.page_bmap[i],
> +				       pfn_start +
> +				       i * VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP,
> +				       pfn_num);
> +	}
> +	if (vb->balloon_page_chunk.chunk_num > 0)
> +		send_page_chunks(vb, vq);
>  }
>  
>  static void set_page_pfns(struct virtio_balloon *vb,
> @@ -141,13 +328,89 @@ static void set_page_pfns(struct virtio_balloon *vb,
>  					  page_to_balloon_pfn(page) + i);
>  }
>  
> +/*
> + * Send ballooned pages in chunks to host.
> + * The ballooned pages are recorded in page bitmaps. Each bit in a bitmap
> + * corresponds to a page of PAGE_SIZE. The page bitmaps are searched for
> + * continuous "1" bits, which correspond to continuous pages, to chunk.
> + * When packing those continuous pages into chunks, pages are converted into
> + * 4KB balloon pages.
> + *
> + * pfn_max and pfn_min form the range of pfns that need to use page bitmaps to
> + * record. If the range is too large to be recorded into the allocated page
> + * bitmaps, the page bitmaps are used multiple times to record the entire
> + * range of pfns.
> + */
> +static void tell_host_page_chunks(struct virtio_balloon *vb,
> +				  struct list_head *pages,
> +				  struct virtqueue *vq,
> +				  unsigned long pfn_max,
> +				  unsigned long pfn_min)
> +{
> +	/*
> +	 * The pfn_start and pfn_end form the range of pfns that the allocated
> +	 * page_bmap can record in each round.
> +	 */
> +	unsigned long pfn_start, pfn_end;
> +	/* Total number of allocated page_bmap */
> +	unsigned int page_bmap_num;
> +	struct page *page;
> +	bool found;
> +
> +	/*
> +	 * In the case that one page_bmap is not sufficient to record the pfn
> +	 * range, page_bmap will be extended by allocating more numbers of
> +	 * page_bmap.
> +	 */
> +	page_bmap_num = extend_page_bmap_size(vb, pfn_max - pfn_min + 1);
> +
> +	/* Start from the beginning of the whole pfn range */
> +	pfn_start = pfn_min;
> +	while (pfn_start < pfn_max) {
> +		pfn_end = pfn_start +
> +			  VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP * page_bmap_num;
> +		pfn_end = pfn_end < pfn_max ? pfn_end : pfn_max;
> +		clear_page_bmap(vb, page_bmap_num);
> +		found = false;
> +
> +		list_for_each_entry(page, pages, lru) {
> +			unsigned long bmap_idx, bmap_pos, this_pfn;
> +
> +			this_pfn = page_to_pfn(page);
> +			if (this_pfn < pfn_start || this_pfn > pfn_end)
> +				continue;
> +			bmap_idx = (this_pfn - pfn_start) /
> +				   VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
> +			bmap_pos = (this_pfn - pfn_start) %
> +				   VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
> +			set_bit(bmap_pos,
> +				vb->balloon_page_chunk.page_bmap[bmap_idx]);
> +
> +			found = true;
> +		}
> +		if (found)
> +			tell_host_from_page_bmap(vb, vq, pfn_start, pfn_end,
> +						 page_bmap_num);
> +		/*
> +		 * Start the next round when pfn_start and pfn_end couldn't
> +		 * cover the whole pfn range given by pfn_max and pfn_min.
> +		 */
> +		pfn_start = pfn_end;
> +	}
> +	free_extended_page_bmap(vb, page_bmap_num);
> +}
> +
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	unsigned num_allocated_pages;
> +	bool chunking = virtio_has_feature(vb->vdev,
> +					   VIRTIO_BALLOON_F_PAGE_CHUNKS);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
>  	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	if (!chunking)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
> @@ -162,7 +425,10 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  			msleep(200);
>  			break;
>  		}
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (chunking)
> +			update_pfn_range(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		if (!virtio_has_feature(vb->vdev,
>  					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> @@ -171,8 +437,14 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  
>  	num_allocated_pages = vb->num_pfns;
>  	/* Did we get any? */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->inflate_vq);
> +	if (vb->num_pfns != 0) {
> +		if (chunking)
> +			tell_host_page_chunks(vb, &vb_dev_info->pages,
> +					      vb->inflate_vq,
> +					      pfn_max, pfn_min);
> +		else
> +			tell_host(vb, vb->inflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	return num_allocated_pages;
> @@ -198,9 +470,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool chunking = virtio_has_feature(vb->vdev,
> +					   VIRTIO_BALLOON_F_PAGE_CHUNKS);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> -	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	/* Traditionally, we can only do one array worth at a time. */
> +	if (!chunking)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	/* We can't release more pages than taken */
> @@ -210,7 +486,10 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  		page = balloon_page_dequeue(vb_dev_info);
>  		if (!page)
>  			break;
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (chunking)
> +			update_pfn_range(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>  		list_add(&page->lru, &pages);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}
> @@ -221,8 +500,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>  	 * is true, we *have* to do it in this order
>  	 */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->deflate_vq);
> +	if (vb->num_pfns != 0) {
> +		if (chunking)
> +			tell_host_page_chunks(vb, &pages, vb->deflate_vq,
> +					      pfn_max, pfn_min);
> +		else
> +			tell_host(vb, vb->deflate_vq);
> +	}
>  	release_pages_balloon(vb, &pages);
>  	mutex_unlock(&vb->balloon_lock);
>  	return num_freed_pages;
> @@ -442,6 +726,14 @@ static int init_vqs(struct virtio_balloon *vb)
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> +
> +static void tell_host_one_page(struct virtio_balloon *vb,
> +			       struct virtqueue *vq, struct page *page)
> +{
> +	add_one_chunk(vb, vq, page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT,
> +		      VIRTIO_BALLOON_PAGES_PER_PAGE);
> +}
> +
>  /*
>   * virtballoon_migratepage - perform the balloon page migration on behalf of
>   *			     a compation thread.     (called under page lock)
> @@ -465,6 +757,8 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  {
>  	struct virtio_balloon *vb = container_of(vb_dev_info,
>  			struct virtio_balloon, vb_dev_info);
> +	bool chunking = virtio_has_feature(vb->vdev,
> +					   VIRTIO_BALLOON_F_PAGE_CHUNKS);
>  	unsigned long flags;
>  
>  	/*
> @@ -486,16 +780,22 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  	vb_dev_info->isolated_pages--;
>  	__count_vm_event(BALLOON_MIGRATE);
>  	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, newpage);
> -	tell_host(vb, vb->inflate_vq);
> -
> +	if (chunking) {
> +		tell_host_one_page(vb, vb->inflate_vq, newpage);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, newpage);
> +		tell_host(vb, vb->inflate_vq);
> +	}
>  	/* balloon's page migration 2nd step -- deflate "page" */
>  	balloon_page_delete(page);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, page);
> -	tell_host(vb, vb->deflate_vq);
> -
> +	if (chunking) {
> +		tell_host_one_page(vb, vb->deflate_vq, page);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, page);
> +		tell_host(vb, vb->deflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	put_page(page); /* balloon reference */
> @@ -522,9 +822,78 @@ static struct file_system_type balloon_fs = {
>  
>  #endif /* CONFIG_BALLOON_COMPACTION */
>  
> +static void free_page_bmap(struct virtio_balloon *vb)
> +{
> +	int i;
> +
> +	for (i = 0; i < VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i++) {
> +		kfree(vb->balloon_page_chunk.page_bmap[i]);
> +		vb->balloon_page_chunk.page_bmap[i] = NULL;
> +	}
> +}
> +
> +static int balloon_page_chunk_init(struct virtio_balloon *vb)
> +{
> +	int i;
> +
> +	vb->balloon_page_chunk.desc_table = alloc_indirect(vb->vdev,
> +						VIRTIO_BALLOON_MAX_PAGE_CHUNKS,
> +						GFP_KERNEL);

This one's problematic, you aren't supposed to use APIs when device
is not inited yet. Seems to work by luck here. I suggest moving
this to probe, that's where we do a bunch of inits.
And then you can move private init back to allocate too.

> +	if (!vb->balloon_page_chunk.desc_table)
> +		goto err_page_chunk;
> +	vb->balloon_page_chunk.chunk_num = 0;
> +
> +	/*
> +	 * The default number of page_bmaps are allocated. More may be
> +	 * allocated on demand.
> +	 */
> +	for (i = 0; i < VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i++) {
> +		vb->balloon_page_chunk.page_bmap[i] =
> +			    kmalloc(VIRTIO_BALLOON_PAGE_BMAP_SIZE, GFP_KERNEL);
> +		if (!vb->balloon_page_chunk.page_bmap[i])
> +			goto err_page_bmap;
> +	}
> +
> +	return 0;
> +err_page_bmap:
> +	free_page_bmap(vb);
> +	kfree(vb->balloon_page_chunk.desc_table);
> +	vb->balloon_page_chunk.desc_table = NULL;
> +err_page_chunk:
> +	__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_PAGE_CHUNKS);
> +	dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
> +	return -ENOMEM;
> +}
> +
> +static int virtballoon_validate(struct virtio_device *vdev)
> +{
> +	struct virtio_balloon *vb = NULL;
> +	int err;
> +
> +	vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
> +	if (!vb) {
> +		err = -ENOMEM;
> +		goto err_vb;
> +	}
> +	vb->vdev = vdev;
> +
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_CHUNKS)) {
> +		err = balloon_page_chunk_init(vb);
> +		if (err < 0)
> +			goto err_page_chunk;
> +	}
> +
> +	return 0;
> +
> +err_page_chunk:
> +	kfree(vb);
> +err_vb:
> +	return err;
> +}
> +

So here you are supposed to validate features, not handle OOM
conditions.  BTW we need a fix for vIOMMU - I noticed balloon does not
support that yes.

>  static int virtballoon_probe(struct virtio_device *vdev)
>  {
> -	struct virtio_balloon *vb;
> +	struct virtio_balloon *vb = vdev->priv;
>  	int err;
>  
>  	if (!vdev->config->get) {
> @@ -533,20 +902,14 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  		return -EINVAL;
>  	}
>  
> -	vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
> -	if (!vb) {
> -		err = -ENOMEM;
> -		goto out;
> -	}
> -
>  	INIT_WORK(&vb->update_balloon_stats_work, update_balloon_stats_func);
>  	INIT_WORK(&vb->update_balloon_size_work, update_balloon_size_func);
>  	spin_lock_init(&vb->stop_update_lock);
>  	vb->stop_update = false;
>  	vb->num_pages = 0;
> +
>  	mutex_init(&vb->balloon_lock);
>  	init_waitqueue_head(&vb->acked);
> -	vb->vdev = vdev;
>  
>  	balloon_devinfo_init(&vb->vb_dev_info);
>  
> @@ -590,7 +953,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	vdev->config->del_vqs(vdev);
>  out_free_vb:
>  	kfree(vb);
> -out:
>  	return err;
>  }
>  
> @@ -620,6 +982,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	cancel_work_sync(&vb->update_balloon_stats_work);
>  
>  	remove_common(vb);
> +	free_page_bmap(vb);
> +	kfree(vb->balloon_page_chunk.desc_table);
>  #ifdef CONFIG_BALLOON_COMPACTION
>  	if (vb->vb_dev_info.inode)
>  		iput(vb->vb_dev_info.inode);
> @@ -664,6 +1028,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_PAGE_CHUNKS,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> @@ -674,6 +1039,7 @@ static struct virtio_driver virtio_balloon_driver = {
>  	.id_table =	id_table,
>  	.probe =	virtballoon_probe,
>  	.remove =	virtballoon_remove,
> +	.validate =	virtballoon_validate,
>  	.config_changed = virtballoon_changed,
>  #ifdef CONFIG_PM_SLEEP
>  	.freeze	=	virtballoon_freeze,
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 409aeaa..0ea2512 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -235,8 +235,17 @@ static int vring_mapping_error(const struct vring_virtqueue *vq,
>  	return dma_mapping_error(vring_dma_dev(vq), addr);
>  }
>  
> -static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
> -					 unsigned int total_sg, gfp_t gfp)
> +/**
> + * alloc_indirect - allocate an indirect desc table
> + * @vdev: the virtio_device that owns the indirect desc table.
> + * @num: the number of entries that the table will have.
> + * @gfp: how to do memory allocations (if necessary).
> + *
> + * Return NULL if the table allocation failed. Otherwise, return the address
> + * of the table.
> + */
> +struct vring_desc *alloc_indirect(struct virtio_device *vdev, unsigned int num,
> +				  gfp_t gfp)
>  {
>  	struct vring_desc *desc;
>  	unsigned int i;
> @@ -248,14 +257,15 @@ static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
>  	 */
>  	gfp &= ~__GFP_HIGHMEM;
>  
> -	desc = kmalloc(total_sg * sizeof(struct vring_desc), gfp);
> +	desc = kmalloc_array(num, sizeof(struct vring_desc), gfp);
>  	if (!desc)
>  		return NULL;
>  
> -	for (i = 0; i < total_sg; i++)
> -		desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);
> +	for (i = 0; i < num; i++)
> +		desc[i].next = cpu_to_virtio16(vdev, i + 1);
>  	return desc;
>  }
> +EXPORT_SYMBOL_GPL(alloc_indirect);
>  
>  static inline int virtqueue_add(struct virtqueue *_vq,
>  				struct scatterlist *sgs[],
> @@ -302,7 +312,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  	/* If the host supports indirect descriptor tables, and we have multiple
>  	 * buffers, then go indirect. FIXME: tune this threshold */
>  	if (vq->indirect && total_sg > 1 && vq->vq.num_free)
> -		desc = alloc_indirect(_vq, total_sg, gfp);
> +		desc = alloc_indirect(_vq->vdev, total_sg, gfp);
>  	else
>  		desc = NULL;
>  
> @@ -433,6 +443,104 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  }
>  
>  /**
> + * virtqueue_indirect_desc_table_add - add an indirect desc table to the vq
> + * @_vq: the struct virtqueue we're talking about.
> + * @desc: the desc table we're talking about.
> + * @num: the number of entries that the desc table has.
> + *
> + * Returns zero or a negative error (ie. ENOSPC, EIO).
> + */
> +int virtqueue_indirect_desc_table_add(struct virtqueue *_vq,
> +				      struct vring_desc *desc,
> +				      unsigned int num)
> +{
> +	struct vring_virtqueue *vq = to_vvq(_vq);
> +	dma_addr_t desc_addr;
> +	unsigned int i, avail;
> +	int head;
> +
> +	/* Sanity check */
> +	if (!desc) {
> +		pr_debug("%s: empty desc table\n", __func__);
> +		return -EINVAL;
> +	}
> +
> +	START_USE(vq);
> +
> +	if (unlikely(vq->broken)) {
> +		END_USE(vq);
> +		return -EIO;
> +	}
> +
> +	if (!vq->vq.num_free) {
> +		pr_debug("%s: the virtioqueue is full\n", __func__);
> +		END_USE(vq);
> +		return -ENOSPC;
> +	}
> +
> +	/* Map and fill in the indirect table */
> +	desc_addr = vring_map_single(vq, desc, num * sizeof(struct vring_desc),
> +				     DMA_TO_DEVICE);
> +	if (vring_mapping_error(vq, desc_addr)) {
> +		pr_debug("%s: map desc failed\n", __func__);
> +		END_USE(vq);
> +		return -EIO;
> +	}
> +
> +	/* Mark the flag of the table entries */
> +	for (i = 0; i < num; i++)
> +		desc[i].flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_NEXT);
> +	/* The last one doesn't continue. */
> +	desc[num - 1].flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
> +
> +	/* Get a ring entry to point to the indirect table */
> +	head = vq->free_head;
> +	vq->vring.desc[head].flags = cpu_to_virtio16(_vq->vdev,
> +						     VRING_DESC_F_INDIRECT);
> +	vq->vring.desc[head].addr = cpu_to_virtio64(_vq->vdev, desc_addr);
> +	vq->vring.desc[head].len = cpu_to_virtio32(_vq->vdev, num *
> +						   sizeof(struct vring_desc));
> +	/* We're using 1 buffers from the free list. */
> +	vq->vq.num_free--;
> +	/* Update free pointer */
> +	vq->free_head = virtio16_to_cpu(_vq->vdev, vq->vring.desc[head].next);
> +
> +	/* Store token and indirect buffer state. */
> +	vq->desc_state[head].data = desc;
> +	/* Don't free the caller allocated indirect table when detach_buf. */
> +	vq->desc_state[head].indir_desc = NULL;
> +
> +	/*
> +	 * Put entry in available array (but don't update avail->idx until they
> +	 * do sync).
> +	 */
> +	avail = vq->avail_idx_shadow & (vq->vring.num - 1);
> +	vq->vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> +
> +	/*
> +	 * Descriptors and available array need to be set before we expose the
> +	 * new available array entries.
> +	 */
> +	virtio_wmb(vq->weak_barriers);
> +	vq->avail_idx_shadow++;
> +	vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
> +	vq->num_added++;
> +
> +	pr_debug("%s: added buffer head %i to %p\n", __func__, head, vq);
> +	END_USE(vq);
> +
> +	/*
> +	 * This is very unlikely, but theoretically possible.  Kick
> +	 * just in case.
> +	 */
> +	if (unlikely(vq->num_added == (1 << 16) - 1))
> +		virtqueue_kick(_vq);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_indirect_desc_table_add);
> +

I'm not really happy with the fact we are duplicating so much code. Most
of this is duplicated from virtqueue_add, isn't it? I imagine you just
need to factor out the code from the following place down:

        /* If the host supports indirect descriptor tables, and we have multiple
         * buffers, then go indirect. FIXME: tune this threshold */
        if (vq->indirect && total_sg > 1 && vq->vq.num_free)
                desc = alloc_indirect(_vq, total_sg, gfp);
        else
                desc = NULL;

then reuse that.

> +/**
>   * virtqueue_add_sgs - expose buffers to other end
>   * @vq: the struct virtqueue we're talking about.
>   * @sgs: array of terminated scatterlists.
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index 7edfbdb..01dad22 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -34,6 +34,13 @@ struct virtqueue {
>  	void *priv;
>  };
>  
> +struct vring_desc *alloc_indirect(struct virtio_device *vdev,
> +				  unsigned int num, gfp_t gfp);
> +

Please prefix with virtqueue or virtio (depending on 1st parameter).
You also want a free API to pair with this (even though it's just kfree
right now).

> +int virtqueue_indirect_desc_table_add(struct virtqueue *_vq,
> +				      struct vring_desc *desc,
> +				      unsigned int num);
> +
>  int virtqueue_add_outbuf(struct virtqueue *vq,
>  			 struct scatterlist sg[], unsigned int num,
>  			 void *data,
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 343d7dd..5ed3c7b 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_PAGE_CHUNKS	3 /* Inflate/Deflate pages in chunks */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
> index c072959..0499fb8 100644
> --- a/include/uapi/linux/virtio_ring.h
> +++ b/include/uapi/linux/virtio_ring.h
> @@ -111,6 +111,9 @@ struct vring {
>  #define VRING_USED_ALIGN_SIZE 4
>  #define VRING_DESC_ALIGN_SIZE 16
>  
> +/* The supported max queue size */
> +#define VIRTQUEUE_MAX_SIZE 1024
> +
>  /* The standard layout for the ring is a continuous chunk of memory which looks
>   * like this.  We assume num is a power of 2.
>   *

Please do not add this to UAPI.

> -- 
> 2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
@ 2017-06-13 17:56     ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-13 17:56 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource,
	Matthew Wilcox

On Fri, Jun 09, 2017 at 06:41:38PM +0800, Wei Wang wrote:
> Add a new feature, VIRTIO_BALLOON_F_PAGE_CHUNKS, which enables
> the transfer of the ballooned (i.e. inflated/deflated) pages in
> chunks to the host.

so now these chunks are just s/g list entry.
So let's rename this VIRTIO_BALLOON_F_SG with a comment:
* Use standard virtio s/g instead of PFN lists *

> The implementation of the previous virtio-balloon is not very
> efficient, because the ballooned pages are transferred to the
> host one by one. Here is the breakdown of the time in percentage
> spent on each step of the balloon inflating process (inflating
> 7GB of an 8GB idle guest).
> 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete.
> The above profiling shows that the bottlenecks are stage 2)
> and stage 4).
> 
> This patch optimizes step 2) by transferring pages to the host in
> chunks. A chunk consists of guest physically continuous pages.
> When the pages are packed into a chunk, they are converted into
> balloon page size (4KB) pages. A chunk is offered to the host
> via a base address (i.e. the start guest physical address of those
> physically continuous pages) and the size (i.e. the total number
> of the 4KB balloon size pages). A chunk is described via a
> vring_desc struct in the implementation.
> 
> By doing so, step 4) can also be optimized by doing address
> translation and madvise() in chunks rather than page by page.
> 
> With this new feature, the above ballooning process takes ~590ms
> resulting in an improvement of ~85%.
> 
> TODO: optimize stage 1) by allocating/freeing a chunk of pages
> instead of a single page each time.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 418 +++++++++++++++++++++++++++++++++---
>  drivers/virtio/virtio_ring.c        | 120 ++++++++++-
>  include/linux/virtio.h              |   7 +
>  include/uapi/linux/virtio_balloon.h |   1 +
>  include/uapi/linux/virtio_ring.h    |   3 +
>  5 files changed, 517 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index ecb64e9..0cf945c 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -51,6 +51,36 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>  static struct vfsmount *balloon_mnt;
>  #endif
>  
> +/* The size of one page_bmap used to record inflated/deflated pages. */
> +#define VIRTIO_BALLOON_PAGE_BMAP_SIZE	(8 * PAGE_SIZE)

At this size, you probably want alloc_pages to avoid kmalloc
overhead.

> +/*
> + * Callulates how many pfns can a page_bmap record. A bit corresponds to a
> + * page of PAGE_SIZE.
> + */
> +#define VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP \
> +	(VIRTIO_BALLOON_PAGE_BMAP_SIZE * BITS_PER_BYTE)
> +
> +/* The number of page_bmap to allocate by default. */
> +#define VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM	1

It's not by default, it's at probe time, right?

> +/* The maximum number of page_bmap that can be allocated. */

Not really, this is the size of the array we use to keep them.

> +#define VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM	32
> +

So you still have a home-grown bitmap. I'd like to know why
isn't xbitmap suggested for this purpose by Matthew Wilcox
appropriate. Please add a comment explaining the requirements
from the data structure.

> +/*
> + * QEMU virtio implementation requires the desc table size less than
> + * VIRTQUEUE_MAX_SIZE, so minus 1 here.

I think it doesn't, the issue is probably that you add a header
as a separate s/g. In any case see below.

> + */
> +#define VIRTIO_BALLOON_MAX_PAGE_CHUNKS (VIRTQUEUE_MAX_SIZE - 1)

This is wrong, virtio spec says s/g size should not exceed VQ size.
If you want to support huge VQ sizes, you can add a fallback to
smaller sizes until it fits in 1 page.

> +
> +/* The struct to manage ballooned pages in chunks */
> +struct virtio_balloon_page_chunk {
> +	/* Indirect desc table to hold chunks of balloon pages */
> +	struct vring_desc *desc_table;
> +	/* Number of added chunks of balloon pages */
> +	unsigned int chunk_num;
> +	/* Bitmap used to record ballooned pages. */
> +	unsigned long *page_bmap[VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM];
> +};
> +
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
>  	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> @@ -79,6 +109,8 @@ struct virtio_balloon {
>  	/* Synchronize access/update to this struct virtio_balloon elements */
>  	struct mutex balloon_lock;
>  
> +	struct virtio_balloon_page_chunk balloon_page_chunk;
> +
>  	/* The array of pfns we tell the Host about. */
>  	unsigned int num_pfns;
>  	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> @@ -111,6 +143,133 @@ static void balloon_ack(struct virtqueue *vq)
>  	wake_up(&vb->acked);
>  }
>  
> +/* Update pfn_max and pfn_min according to the pfn of page */
> +static inline void update_pfn_range(struct virtio_balloon *vb,
> +				    struct page *page,
> +				    unsigned long *pfn_min,
> +				    unsigned long *pfn_max)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +}
> +
> +static unsigned int extend_page_bmap_size(struct virtio_balloon *vb,
> +					  unsigned long pfn_num)

what's this API doing?  Pls add comments. this seems to assume
it will only be called once. it would be better to avoid making
this assumption, just look at what has been allocated
and extend it.

> +{
> +	unsigned int i, bmap_num, allocated_bmap_num;
> +	unsigned long bmap_len;
> +
> +	allocated_bmap_num = VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM;

how come? Pls init vars where they are declared.

> +	bmap_len = ALIGN(pfn_num, BITS_PER_LONG) / BITS_PER_BYTE;
> +	bmap_len = roundup(bmap_len, VIRTIO_BALLOON_PAGE_BMAP_SIZE);
> +	/*
> +	 * VIRTIO_BALLOON_PAGE_BMAP_SIZE is the size of one page_bmap, so
> +	 * divide it to calculate how many page_bmap that we need.
> +	 */
> +	bmap_num = (unsigned int)(bmap_len / VIRTIO_BALLOON_PAGE_BMAP_SIZE);
> +	/* The number of page_bmap to allocate should not exceed the max */
> +	bmap_num = min_t(unsigned int, VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM,
> +			 bmap_num);

two comments above don't really help just drop them.

> +
> +	for (i = VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i < bmap_num; i++) {
> +		vb->balloon_page_chunk.page_bmap[i] =
> +			    kmalloc(VIRTIO_BALLOON_PAGE_BMAP_SIZE, GFP_KERNEL);
> +		if (vb->balloon_page_chunk.page_bmap[i])
> +			allocated_bmap_num++;
> +		else
> +			break;
> +	}
> +
> +	return allocated_bmap_num;
> +}
> +
> +static void free_extended_page_bmap(struct virtio_balloon *vb,
> +				    unsigned int page_bmap_num)
> +{
> +	unsigned int i;
> +
> +	for (i = VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i < page_bmap_num;
> +	     i++) {
> +		kfree(vb->balloon_page_chunk.page_bmap[i]);
> +		vb->balloon_page_chunk.page_bmap[i] = NULL;
> +		page_bmap_num--;
> +	}
> +}
> +
> +static void clear_page_bmap(struct virtio_balloon *vb,
> +			    unsigned int page_bmap_num)
> +{
> +	int i;
> +
> +	for (i = 0; i < page_bmap_num; i++)
> +		memset(vb->balloon_page_chunk.page_bmap[i], 0,
> +		       VIRTIO_BALLOON_PAGE_BMAP_SIZE);
> +}
> +
> +static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq)
> +{
> +	unsigned int len, num;
> +	struct vring_desc *desc = vb->balloon_page_chunk.desc_table;
> +
> +	num = vb->balloon_page_chunk.chunk_num;
> +	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
> +		virtqueue_kick(vq);
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		vb->balloon_page_chunk.chunk_num = 0;
> +	}
> +}
> +
> +/* Add a chunk to the buffer. */
> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> +			  u64 base_addr, u32 size)
> +{
> +	unsigned int *num = &vb->balloon_page_chunk.chunk_num;
> +	struct vring_desc *desc = &vb->balloon_page_chunk.desc_table[*num];
> +
> +	desc->addr = cpu_to_virtio64(vb->vdev, base_addr);
> +	desc->len = cpu_to_virtio32(vb->vdev, size);
> +	*num += 1;
> +	if (*num == VIRTIO_BALLOON_MAX_PAGE_CHUNKS)
> +		send_page_chunks(vb, vq);
> +}
> +

Poking at virtio internals like this is not nice. Pls move to virtio
code.  Also, pages must be read descriptors as host might modify them.

This also lacks viommu support but this is not mandatory as
that is borken atm anyway. I'll send a patch to at least fail cleanly.

> +static void convert_bmap_to_chunks(struct virtio_balloon *vb,
> +				   struct virtqueue *vq,
> +				   unsigned long *bmap,
> +				   unsigned long pfn_start,
> +				   unsigned long size)
> +{
> +	unsigned long next_one, next_zero, pos = 0;
> +	u64 chunk_base_addr;
> +	u32 chunk_size;
> +
> +	while (pos < size) {
> +		next_one = find_next_bit(bmap, size, pos);
> +		/*
> +		 * No "1" bit found, which means that there is no pfn
> +		 * recorded in the rest of this bmap.
> +		 */
> +		if (next_one == size)
> +			break;
> +		next_zero = find_next_zero_bit(bmap, size, next_one + 1);
> +		/*
> +		 * A bit in page_bmap corresponds to a page of PAGE_SIZE.
> +		 * Convert it to be pages of 4KB balloon page size when
> +		 * adding it to a chunk.

This looks wrong. add_one_chunk assumes size in bytes. So should be just
PAGE_SIZE.

> +		 */
> +		chunk_size = (next_zero - next_one) *
> +			     VIRTIO_BALLOON_PAGES_PER_PAGE;

How do you know this won't overflow a 32 bit integer? Needs a comment.

> +		chunk_base_addr = (pfn_start + next_one) <<
> +				  VIRTIO_BALLOON_PFN_SHIFT;

Same here I think we've left pfns behind, we are using standard s/g now.

> +		if (chunk_size) {
> +			add_one_chunk(vb, vq, chunk_base_addr, chunk_size);
> +			pos += next_zero + 1;
> +		}
> +	}
> +}
> +
>  static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>  {
>  	struct scatterlist sg;
> @@ -124,7 +283,35 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>  
>  	/* When host has read buffer, this completes via balloon_ack */
>  	wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +}
> +
> +static void tell_host_from_page_bmap(struct virtio_balloon *vb,
> +				     struct virtqueue *vq,
> +				     unsigned long pfn_start,
> +				     unsigned long pfn_end,
> +				     unsigned int page_bmap_num)
> +{
> +	unsigned long i, pfn_num;
>  
> +	for (i = 0; i < page_bmap_num; i++) {
> +		/*
> +		 * For the last page_bmap, only the remaining number of pfns
> +		 * need to be searched rather than the entire page_bmap.
> +		 */
> +		if (i + 1 == page_bmap_num)
> +			pfn_num = (pfn_end - pfn_start) %
> +				  VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
> +		else
> +			pfn_num = VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
> +
> +		convert_bmap_to_chunks(vb, vq,
> +				       vb->balloon_page_chunk.page_bmap[i],
> +				       pfn_start +
> +				       i * VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP,
> +				       pfn_num);
> +	}
> +	if (vb->balloon_page_chunk.chunk_num > 0)
> +		send_page_chunks(vb, vq);
>  }
>  
>  static void set_page_pfns(struct virtio_balloon *vb,
> @@ -141,13 +328,89 @@ static void set_page_pfns(struct virtio_balloon *vb,
>  					  page_to_balloon_pfn(page) + i);
>  }
>  
> +/*
> + * Send ballooned pages in chunks to host.
> + * The ballooned pages are recorded in page bitmaps. Each bit in a bitmap
> + * corresponds to a page of PAGE_SIZE. The page bitmaps are searched for
> + * continuous "1" bits, which correspond to continuous pages, to chunk.
> + * When packing those continuous pages into chunks, pages are converted into
> + * 4KB balloon pages.
> + *
> + * pfn_max and pfn_min form the range of pfns that need to use page bitmaps to
> + * record. If the range is too large to be recorded into the allocated page
> + * bitmaps, the page bitmaps are used multiple times to record the entire
> + * range of pfns.
> + */
> +static void tell_host_page_chunks(struct virtio_balloon *vb,
> +				  struct list_head *pages,
> +				  struct virtqueue *vq,
> +				  unsigned long pfn_max,
> +				  unsigned long pfn_min)
> +{
> +	/*
> +	 * The pfn_start and pfn_end form the range of pfns that the allocated
> +	 * page_bmap can record in each round.
> +	 */
> +	unsigned long pfn_start, pfn_end;
> +	/* Total number of allocated page_bmap */
> +	unsigned int page_bmap_num;
> +	struct page *page;
> +	bool found;
> +
> +	/*
> +	 * In the case that one page_bmap is not sufficient to record the pfn
> +	 * range, page_bmap will be extended by allocating more numbers of
> +	 * page_bmap.
> +	 */
> +	page_bmap_num = extend_page_bmap_size(vb, pfn_max - pfn_min + 1);
> +
> +	/* Start from the beginning of the whole pfn range */
> +	pfn_start = pfn_min;
> +	while (pfn_start < pfn_max) {
> +		pfn_end = pfn_start +
> +			  VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP * page_bmap_num;
> +		pfn_end = pfn_end < pfn_max ? pfn_end : pfn_max;
> +		clear_page_bmap(vb, page_bmap_num);
> +		found = false;
> +
> +		list_for_each_entry(page, pages, lru) {
> +			unsigned long bmap_idx, bmap_pos, this_pfn;
> +
> +			this_pfn = page_to_pfn(page);
> +			if (this_pfn < pfn_start || this_pfn > pfn_end)
> +				continue;
> +			bmap_idx = (this_pfn - pfn_start) /
> +				   VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
> +			bmap_pos = (this_pfn - pfn_start) %
> +				   VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
> +			set_bit(bmap_pos,
> +				vb->balloon_page_chunk.page_bmap[bmap_idx]);
> +
> +			found = true;
> +		}
> +		if (found)
> +			tell_host_from_page_bmap(vb, vq, pfn_start, pfn_end,
> +						 page_bmap_num);
> +		/*
> +		 * Start the next round when pfn_start and pfn_end couldn't
> +		 * cover the whole pfn range given by pfn_max and pfn_min.
> +		 */
> +		pfn_start = pfn_end;
> +	}
> +	free_extended_page_bmap(vb, page_bmap_num);
> +}
> +
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	unsigned num_allocated_pages;
> +	bool chunking = virtio_has_feature(vb->vdev,
> +					   VIRTIO_BALLOON_F_PAGE_CHUNKS);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
>  	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	if (!chunking)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
> @@ -162,7 +425,10 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  			msleep(200);
>  			break;
>  		}
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (chunking)
> +			update_pfn_range(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		if (!virtio_has_feature(vb->vdev,
>  					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> @@ -171,8 +437,14 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  
>  	num_allocated_pages = vb->num_pfns;
>  	/* Did we get any? */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->inflate_vq);
> +	if (vb->num_pfns != 0) {
> +		if (chunking)
> +			tell_host_page_chunks(vb, &vb_dev_info->pages,
> +					      vb->inflate_vq,
> +					      pfn_max, pfn_min);
> +		else
> +			tell_host(vb, vb->inflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	return num_allocated_pages;
> @@ -198,9 +470,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool chunking = virtio_has_feature(vb->vdev,
> +					   VIRTIO_BALLOON_F_PAGE_CHUNKS);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> -	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	/* Traditionally, we can only do one array worth at a time. */
> +	if (!chunking)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	/* We can't release more pages than taken */
> @@ -210,7 +486,10 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  		page = balloon_page_dequeue(vb_dev_info);
>  		if (!page)
>  			break;
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (chunking)
> +			update_pfn_range(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>  		list_add(&page->lru, &pages);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}
> @@ -221,8 +500,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>  	 * is true, we *have* to do it in this order
>  	 */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->deflate_vq);
> +	if (vb->num_pfns != 0) {
> +		if (chunking)
> +			tell_host_page_chunks(vb, &pages, vb->deflate_vq,
> +					      pfn_max, pfn_min);
> +		else
> +			tell_host(vb, vb->deflate_vq);
> +	}
>  	release_pages_balloon(vb, &pages);
>  	mutex_unlock(&vb->balloon_lock);
>  	return num_freed_pages;
> @@ -442,6 +726,14 @@ static int init_vqs(struct virtio_balloon *vb)
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> +
> +static void tell_host_one_page(struct virtio_balloon *vb,
> +			       struct virtqueue *vq, struct page *page)
> +{
> +	add_one_chunk(vb, vq, page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT,
> +		      VIRTIO_BALLOON_PAGES_PER_PAGE);
> +}
> +
>  /*
>   * virtballoon_migratepage - perform the balloon page migration on behalf of
>   *			     a compation thread.     (called under page lock)
> @@ -465,6 +757,8 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  {
>  	struct virtio_balloon *vb = container_of(vb_dev_info,
>  			struct virtio_balloon, vb_dev_info);
> +	bool chunking = virtio_has_feature(vb->vdev,
> +					   VIRTIO_BALLOON_F_PAGE_CHUNKS);
>  	unsigned long flags;
>  
>  	/*
> @@ -486,16 +780,22 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  	vb_dev_info->isolated_pages--;
>  	__count_vm_event(BALLOON_MIGRATE);
>  	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, newpage);
> -	tell_host(vb, vb->inflate_vq);
> -
> +	if (chunking) {
> +		tell_host_one_page(vb, vb->inflate_vq, newpage);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, newpage);
> +		tell_host(vb, vb->inflate_vq);
> +	}
>  	/* balloon's page migration 2nd step -- deflate "page" */
>  	balloon_page_delete(page);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, page);
> -	tell_host(vb, vb->deflate_vq);
> -
> +	if (chunking) {
> +		tell_host_one_page(vb, vb->deflate_vq, page);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, page);
> +		tell_host(vb, vb->deflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	put_page(page); /* balloon reference */
> @@ -522,9 +822,78 @@ static struct file_system_type balloon_fs = {
>  
>  #endif /* CONFIG_BALLOON_COMPACTION */
>  
> +static void free_page_bmap(struct virtio_balloon *vb)
> +{
> +	int i;
> +
> +	for (i = 0; i < VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i++) {
> +		kfree(vb->balloon_page_chunk.page_bmap[i]);
> +		vb->balloon_page_chunk.page_bmap[i] = NULL;
> +	}
> +}
> +
> +static int balloon_page_chunk_init(struct virtio_balloon *vb)
> +{
> +	int i;
> +
> +	vb->balloon_page_chunk.desc_table = alloc_indirect(vb->vdev,
> +						VIRTIO_BALLOON_MAX_PAGE_CHUNKS,
> +						GFP_KERNEL);

This one's problematic, you aren't supposed to use APIs when device
is not inited yet. Seems to work by luck here. I suggest moving
this to probe, that's where we do a bunch of inits.
And then you can move private init back to allocate too.

> +	if (!vb->balloon_page_chunk.desc_table)
> +		goto err_page_chunk;
> +	vb->balloon_page_chunk.chunk_num = 0;
> +
> +	/*
> +	 * The default number of page_bmaps are allocated. More may be
> +	 * allocated on demand.
> +	 */
> +	for (i = 0; i < VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i++) {
> +		vb->balloon_page_chunk.page_bmap[i] =
> +			    kmalloc(VIRTIO_BALLOON_PAGE_BMAP_SIZE, GFP_KERNEL);
> +		if (!vb->balloon_page_chunk.page_bmap[i])
> +			goto err_page_bmap;
> +	}
> +
> +	return 0;
> +err_page_bmap:
> +	free_page_bmap(vb);
> +	kfree(vb->balloon_page_chunk.desc_table);
> +	vb->balloon_page_chunk.desc_table = NULL;
> +err_page_chunk:
> +	__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_PAGE_CHUNKS);
> +	dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
> +	return -ENOMEM;
> +}
> +
> +static int virtballoon_validate(struct virtio_device *vdev)
> +{
> +	struct virtio_balloon *vb = NULL;
> +	int err;
> +
> +	vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
> +	if (!vb) {
> +		err = -ENOMEM;
> +		goto err_vb;
> +	}
> +	vb->vdev = vdev;
> +
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_CHUNKS)) {
> +		err = balloon_page_chunk_init(vb);
> +		if (err < 0)
> +			goto err_page_chunk;
> +	}
> +
> +	return 0;
> +
> +err_page_chunk:
> +	kfree(vb);
> +err_vb:
> +	return err;
> +}
> +

So here you are supposed to validate features, not handle OOM
conditions.  BTW we need a fix for vIOMMU - I noticed balloon does not
support that yes.

>  static int virtballoon_probe(struct virtio_device *vdev)
>  {
> -	struct virtio_balloon *vb;
> +	struct virtio_balloon *vb = vdev->priv;
>  	int err;
>  
>  	if (!vdev->config->get) {
> @@ -533,20 +902,14 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  		return -EINVAL;
>  	}
>  
> -	vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
> -	if (!vb) {
> -		err = -ENOMEM;
> -		goto out;
> -	}
> -
>  	INIT_WORK(&vb->update_balloon_stats_work, update_balloon_stats_func);
>  	INIT_WORK(&vb->update_balloon_size_work, update_balloon_size_func);
>  	spin_lock_init(&vb->stop_update_lock);
>  	vb->stop_update = false;
>  	vb->num_pages = 0;
> +
>  	mutex_init(&vb->balloon_lock);
>  	init_waitqueue_head(&vb->acked);
> -	vb->vdev = vdev;
>  
>  	balloon_devinfo_init(&vb->vb_dev_info);
>  
> @@ -590,7 +953,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	vdev->config->del_vqs(vdev);
>  out_free_vb:
>  	kfree(vb);
> -out:
>  	return err;
>  }
>  
> @@ -620,6 +982,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	cancel_work_sync(&vb->update_balloon_stats_work);
>  
>  	remove_common(vb);
> +	free_page_bmap(vb);
> +	kfree(vb->balloon_page_chunk.desc_table);
>  #ifdef CONFIG_BALLOON_COMPACTION
>  	if (vb->vb_dev_info.inode)
>  		iput(vb->vb_dev_info.inode);
> @@ -664,6 +1028,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_PAGE_CHUNKS,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> @@ -674,6 +1039,7 @@ static struct virtio_driver virtio_balloon_driver = {
>  	.id_table =	id_table,
>  	.probe =	virtballoon_probe,
>  	.remove =	virtballoon_remove,
> +	.validate =	virtballoon_validate,
>  	.config_changed = virtballoon_changed,
>  #ifdef CONFIG_PM_SLEEP
>  	.freeze	=	virtballoon_freeze,
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 409aeaa..0ea2512 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -235,8 +235,17 @@ static int vring_mapping_error(const struct vring_virtqueue *vq,
>  	return dma_mapping_error(vring_dma_dev(vq), addr);
>  }
>  
> -static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
> -					 unsigned int total_sg, gfp_t gfp)
> +/**
> + * alloc_indirect - allocate an indirect desc table
> + * @vdev: the virtio_device that owns the indirect desc table.
> + * @num: the number of entries that the table will have.
> + * @gfp: how to do memory allocations (if necessary).
> + *
> + * Return NULL if the table allocation failed. Otherwise, return the address
> + * of the table.
> + */
> +struct vring_desc *alloc_indirect(struct virtio_device *vdev, unsigned int num,
> +				  gfp_t gfp)
>  {
>  	struct vring_desc *desc;
>  	unsigned int i;
> @@ -248,14 +257,15 @@ static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
>  	 */
>  	gfp &= ~__GFP_HIGHMEM;
>  
> -	desc = kmalloc(total_sg * sizeof(struct vring_desc), gfp);
> +	desc = kmalloc_array(num, sizeof(struct vring_desc), gfp);
>  	if (!desc)
>  		return NULL;
>  
> -	for (i = 0; i < total_sg; i++)
> -		desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);
> +	for (i = 0; i < num; i++)
> +		desc[i].next = cpu_to_virtio16(vdev, i + 1);
>  	return desc;
>  }
> +EXPORT_SYMBOL_GPL(alloc_indirect);
>  
>  static inline int virtqueue_add(struct virtqueue *_vq,
>  				struct scatterlist *sgs[],
> @@ -302,7 +312,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  	/* If the host supports indirect descriptor tables, and we have multiple
>  	 * buffers, then go indirect. FIXME: tune this threshold */
>  	if (vq->indirect && total_sg > 1 && vq->vq.num_free)
> -		desc = alloc_indirect(_vq, total_sg, gfp);
> +		desc = alloc_indirect(_vq->vdev, total_sg, gfp);
>  	else
>  		desc = NULL;
>  
> @@ -433,6 +443,104 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  }
>  
>  /**
> + * virtqueue_indirect_desc_table_add - add an indirect desc table to the vq
> + * @_vq: the struct virtqueue we're talking about.
> + * @desc: the desc table we're talking about.
> + * @num: the number of entries that the desc table has.
> + *
> + * Returns zero or a negative error (ie. ENOSPC, EIO).
> + */
> +int virtqueue_indirect_desc_table_add(struct virtqueue *_vq,
> +				      struct vring_desc *desc,
> +				      unsigned int num)
> +{
> +	struct vring_virtqueue *vq = to_vvq(_vq);
> +	dma_addr_t desc_addr;
> +	unsigned int i, avail;
> +	int head;
> +
> +	/* Sanity check */
> +	if (!desc) {
> +		pr_debug("%s: empty desc table\n", __func__);
> +		return -EINVAL;
> +	}
> +
> +	START_USE(vq);
> +
> +	if (unlikely(vq->broken)) {
> +		END_USE(vq);
> +		return -EIO;
> +	}
> +
> +	if (!vq->vq.num_free) {
> +		pr_debug("%s: the virtioqueue is full\n", __func__);
> +		END_USE(vq);
> +		return -ENOSPC;
> +	}
> +
> +	/* Map and fill in the indirect table */
> +	desc_addr = vring_map_single(vq, desc, num * sizeof(struct vring_desc),
> +				     DMA_TO_DEVICE);
> +	if (vring_mapping_error(vq, desc_addr)) {
> +		pr_debug("%s: map desc failed\n", __func__);
> +		END_USE(vq);
> +		return -EIO;
> +	}
> +
> +	/* Mark the flag of the table entries */
> +	for (i = 0; i < num; i++)
> +		desc[i].flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_NEXT);
> +	/* The last one doesn't continue. */
> +	desc[num - 1].flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
> +
> +	/* Get a ring entry to point to the indirect table */
> +	head = vq->free_head;
> +	vq->vring.desc[head].flags = cpu_to_virtio16(_vq->vdev,
> +						     VRING_DESC_F_INDIRECT);
> +	vq->vring.desc[head].addr = cpu_to_virtio64(_vq->vdev, desc_addr);
> +	vq->vring.desc[head].len = cpu_to_virtio32(_vq->vdev, num *
> +						   sizeof(struct vring_desc));
> +	/* We're using 1 buffers from the free list. */
> +	vq->vq.num_free--;
> +	/* Update free pointer */
> +	vq->free_head = virtio16_to_cpu(_vq->vdev, vq->vring.desc[head].next);
> +
> +	/* Store token and indirect buffer state. */
> +	vq->desc_state[head].data = desc;
> +	/* Don't free the caller allocated indirect table when detach_buf. */
> +	vq->desc_state[head].indir_desc = NULL;
> +
> +	/*
> +	 * Put entry in available array (but don't update avail->idx until they
> +	 * do sync).
> +	 */
> +	avail = vq->avail_idx_shadow & (vq->vring.num - 1);
> +	vq->vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> +
> +	/*
> +	 * Descriptors and available array need to be set before we expose the
> +	 * new available array entries.
> +	 */
> +	virtio_wmb(vq->weak_barriers);
> +	vq->avail_idx_shadow++;
> +	vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
> +	vq->num_added++;
> +
> +	pr_debug("%s: added buffer head %i to %p\n", __func__, head, vq);
> +	END_USE(vq);
> +
> +	/*
> +	 * This is very unlikely, but theoretically possible.  Kick
> +	 * just in case.
> +	 */
> +	if (unlikely(vq->num_added == (1 << 16) - 1))
> +		virtqueue_kick(_vq);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_indirect_desc_table_add);
> +

I'm not really happy with the fact we are duplicating so much code. Most
of this is duplicated from virtqueue_add, isn't it? I imagine you just
need to factor out the code from the following place down:

        /* If the host supports indirect descriptor tables, and we have multiple
         * buffers, then go indirect. FIXME: tune this threshold */
        if (vq->indirect && total_sg > 1 && vq->vq.num_free)
                desc = alloc_indirect(_vq, total_sg, gfp);
        else
                desc = NULL;

then reuse that.

> +/**
>   * virtqueue_add_sgs - expose buffers to other end
>   * @vq: the struct virtqueue we're talking about.
>   * @sgs: array of terminated scatterlists.
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index 7edfbdb..01dad22 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -34,6 +34,13 @@ struct virtqueue {
>  	void *priv;
>  };
>  
> +struct vring_desc *alloc_indirect(struct virtio_device *vdev,
> +				  unsigned int num, gfp_t gfp);
> +

Please prefix with virtqueue or virtio (depending on 1st parameter).
You also want a free API to pair with this (even though it's just kfree
right now).

> +int virtqueue_indirect_desc_table_add(struct virtqueue *_vq,
> +				      struct vring_desc *desc,
> +				      unsigned int num);
> +
>  int virtqueue_add_outbuf(struct virtqueue *vq,
>  			 struct scatterlist sg[], unsigned int num,
>  			 void *data,
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 343d7dd..5ed3c7b 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_PAGE_CHUNKS	3 /* Inflate/Deflate pages in chunks */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
> index c072959..0499fb8 100644
> --- a/include/uapi/linux/virtio_ring.h
> +++ b/include/uapi/linux/virtio_ring.h
> @@ -111,6 +111,9 @@ struct vring {
>  #define VRING_USED_ALIGN_SIZE 4
>  #define VRING_DESC_ALIGN_SIZE 16
>  
> +/* The supported max queue size */
> +#define VIRTQUEUE_MAX_SIZE 1024
> +
>  /* The standard layout for the ring is a continuous chunk of memory which looks
>   * like this.  We assume num is a power of 2.
>   *

Please do not add this to UAPI.

> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
  2017-06-09 10:41   ` Wei Wang
  (?)
  (?)
@ 2017-06-13 17:56   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-13 17:56 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, dave.hansen, linux-kernel, Matthew Wilcox,
	virtualization, linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On Fri, Jun 09, 2017 at 06:41:38PM +0800, Wei Wang wrote:
> Add a new feature, VIRTIO_BALLOON_F_PAGE_CHUNKS, which enables
> the transfer of the ballooned (i.e. inflated/deflated) pages in
> chunks to the host.

so now these chunks are just s/g list entry.
So let's rename this VIRTIO_BALLOON_F_SG with a comment:
* Use standard virtio s/g instead of PFN lists *

> The implementation of the previous virtio-balloon is not very
> efficient, because the ballooned pages are transferred to the
> host one by one. Here is the breakdown of the time in percentage
> spent on each step of the balloon inflating process (inflating
> 7GB of an 8GB idle guest).
> 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete.
> The above profiling shows that the bottlenecks are stage 2)
> and stage 4).
> 
> This patch optimizes step 2) by transferring pages to the host in
> chunks. A chunk consists of guest physically continuous pages.
> When the pages are packed into a chunk, they are converted into
> balloon page size (4KB) pages. A chunk is offered to the host
> via a base address (i.e. the start guest physical address of those
> physically continuous pages) and the size (i.e. the total number
> of the 4KB balloon size pages). A chunk is described via a
> vring_desc struct in the implementation.
> 
> By doing so, step 4) can also be optimized by doing address
> translation and madvise() in chunks rather than page by page.
> 
> With this new feature, the above ballooning process takes ~590ms
> resulting in an improvement of ~85%.
> 
> TODO: optimize stage 1) by allocating/freeing a chunk of pages
> instead of a single page each time.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 418 +++++++++++++++++++++++++++++++++---
>  drivers/virtio/virtio_ring.c        | 120 ++++++++++-
>  include/linux/virtio.h              |   7 +
>  include/uapi/linux/virtio_balloon.h |   1 +
>  include/uapi/linux/virtio_ring.h    |   3 +
>  5 files changed, 517 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index ecb64e9..0cf945c 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -51,6 +51,36 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>  static struct vfsmount *balloon_mnt;
>  #endif
>  
> +/* The size of one page_bmap used to record inflated/deflated pages. */
> +#define VIRTIO_BALLOON_PAGE_BMAP_SIZE	(8 * PAGE_SIZE)

At this size, you probably want alloc_pages to avoid kmalloc
overhead.

> +/*
> + * Callulates how many pfns can a page_bmap record. A bit corresponds to a
> + * page of PAGE_SIZE.
> + */
> +#define VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP \
> +	(VIRTIO_BALLOON_PAGE_BMAP_SIZE * BITS_PER_BYTE)
> +
> +/* The number of page_bmap to allocate by default. */
> +#define VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM	1

It's not by default, it's at probe time, right?

> +/* The maximum number of page_bmap that can be allocated. */

Not really, this is the size of the array we use to keep them.

> +#define VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM	32
> +

So you still have a home-grown bitmap. I'd like to know why
isn't xbitmap suggested for this purpose by Matthew Wilcox
appropriate. Please add a comment explaining the requirements
from the data structure.

> +/*
> + * QEMU virtio implementation requires the desc table size less than
> + * VIRTQUEUE_MAX_SIZE, so minus 1 here.

I think it doesn't, the issue is probably that you add a header
as a separate s/g. In any case see below.

> + */
> +#define VIRTIO_BALLOON_MAX_PAGE_CHUNKS (VIRTQUEUE_MAX_SIZE - 1)

This is wrong, virtio spec says s/g size should not exceed VQ size.
If you want to support huge VQ sizes, you can add a fallback to
smaller sizes until it fits in 1 page.

> +
> +/* The struct to manage ballooned pages in chunks */
> +struct virtio_balloon_page_chunk {
> +	/* Indirect desc table to hold chunks of balloon pages */
> +	struct vring_desc *desc_table;
> +	/* Number of added chunks of balloon pages */
> +	unsigned int chunk_num;
> +	/* Bitmap used to record ballooned pages. */
> +	unsigned long *page_bmap[VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM];
> +};
> +
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
>  	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> @@ -79,6 +109,8 @@ struct virtio_balloon {
>  	/* Synchronize access/update to this struct virtio_balloon elements */
>  	struct mutex balloon_lock;
>  
> +	struct virtio_balloon_page_chunk balloon_page_chunk;
> +
>  	/* The array of pfns we tell the Host about. */
>  	unsigned int num_pfns;
>  	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> @@ -111,6 +143,133 @@ static void balloon_ack(struct virtqueue *vq)
>  	wake_up(&vb->acked);
>  }
>  
> +/* Update pfn_max and pfn_min according to the pfn of page */
> +static inline void update_pfn_range(struct virtio_balloon *vb,
> +				    struct page *page,
> +				    unsigned long *pfn_min,
> +				    unsigned long *pfn_max)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +}
> +
> +static unsigned int extend_page_bmap_size(struct virtio_balloon *vb,
> +					  unsigned long pfn_num)

what's this API doing?  Pls add comments. this seems to assume
it will only be called once. it would be better to avoid making
this assumption, just look at what has been allocated
and extend it.

> +{
> +	unsigned int i, bmap_num, allocated_bmap_num;
> +	unsigned long bmap_len;
> +
> +	allocated_bmap_num = VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM;

how come? Pls init vars where they are declared.

> +	bmap_len = ALIGN(pfn_num, BITS_PER_LONG) / BITS_PER_BYTE;
> +	bmap_len = roundup(bmap_len, VIRTIO_BALLOON_PAGE_BMAP_SIZE);
> +	/*
> +	 * VIRTIO_BALLOON_PAGE_BMAP_SIZE is the size of one page_bmap, so
> +	 * divide it to calculate how many page_bmap that we need.
> +	 */
> +	bmap_num = (unsigned int)(bmap_len / VIRTIO_BALLOON_PAGE_BMAP_SIZE);
> +	/* The number of page_bmap to allocate should not exceed the max */
> +	bmap_num = min_t(unsigned int, VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM,
> +			 bmap_num);

two comments above don't really help just drop them.

> +
> +	for (i = VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i < bmap_num; i++) {
> +		vb->balloon_page_chunk.page_bmap[i] =
> +			    kmalloc(VIRTIO_BALLOON_PAGE_BMAP_SIZE, GFP_KERNEL);
> +		if (vb->balloon_page_chunk.page_bmap[i])
> +			allocated_bmap_num++;
> +		else
> +			break;
> +	}
> +
> +	return allocated_bmap_num;
> +}
> +
> +static void free_extended_page_bmap(struct virtio_balloon *vb,
> +				    unsigned int page_bmap_num)
> +{
> +	unsigned int i;
> +
> +	for (i = VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i < page_bmap_num;
> +	     i++) {
> +		kfree(vb->balloon_page_chunk.page_bmap[i]);
> +		vb->balloon_page_chunk.page_bmap[i] = NULL;
> +		page_bmap_num--;
> +	}
> +}
> +
> +static void clear_page_bmap(struct virtio_balloon *vb,
> +			    unsigned int page_bmap_num)
> +{
> +	int i;
> +
> +	for (i = 0; i < page_bmap_num; i++)
> +		memset(vb->balloon_page_chunk.page_bmap[i], 0,
> +		       VIRTIO_BALLOON_PAGE_BMAP_SIZE);
> +}
> +
> +static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq)
> +{
> +	unsigned int len, num;
> +	struct vring_desc *desc = vb->balloon_page_chunk.desc_table;
> +
> +	num = vb->balloon_page_chunk.chunk_num;
> +	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
> +		virtqueue_kick(vq);
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		vb->balloon_page_chunk.chunk_num = 0;
> +	}
> +}
> +
> +/* Add a chunk to the buffer. */
> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> +			  u64 base_addr, u32 size)
> +{
> +	unsigned int *num = &vb->balloon_page_chunk.chunk_num;
> +	struct vring_desc *desc = &vb->balloon_page_chunk.desc_table[*num];
> +
> +	desc->addr = cpu_to_virtio64(vb->vdev, base_addr);
> +	desc->len = cpu_to_virtio32(vb->vdev, size);
> +	*num += 1;
> +	if (*num == VIRTIO_BALLOON_MAX_PAGE_CHUNKS)
> +		send_page_chunks(vb, vq);
> +}
> +

Poking at virtio internals like this is not nice. Pls move to virtio
code.  Also, pages must be read descriptors as host might modify them.

This also lacks viommu support but this is not mandatory as
that is borken atm anyway. I'll send a patch to at least fail cleanly.

> +static void convert_bmap_to_chunks(struct virtio_balloon *vb,
> +				   struct virtqueue *vq,
> +				   unsigned long *bmap,
> +				   unsigned long pfn_start,
> +				   unsigned long size)
> +{
> +	unsigned long next_one, next_zero, pos = 0;
> +	u64 chunk_base_addr;
> +	u32 chunk_size;
> +
> +	while (pos < size) {
> +		next_one = find_next_bit(bmap, size, pos);
> +		/*
> +		 * No "1" bit found, which means that there is no pfn
> +		 * recorded in the rest of this bmap.
> +		 */
> +		if (next_one == size)
> +			break;
> +		next_zero = find_next_zero_bit(bmap, size, next_one + 1);
> +		/*
> +		 * A bit in page_bmap corresponds to a page of PAGE_SIZE.
> +		 * Convert it to be pages of 4KB balloon page size when
> +		 * adding it to a chunk.

This looks wrong. add_one_chunk assumes size in bytes. So should be just
PAGE_SIZE.

> +		 */
> +		chunk_size = (next_zero - next_one) *
> +			     VIRTIO_BALLOON_PAGES_PER_PAGE;

How do you know this won't overflow a 32 bit integer? Needs a comment.

> +		chunk_base_addr = (pfn_start + next_one) <<
> +				  VIRTIO_BALLOON_PFN_SHIFT;

Same here I think we've left pfns behind, we are using standard s/g now.

> +		if (chunk_size) {
> +			add_one_chunk(vb, vq, chunk_base_addr, chunk_size);
> +			pos += next_zero + 1;
> +		}
> +	}
> +}
> +
>  static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>  {
>  	struct scatterlist sg;
> @@ -124,7 +283,35 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>  
>  	/* When host has read buffer, this completes via balloon_ack */
>  	wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +}
> +
> +static void tell_host_from_page_bmap(struct virtio_balloon *vb,
> +				     struct virtqueue *vq,
> +				     unsigned long pfn_start,
> +				     unsigned long pfn_end,
> +				     unsigned int page_bmap_num)
> +{
> +	unsigned long i, pfn_num;
>  
> +	for (i = 0; i < page_bmap_num; i++) {
> +		/*
> +		 * For the last page_bmap, only the remaining number of pfns
> +		 * need to be searched rather than the entire page_bmap.
> +		 */
> +		if (i + 1 == page_bmap_num)
> +			pfn_num = (pfn_end - pfn_start) %
> +				  VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
> +		else
> +			pfn_num = VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
> +
> +		convert_bmap_to_chunks(vb, vq,
> +				       vb->balloon_page_chunk.page_bmap[i],
> +				       pfn_start +
> +				       i * VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP,
> +				       pfn_num);
> +	}
> +	if (vb->balloon_page_chunk.chunk_num > 0)
> +		send_page_chunks(vb, vq);
>  }
>  
>  static void set_page_pfns(struct virtio_balloon *vb,
> @@ -141,13 +328,89 @@ static void set_page_pfns(struct virtio_balloon *vb,
>  					  page_to_balloon_pfn(page) + i);
>  }
>  
> +/*
> + * Send ballooned pages in chunks to host.
> + * The ballooned pages are recorded in page bitmaps. Each bit in a bitmap
> + * corresponds to a page of PAGE_SIZE. The page bitmaps are searched for
> + * continuous "1" bits, which correspond to continuous pages, to chunk.
> + * When packing those continuous pages into chunks, pages are converted into
> + * 4KB balloon pages.
> + *
> + * pfn_max and pfn_min form the range of pfns that need to use page bitmaps to
> + * record. If the range is too large to be recorded into the allocated page
> + * bitmaps, the page bitmaps are used multiple times to record the entire
> + * range of pfns.
> + */
> +static void tell_host_page_chunks(struct virtio_balloon *vb,
> +				  struct list_head *pages,
> +				  struct virtqueue *vq,
> +				  unsigned long pfn_max,
> +				  unsigned long pfn_min)
> +{
> +	/*
> +	 * The pfn_start and pfn_end form the range of pfns that the allocated
> +	 * page_bmap can record in each round.
> +	 */
> +	unsigned long pfn_start, pfn_end;
> +	/* Total number of allocated page_bmap */
> +	unsigned int page_bmap_num;
> +	struct page *page;
> +	bool found;
> +
> +	/*
> +	 * In the case that one page_bmap is not sufficient to record the pfn
> +	 * range, page_bmap will be extended by allocating more numbers of
> +	 * page_bmap.
> +	 */
> +	page_bmap_num = extend_page_bmap_size(vb, pfn_max - pfn_min + 1);
> +
> +	/* Start from the beginning of the whole pfn range */
> +	pfn_start = pfn_min;
> +	while (pfn_start < pfn_max) {
> +		pfn_end = pfn_start +
> +			  VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP * page_bmap_num;
> +		pfn_end = pfn_end < pfn_max ? pfn_end : pfn_max;
> +		clear_page_bmap(vb, page_bmap_num);
> +		found = false;
> +
> +		list_for_each_entry(page, pages, lru) {
> +			unsigned long bmap_idx, bmap_pos, this_pfn;
> +
> +			this_pfn = page_to_pfn(page);
> +			if (this_pfn < pfn_start || this_pfn > pfn_end)
> +				continue;
> +			bmap_idx = (this_pfn - pfn_start) /
> +				   VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
> +			bmap_pos = (this_pfn - pfn_start) %
> +				   VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP;
> +			set_bit(bmap_pos,
> +				vb->balloon_page_chunk.page_bmap[bmap_idx]);
> +
> +			found = true;
> +		}
> +		if (found)
> +			tell_host_from_page_bmap(vb, vq, pfn_start, pfn_end,
> +						 page_bmap_num);
> +		/*
> +		 * Start the next round when pfn_start and pfn_end couldn't
> +		 * cover the whole pfn range given by pfn_max and pfn_min.
> +		 */
> +		pfn_start = pfn_end;
> +	}
> +	free_extended_page_bmap(vb, page_bmap_num);
> +}
> +
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	unsigned num_allocated_pages;
> +	bool chunking = virtio_has_feature(vb->vdev,
> +					   VIRTIO_BALLOON_F_PAGE_CHUNKS);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
>  	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	if (!chunking)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
> @@ -162,7 +425,10 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  			msleep(200);
>  			break;
>  		}
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (chunking)
> +			update_pfn_range(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		if (!virtio_has_feature(vb->vdev,
>  					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> @@ -171,8 +437,14 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  
>  	num_allocated_pages = vb->num_pfns;
>  	/* Did we get any? */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->inflate_vq);
> +	if (vb->num_pfns != 0) {
> +		if (chunking)
> +			tell_host_page_chunks(vb, &vb_dev_info->pages,
> +					      vb->inflate_vq,
> +					      pfn_max, pfn_min);
> +		else
> +			tell_host(vb, vb->inflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	return num_allocated_pages;
> @@ -198,9 +470,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool chunking = virtio_has_feature(vb->vdev,
> +					   VIRTIO_BALLOON_F_PAGE_CHUNKS);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> -	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	/* Traditionally, we can only do one array worth at a time. */
> +	if (!chunking)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	/* We can't release more pages than taken */
> @@ -210,7 +486,10 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  		page = balloon_page_dequeue(vb_dev_info);
>  		if (!page)
>  			break;
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (chunking)
> +			update_pfn_range(vb, page, &pfn_min, &pfn_max);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>  		list_add(&page->lru, &pages);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}
> @@ -221,8 +500,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>  	 * is true, we *have* to do it in this order
>  	 */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->deflate_vq);
> +	if (vb->num_pfns != 0) {
> +		if (chunking)
> +			tell_host_page_chunks(vb, &pages, vb->deflate_vq,
> +					      pfn_max, pfn_min);
> +		else
> +			tell_host(vb, vb->deflate_vq);
> +	}
>  	release_pages_balloon(vb, &pages);
>  	mutex_unlock(&vb->balloon_lock);
>  	return num_freed_pages;
> @@ -442,6 +726,14 @@ static int init_vqs(struct virtio_balloon *vb)
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> +
> +static void tell_host_one_page(struct virtio_balloon *vb,
> +			       struct virtqueue *vq, struct page *page)
> +{
> +	add_one_chunk(vb, vq, page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT,
> +		      VIRTIO_BALLOON_PAGES_PER_PAGE);
> +}
> +
>  /*
>   * virtballoon_migratepage - perform the balloon page migration on behalf of
>   *			     a compation thread.     (called under page lock)
> @@ -465,6 +757,8 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  {
>  	struct virtio_balloon *vb = container_of(vb_dev_info,
>  			struct virtio_balloon, vb_dev_info);
> +	bool chunking = virtio_has_feature(vb->vdev,
> +					   VIRTIO_BALLOON_F_PAGE_CHUNKS);
>  	unsigned long flags;
>  
>  	/*
> @@ -486,16 +780,22 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  	vb_dev_info->isolated_pages--;
>  	__count_vm_event(BALLOON_MIGRATE);
>  	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, newpage);
> -	tell_host(vb, vb->inflate_vq);
> -
> +	if (chunking) {
> +		tell_host_one_page(vb, vb->inflate_vq, newpage);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, newpage);
> +		tell_host(vb, vb->inflate_vq);
> +	}
>  	/* balloon's page migration 2nd step -- deflate "page" */
>  	balloon_page_delete(page);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, page);
> -	tell_host(vb, vb->deflate_vq);
> -
> +	if (chunking) {
> +		tell_host_one_page(vb, vb->deflate_vq, page);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, page);
> +		tell_host(vb, vb->deflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	put_page(page); /* balloon reference */
> @@ -522,9 +822,78 @@ static struct file_system_type balloon_fs = {
>  
>  #endif /* CONFIG_BALLOON_COMPACTION */
>  
> +static void free_page_bmap(struct virtio_balloon *vb)
> +{
> +	int i;
> +
> +	for (i = 0; i < VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i++) {
> +		kfree(vb->balloon_page_chunk.page_bmap[i]);
> +		vb->balloon_page_chunk.page_bmap[i] = NULL;
> +	}
> +}
> +
> +static int balloon_page_chunk_init(struct virtio_balloon *vb)
> +{
> +	int i;
> +
> +	vb->balloon_page_chunk.desc_table = alloc_indirect(vb->vdev,
> +						VIRTIO_BALLOON_MAX_PAGE_CHUNKS,
> +						GFP_KERNEL);

This one's problematic, you aren't supposed to use APIs when device
is not inited yet. Seems to work by luck here. I suggest moving
this to probe, that's where we do a bunch of inits.
And then you can move private init back to allocate too.

> +	if (!vb->balloon_page_chunk.desc_table)
> +		goto err_page_chunk;
> +	vb->balloon_page_chunk.chunk_num = 0;
> +
> +	/*
> +	 * The default number of page_bmaps are allocated. More may be
> +	 * allocated on demand.
> +	 */
> +	for (i = 0; i < VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM; i++) {
> +		vb->balloon_page_chunk.page_bmap[i] =
> +			    kmalloc(VIRTIO_BALLOON_PAGE_BMAP_SIZE, GFP_KERNEL);
> +		if (!vb->balloon_page_chunk.page_bmap[i])
> +			goto err_page_bmap;
> +	}
> +
> +	return 0;
> +err_page_bmap:
> +	free_page_bmap(vb);
> +	kfree(vb->balloon_page_chunk.desc_table);
> +	vb->balloon_page_chunk.desc_table = NULL;
> +err_page_chunk:
> +	__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_PAGE_CHUNKS);
> +	dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
> +	return -ENOMEM;
> +}
> +
> +static int virtballoon_validate(struct virtio_device *vdev)
> +{
> +	struct virtio_balloon *vb = NULL;
> +	int err;
> +
> +	vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
> +	if (!vb) {
> +		err = -ENOMEM;
> +		goto err_vb;
> +	}
> +	vb->vdev = vdev;
> +
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_CHUNKS)) {
> +		err = balloon_page_chunk_init(vb);
> +		if (err < 0)
> +			goto err_page_chunk;
> +	}
> +
> +	return 0;
> +
> +err_page_chunk:
> +	kfree(vb);
> +err_vb:
> +	return err;
> +}
> +

So here you are supposed to validate features, not handle OOM
conditions.  BTW we need a fix for vIOMMU - I noticed balloon does not
support that yes.

>  static int virtballoon_probe(struct virtio_device *vdev)
>  {
> -	struct virtio_balloon *vb;
> +	struct virtio_balloon *vb = vdev->priv;
>  	int err;
>  
>  	if (!vdev->config->get) {
> @@ -533,20 +902,14 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  		return -EINVAL;
>  	}
>  
> -	vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
> -	if (!vb) {
> -		err = -ENOMEM;
> -		goto out;
> -	}
> -
>  	INIT_WORK(&vb->update_balloon_stats_work, update_balloon_stats_func);
>  	INIT_WORK(&vb->update_balloon_size_work, update_balloon_size_func);
>  	spin_lock_init(&vb->stop_update_lock);
>  	vb->stop_update = false;
>  	vb->num_pages = 0;
> +
>  	mutex_init(&vb->balloon_lock);
>  	init_waitqueue_head(&vb->acked);
> -	vb->vdev = vdev;
>  
>  	balloon_devinfo_init(&vb->vb_dev_info);
>  
> @@ -590,7 +953,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	vdev->config->del_vqs(vdev);
>  out_free_vb:
>  	kfree(vb);
> -out:
>  	return err;
>  }
>  
> @@ -620,6 +982,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	cancel_work_sync(&vb->update_balloon_stats_work);
>  
>  	remove_common(vb);
> +	free_page_bmap(vb);
> +	kfree(vb->balloon_page_chunk.desc_table);
>  #ifdef CONFIG_BALLOON_COMPACTION
>  	if (vb->vb_dev_info.inode)
>  		iput(vb->vb_dev_info.inode);
> @@ -664,6 +1028,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_PAGE_CHUNKS,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> @@ -674,6 +1039,7 @@ static struct virtio_driver virtio_balloon_driver = {
>  	.id_table =	id_table,
>  	.probe =	virtballoon_probe,
>  	.remove =	virtballoon_remove,
> +	.validate =	virtballoon_validate,
>  	.config_changed = virtballoon_changed,
>  #ifdef CONFIG_PM_SLEEP
>  	.freeze	=	virtballoon_freeze,
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 409aeaa..0ea2512 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -235,8 +235,17 @@ static int vring_mapping_error(const struct vring_virtqueue *vq,
>  	return dma_mapping_error(vring_dma_dev(vq), addr);
>  }
>  
> -static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
> -					 unsigned int total_sg, gfp_t gfp)
> +/**
> + * alloc_indirect - allocate an indirect desc table
> + * @vdev: the virtio_device that owns the indirect desc table.
> + * @num: the number of entries that the table will have.
> + * @gfp: how to do memory allocations (if necessary).
> + *
> + * Return NULL if the table allocation failed. Otherwise, return the address
> + * of the table.
> + */
> +struct vring_desc *alloc_indirect(struct virtio_device *vdev, unsigned int num,
> +				  gfp_t gfp)
>  {
>  	struct vring_desc *desc;
>  	unsigned int i;
> @@ -248,14 +257,15 @@ static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
>  	 */
>  	gfp &= ~__GFP_HIGHMEM;
>  
> -	desc = kmalloc(total_sg * sizeof(struct vring_desc), gfp);
> +	desc = kmalloc_array(num, sizeof(struct vring_desc), gfp);
>  	if (!desc)
>  		return NULL;
>  
> -	for (i = 0; i < total_sg; i++)
> -		desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);
> +	for (i = 0; i < num; i++)
> +		desc[i].next = cpu_to_virtio16(vdev, i + 1);
>  	return desc;
>  }
> +EXPORT_SYMBOL_GPL(alloc_indirect);
>  
>  static inline int virtqueue_add(struct virtqueue *_vq,
>  				struct scatterlist *sgs[],
> @@ -302,7 +312,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  	/* If the host supports indirect descriptor tables, and we have multiple
>  	 * buffers, then go indirect. FIXME: tune this threshold */
>  	if (vq->indirect && total_sg > 1 && vq->vq.num_free)
> -		desc = alloc_indirect(_vq, total_sg, gfp);
> +		desc = alloc_indirect(_vq->vdev, total_sg, gfp);
>  	else
>  		desc = NULL;
>  
> @@ -433,6 +443,104 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  }
>  
>  /**
> + * virtqueue_indirect_desc_table_add - add an indirect desc table to the vq
> + * @_vq: the struct virtqueue we're talking about.
> + * @desc: the desc table we're talking about.
> + * @num: the number of entries that the desc table has.
> + *
> + * Returns zero or a negative error (ie. ENOSPC, EIO).
> + */
> +int virtqueue_indirect_desc_table_add(struct virtqueue *_vq,
> +				      struct vring_desc *desc,
> +				      unsigned int num)
> +{
> +	struct vring_virtqueue *vq = to_vvq(_vq);
> +	dma_addr_t desc_addr;
> +	unsigned int i, avail;
> +	int head;
> +
> +	/* Sanity check */
> +	if (!desc) {
> +		pr_debug("%s: empty desc table\n", __func__);
> +		return -EINVAL;
> +	}
> +
> +	START_USE(vq);
> +
> +	if (unlikely(vq->broken)) {
> +		END_USE(vq);
> +		return -EIO;
> +	}
> +
> +	if (!vq->vq.num_free) {
> +		pr_debug("%s: the virtioqueue is full\n", __func__);
> +		END_USE(vq);
> +		return -ENOSPC;
> +	}
> +
> +	/* Map and fill in the indirect table */
> +	desc_addr = vring_map_single(vq, desc, num * sizeof(struct vring_desc),
> +				     DMA_TO_DEVICE);
> +	if (vring_mapping_error(vq, desc_addr)) {
> +		pr_debug("%s: map desc failed\n", __func__);
> +		END_USE(vq);
> +		return -EIO;
> +	}
> +
> +	/* Mark the flag of the table entries */
> +	for (i = 0; i < num; i++)
> +		desc[i].flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_NEXT);
> +	/* The last one doesn't continue. */
> +	desc[num - 1].flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
> +
> +	/* Get a ring entry to point to the indirect table */
> +	head = vq->free_head;
> +	vq->vring.desc[head].flags = cpu_to_virtio16(_vq->vdev,
> +						     VRING_DESC_F_INDIRECT);
> +	vq->vring.desc[head].addr = cpu_to_virtio64(_vq->vdev, desc_addr);
> +	vq->vring.desc[head].len = cpu_to_virtio32(_vq->vdev, num *
> +						   sizeof(struct vring_desc));
> +	/* We're using 1 buffers from the free list. */
> +	vq->vq.num_free--;
> +	/* Update free pointer */
> +	vq->free_head = virtio16_to_cpu(_vq->vdev, vq->vring.desc[head].next);
> +
> +	/* Store token and indirect buffer state. */
> +	vq->desc_state[head].data = desc;
> +	/* Don't free the caller allocated indirect table when detach_buf. */
> +	vq->desc_state[head].indir_desc = NULL;
> +
> +	/*
> +	 * Put entry in available array (but don't update avail->idx until they
> +	 * do sync).
> +	 */
> +	avail = vq->avail_idx_shadow & (vq->vring.num - 1);
> +	vq->vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> +
> +	/*
> +	 * Descriptors and available array need to be set before we expose the
> +	 * new available array entries.
> +	 */
> +	virtio_wmb(vq->weak_barriers);
> +	vq->avail_idx_shadow++;
> +	vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
> +	vq->num_added++;
> +
> +	pr_debug("%s: added buffer head %i to %p\n", __func__, head, vq);
> +	END_USE(vq);
> +
> +	/*
> +	 * This is very unlikely, but theoretically possible.  Kick
> +	 * just in case.
> +	 */
> +	if (unlikely(vq->num_added == (1 << 16) - 1))
> +		virtqueue_kick(_vq);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_indirect_desc_table_add);
> +

I'm not really happy with the fact we are duplicating so much code. Most
of this is duplicated from virtqueue_add, isn't it? I imagine you just
need to factor out the code from the following place down:

        /* If the host supports indirect descriptor tables, and we have multiple
         * buffers, then go indirect. FIXME: tune this threshold */
        if (vq->indirect && total_sg > 1 && vq->vq.num_free)
                desc = alloc_indirect(_vq, total_sg, gfp);
        else
                desc = NULL;

then reuse that.

> +/**
>   * virtqueue_add_sgs - expose buffers to other end
>   * @vq: the struct virtqueue we're talking about.
>   * @sgs: array of terminated scatterlists.
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index 7edfbdb..01dad22 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -34,6 +34,13 @@ struct virtqueue {
>  	void *priv;
>  };
>  
> +struct vring_desc *alloc_indirect(struct virtio_device *vdev,
> +				  unsigned int num, gfp_t gfp);
> +

Please prefix with virtqueue or virtio (depending on 1st parameter).
You also want a free API to pair with this (even though it's just kfree
right now).

> +int virtqueue_indirect_desc_table_add(struct virtqueue *_vq,
> +				      struct vring_desc *desc,
> +				      unsigned int num);
> +
>  int virtqueue_add_outbuf(struct virtqueue *vq,
>  			 struct scatterlist sg[], unsigned int num,
>  			 void *data,
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 343d7dd..5ed3c7b 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_PAGE_CHUNKS	3 /* Inflate/Deflate pages in chunks */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
> index c072959..0499fb8 100644
> --- a/include/uapi/linux/virtio_ring.h
> +++ b/include/uapi/linux/virtio_ring.h
> @@ -111,6 +111,9 @@ struct vring {
>  #define VRING_USED_ALIGN_SIZE 4
>  #define VRING_DESC_ALIGN_SIZE 16
>  
> +/* The supported max queue size */
> +#define VIRTQUEUE_MAX_SIZE 1024
> +
>  /* The standard layout for the ring is a continuous chunk of memory which looks
>   * like this.  We assume num is a power of 2.
>   *

Please do not add this to UAPI.

> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
  2017-06-13 17:56     ` Michael S. Tsirkin
  (?)
@ 2017-06-13 17:59       ` Dave Hansen
  -1 siblings, 0 replies; 175+ messages in thread
From: Dave Hansen @ 2017-06-13 17:59 UTC (permalink / raw)
  To: Michael S. Tsirkin, Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, Matthew Wilcox

On 06/13/2017 10:56 AM, Michael S. Tsirkin wrote:
>> +/* The size of one page_bmap used to record inflated/deflated pages. */
>> +#define VIRTIO_BALLOON_PAGE_BMAP_SIZE	(8 * PAGE_SIZE)
> At this size, you probably want alloc_pages to avoid kmalloc
> overhead.

For slub, at least, kmalloc() just calls alloc_pages() basically
directly.  There's virtually no overhead.

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
@ 2017-06-13 17:59       ` Dave Hansen
  0 siblings, 0 replies; 175+ messages in thread
From: Dave Hansen @ 2017-06-13 17:59 UTC (permalink / raw)
  To: Michael S. Tsirkin, Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, Matthew Wilcox

On 06/13/2017 10:56 AM, Michael S. Tsirkin wrote:
>> +/* The size of one page_bmap used to record inflated/deflated pages. */
>> +#define VIRTIO_BALLOON_PAGE_BMAP_SIZE	(8 * PAGE_SIZE)
> At this size, you probably want alloc_pages to avoid kmalloc
> overhead.

For slub, at least, kmalloc() just calls alloc_pages() basically
directly.  There's virtually no overhead.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
@ 2017-06-13 17:59       ` Dave Hansen
  0 siblings, 0 replies; 175+ messages in thread
From: Dave Hansen @ 2017-06-13 17:59 UTC (permalink / raw)
  To: Michael S. Tsirkin, Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, Matthew Wilcox

On 06/13/2017 10:56 AM, Michael S. Tsirkin wrote:
>> +/* The size of one page_bmap used to record inflated/deflated pages. */
>> +#define VIRTIO_BALLOON_PAGE_BMAP_SIZE	(8 * PAGE_SIZE)
> At this size, you probably want alloc_pages to avoid kmalloc
> overhead.

For slub, at least, kmalloc() just calls alloc_pages() basically
directly.  There's virtually no overhead.

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
  2017-06-13 17:56     ` Michael S. Tsirkin
  (?)
  (?)
@ 2017-06-13 17:59     ` Dave Hansen
  -1 siblings, 0 replies; 175+ messages in thread
From: Dave Hansen @ 2017-06-13 17:59 UTC (permalink / raw)
  To: Michael S. Tsirkin, Wei Wang
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, Matthew Wilcox, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On 06/13/2017 10:56 AM, Michael S. Tsirkin wrote:
>> +/* The size of one page_bmap used to record inflated/deflated pages. */
>> +#define VIRTIO_BALLOON_PAGE_BMAP_SIZE	(8 * PAGE_SIZE)
> At this size, you probably want alloc_pages to avoid kmalloc
> overhead.

For slub, at least, kmalloc() just calls alloc_pages() basically
directly.  There's virtually no overhead.

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
  2017-06-13 17:59       ` Dave Hansen
  (?)
@ 2017-06-13 18:55         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-13 18:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, Matthew Wilcox

On Tue, Jun 13, 2017 at 10:59:07AM -0700, Dave Hansen wrote:
> On 06/13/2017 10:56 AM, Michael S. Tsirkin wrote:
> >> +/* The size of one page_bmap used to record inflated/deflated pages. */
> >> +#define VIRTIO_BALLOON_PAGE_BMAP_SIZE	(8 * PAGE_SIZE)
> > At this size, you probably want alloc_pages to avoid kmalloc
> > overhead.
> 
> For slub, at least, kmalloc() just calls alloc_pages() basically
> directly.  There's virtually no overhead.
> 
> 

OK then.

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
@ 2017-06-13 18:55         ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-13 18:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, Matthew Wilcox

On Tue, Jun 13, 2017 at 10:59:07AM -0700, Dave Hansen wrote:
> On 06/13/2017 10:56 AM, Michael S. Tsirkin wrote:
> >> +/* The size of one page_bmap used to record inflated/deflated pages. */
> >> +#define VIRTIO_BALLOON_PAGE_BMAP_SIZE	(8 * PAGE_SIZE)
> > At this size, you probably want alloc_pages to avoid kmalloc
> > overhead.
> 
> For slub, at least, kmalloc() just calls alloc_pages() basically
> directly.  There's virtually no overhead.
> 
> 

OK then.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
@ 2017-06-13 18:55         ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-13 18:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, cornelia.huck, akpm, mgorman, aarcange,
	amit.shah, pbonzini, liliang.opensource, Matthew Wilcox

On Tue, Jun 13, 2017 at 10:59:07AM -0700, Dave Hansen wrote:
> On 06/13/2017 10:56 AM, Michael S. Tsirkin wrote:
> >> +/* The size of one page_bmap used to record inflated/deflated pages. */
> >> +#define VIRTIO_BALLOON_PAGE_BMAP_SIZE	(8 * PAGE_SIZE)
> > At this size, you probably want alloc_pages to avoid kmalloc
> > overhead.
> 
> For slub, at least, kmalloc() just calls alloc_pages() basically
> directly.  There's virtually no overhead.
> 
> 

OK then.

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
  2017-06-13 17:59       ` Dave Hansen
                         ` (2 preceding siblings ...)
  (?)
@ 2017-06-13 18:55       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-13 18:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, linux-kernel, Matthew Wilcox, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On Tue, Jun 13, 2017 at 10:59:07AM -0700, Dave Hansen wrote:
> On 06/13/2017 10:56 AM, Michael S. Tsirkin wrote:
> >> +/* The size of one page_bmap used to record inflated/deflated pages. */
> >> +#define VIRTIO_BALLOON_PAGE_BMAP_SIZE	(8 * PAGE_SIZE)
> > At this size, you probably want alloc_pages to avoid kmalloc
> > overhead.
> 
> For slub, at least, kmalloc() just calls alloc_pages() basically
> directly.  There's virtually no overhead.
> 
> 

OK then.

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
  2017-06-13 17:56     ` Michael S. Tsirkin
  (?)
@ 2017-06-15  8:10       ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-15  8:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource,
	Matthew Wilcox

On 06/14/2017 01:56 AM, Michael S. Tsirkin wrote:
> On Fri, Jun 09, 2017 at 06:41:38PM +0800, Wei Wang wrote:
>> Add a new feature, VIRTIO_BALLOON_F_PAGE_CHUNKS, which enables
>> the transfer of the ballooned (i.e. inflated/deflated) pages in
>> chunks to the host.
> so now these chunks are just s/g list entry.
> So let's rename this VIRTIO_BALLOON_F_SG with a comment:
> * Use standard virtio s/g instead of PFN lists *

Actually, it's not using the standard s/g list in the implementation,
because:
using the standard s/g will need kmalloc() the indirect table on
demand (i.e. when virtqueue_add() converts s/g to indirect table);

The implementation directly pre-allocates an indirect desc table,
and uses a entry (i.e. vring_desc) to describe a chunk. This
avoids the overhead of kmalloc() the indirect table.


>> +/*
>> + * Callulates how many pfns can a page_bmap record. A bit corresponds to a
>> + * page of PAGE_SIZE.
>> + */
>> +#define VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP \
>> +	(VIRTIO_BALLOON_PAGE_BMAP_SIZE * BITS_PER_BYTE)
>> +
>> +/* The number of page_bmap to allocate by default. */
>> +#define VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM	1
> It's not by default, it's at probe time, right?
It is the number of page bitmap being kept throughout the whole
lifecycle of the driver. The page bmap will be temporarily extended
due to insufficiency during a ballooning process, but when that
ballooning finishes, the extended part will be freed.
>> +/* The maximum number of page_bmap that can be allocated. */
> Not really, this is the size of the array we use to keep them.

This is the max number of the page bmap that can be
extended temporarily.

>> +#define VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM	32
>> +
> So you still have a home-grown bitmap. I'd like to know why
> isn't xbitmap suggested for this purpose by Matthew Wilcox
> appropriate. Please add a comment explaining the requirements
> from the data structure.

I didn't find his xbitmap being upstreamed, did you?

>> +/*
>> + * QEMU virtio implementation requires the desc table size less than
>> + * VIRTQUEUE_MAX_SIZE, so minus 1 here.
> I think it doesn't, the issue is probably that you add a header
> as a separate s/g. In any case see below.
>
>> + */
>> +#define VIRTIO_BALLOON_MAX_PAGE_CHUNKS (VIRTQUEUE_MAX_SIZE - 1)
> This is wrong, virtio spec says s/g size should not exceed VQ size.
> If you want to support huge VQ sizes, you can add a fallback to
> smaller sizes until it fits in 1 page.

Probably no need for huge VQ size, 1024 queue size should be
enough. And we can have 1024 descriptors in the indirect
table, so the above size doesn't exceed the vq size, right?


> +static unsigned int extend_page_bmap_size(struct virtio_balloon *vb,
> +					  unsigned long pfn_num)
> what's this API doing?  Pls add comments. this seems to assume
> it will only be called once.
OK, I will add some comments here. This is the function to extend
the number of page bitmap when the original 1 page bmap is
not sufficient during a ballooning process. As mentioned above,
at the end of this ballooning process, the extended part will be freed.

> it would be better to avoid making
> this assumption, just look at what has been allocated
> and extend it.
Actually it's not an assumption. The rule here is that we always keep
"1" page bmap. "1" is defined by the
VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM. So when freeing, it also
references VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM (not assuming
any number)

> +}
> +
> +/* Add a chunk to the buffer. */
> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> +			  u64 base_addr, u32 size)
> +{
> +	unsigned int *num = &vb->balloon_page_chunk.chunk_num;
> +	struct vring_desc *desc = &vb->balloon_page_chunk.desc_table[*num];
> +
> +	desc->addr = cpu_to_virtio64(vb->vdev, base_addr);
> +	desc->len = cpu_to_virtio32(vb->vdev, size);
> +	*num += 1;
> +	if (*num == VIRTIO_BALLOON_MAX_PAGE_CHUNKS)
> +		send_page_chunks(vb, vq);
> +}
> +
> Poking at virtio internals like this is not nice. Pls move to virtio
> code.  Also, pages must be read descriptors as host might modify them.
>
> This also lacks viommu support but this is not mandatory as
> that is borken atm anyway. I'll send a patch to at least fail cleanly.
OK, thanks.

>> +static void convert_bmap_to_chunks(struct virtio_balloon *vb,
>> +				   struct virtqueue *vq,
>> +				   unsigned long *bmap,
>> +				   unsigned long pfn_start,
>> +				   unsigned long size)
>> +{
>> +	unsigned long next_one, next_zero, pos = 0;
>> +	u64 chunk_base_addr;
>> +	u32 chunk_size;
>> +
>> +	while (pos < size) {
>> +		next_one = find_next_bit(bmap, size, pos);
>> +		/*
>> +		 * No "1" bit found, which means that there is no pfn
>> +		 * recorded in the rest of this bmap.
>> +		 */
>> +		if (next_one == size)
>> +			break;
>> +		next_zero = find_next_zero_bit(bmap, size, next_one + 1);
>> +		/*
>> +		 * A bit in page_bmap corresponds to a page of PAGE_SIZE.
>> +		 * Convert it to be pages of 4KB balloon page size when
>> +		 * adding it to a chunk.
> This looks wrong. add_one_chunk assumes size in bytes. So should be just
> PAGE_SIZE.

It's intended to be "chunk size", which is the number of pfns. The 
benefit is
that the 32-bit desc->len won't be overflow, as you mentioned below.


>
>> +		 */
>> +		chunk_size = (next_zero - next_one) *
>> +			     VIRTIO_BALLOON_PAGES_PER_PAGE;
> How do you know this won't overflow a 32 bit integer? Needs a comment.

If it stores size in bytes, it has the possibility to overflow.
If storing number of pfns, the 32-bit value can support 2^32*4KB=8TB
memory, unlikely to overflow.
> +
> +static int balloon_page_chunk_init(struct virtio_balloon *vb)
> +{
> +	int i;
> +
> +	vb->balloon_page_chunk.desc_table = alloc_indirect(vb->vdev,
> +						VIRTIO_BALLOON_MAX_PAGE_CHUNKS,
> +						GFP_KERNEL);
> This one's problematic, you aren't supposed to use APIs when device
> is not inited yet. Seems to work by luck here. I suggest moving
> this to probe, that's where we do a bunch of inits.
> And then you can move private init back to allocate too.

This is just to allocate an indirect desc table. If allocation fails, we 
need to clear
the related feature bit in ->validate(), right?

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
@ 2017-06-15  8:10       ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-15  8:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource,
	Matthew Wilcox

On 06/14/2017 01:56 AM, Michael S. Tsirkin wrote:
> On Fri, Jun 09, 2017 at 06:41:38PM +0800, Wei Wang wrote:
>> Add a new feature, VIRTIO_BALLOON_F_PAGE_CHUNKS, which enables
>> the transfer of the ballooned (i.e. inflated/deflated) pages in
>> chunks to the host.
> so now these chunks are just s/g list entry.
> So let's rename this VIRTIO_BALLOON_F_SG with a comment:
> * Use standard virtio s/g instead of PFN lists *

Actually, it's not using the standard s/g list in the implementation,
because:
using the standard s/g will need kmalloc() the indirect table on
demand (i.e. when virtqueue_add() converts s/g to indirect table);

The implementation directly pre-allocates an indirect desc table,
and uses a entry (i.e. vring_desc) to describe a chunk. This
avoids the overhead of kmalloc() the indirect table.


>> +/*
>> + * Callulates how many pfns can a page_bmap record. A bit corresponds to a
>> + * page of PAGE_SIZE.
>> + */
>> +#define VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP \
>> +	(VIRTIO_BALLOON_PAGE_BMAP_SIZE * BITS_PER_BYTE)
>> +
>> +/* The number of page_bmap to allocate by default. */
>> +#define VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM	1
> It's not by default, it's at probe time, right?
It is the number of page bitmap being kept throughout the whole
lifecycle of the driver. The page bmap will be temporarily extended
due to insufficiency during a ballooning process, but when that
ballooning finishes, the extended part will be freed.
>> +/* The maximum number of page_bmap that can be allocated. */
> Not really, this is the size of the array we use to keep them.

This is the max number of the page bmap that can be
extended temporarily.

>> +#define VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM	32
>> +
> So you still have a home-grown bitmap. I'd like to know why
> isn't xbitmap suggested for this purpose by Matthew Wilcox
> appropriate. Please add a comment explaining the requirements
> from the data structure.

I didn't find his xbitmap being upstreamed, did you?

>> +/*
>> + * QEMU virtio implementation requires the desc table size less than
>> + * VIRTQUEUE_MAX_SIZE, so minus 1 here.
> I think it doesn't, the issue is probably that you add a header
> as a separate s/g. In any case see below.
>
>> + */
>> +#define VIRTIO_BALLOON_MAX_PAGE_CHUNKS (VIRTQUEUE_MAX_SIZE - 1)
> This is wrong, virtio spec says s/g size should not exceed VQ size.
> If you want to support huge VQ sizes, you can add a fallback to
> smaller sizes until it fits in 1 page.

Probably no need for huge VQ size, 1024 queue size should be
enough. And we can have 1024 descriptors in the indirect
table, so the above size doesn't exceed the vq size, right?


> +static unsigned int extend_page_bmap_size(struct virtio_balloon *vb,
> +					  unsigned long pfn_num)
> what's this API doing?  Pls add comments. this seems to assume
> it will only be called once.
OK, I will add some comments here. This is the function to extend
the number of page bitmap when the original 1 page bmap is
not sufficient during a ballooning process. As mentioned above,
at the end of this ballooning process, the extended part will be freed.

> it would be better to avoid making
> this assumption, just look at what has been allocated
> and extend it.
Actually it's not an assumption. The rule here is that we always keep
"1" page bmap. "1" is defined by the
VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM. So when freeing, it also
references VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM (not assuming
any number)

> +}
> +
> +/* Add a chunk to the buffer. */
> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> +			  u64 base_addr, u32 size)
> +{
> +	unsigned int *num = &vb->balloon_page_chunk.chunk_num;
> +	struct vring_desc *desc = &vb->balloon_page_chunk.desc_table[*num];
> +
> +	desc->addr = cpu_to_virtio64(vb->vdev, base_addr);
> +	desc->len = cpu_to_virtio32(vb->vdev, size);
> +	*num += 1;
> +	if (*num == VIRTIO_BALLOON_MAX_PAGE_CHUNKS)
> +		send_page_chunks(vb, vq);
> +}
> +
> Poking at virtio internals like this is not nice. Pls move to virtio
> code.  Also, pages must be read descriptors as host might modify them.
>
> This also lacks viommu support but this is not mandatory as
> that is borken atm anyway. I'll send a patch to at least fail cleanly.
OK, thanks.

>> +static void convert_bmap_to_chunks(struct virtio_balloon *vb,
>> +				   struct virtqueue *vq,
>> +				   unsigned long *bmap,
>> +				   unsigned long pfn_start,
>> +				   unsigned long size)
>> +{
>> +	unsigned long next_one, next_zero, pos = 0;
>> +	u64 chunk_base_addr;
>> +	u32 chunk_size;
>> +
>> +	while (pos < size) {
>> +		next_one = find_next_bit(bmap, size, pos);
>> +		/*
>> +		 * No "1" bit found, which means that there is no pfn
>> +		 * recorded in the rest of this bmap.
>> +		 */
>> +		if (next_one == size)
>> +			break;
>> +		next_zero = find_next_zero_bit(bmap, size, next_one + 1);
>> +		/*
>> +		 * A bit in page_bmap corresponds to a page of PAGE_SIZE.
>> +		 * Convert it to be pages of 4KB balloon page size when
>> +		 * adding it to a chunk.
> This looks wrong. add_one_chunk assumes size in bytes. So should be just
> PAGE_SIZE.

It's intended to be "chunk size", which is the number of pfns. The 
benefit is
that the 32-bit desc->len won't be overflow, as you mentioned below.


>
>> +		 */
>> +		chunk_size = (next_zero - next_one) *
>> +			     VIRTIO_BALLOON_PAGES_PER_PAGE;
> How do you know this won't overflow a 32 bit integer? Needs a comment.

If it stores size in bytes, it has the possibility to overflow.
If storing number of pfns, the 32-bit value can support 2^32*4KB=8TB
memory, unlikely to overflow.
> +
> +static int balloon_page_chunk_init(struct virtio_balloon *vb)
> +{
> +	int i;
> +
> +	vb->balloon_page_chunk.desc_table = alloc_indirect(vb->vdev,
> +						VIRTIO_BALLOON_MAX_PAGE_CHUNKS,
> +						GFP_KERNEL);
> This one's problematic, you aren't supposed to use APIs when device
> is not inited yet. Seems to work by luck here. I suggest moving
> this to probe, that's where we do a bunch of inits.
> And then you can move private init back to allocate too.

This is just to allocate an indirect desc table. If allocation fails, we 
need to clear
the related feature bit in ->validate(), right?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [virtio-dev] Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
@ 2017-06-15  8:10       ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-15  8:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource,
	Matthew Wilcox

On 06/14/2017 01:56 AM, Michael S. Tsirkin wrote:
> On Fri, Jun 09, 2017 at 06:41:38PM +0800, Wei Wang wrote:
>> Add a new feature, VIRTIO_BALLOON_F_PAGE_CHUNKS, which enables
>> the transfer of the ballooned (i.e. inflated/deflated) pages in
>> chunks to the host.
> so now these chunks are just s/g list entry.
> So let's rename this VIRTIO_BALLOON_F_SG with a comment:
> * Use standard virtio s/g instead of PFN lists *

Actually, it's not using the standard s/g list in the implementation,
because:
using the standard s/g will need kmalloc() the indirect table on
demand (i.e. when virtqueue_add() converts s/g to indirect table);

The implementation directly pre-allocates an indirect desc table,
and uses a entry (i.e. vring_desc) to describe a chunk. This
avoids the overhead of kmalloc() the indirect table.


>> +/*
>> + * Callulates how many pfns can a page_bmap record. A bit corresponds to a
>> + * page of PAGE_SIZE.
>> + */
>> +#define VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP \
>> +	(VIRTIO_BALLOON_PAGE_BMAP_SIZE * BITS_PER_BYTE)
>> +
>> +/* The number of page_bmap to allocate by default. */
>> +#define VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM	1
> It's not by default, it's at probe time, right?
It is the number of page bitmap being kept throughout the whole
lifecycle of the driver. The page bmap will be temporarily extended
due to insufficiency during a ballooning process, but when that
ballooning finishes, the extended part will be freed.
>> +/* The maximum number of page_bmap that can be allocated. */
> Not really, this is the size of the array we use to keep them.

This is the max number of the page bmap that can be
extended temporarily.

>> +#define VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM	32
>> +
> So you still have a home-grown bitmap. I'd like to know why
> isn't xbitmap suggested for this purpose by Matthew Wilcox
> appropriate. Please add a comment explaining the requirements
> from the data structure.

I didn't find his xbitmap being upstreamed, did you?

>> +/*
>> + * QEMU virtio implementation requires the desc table size less than
>> + * VIRTQUEUE_MAX_SIZE, so minus 1 here.
> I think it doesn't, the issue is probably that you add a header
> as a separate s/g. In any case see below.
>
>> + */
>> +#define VIRTIO_BALLOON_MAX_PAGE_CHUNKS (VIRTQUEUE_MAX_SIZE - 1)
> This is wrong, virtio spec says s/g size should not exceed VQ size.
> If you want to support huge VQ sizes, you can add a fallback to
> smaller sizes until it fits in 1 page.

Probably no need for huge VQ size, 1024 queue size should be
enough. And we can have 1024 descriptors in the indirect
table, so the above size doesn't exceed the vq size, right?


> +static unsigned int extend_page_bmap_size(struct virtio_balloon *vb,
> +					  unsigned long pfn_num)
> what's this API doing?  Pls add comments. this seems to assume
> it will only be called once.
OK, I will add some comments here. This is the function to extend
the number of page bitmap when the original 1 page bmap is
not sufficient during a ballooning process. As mentioned above,
at the end of this ballooning process, the extended part will be freed.

> it would be better to avoid making
> this assumption, just look at what has been allocated
> and extend it.
Actually it's not an assumption. The rule here is that we always keep
"1" page bmap. "1" is defined by the
VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM. So when freeing, it also
references VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM (not assuming
any number)

> +}
> +
> +/* Add a chunk to the buffer. */
> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> +			  u64 base_addr, u32 size)
> +{
> +	unsigned int *num = &vb->balloon_page_chunk.chunk_num;
> +	struct vring_desc *desc = &vb->balloon_page_chunk.desc_table[*num];
> +
> +	desc->addr = cpu_to_virtio64(vb->vdev, base_addr);
> +	desc->len = cpu_to_virtio32(vb->vdev, size);
> +	*num += 1;
> +	if (*num == VIRTIO_BALLOON_MAX_PAGE_CHUNKS)
> +		send_page_chunks(vb, vq);
> +}
> +
> Poking at virtio internals like this is not nice. Pls move to virtio
> code.  Also, pages must be read descriptors as host might modify them.
>
> This also lacks viommu support but this is not mandatory as
> that is borken atm anyway. I'll send a patch to at least fail cleanly.
OK, thanks.

>> +static void convert_bmap_to_chunks(struct virtio_balloon *vb,
>> +				   struct virtqueue *vq,
>> +				   unsigned long *bmap,
>> +				   unsigned long pfn_start,
>> +				   unsigned long size)
>> +{
>> +	unsigned long next_one, next_zero, pos = 0;
>> +	u64 chunk_base_addr;
>> +	u32 chunk_size;
>> +
>> +	while (pos < size) {
>> +		next_one = find_next_bit(bmap, size, pos);
>> +		/*
>> +		 * No "1" bit found, which means that there is no pfn
>> +		 * recorded in the rest of this bmap.
>> +		 */
>> +		if (next_one == size)
>> +			break;
>> +		next_zero = find_next_zero_bit(bmap, size, next_one + 1);
>> +		/*
>> +		 * A bit in page_bmap corresponds to a page of PAGE_SIZE.
>> +		 * Convert it to be pages of 4KB balloon page size when
>> +		 * adding it to a chunk.
> This looks wrong. add_one_chunk assumes size in bytes. So should be just
> PAGE_SIZE.

It's intended to be "chunk size", which is the number of pfns. The 
benefit is
that the 32-bit desc->len won't be overflow, as you mentioned below.


>
>> +		 */
>> +		chunk_size = (next_zero - next_one) *
>> +			     VIRTIO_BALLOON_PAGES_PER_PAGE;
> How do you know this won't overflow a 32 bit integer? Needs a comment.

If it stores size in bytes, it has the possibility to overflow.
If storing number of pfns, the 32-bit value can support 2^32*4KB=8TB
memory, unlikely to overflow.
> +
> +static int balloon_page_chunk_init(struct virtio_balloon *vb)
> +{
> +	int i;
> +
> +	vb->balloon_page_chunk.desc_table = alloc_indirect(vb->vdev,
> +						VIRTIO_BALLOON_MAX_PAGE_CHUNKS,
> +						GFP_KERNEL);
> This one's problematic, you aren't supposed to use APIs when device
> is not inited yet. Seems to work by luck here. I suggest moving
> this to probe, that's where we do a bunch of inits.
> And then you can move private init back to allocate too.

This is just to allocate an indirect desc table. If allocation fails, we 
need to clear
the related feature bit in ->validate(), right?

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
  2017-06-13 17:56     ` Michael S. Tsirkin
                       ` (3 preceding siblings ...)
  (?)
@ 2017-06-15  8:10     ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-15  8:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, dave.hansen, linux-kernel, Matthew Wilcox,
	virtualization, linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On 06/14/2017 01:56 AM, Michael S. Tsirkin wrote:
> On Fri, Jun 09, 2017 at 06:41:38PM +0800, Wei Wang wrote:
>> Add a new feature, VIRTIO_BALLOON_F_PAGE_CHUNKS, which enables
>> the transfer of the ballooned (i.e. inflated/deflated) pages in
>> chunks to the host.
> so now these chunks are just s/g list entry.
> So let's rename this VIRTIO_BALLOON_F_SG with a comment:
> * Use standard virtio s/g instead of PFN lists *

Actually, it's not using the standard s/g list in the implementation,
because:
using the standard s/g will need kmalloc() the indirect table on
demand (i.e. when virtqueue_add() converts s/g to indirect table);

The implementation directly pre-allocates an indirect desc table,
and uses a entry (i.e. vring_desc) to describe a chunk. This
avoids the overhead of kmalloc() the indirect table.


>> +/*
>> + * Callulates how many pfns can a page_bmap record. A bit corresponds to a
>> + * page of PAGE_SIZE.
>> + */
>> +#define VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP \
>> +	(VIRTIO_BALLOON_PAGE_BMAP_SIZE * BITS_PER_BYTE)
>> +
>> +/* The number of page_bmap to allocate by default. */
>> +#define VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM	1
> It's not by default, it's at probe time, right?
It is the number of page bitmap being kept throughout the whole
lifecycle of the driver. The page bmap will be temporarily extended
due to insufficiency during a ballooning process, but when that
ballooning finishes, the extended part will be freed.
>> +/* The maximum number of page_bmap that can be allocated. */
> Not really, this is the size of the array we use to keep them.

This is the max number of the page bmap that can be
extended temporarily.

>> +#define VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM	32
>> +
> So you still have a home-grown bitmap. I'd like to know why
> isn't xbitmap suggested for this purpose by Matthew Wilcox
> appropriate. Please add a comment explaining the requirements
> from the data structure.

I didn't find his xbitmap being upstreamed, did you?

>> +/*
>> + * QEMU virtio implementation requires the desc table size less than
>> + * VIRTQUEUE_MAX_SIZE, so minus 1 here.
> I think it doesn't, the issue is probably that you add a header
> as a separate s/g. In any case see below.
>
>> + */
>> +#define VIRTIO_BALLOON_MAX_PAGE_CHUNKS (VIRTQUEUE_MAX_SIZE - 1)
> This is wrong, virtio spec says s/g size should not exceed VQ size.
> If you want to support huge VQ sizes, you can add a fallback to
> smaller sizes until it fits in 1 page.

Probably no need for huge VQ size, 1024 queue size should be
enough. And we can have 1024 descriptors in the indirect
table, so the above size doesn't exceed the vq size, right?


> +static unsigned int extend_page_bmap_size(struct virtio_balloon *vb,
> +					  unsigned long pfn_num)
> what's this API doing?  Pls add comments. this seems to assume
> it will only be called once.
OK, I will add some comments here. This is the function to extend
the number of page bitmap when the original 1 page bmap is
not sufficient during a ballooning process. As mentioned above,
at the end of this ballooning process, the extended part will be freed.

> it would be better to avoid making
> this assumption, just look at what has been allocated
> and extend it.
Actually it's not an assumption. The rule here is that we always keep
"1" page bmap. "1" is defined by the
VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM. So when freeing, it also
references VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM (not assuming
any number)

> +}
> +
> +/* Add a chunk to the buffer. */
> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> +			  u64 base_addr, u32 size)
> +{
> +	unsigned int *num = &vb->balloon_page_chunk.chunk_num;
> +	struct vring_desc *desc = &vb->balloon_page_chunk.desc_table[*num];
> +
> +	desc->addr = cpu_to_virtio64(vb->vdev, base_addr);
> +	desc->len = cpu_to_virtio32(vb->vdev, size);
> +	*num += 1;
> +	if (*num == VIRTIO_BALLOON_MAX_PAGE_CHUNKS)
> +		send_page_chunks(vb, vq);
> +}
> +
> Poking at virtio internals like this is not nice. Pls move to virtio
> code.  Also, pages must be read descriptors as host might modify them.
>
> This also lacks viommu support but this is not mandatory as
> that is borken atm anyway. I'll send a patch to at least fail cleanly.
OK, thanks.

>> +static void convert_bmap_to_chunks(struct virtio_balloon *vb,
>> +				   struct virtqueue *vq,
>> +				   unsigned long *bmap,
>> +				   unsigned long pfn_start,
>> +				   unsigned long size)
>> +{
>> +	unsigned long next_one, next_zero, pos = 0;
>> +	u64 chunk_base_addr;
>> +	u32 chunk_size;
>> +
>> +	while (pos < size) {
>> +		next_one = find_next_bit(bmap, size, pos);
>> +		/*
>> +		 * No "1" bit found, which means that there is no pfn
>> +		 * recorded in the rest of this bmap.
>> +		 */
>> +		if (next_one == size)
>> +			break;
>> +		next_zero = find_next_zero_bit(bmap, size, next_one + 1);
>> +		/*
>> +		 * A bit in page_bmap corresponds to a page of PAGE_SIZE.
>> +		 * Convert it to be pages of 4KB balloon page size when
>> +		 * adding it to a chunk.
> This looks wrong. add_one_chunk assumes size in bytes. So should be just
> PAGE_SIZE.

It's intended to be "chunk size", which is the number of pfns. The 
benefit is
that the 32-bit desc->len won't be overflow, as you mentioned below.


>
>> +		 */
>> +		chunk_size = (next_zero - next_one) *
>> +			     VIRTIO_BALLOON_PAGES_PER_PAGE;
> How do you know this won't overflow a 32 bit integer? Needs a comment.

If it stores size in bytes, it has the possibility to overflow.
If storing number of pfns, the 32-bit value can support 2^32*4KB=8TB
memory, unlikely to overflow.
> +
> +static int balloon_page_chunk_init(struct virtio_balloon *vb)
> +{
> +	int i;
> +
> +	vb->balloon_page_chunk.desc_table = alloc_indirect(vb->vdev,
> +						VIRTIO_BALLOON_MAX_PAGE_CHUNKS,
> +						GFP_KERNEL);
> This one's problematic, you aren't supposed to use APIs when device
> is not inited yet. Seems to work by luck here. I suggest moving
> this to probe, that's where we do a bunch of inits.
> And then you can move private init back to allocate too.

This is just to allocate an indirect desc table. If allocation fails, we 
need to clear
the related feature bit in ->validate(), right?

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
  2017-06-15  8:10       ` Wei Wang
  (?)
  (?)
@ 2017-06-16  3:19         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-16  3:19 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource,
	Matthew Wilcox

On Thu, Jun 15, 2017 at 04:10:17PM +0800, Wei Wang wrote:
> On 06/14/2017 01:56 AM, Michael S. Tsirkin wrote:
> > On Fri, Jun 09, 2017 at 06:41:38PM +0800, Wei Wang wrote:
> > > Add a new feature, VIRTIO_BALLOON_F_PAGE_CHUNKS, which enables
> > > the transfer of the ballooned (i.e. inflated/deflated) pages in
> > > chunks to the host.
> > so now these chunks are just s/g list entry.
> > So let's rename this VIRTIO_BALLOON_F_SG with a comment:
> > * Use standard virtio s/g instead of PFN lists *
> 
> Actually, it's not using the standard s/g list in the implementation,
> because:
> using the standard s/g will need kmalloc() the indirect table on
> demand (i.e. when virtqueue_add() converts s/g to indirect table);
> 
> The implementation directly pre-allocates an indirect desc table,
> and uses a entry (i.e. vring_desc) to describe a chunk. This
> avoids the overhead of kmalloc() the indirect table.

It's a separate API but the host/guest interface is standard.

> 
> > > +/*
> > > + * Callulates how many pfns can a page_bmap record. A bit corresponds to a
> > > + * page of PAGE_SIZE.
> > > + */
> > > +#define VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP \
> > > +	(VIRTIO_BALLOON_PAGE_BMAP_SIZE * BITS_PER_BYTE)
> > > +
> > > +/* The number of page_bmap to allocate by default. */
> > > +#define VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM	1
> > It's not by default, it's at probe time, right?
> It is the number of page bitmap being kept throughout the whole
> lifecycle of the driver. The page bmap will be temporarily extended
> due to insufficiency during a ballooning process, but when that
> ballooning finishes, the extended part will be freed.
> > > +/* The maximum number of page_bmap that can be allocated. */
> > Not really, this is the size of the array we use to keep them.
> 
> This is the max number of the page bmap that can be
> extended temporarily.

That's just a confusing way to say the same.

> > > +#define VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM	32
> > > +
> > So you still have a home-grown bitmap. I'd like to know why
> > isn't xbitmap suggested for this purpose by Matthew Wilcox
> > appropriate. Please add a comment explaining the requirements
> > from the data structure.
> 
> I didn't find his xbitmap being upstreamed, did you?

It's from dax tree - Matthew?

> > > +/*
> > > + * QEMU virtio implementation requires the desc table size less than
> > > + * VIRTQUEUE_MAX_SIZE, so minus 1 here.
> > I think it doesn't, the issue is probably that you add a header
> > as a separate s/g. In any case see below.
> > 
> > > + */
> > > +#define VIRTIO_BALLOON_MAX_PAGE_CHUNKS (VIRTQUEUE_MAX_SIZE - 1)
> > This is wrong, virtio spec says s/g size should not exceed VQ size.
> > If you want to support huge VQ sizes, you can add a fallback to
> > smaller sizes until it fits in 1 page.
> 
> Probably no need for huge VQ size, 1024 queue size should be
> enough. And we can have 1024 descriptors in the indirect
> table, so the above size doesn't exceed the vq size, right?

You need to look at vq size, you shouldn't assume it's > 1024.


> 
> > +static unsigned int extend_page_bmap_size(struct virtio_balloon *vb,
> > +					  unsigned long pfn_num)
> > what's this API doing?  Pls add comments. this seems to assume
> > it will only be called once.
> OK, I will add some comments here. This is the function to extend
> the number of page bitmap when the original 1 page bmap is
> not sufficient during a ballooning process. As mentioned above,
> at the end of this ballooning process, the extended part will be freed.
> 
> > it would be better to avoid making
> > this assumption, just look at what has been allocated
> > and extend it.
> Actually it's not an assumption. The rule here is that we always keep
> "1" page bmap. "1" is defined by the
> VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM. So when freeing, it also
> references VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM (not assuming
> any number)

When allocating, why don't you check what's allocated already?
why assume VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM was allocated?
Then calling extend_page_bmap_size many times would be idempotent.

> > +}
> > +
> > +/* Add a chunk to the buffer. */
> > +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> > +			  u64 base_addr, u32 size)
> > +{
> > +	unsigned int *num = &vb->balloon_page_chunk.chunk_num;
> > +	struct vring_desc *desc = &vb->balloon_page_chunk.desc_table[*num];
> > +
> > +	desc->addr = cpu_to_virtio64(vb->vdev, base_addr);
> > +	desc->len = cpu_to_virtio32(vb->vdev, size);
> > +	*num += 1;
> > +	if (*num == VIRTIO_BALLOON_MAX_PAGE_CHUNKS)
> > +		send_page_chunks(vb, vq);
> > +}
> > +
> > Poking at virtio internals like this is not nice. Pls move to virtio
> > code.  Also, pages must be read descriptors as host might modify them.
> > 
> > This also lacks viommu support but this is not mandatory as
> > that is borken atm anyway. I'll send a patch to at least fail cleanly.
> OK, thanks.
> 
> > > +static void convert_bmap_to_chunks(struct virtio_balloon *vb,
> > > +				   struct virtqueue *vq,
> > > +				   unsigned long *bmap,
> > > +				   unsigned long pfn_start,
> > > +				   unsigned long size)
> > > +{
> > > +	unsigned long next_one, next_zero, pos = 0;
> > > +	u64 chunk_base_addr;
> > > +	u32 chunk_size;
> > > +
> > > +	while (pos < size) {
> > > +		next_one = find_next_bit(bmap, size, pos);
> > > +		/*
> > > +		 * No "1" bit found, which means that there is no pfn
> > > +		 * recorded in the rest of this bmap.
> > > +		 */
> > > +		if (next_one == size)
> > > +			break;
> > > +		next_zero = find_next_zero_bit(bmap, size, next_one + 1);
> > > +		/*
> > > +		 * A bit in page_bmap corresponds to a page of PAGE_SIZE.
> > > +		 * Convert it to be pages of 4KB balloon page size when
> > > +		 * adding it to a chunk.
> > This looks wrong. add_one_chunk assumes size in bytes. So should be just
> > PAGE_SIZE.
> 
> It's intended to be "chunk size", which is the number of pfns. The benefit
> is
> that the 32-bit desc->len won't be overflow, as you mentioned below.
> 

You can safely assume PAGE_SIZE >= 4K. Just pass # of pages.

> > 
> > > +		 */
> > > +		chunk_size = (next_zero - next_one) *
> > > +			     VIRTIO_BALLOON_PAGES_PER_PAGE;
> > How do you know this won't overflow a 32 bit integer? Needs a comment.
> 
> If it stores size in bytes, it has the possibility to overflow.
> If storing number of pfns, the 32-bit value can support 2^32*4KB=8TB
> memory, unlikely to overflow.

Do you put in len in the descriptor in 4K chunks then? That
needs some thought. Also, processors support up to 256TB now,
I don't think we can just assume it won't overflow anymore.
All in all I'd prefer we just split everything up to 2G chunks.
We can discuss extending len to more bits or specifying alignment
in the descriptor separately down the road.

> > +
> > +static int balloon_page_chunk_init(struct virtio_balloon *vb)
> > +{
> > +	int i;
> > +
> > +	vb->balloon_page_chunk.desc_table = alloc_indirect(vb->vdev,
> > +						VIRTIO_BALLOON_MAX_PAGE_CHUNKS,
> > +						GFP_KERNEL);
> > This one's problematic, you aren't supposed to use APIs when device
> > is not inited yet. Seems to work by luck here. I suggest moving
> > this to probe, that's where we do a bunch of inits.
> > And then you can move private init back to allocate too.
> 
> This is just to allocate an indirect desc table. If allocation fails, we
> need to clear
> the related feature bit in ->validate(), right?

Failing probe on OOM is ok too.

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
@ 2017-06-16  3:19         ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-16  3:19 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource,
	Matthew Wilcox

On Thu, Jun 15, 2017 at 04:10:17PM +0800, Wei Wang wrote:
> On 06/14/2017 01:56 AM, Michael S. Tsirkin wrote:
> > On Fri, Jun 09, 2017 at 06:41:38PM +0800, Wei Wang wrote:
> > > Add a new feature, VIRTIO_BALLOON_F_PAGE_CHUNKS, which enables
> > > the transfer of the ballooned (i.e. inflated/deflated) pages in
> > > chunks to the host.
> > so now these chunks are just s/g list entry.
> > So let's rename this VIRTIO_BALLOON_F_SG with a comment:
> > * Use standard virtio s/g instead of PFN lists *
> 
> Actually, it's not using the standard s/g list in the implementation,
> because:
> using the standard s/g will need kmalloc() the indirect table on
> demand (i.e. when virtqueue_add() converts s/g to indirect table);
> 
> The implementation directly pre-allocates an indirect desc table,
> and uses a entry (i.e. vring_desc) to describe a chunk. This
> avoids the overhead of kmalloc() the indirect table.

It's a separate API but the host/guest interface is standard.

> 
> > > +/*
> > > + * Callulates how many pfns can a page_bmap record. A bit corresponds to a
> > > + * page of PAGE_SIZE.
> > > + */
> > > +#define VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP \
> > > +	(VIRTIO_BALLOON_PAGE_BMAP_SIZE * BITS_PER_BYTE)
> > > +
> > > +/* The number of page_bmap to allocate by default. */
> > > +#define VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM	1
> > It's not by default, it's at probe time, right?
> It is the number of page bitmap being kept throughout the whole
> lifecycle of the driver. The page bmap will be temporarily extended
> due to insufficiency during a ballooning process, but when that
> ballooning finishes, the extended part will be freed.
> > > +/* The maximum number of page_bmap that can be allocated. */
> > Not really, this is the size of the array we use to keep them.
> 
> This is the max number of the page bmap that can be
> extended temporarily.

That's just a confusing way to say the same.

> > > +#define VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM	32
> > > +
> > So you still have a home-grown bitmap. I'd like to know why
> > isn't xbitmap suggested for this purpose by Matthew Wilcox
> > appropriate. Please add a comment explaining the requirements
> > from the data structure.
> 
> I didn't find his xbitmap being upstreamed, did you?

It's from dax tree - Matthew?

> > > +/*
> > > + * QEMU virtio implementation requires the desc table size less than
> > > + * VIRTQUEUE_MAX_SIZE, so minus 1 here.
> > I think it doesn't, the issue is probably that you add a header
> > as a separate s/g. In any case see below.
> > 
> > > + */
> > > +#define VIRTIO_BALLOON_MAX_PAGE_CHUNKS (VIRTQUEUE_MAX_SIZE - 1)
> > This is wrong, virtio spec says s/g size should not exceed VQ size.
> > If you want to support huge VQ sizes, you can add a fallback to
> > smaller sizes until it fits in 1 page.
> 
> Probably no need for huge VQ size, 1024 queue size should be
> enough. And we can have 1024 descriptors in the indirect
> table, so the above size doesn't exceed the vq size, right?

You need to look at vq size, you shouldn't assume it's > 1024.


> 
> > +static unsigned int extend_page_bmap_size(struct virtio_balloon *vb,
> > +					  unsigned long pfn_num)
> > what's this API doing?  Pls add comments. this seems to assume
> > it will only be called once.
> OK, I will add some comments here. This is the function to extend
> the number of page bitmap when the original 1 page bmap is
> not sufficient during a ballooning process. As mentioned above,
> at the end of this ballooning process, the extended part will be freed.
> 
> > it would be better to avoid making
> > this assumption, just look at what has been allocated
> > and extend it.
> Actually it's not an assumption. The rule here is that we always keep
> "1" page bmap. "1" is defined by the
> VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM. So when freeing, it also
> references VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM (not assuming
> any number)

When allocating, why don't you check what's allocated already?
why assume VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM was allocated?
Then calling extend_page_bmap_size many times would be idempotent.

> > +}
> > +
> > +/* Add a chunk to the buffer. */
> > +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> > +			  u64 base_addr, u32 size)
> > +{
> > +	unsigned int *num = &vb->balloon_page_chunk.chunk_num;
> > +	struct vring_desc *desc = &vb->balloon_page_chunk.desc_table[*num];
> > +
> > +	desc->addr = cpu_to_virtio64(vb->vdev, base_addr);
> > +	desc->len = cpu_to_virtio32(vb->vdev, size);
> > +	*num += 1;
> > +	if (*num == VIRTIO_BALLOON_MAX_PAGE_CHUNKS)
> > +		send_page_chunks(vb, vq);
> > +}
> > +
> > Poking at virtio internals like this is not nice. Pls move to virtio
> > code.  Also, pages must be read descriptors as host might modify them.
> > 
> > This also lacks viommu support but this is not mandatory as
> > that is borken atm anyway. I'll send a patch to at least fail cleanly.
> OK, thanks.
> 
> > > +static void convert_bmap_to_chunks(struct virtio_balloon *vb,
> > > +				   struct virtqueue *vq,
> > > +				   unsigned long *bmap,
> > > +				   unsigned long pfn_start,
> > > +				   unsigned long size)
> > > +{
> > > +	unsigned long next_one, next_zero, pos = 0;
> > > +	u64 chunk_base_addr;
> > > +	u32 chunk_size;
> > > +
> > > +	while (pos < size) {
> > > +		next_one = find_next_bit(bmap, size, pos);
> > > +		/*
> > > +		 * No "1" bit found, which means that there is no pfn
> > > +		 * recorded in the rest of this bmap.
> > > +		 */
> > > +		if (next_one == size)
> > > +			break;
> > > +		next_zero = find_next_zero_bit(bmap, size, next_one + 1);
> > > +		/*
> > > +		 * A bit in page_bmap corresponds to a page of PAGE_SIZE.
> > > +		 * Convert it to be pages of 4KB balloon page size when
> > > +		 * adding it to a chunk.
> > This looks wrong. add_one_chunk assumes size in bytes. So should be just
> > PAGE_SIZE.
> 
> It's intended to be "chunk size", which is the number of pfns. The benefit
> is
> that the 32-bit desc->len won't be overflow, as you mentioned below.
> 

You can safely assume PAGE_SIZE >= 4K. Just pass # of pages.

> > 
> > > +		 */
> > > +		chunk_size = (next_zero - next_one) *
> > > +			     VIRTIO_BALLOON_PAGES_PER_PAGE;
> > How do you know this won't overflow a 32 bit integer? Needs a comment.
> 
> If it stores size in bytes, it has the possibility to overflow.
> If storing number of pfns, the 32-bit value can support 2^32*4KB=8TB
> memory, unlikely to overflow.

Do you put in len in the descriptor in 4K chunks then? That
needs some thought. Also, processors support up to 256TB now,
I don't think we can just assume it won't overflow anymore.
All in all I'd prefer we just split everything up to 2G chunks.
We can discuss extending len to more bits or specifying alignment
in the descriptor separately down the road.

> > +
> > +static int balloon_page_chunk_init(struct virtio_balloon *vb)
> > +{
> > +	int i;
> > +
> > +	vb->balloon_page_chunk.desc_table = alloc_indirect(vb->vdev,
> > +						VIRTIO_BALLOON_MAX_PAGE_CHUNKS,
> > +						GFP_KERNEL);
> > This one's problematic, you aren't supposed to use APIs when device
> > is not inited yet. Seems to work by luck here. I suggest moving
> > this to probe, that's where we do a bunch of inits.
> > And then you can move private init back to allocate too.
> 
> This is just to allocate an indirect desc table. If allocation fails, we
> need to clear
> the related feature bit in ->validate(), right?

Failing probe on OOM is ok too.

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
@ 2017-06-16  3:19         ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-16  3:19 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource,
	Matthew Wilcox

On Thu, Jun 15, 2017 at 04:10:17PM +0800, Wei Wang wrote:
> On 06/14/2017 01:56 AM, Michael S. Tsirkin wrote:
> > On Fri, Jun 09, 2017 at 06:41:38PM +0800, Wei Wang wrote:
> > > Add a new feature, VIRTIO_BALLOON_F_PAGE_CHUNKS, which enables
> > > the transfer of the ballooned (i.e. inflated/deflated) pages in
> > > chunks to the host.
> > so now these chunks are just s/g list entry.
> > So let's rename this VIRTIO_BALLOON_F_SG with a comment:
> > * Use standard virtio s/g instead of PFN lists *
> 
> Actually, it's not using the standard s/g list in the implementation,
> because:
> using the standard s/g will need kmalloc() the indirect table on
> demand (i.e. when virtqueue_add() converts s/g to indirect table);
> 
> The implementation directly pre-allocates an indirect desc table,
> and uses a entry (i.e. vring_desc) to describe a chunk. This
> avoids the overhead of kmalloc() the indirect table.

It's a separate API but the host/guest interface is standard.

> 
> > > +/*
> > > + * Callulates how many pfns can a page_bmap record. A bit corresponds to a
> > > + * page of PAGE_SIZE.
> > > + */
> > > +#define VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP \
> > > +	(VIRTIO_BALLOON_PAGE_BMAP_SIZE * BITS_PER_BYTE)
> > > +
> > > +/* The number of page_bmap to allocate by default. */
> > > +#define VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM	1
> > It's not by default, it's at probe time, right?
> It is the number of page bitmap being kept throughout the whole
> lifecycle of the driver. The page bmap will be temporarily extended
> due to insufficiency during a ballooning process, but when that
> ballooning finishes, the extended part will be freed.
> > > +/* The maximum number of page_bmap that can be allocated. */
> > Not really, this is the size of the array we use to keep them.
> 
> This is the max number of the page bmap that can be
> extended temporarily.

That's just a confusing way to say the same.

> > > +#define VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM	32
> > > +
> > So you still have a home-grown bitmap. I'd like to know why
> > isn't xbitmap suggested for this purpose by Matthew Wilcox
> > appropriate. Please add a comment explaining the requirements
> > from the data structure.
> 
> I didn't find his xbitmap being upstreamed, did you?

It's from dax tree - Matthew?

> > > +/*
> > > + * QEMU virtio implementation requires the desc table size less than
> > > + * VIRTQUEUE_MAX_SIZE, so minus 1 here.
> > I think it doesn't, the issue is probably that you add a header
> > as a separate s/g. In any case see below.
> > 
> > > + */
> > > +#define VIRTIO_BALLOON_MAX_PAGE_CHUNKS (VIRTQUEUE_MAX_SIZE - 1)
> > This is wrong, virtio spec says s/g size should not exceed VQ size.
> > If you want to support huge VQ sizes, you can add a fallback to
> > smaller sizes until it fits in 1 page.
> 
> Probably no need for huge VQ size, 1024 queue size should be
> enough. And we can have 1024 descriptors in the indirect
> table, so the above size doesn't exceed the vq size, right?

You need to look at vq size, you shouldn't assume it's > 1024.


> 
> > +static unsigned int extend_page_bmap_size(struct virtio_balloon *vb,
> > +					  unsigned long pfn_num)
> > what's this API doing?  Pls add comments. this seems to assume
> > it will only be called once.
> OK, I will add some comments here. This is the function to extend
> the number of page bitmap when the original 1 page bmap is
> not sufficient during a ballooning process. As mentioned above,
> at the end of this ballooning process, the extended part will be freed.
> 
> > it would be better to avoid making
> > this assumption, just look at what has been allocated
> > and extend it.
> Actually it's not an assumption. The rule here is that we always keep
> "1" page bmap. "1" is defined by the
> VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM. So when freeing, it also
> references VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM (not assuming
> any number)

When allocating, why don't you check what's allocated already?
why assume VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM was allocated?
Then calling extend_page_bmap_size many times would be idempotent.

> > +}
> > +
> > +/* Add a chunk to the buffer. */
> > +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> > +			  u64 base_addr, u32 size)
> > +{
> > +	unsigned int *num = &vb->balloon_page_chunk.chunk_num;
> > +	struct vring_desc *desc = &vb->balloon_page_chunk.desc_table[*num];
> > +
> > +	desc->addr = cpu_to_virtio64(vb->vdev, base_addr);
> > +	desc->len = cpu_to_virtio32(vb->vdev, size);
> > +	*num += 1;
> > +	if (*num == VIRTIO_BALLOON_MAX_PAGE_CHUNKS)
> > +		send_page_chunks(vb, vq);
> > +}
> > +
> > Poking at virtio internals like this is not nice. Pls move to virtio
> > code.  Also, pages must be read descriptors as host might modify them.
> > 
> > This also lacks viommu support but this is not mandatory as
> > that is borken atm anyway. I'll send a patch to at least fail cleanly.
> OK, thanks.
> 
> > > +static void convert_bmap_to_chunks(struct virtio_balloon *vb,
> > > +				   struct virtqueue *vq,
> > > +				   unsigned long *bmap,
> > > +				   unsigned long pfn_start,
> > > +				   unsigned long size)
> > > +{
> > > +	unsigned long next_one, next_zero, pos = 0;
> > > +	u64 chunk_base_addr;
> > > +	u32 chunk_size;
> > > +
> > > +	while (pos < size) {
> > > +		next_one = find_next_bit(bmap, size, pos);
> > > +		/*
> > > +		 * No "1" bit found, which means that there is no pfn
> > > +		 * recorded in the rest of this bmap.
> > > +		 */
> > > +		if (next_one == size)
> > > +			break;
> > > +		next_zero = find_next_zero_bit(bmap, size, next_one + 1);
> > > +		/*
> > > +		 * A bit in page_bmap corresponds to a page of PAGE_SIZE.
> > > +		 * Convert it to be pages of 4KB balloon page size when
> > > +		 * adding it to a chunk.
> > This looks wrong. add_one_chunk assumes size in bytes. So should be just
> > PAGE_SIZE.
> 
> It's intended to be "chunk size", which is the number of pfns. The benefit
> is
> that the 32-bit desc->len won't be overflow, as you mentioned below.
> 

You can safely assume PAGE_SIZE >= 4K. Just pass # of pages.

> > 
> > > +		 */
> > > +		chunk_size = (next_zero - next_one) *
> > > +			     VIRTIO_BALLOON_PAGES_PER_PAGE;
> > How do you know this won't overflow a 32 bit integer? Needs a comment.
> 
> If it stores size in bytes, it has the possibility to overflow.
> If storing number of pfns, the 32-bit value can support 2^32*4KB=8TB
> memory, unlikely to overflow.

Do you put in len in the descriptor in 4K chunks then? That
needs some thought. Also, processors support up to 256TB now,
I don't think we can just assume it won't overflow anymore.
All in all I'd prefer we just split everything up to 2G chunks.
We can discuss extending len to more bits or specifying alignment
in the descriptor separately down the road.

> > +
> > +static int balloon_page_chunk_init(struct virtio_balloon *vb)
> > +{
> > +	int i;
> > +
> > +	vb->balloon_page_chunk.desc_table = alloc_indirect(vb->vdev,
> > +						VIRTIO_BALLOON_MAX_PAGE_CHUNKS,
> > +						GFP_KERNEL);
> > This one's problematic, you aren't supposed to use APIs when device
> > is not inited yet. Seems to work by luck here. I suggest moving
> > this to probe, that's where we do a bunch of inits.
> > And then you can move private init back to allocate too.
> 
> This is just to allocate an indirect desc table. If allocation fails, we
> need to clear
> the related feature bit in ->validate(), right?

Failing probe on OOM is ok too.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [virtio-dev] Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
@ 2017-06-16  3:19         ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-16  3:19 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource,
	Matthew Wilcox

On Thu, Jun 15, 2017 at 04:10:17PM +0800, Wei Wang wrote:
> On 06/14/2017 01:56 AM, Michael S. Tsirkin wrote:
> > On Fri, Jun 09, 2017 at 06:41:38PM +0800, Wei Wang wrote:
> > > Add a new feature, VIRTIO_BALLOON_F_PAGE_CHUNKS, which enables
> > > the transfer of the ballooned (i.e. inflated/deflated) pages in
> > > chunks to the host.
> > so now these chunks are just s/g list entry.
> > So let's rename this VIRTIO_BALLOON_F_SG with a comment:
> > * Use standard virtio s/g instead of PFN lists *
> 
> Actually, it's not using the standard s/g list in the implementation,
> because:
> using the standard s/g will need kmalloc() the indirect table on
> demand (i.e. when virtqueue_add() converts s/g to indirect table);
> 
> The implementation directly pre-allocates an indirect desc table,
> and uses a entry (i.e. vring_desc) to describe a chunk. This
> avoids the overhead of kmalloc() the indirect table.

It's a separate API but the host/guest interface is standard.

> 
> > > +/*
> > > + * Callulates how many pfns can a page_bmap record. A bit corresponds to a
> > > + * page of PAGE_SIZE.
> > > + */
> > > +#define VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP \
> > > +	(VIRTIO_BALLOON_PAGE_BMAP_SIZE * BITS_PER_BYTE)
> > > +
> > > +/* The number of page_bmap to allocate by default. */
> > > +#define VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM	1
> > It's not by default, it's at probe time, right?
> It is the number of page bitmap being kept throughout the whole
> lifecycle of the driver. The page bmap will be temporarily extended
> due to insufficiency during a ballooning process, but when that
> ballooning finishes, the extended part will be freed.
> > > +/* The maximum number of page_bmap that can be allocated. */
> > Not really, this is the size of the array we use to keep them.
> 
> This is the max number of the page bmap that can be
> extended temporarily.

That's just a confusing way to say the same.

> > > +#define VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM	32
> > > +
> > So you still have a home-grown bitmap. I'd like to know why
> > isn't xbitmap suggested for this purpose by Matthew Wilcox
> > appropriate. Please add a comment explaining the requirements
> > from the data structure.
> 
> I didn't find his xbitmap being upstreamed, did you?

It's from dax tree - Matthew?

> > > +/*
> > > + * QEMU virtio implementation requires the desc table size less than
> > > + * VIRTQUEUE_MAX_SIZE, so minus 1 here.
> > I think it doesn't, the issue is probably that you add a header
> > as a separate s/g. In any case see below.
> > 
> > > + */
> > > +#define VIRTIO_BALLOON_MAX_PAGE_CHUNKS (VIRTQUEUE_MAX_SIZE - 1)
> > This is wrong, virtio spec says s/g size should not exceed VQ size.
> > If you want to support huge VQ sizes, you can add a fallback to
> > smaller sizes until it fits in 1 page.
> 
> Probably no need for huge VQ size, 1024 queue size should be
> enough. And we can have 1024 descriptors in the indirect
> table, so the above size doesn't exceed the vq size, right?

You need to look at vq size, you shouldn't assume it's > 1024.


> 
> > +static unsigned int extend_page_bmap_size(struct virtio_balloon *vb,
> > +					  unsigned long pfn_num)
> > what's this API doing?  Pls add comments. this seems to assume
> > it will only be called once.
> OK, I will add some comments here. This is the function to extend
> the number of page bitmap when the original 1 page bmap is
> not sufficient during a ballooning process. As mentioned above,
> at the end of this ballooning process, the extended part will be freed.
> 
> > it would be better to avoid making
> > this assumption, just look at what has been allocated
> > and extend it.
> Actually it's not an assumption. The rule here is that we always keep
> "1" page bmap. "1" is defined by the
> VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM. So when freeing, it also
> references VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM (not assuming
> any number)

When allocating, why don't you check what's allocated already?
why assume VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM was allocated?
Then calling extend_page_bmap_size many times would be idempotent.

> > +}
> > +
> > +/* Add a chunk to the buffer. */
> > +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> > +			  u64 base_addr, u32 size)
> > +{
> > +	unsigned int *num = &vb->balloon_page_chunk.chunk_num;
> > +	struct vring_desc *desc = &vb->balloon_page_chunk.desc_table[*num];
> > +
> > +	desc->addr = cpu_to_virtio64(vb->vdev, base_addr);
> > +	desc->len = cpu_to_virtio32(vb->vdev, size);
> > +	*num += 1;
> > +	if (*num == VIRTIO_BALLOON_MAX_PAGE_CHUNKS)
> > +		send_page_chunks(vb, vq);
> > +}
> > +
> > Poking at virtio internals like this is not nice. Pls move to virtio
> > code.  Also, pages must be read descriptors as host might modify them.
> > 
> > This also lacks viommu support but this is not mandatory as
> > that is borken atm anyway. I'll send a patch to at least fail cleanly.
> OK, thanks.
> 
> > > +static void convert_bmap_to_chunks(struct virtio_balloon *vb,
> > > +				   struct virtqueue *vq,
> > > +				   unsigned long *bmap,
> > > +				   unsigned long pfn_start,
> > > +				   unsigned long size)
> > > +{
> > > +	unsigned long next_one, next_zero, pos = 0;
> > > +	u64 chunk_base_addr;
> > > +	u32 chunk_size;
> > > +
> > > +	while (pos < size) {
> > > +		next_one = find_next_bit(bmap, size, pos);
> > > +		/*
> > > +		 * No "1" bit found, which means that there is no pfn
> > > +		 * recorded in the rest of this bmap.
> > > +		 */
> > > +		if (next_one == size)
> > > +			break;
> > > +		next_zero = find_next_zero_bit(bmap, size, next_one + 1);
> > > +		/*
> > > +		 * A bit in page_bmap corresponds to a page of PAGE_SIZE.
> > > +		 * Convert it to be pages of 4KB balloon page size when
> > > +		 * adding it to a chunk.
> > This looks wrong. add_one_chunk assumes size in bytes. So should be just
> > PAGE_SIZE.
> 
> It's intended to be "chunk size", which is the number of pfns. The benefit
> is
> that the 32-bit desc->len won't be overflow, as you mentioned below.
> 

You can safely assume PAGE_SIZE >= 4K. Just pass # of pages.

> > 
> > > +		 */
> > > +		chunk_size = (next_zero - next_one) *
> > > +			     VIRTIO_BALLOON_PAGES_PER_PAGE;
> > How do you know this won't overflow a 32 bit integer? Needs a comment.
> 
> If it stores size in bytes, it has the possibility to overflow.
> If storing number of pfns, the 32-bit value can support 2^32*4KB=8TB
> memory, unlikely to overflow.

Do you put in len in the descriptor in 4K chunks then? That
needs some thought. Also, processors support up to 256TB now,
I don't think we can just assume it won't overflow anymore.
All in all I'd prefer we just split everything up to 2G chunks.
We can discuss extending len to more bits or specifying alignment
in the descriptor separately down the road.

> > +
> > +static int balloon_page_chunk_init(struct virtio_balloon *vb)
> > +{
> > +	int i;
> > +
> > +	vb->balloon_page_chunk.desc_table = alloc_indirect(vb->vdev,
> > +						VIRTIO_BALLOON_MAX_PAGE_CHUNKS,
> > +						GFP_KERNEL);
> > This one's problematic, you aren't supposed to use APIs when device
> > is not inited yet. Seems to work by luck here. I suggest moving
> > this to probe, that's where we do a bunch of inits.
> > And then you can move private init back to allocate too.
> 
> This is just to allocate an indirect desc table. If allocation fails, we
> need to clear
> the related feature bit in ->validate(), right?

Failing probe on OOM is ok too.

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
  2017-06-15  8:10       ` Wei Wang
  (?)
  (?)
@ 2017-06-16  3:19       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-16  3:19 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, dave.hansen, linux-kernel, Matthew Wilcox,
	virtualization, linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On Thu, Jun 15, 2017 at 04:10:17PM +0800, Wei Wang wrote:
> On 06/14/2017 01:56 AM, Michael S. Tsirkin wrote:
> > On Fri, Jun 09, 2017 at 06:41:38PM +0800, Wei Wang wrote:
> > > Add a new feature, VIRTIO_BALLOON_F_PAGE_CHUNKS, which enables
> > > the transfer of the ballooned (i.e. inflated/deflated) pages in
> > > chunks to the host.
> > so now these chunks are just s/g list entry.
> > So let's rename this VIRTIO_BALLOON_F_SG with a comment:
> > * Use standard virtio s/g instead of PFN lists *
> 
> Actually, it's not using the standard s/g list in the implementation,
> because:
> using the standard s/g will need kmalloc() the indirect table on
> demand (i.e. when virtqueue_add() converts s/g to indirect table);
> 
> The implementation directly pre-allocates an indirect desc table,
> and uses a entry (i.e. vring_desc) to describe a chunk. This
> avoids the overhead of kmalloc() the indirect table.

It's a separate API but the host/guest interface is standard.

> 
> > > +/*
> > > + * Callulates how many pfns can a page_bmap record. A bit corresponds to a
> > > + * page of PAGE_SIZE.
> > > + */
> > > +#define VIRTIO_BALLOON_PFNS_PER_PAGE_BMAP \
> > > +	(VIRTIO_BALLOON_PAGE_BMAP_SIZE * BITS_PER_BYTE)
> > > +
> > > +/* The number of page_bmap to allocate by default. */
> > > +#define VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM	1
> > It's not by default, it's at probe time, right?
> It is the number of page bitmap being kept throughout the whole
> lifecycle of the driver. The page bmap will be temporarily extended
> due to insufficiency during a ballooning process, but when that
> ballooning finishes, the extended part will be freed.
> > > +/* The maximum number of page_bmap that can be allocated. */
> > Not really, this is the size of the array we use to keep them.
> 
> This is the max number of the page bmap that can be
> extended temporarily.

That's just a confusing way to say the same.

> > > +#define VIRTIO_BALLOON_PAGE_BMAP_MAX_NUM	32
> > > +
> > So you still have a home-grown bitmap. I'd like to know why
> > isn't xbitmap suggested for this purpose by Matthew Wilcox
> > appropriate. Please add a comment explaining the requirements
> > from the data structure.
> 
> I didn't find his xbitmap being upstreamed, did you?

It's from dax tree - Matthew?

> > > +/*
> > > + * QEMU virtio implementation requires the desc table size less than
> > > + * VIRTQUEUE_MAX_SIZE, so minus 1 here.
> > I think it doesn't, the issue is probably that you add a header
> > as a separate s/g. In any case see below.
> > 
> > > + */
> > > +#define VIRTIO_BALLOON_MAX_PAGE_CHUNKS (VIRTQUEUE_MAX_SIZE - 1)
> > This is wrong, virtio spec says s/g size should not exceed VQ size.
> > If you want to support huge VQ sizes, you can add a fallback to
> > smaller sizes until it fits in 1 page.
> 
> Probably no need for huge VQ size, 1024 queue size should be
> enough. And we can have 1024 descriptors in the indirect
> table, so the above size doesn't exceed the vq size, right?

You need to look at vq size, you shouldn't assume it's > 1024.


> 
> > +static unsigned int extend_page_bmap_size(struct virtio_balloon *vb,
> > +					  unsigned long pfn_num)
> > what's this API doing?  Pls add comments. this seems to assume
> > it will only be called once.
> OK, I will add some comments here. This is the function to extend
> the number of page bitmap when the original 1 page bmap is
> not sufficient during a ballooning process. As mentioned above,
> at the end of this ballooning process, the extended part will be freed.
> 
> > it would be better to avoid making
> > this assumption, just look at what has been allocated
> > and extend it.
> Actually it's not an assumption. The rule here is that we always keep
> "1" page bmap. "1" is defined by the
> VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM. So when freeing, it also
> references VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM (not assuming
> any number)

When allocating, why don't you check what's allocated already?
why assume VIRTIO_BALLOON_PAGE_BMAP_DEFAULT_NUM was allocated?
Then calling extend_page_bmap_size many times would be idempotent.

> > +}
> > +
> > +/* Add a chunk to the buffer. */
> > +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> > +			  u64 base_addr, u32 size)
> > +{
> > +	unsigned int *num = &vb->balloon_page_chunk.chunk_num;
> > +	struct vring_desc *desc = &vb->balloon_page_chunk.desc_table[*num];
> > +
> > +	desc->addr = cpu_to_virtio64(vb->vdev, base_addr);
> > +	desc->len = cpu_to_virtio32(vb->vdev, size);
> > +	*num += 1;
> > +	if (*num == VIRTIO_BALLOON_MAX_PAGE_CHUNKS)
> > +		send_page_chunks(vb, vq);
> > +}
> > +
> > Poking at virtio internals like this is not nice. Pls move to virtio
> > code.  Also, pages must be read descriptors as host might modify them.
> > 
> > This also lacks viommu support but this is not mandatory as
> > that is borken atm anyway. I'll send a patch to at least fail cleanly.
> OK, thanks.
> 
> > > +static void convert_bmap_to_chunks(struct virtio_balloon *vb,
> > > +				   struct virtqueue *vq,
> > > +				   unsigned long *bmap,
> > > +				   unsigned long pfn_start,
> > > +				   unsigned long size)
> > > +{
> > > +	unsigned long next_one, next_zero, pos = 0;
> > > +	u64 chunk_base_addr;
> > > +	u32 chunk_size;
> > > +
> > > +	while (pos < size) {
> > > +		next_one = find_next_bit(bmap, size, pos);
> > > +		/*
> > > +		 * No "1" bit found, which means that there is no pfn
> > > +		 * recorded in the rest of this bmap.
> > > +		 */
> > > +		if (next_one == size)
> > > +			break;
> > > +		next_zero = find_next_zero_bit(bmap, size, next_one + 1);
> > > +		/*
> > > +		 * A bit in page_bmap corresponds to a page of PAGE_SIZE.
> > > +		 * Convert it to be pages of 4KB balloon page size when
> > > +		 * adding it to a chunk.
> > This looks wrong. add_one_chunk assumes size in bytes. So should be just
> > PAGE_SIZE.
> 
> It's intended to be "chunk size", which is the number of pfns. The benefit
> is
> that the 32-bit desc->len won't be overflow, as you mentioned below.
> 

You can safely assume PAGE_SIZE >= 4K. Just pass # of pages.

> > 
> > > +		 */
> > > +		chunk_size = (next_zero - next_one) *
> > > +			     VIRTIO_BALLOON_PAGES_PER_PAGE;
> > How do you know this won't overflow a 32 bit integer? Needs a comment.
> 
> If it stores size in bytes, it has the possibility to overflow.
> If storing number of pfns, the 32-bit value can support 2^32*4KB=8TB
> memory, unlikely to overflow.

Do you put in len in the descriptor in 4K chunks then? That
needs some thought. Also, processors support up to 256TB now,
I don't think we can just assume it won't overflow anymore.
All in all I'd prefer we just split everything up to 2G chunks.
We can discuss extending len to more bits or specifying alignment
in the descriptor separately down the road.

> > +
> > +static int balloon_page_chunk_init(struct virtio_balloon *vb)
> > +{
> > +	int i;
> > +
> > +	vb->balloon_page_chunk.desc_table = alloc_indirect(vb->vdev,
> > +						VIRTIO_BALLOON_MAX_PAGE_CHUNKS,
> > +						GFP_KERNEL);
> > This one's problematic, you aren't supposed to use APIs when device
> > is not inited yet. Seems to work by luck here. I suggest moving
> > this to probe, that's where we do a bunch of inits.
> > And then you can move private init back to allocate too.
> 
> This is just to allocate an indirect desc table. If allocation fails, we
> need to clear
> the related feature bit in ->validate(), right?

Failing probe on OOM is ok too.

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
  2017-06-09 10:41   ` Wei Wang
  (?)
@ 2017-06-20 16:18     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-20 16:18 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
> -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
> +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
>  		virtqueue_kick(vq);
> -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> -		vb->balloon_page_chunk.chunk_num = 0;
> +		if (busy_wait)
> +			while (!virtqueue_get_buf(vq, &len) &&
> +			       !virtqueue_is_broken(vq))
> +				cpu_relax();
> +		else
> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));


This is something I didn't previously notice.
As you always keep a single buffer in flight, you do not
really need indirect at all. Just add all descriptors
in the ring directly, then kick.

E.g.
	virtqueue_add_first
	virtqueue_add_next
	virtqueue_add_last

?

You also want a flag to avoid allocations but there's no need to do it
per descriptor, set it on vq.

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
@ 2017-06-20 16:18     ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-20 16:18 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
> -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
> +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
>  		virtqueue_kick(vq);
> -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> -		vb->balloon_page_chunk.chunk_num = 0;
> +		if (busy_wait)
> +			while (!virtqueue_get_buf(vq, &len) &&
> +			       !virtqueue_is_broken(vq))
> +				cpu_relax();
> +		else
> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));


This is something I didn't previously notice.
As you always keep a single buffer in flight, you do not
really need indirect at all. Just add all descriptors
in the ring directly, then kick.

E.g.
	virtqueue_add_first
	virtqueue_add_next
	virtqueue_add_last

?

You also want a flag to avoid allocations but there's no need to do it
per descriptor, set it on vq.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
@ 2017-06-20 16:18     ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-20 16:18 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
> -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
> +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
>  		virtqueue_kick(vq);
> -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> -		vb->balloon_page_chunk.chunk_num = 0;
> +		if (busy_wait)
> +			while (!virtqueue_get_buf(vq, &len) &&
> +			       !virtqueue_is_broken(vq))
> +				cpu_relax();
> +		else
> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));


This is something I didn't previously notice.
As you always keep a single buffer in flight, you do not
really need indirect at all. Just add all descriptors
in the ring directly, then kick.

E.g.
	virtqueue_add_first
	virtqueue_add_next
	virtqueue_add_last

?

You also want a flag to avoid allocations but there's no need to do it
per descriptor, set it on vq.

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
  2017-06-09 10:41   ` Wei Wang
                     ` (3 preceding siblings ...)
  (?)
@ 2017-06-20 16:18   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-20 16:18 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, dave.hansen, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
> -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
> +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
>  		virtqueue_kick(vq);
> -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> -		vb->balloon_page_chunk.chunk_num = 0;
> +		if (busy_wait)
> +			while (!virtqueue_get_buf(vq, &len) &&
> +			       !virtqueue_is_broken(vq))
> +				cpu_relax();
> +		else
> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));


This is something I didn't previously notice.
As you always keep a single buffer in flight, you do not
really need indirect at all. Just add all descriptors
in the ring directly, then kick.

E.g.
	virtqueue_add_first
	virtqueue_add_next
	virtqueue_add_last

?

You also want a flag to avoid allocations but there's no need to do it
per descriptor, set it on vq.

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-12 14:10     ` Dave Hansen
  (?)
@ 2017-06-20 16:44       ` Rik van Riel
  -1 siblings, 0 replies; 175+ messages in thread
From: Rik van Riel @ 2017-06-20 16:44 UTC (permalink / raw)
  To: Dave Hansen, Wei Wang, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, david, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource
  Cc: Nitesh Narayan Lal

[-- Attachment #1: Type: text/plain, Size: 867 bytes --]

On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:

> The hypervisor is going to throw away the contents of these pages,
> right?  As soon as the spinlock is released, someone can allocate a
> page, and put good data in it.  What keeps the hypervisor from
> throwing
> away good data?

That looks like it may be the wrong API, then?

We already have hooks called arch_free_page and
arch_alloc_page in the VM, which are called when
pages are freed, and allocated, respectively.

Nitesh Lal (on the CC list) is working on a way
to efficiently batch recently freed pages for
free page hinting to the hypervisor.

If that is done efficiently enough (eg. with
MADV_FREE on the hypervisor side for lazy freeing,
and lazy later re-use of the pages), do we still
need the harder to use batch interface from this
patch?

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-20 16:44       ` Rik van Riel
  0 siblings, 0 replies; 175+ messages in thread
From: Rik van Riel @ 2017-06-20 16:44 UTC (permalink / raw)
  To: Dave Hansen, Wei Wang, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, david, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource
  Cc: Nitesh Narayan Lal

[-- Attachment #1: Type: text/plain, Size: 867 bytes --]

On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:

> The hypervisor is going to throw away the contents of these pages,
> right?  As soon as the spinlock is released, someone can allocate a
> page, and put good data in it.  What keeps the hypervisor from
> throwing
> away good data?

That looks like it may be the wrong API, then?

We already have hooks called arch_free_page and
arch_alloc_page in the VM, which are called when
pages are freed, and allocated, respectively.

Nitesh Lal (on the CC list) is working on a way
to efficiently batch recently freed pages for
free page hinting to the hypervisor.

If that is done efficiently enough (eg. with
MADV_FREE on the hypervisor side for lazy freeing,
and lazy later re-use of the pages), do we still
need the harder to use batch interface from this
patch?

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-20 16:44       ` Rik van Riel
  0 siblings, 0 replies; 175+ messages in thread
From: Rik van Riel @ 2017-06-20 16:44 UTC (permalink / raw)
  To: Dave Hansen, Wei Wang, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, david, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource
  Cc: Nitesh Narayan Lal

[-- Attachment #1: Type: text/plain, Size: 867 bytes --]

On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:

> The hypervisor is going to throw away the contents of these pages,
> right?  As soon as the spinlock is released, someone can allocate a
> page, and put good data in it.  What keeps the hypervisor from
> throwing
> away good data?

That looks like it may be the wrong API, then?

We already have hooks called arch_free_page and
arch_alloc_page in the VM, which are called when
pages are freed, and allocated, respectively.

Nitesh Lal (on the CC list) is working on a way
to efficiently batch recently freed pages for
free page hinting to the hypervisor.

If that is done efficiently enough (eg. with
MADV_FREE on the hypervisor side for lazy freeing,
and lazy later re-use of the pages), do we still
need the harder to use batch interface from this
patch?

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-12 14:10     ` Dave Hansen
                       ` (4 preceding siblings ...)
  (?)
@ 2017-06-20 16:44     ` Rik van Riel
  -1 siblings, 0 replies; 175+ messages in thread
From: Rik van Riel @ 2017-06-20 16:44 UTC (permalink / raw)
  To: Dave Hansen, Wei Wang, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, mst, david, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource
  Cc: Nitesh Narayan Lal


[-- Attachment #1.1: Type: text/plain, Size: 867 bytes --]

On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:

> The hypervisor is going to throw away the contents of these pages,
> right?  As soon as the spinlock is released, someone can allocate a
> page, and put good data in it.  What keeps the hypervisor from
> throwing
> away good data?

That looks like it may be the wrong API, then?

We already have hooks called arch_free_page and
arch_alloc_page in the VM, which are called when
pages are freed, and allocated, respectively.

Nitesh Lal (on the CC list) is working on a way
to efficiently batch recently freed pages for
free page hinting to the hypervisor.

If that is done efficiently enough (eg. with
MADV_FREE on the hypervisor side for lazy freeing,
and lazy later re-use of the pages), do we still
need the harder to use batch interface from this
patch?

-- 
All rights reversed

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-20 16:44       ` Rik van Riel
  (?)
@ 2017-06-20 16:49         ` David Hildenbrand
  -1 siblings, 0 replies; 175+ messages in thread
From: David Hildenbrand @ 2017-06-20 16:49 UTC (permalink / raw)
  To: Rik van Riel, Dave Hansen, Wei Wang, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, mst, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource
  Cc: Nitesh Narayan Lal

On 20.06.2017 18:44, Rik van Riel wrote:
> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
> 
>> The hypervisor is going to throw away the contents of these pages,
>> right?  As soon as the spinlock is released, someone can allocate a
>> page, and put good data in it.  What keeps the hypervisor from
>> throwing
>> away good data?
> 
> That looks like it may be the wrong API, then?
> 
> We already have hooks called arch_free_page and
> arch_alloc_page in the VM, which are called when
> pages are freed, and allocated, respectively.
> 
> Nitesh Lal (on the CC list) is working on a way
> to efficiently batch recently freed pages for
> free page hinting to the hypervisor.
> 
> If that is done efficiently enough (eg. with
> MADV_FREE on the hypervisor side for lazy freeing,
> and lazy later re-use of the pages), do we still
> need the harder to use batch interface from this
> patch?
> 
David's opinion incoming:

No, I think proper free page hinting would be the optimum solution, if
done right. This would avoid the batch interface and even turn
virtio-balloon in some sense useless.

-- 

Thanks,

David

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-20 16:49         ` David Hildenbrand
  0 siblings, 0 replies; 175+ messages in thread
From: David Hildenbrand @ 2017-06-20 16:49 UTC (permalink / raw)
  To: Rik van Riel, Dave Hansen, Wei Wang, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, mst, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource
  Cc: Nitesh Narayan Lal

On 20.06.2017 18:44, Rik van Riel wrote:
> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
> 
>> The hypervisor is going to throw away the contents of these pages,
>> right?  As soon as the spinlock is released, someone can allocate a
>> page, and put good data in it.  What keeps the hypervisor from
>> throwing
>> away good data?
> 
> That looks like it may be the wrong API, then?
> 
> We already have hooks called arch_free_page and
> arch_alloc_page in the VM, which are called when
> pages are freed, and allocated, respectively.
> 
> Nitesh Lal (on the CC list) is working on a way
> to efficiently batch recently freed pages for
> free page hinting to the hypervisor.
> 
> If that is done efficiently enough (eg. with
> MADV_FREE on the hypervisor side for lazy freeing,
> and lazy later re-use of the pages), do we still
> need the harder to use batch interface from this
> patch?
> 
David's opinion incoming:

No, I think proper free page hinting would be the optimum solution, if
done right. This would avoid the batch interface and even turn
virtio-balloon in some sense useless.

-- 

Thanks,

David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-20 16:49         ` David Hildenbrand
  0 siblings, 0 replies; 175+ messages in thread
From: David Hildenbrand @ 2017-06-20 16:49 UTC (permalink / raw)
  To: Rik van Riel, Dave Hansen, Wei Wang, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, mst, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource
  Cc: Nitesh Narayan Lal

On 20.06.2017 18:44, Rik van Riel wrote:
> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
> 
>> The hypervisor is going to throw away the contents of these pages,
>> right?  As soon as the spinlock is released, someone can allocate a
>> page, and put good data in it.  What keeps the hypervisor from
>> throwing
>> away good data?
> 
> That looks like it may be the wrong API, then?
> 
> We already have hooks called arch_free_page and
> arch_alloc_page in the VM, which are called when
> pages are freed, and allocated, respectively.
> 
> Nitesh Lal (on the CC list) is working on a way
> to efficiently batch recently freed pages for
> free page hinting to the hypervisor.
> 
> If that is done efficiently enough (eg. with
> MADV_FREE on the hypervisor side for lazy freeing,
> and lazy later re-use of the pages), do we still
> need the harder to use batch interface from this
> patch?
> 
David's opinion incoming:

No, I think proper free page hinting would be the optimum solution, if
done right. This would avoid the batch interface and even turn
virtio-balloon in some sense useless.

-- 

Thanks,

David

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-20 16:44       ` Rik van Riel
  (?)
  (?)
@ 2017-06-20 16:49       ` David Hildenbrand
  -1 siblings, 0 replies; 175+ messages in thread
From: David Hildenbrand @ 2017-06-20 16:49 UTC (permalink / raw)
  To: Rik van Riel, Dave Hansen, Wei Wang, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, mst, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource
  Cc: Nitesh Narayan Lal

On 20.06.2017 18:44, Rik van Riel wrote:
> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
> 
>> The hypervisor is going to throw away the contents of these pages,
>> right?  As soon as the spinlock is released, someone can allocate a
>> page, and put good data in it.  What keeps the hypervisor from
>> throwing
>> away good data?
> 
> That looks like it may be the wrong API, then?
> 
> We already have hooks called arch_free_page and
> arch_alloc_page in the VM, which are called when
> pages are freed, and allocated, respectively.
> 
> Nitesh Lal (on the CC list) is working on a way
> to efficiently batch recently freed pages for
> free page hinting to the hypervisor.
> 
> If that is done efficiently enough (eg. with
> MADV_FREE on the hypervisor side for lazy freeing,
> and lazy later re-use of the pages), do we still
> need the harder to use batch interface from this
> patch?
> 
David's opinion incoming:

No, I think proper free page hinting would be the optimum solution, if
done right. This would avoid the batch interface and even turn
virtio-balloon in some sense useless.

-- 

Thanks,

David

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-20 16:49         ` David Hildenbrand
  (?)
@ 2017-06-20 17:29           ` Rik van Riel
  -1 siblings, 0 replies; 175+ messages in thread
From: Rik van Riel @ 2017-06-20 17:29 UTC (permalink / raw)
  To: David Hildenbrand, Dave Hansen, Wei Wang, linux-kernel,
	qemu-devel, virtualization, kvm, linux-mm, mst, cornelia.huck,
	akpm, mgorman, aarcange, amit.shah, pbonzini, liliang.opensource
  Cc: Nitesh Narayan Lal

[-- Attachment #1: Type: text/plain, Size: 1527 bytes --]

On Tue, 2017-06-20 at 18:49 +0200, David Hildenbrand wrote:
> On 20.06.2017 18:44, Rik van Riel wrote:

> > Nitesh Lal (on the CC list) is working on a way
> > to efficiently batch recently freed pages for
> > free page hinting to the hypervisor.
> > 
> > If that is done efficiently enough (eg. with
> > MADV_FREE on the hypervisor side for lazy freeing,
> > and lazy later re-use of the pages), do we still
> > need the harder to use batch interface from this
> > patch?
> > 
> 
> David's opinion incoming:
> 
> No, I think proper free page hinting would be the optimum solution,
> if
> done right. This would avoid the batch interface and even turn
> virtio-balloon in some sense useless.

I agree with that.  Let me go into some more detail of
what Nitesh is implementing:

1) In arch_free_page, the being-freed page is added
   to a per-cpu set of freed pages.
2) Once that set is full, arch_free_pages goes into a
   slow path, which:
   2a) Iterates over the set of freed pages, and
   2b) Checks whether they are still free, and
   2c) Adds the still free pages to a list that is
       to be passed to the hypervisor, to be MADV_FREEd.
   2d) Makes that hypercall.

Meanwhile all arch_alloc_pages has to do is make sure it
does not allocate a page while it is currently being
MADV_FREEd on the hypervisor side.

The code Wei is working on looks like it could be 
suitable for steps (2c) and (2d) above. Nitesh already
has code for steps 1 through 2b.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-20 17:29           ` Rik van Riel
  0 siblings, 0 replies; 175+ messages in thread
From: Rik van Riel @ 2017-06-20 17:29 UTC (permalink / raw)
  To: David Hildenbrand, Dave Hansen, Wei Wang, linux-kernel,
	qemu-devel, virtualization, kvm, linux-mm, mst, cornelia.huck,
	akpm, mgorman, aarcange, amit.shah, pbonzini, liliang.opensource
  Cc: Nitesh Narayan Lal

[-- Attachment #1: Type: text/plain, Size: 1527 bytes --]

On Tue, 2017-06-20 at 18:49 +0200, David Hildenbrand wrote:
> On 20.06.2017 18:44, Rik van Riel wrote:

> > Nitesh Lal (on the CC list) is working on a way
> > to efficiently batch recently freed pages for
> > free page hinting to the hypervisor.
> > 
> > If that is done efficiently enough (eg. with
> > MADV_FREE on the hypervisor side for lazy freeing,
> > and lazy later re-use of the pages), do we still
> > need the harder to use batch interface from this
> > patch?
> > 
> 
> David's opinion incoming:
> 
> No, I think proper free page hinting would be the optimum solution,
> if
> done right. This would avoid the batch interface and even turn
> virtio-balloon in some sense useless.

I agree with that.  Let me go into some more detail of
what Nitesh is implementing:

1) In arch_free_page, the being-freed page is added
   to a per-cpu set of freed pages.
2) Once that set is full, arch_free_pages goes into a
   slow path, which:
   2a) Iterates over the set of freed pages, and
   2b) Checks whether they are still free, and
   2c) Adds the still free pages to a list that is
       to be passed to the hypervisor, to be MADV_FREEd.
   2d) Makes that hypercall.

Meanwhile all arch_alloc_pages has to do is make sure it
does not allocate a page while it is currently being
MADV_FREEd on the hypervisor side.

The code Wei is working on looks like it could be 
suitable for steps (2c) and (2d) above. Nitesh already
has code for steps 1 through 2b.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-20 17:29           ` Rik van Riel
  0 siblings, 0 replies; 175+ messages in thread
From: Rik van Riel @ 2017-06-20 17:29 UTC (permalink / raw)
  To: David Hildenbrand, Dave Hansen, Wei Wang, linux-kernel,
	qemu-devel, virtualization, kvm, linux-mm, mst, cornelia.huck,
	akpm, mgorman, aarcange, amit.shah, pbonzini, liliang.opensource
  Cc: Nitesh Narayan Lal

[-- Attachment #1: Type: text/plain, Size: 1527 bytes --]

On Tue, 2017-06-20 at 18:49 +0200, David Hildenbrand wrote:
> On 20.06.2017 18:44, Rik van Riel wrote:

> > Nitesh Lal (on the CC list) is working on a way
> > to efficiently batch recently freed pages for
> > free page hinting to the hypervisor.
> > 
> > If that is done efficiently enough (eg. with
> > MADV_FREE on the hypervisor side for lazy freeing,
> > and lazy later re-use of the pages), do we still
> > need the harder to use batch interface from this
> > patch?
> > 
> 
> David's opinion incoming:
> 
> No, I think proper free page hinting would be the optimum solution,
> if
> done right. This would avoid the batch interface and even turn
> virtio-balloon in some sense useless.

I agree with that.  Let me go into some more detail of
what Nitesh is implementing:

1) In arch_free_page, the being-freed page is added
   to a per-cpu set of freed pages.
2) Once that set is full, arch_free_pages goes into a
   slow path, which:
   2a) Iterates over the set of freed pages, and
   2b) Checks whether they are still free, and
   2c) Adds the still free pages to a list that is
       to be passed to the hypervisor, to be MADV_FREEd.
   2d) Makes that hypercall.

Meanwhile all arch_alloc_pages has to do is make sure it
does not allocate a page while it is currently being
MADV_FREEd on the hypervisor side.

The code Wei is working on looks like it could be 
suitable for steps (2c) and (2d) above. Nitesh already
has code for steps 1 through 2b.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-20 16:49         ` David Hildenbrand
  (?)
  (?)
@ 2017-06-20 17:29         ` Rik van Riel
  -1 siblings, 0 replies; 175+ messages in thread
From: Rik van Riel @ 2017-06-20 17:29 UTC (permalink / raw)
  To: David Hildenbrand, Dave Hansen, Wei Wang, linux-kernel,
	qemu-devel, virtualization, kvm, linux-mm, mst, cornelia.huck,
	akpm, mgorman, aarcange, amit.shah, pbonzini, liliang.opensource
  Cc: Nitesh Narayan Lal


[-- Attachment #1.1: Type: text/plain, Size: 1527 bytes --]

On Tue, 2017-06-20 at 18:49 +0200, David Hildenbrand wrote:
> On 20.06.2017 18:44, Rik van Riel wrote:

> > Nitesh Lal (on the CC list) is working on a way
> > to efficiently batch recently freed pages for
> > free page hinting to the hypervisor.
> > 
> > If that is done efficiently enough (eg. with
> > MADV_FREE on the hypervisor side for lazy freeing,
> > and lazy later re-use of the pages), do we still
> > need the harder to use batch interface from this
> > patch?
> > 
> 
> David's opinion incoming:
> 
> No, I think proper free page hinting would be the optimum solution,
> if
> done right. This would avoid the batch interface and even turn
> virtio-balloon in some sense useless.

I agree with that.  Let me go into some more detail of
what Nitesh is implementing:

1) In arch_free_page, the being-freed page is added
   to a per-cpu set of freed pages.
2) Once that set is full, arch_free_pages goes into a
   slow path, which:
   2a) Iterates over the set of freed pages, and
   2b) Checks whether they are still free, and
   2c) Adds the still free pages to a list that is
       to be passed to the hypervisor, to be MADV_FREEd.
   2d) Makes that hypercall.

Meanwhile all arch_alloc_pages has to do is make sure it
does not allocate a page while it is currently being
MADV_FREEd on the hypervisor side.

The code Wei is working on looks like it could be 
suitable for steps (2c) and (2d) above. Nitesh already
has code for steps 1 through 2b.

-- 
All rights reversed

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-20 16:49         ` David Hildenbrand
  (?)
@ 2017-06-20 18:17           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-20 18:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Rik van Riel, Dave Hansen, Wei Wang, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource,
	Nitesh Narayan Lal

On Tue, Jun 20, 2017 at 06:49:33PM +0200, David Hildenbrand wrote:
> On 20.06.2017 18:44, Rik van Riel wrote:
> > On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
> > 
> >> The hypervisor is going to throw away the contents of these pages,
> >> right?  As soon as the spinlock is released, someone can allocate a
> >> page, and put good data in it.  What keeps the hypervisor from
> >> throwing
> >> away good data?
> > 
> > That looks like it may be the wrong API, then?
> > 
> > We already have hooks called arch_free_page and
> > arch_alloc_page in the VM, which are called when
> > pages are freed, and allocated, respectively.
> > 
> > Nitesh Lal (on the CC list) is working on a way
> > to efficiently batch recently freed pages for
> > free page hinting to the hypervisor.
> > 
> > If that is done efficiently enough (eg. with
> > MADV_FREE on the hypervisor side for lazy freeing,
> > and lazy later re-use of the pages), do we still
> > need the harder to use batch interface from this
> > patch?
> > 
> David's opinion incoming:
> 
> No, I think proper free page hinting would be the optimum solution, if
> done right. This would avoid the batch interface and even turn
> virtio-balloon in some sense useless.

I agree generally. But we have to balance that against the fact that
this was discussed since at least 2011 and no one built this solution
yet.

> -- 
> 
> Thanks,
> 
> David

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-20 18:17           ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-20 18:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Rik van Riel, Dave Hansen, Wei Wang, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource,
	Nitesh Narayan Lal

On Tue, Jun 20, 2017 at 06:49:33PM +0200, David Hildenbrand wrote:
> On 20.06.2017 18:44, Rik van Riel wrote:
> > On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
> > 
> >> The hypervisor is going to throw away the contents of these pages,
> >> right?  As soon as the spinlock is released, someone can allocate a
> >> page, and put good data in it.  What keeps the hypervisor from
> >> throwing
> >> away good data?
> > 
> > That looks like it may be the wrong API, then?
> > 
> > We already have hooks called arch_free_page and
> > arch_alloc_page in the VM, which are called when
> > pages are freed, and allocated, respectively.
> > 
> > Nitesh Lal (on the CC list) is working on a way
> > to efficiently batch recently freed pages for
> > free page hinting to the hypervisor.
> > 
> > If that is done efficiently enough (eg. with
> > MADV_FREE on the hypervisor side for lazy freeing,
> > and lazy later re-use of the pages), do we still
> > need the harder to use batch interface from this
> > patch?
> > 
> David's opinion incoming:
> 
> No, I think proper free page hinting would be the optimum solution, if
> done right. This would avoid the batch interface and even turn
> virtio-balloon in some sense useless.

I agree generally. But we have to balance that against the fact that
this was discussed since at least 2011 and no one built this solution
yet.

> -- 
> 
> Thanks,
> 
> David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-20 18:17           ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-20 18:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Rik van Riel, Dave Hansen, Wei Wang, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource,
	Nitesh Narayan Lal

On Tue, Jun 20, 2017 at 06:49:33PM +0200, David Hildenbrand wrote:
> On 20.06.2017 18:44, Rik van Riel wrote:
> > On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
> > 
> >> The hypervisor is going to throw away the contents of these pages,
> >> right?  As soon as the spinlock is released, someone can allocate a
> >> page, and put good data in it.  What keeps the hypervisor from
> >> throwing
> >> away good data?
> > 
> > That looks like it may be the wrong API, then?
> > 
> > We already have hooks called arch_free_page and
> > arch_alloc_page in the VM, which are called when
> > pages are freed, and allocated, respectively.
> > 
> > Nitesh Lal (on the CC list) is working on a way
> > to efficiently batch recently freed pages for
> > free page hinting to the hypervisor.
> > 
> > If that is done efficiently enough (eg. with
> > MADV_FREE on the hypervisor side for lazy freeing,
> > and lazy later re-use of the pages), do we still
> > need the harder to use batch interface from this
> > patch?
> > 
> David's opinion incoming:
> 
> No, I think proper free page hinting would be the optimum solution, if
> done right. This would avoid the batch interface and even turn
> virtio-balloon in some sense useless.

I agree generally. But we have to balance that against the fact that
this was discussed since at least 2011 and no one built this solution
yet.

> -- 
> 
> Thanks,
> 
> David

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-20 16:49         ` David Hildenbrand
                           ` (4 preceding siblings ...)
  (?)
@ 2017-06-20 18:17         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-20 18:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: aarcange, Rik van Riel, amit.shah, kvm, linux-mm, linux-kernel,
	liliang.opensource, qemu-devel, virtualization, Dave Hansen,
	cornelia.huck, pbonzini, akpm, Nitesh Narayan Lal, mgorman

On Tue, Jun 20, 2017 at 06:49:33PM +0200, David Hildenbrand wrote:
> On 20.06.2017 18:44, Rik van Riel wrote:
> > On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
> > 
> >> The hypervisor is going to throw away the contents of these pages,
> >> right?  As soon as the spinlock is released, someone can allocate a
> >> page, and put good data in it.  What keeps the hypervisor from
> >> throwing
> >> away good data?
> > 
> > That looks like it may be the wrong API, then?
> > 
> > We already have hooks called arch_free_page and
> > arch_alloc_page in the VM, which are called when
> > pages are freed, and allocated, respectively.
> > 
> > Nitesh Lal (on the CC list) is working on a way
> > to efficiently batch recently freed pages for
> > free page hinting to the hypervisor.
> > 
> > If that is done efficiently enough (eg. with
> > MADV_FREE on the hypervisor side for lazy freeing,
> > and lazy later re-use of the pages), do we still
> > need the harder to use batch interface from this
> > patch?
> > 
> David's opinion incoming:
> 
> No, I think proper free page hinting would be the optimum solution, if
> done right. This would avoid the batch interface and even turn
> virtio-balloon in some sense useless.

I agree generally. But we have to balance that against the fact that
this was discussed since at least 2011 and no one built this solution
yet.

> -- 
> 
> Thanks,
> 
> David

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-20 17:29           ` Rik van Riel
  (?)
@ 2017-06-20 18:26             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-20 18:26 UTC (permalink / raw)
  To: Rik van Riel
  Cc: David Hildenbrand, Dave Hansen, Wei Wang, linux-kernel,
	qemu-devel, virtualization, kvm, linux-mm, cornelia.huck, akpm,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource,
	Nitesh Narayan Lal

On Tue, Jun 20, 2017 at 01:29:00PM -0400, Rik van Riel wrote:
> On Tue, 2017-06-20 at 18:49 +0200, David Hildenbrand wrote:
> > On 20.06.2017 18:44, Rik van Riel wrote:
> 
> > > Nitesh Lal (on the CC list) is working on a way
> > > to efficiently batch recently freed pages for
> > > free page hinting to the hypervisor.
> > > 
> > > If that is done efficiently enough (eg. with
> > > MADV_FREE on the hypervisor side for lazy freeing,
> > > and lazy later re-use of the pages), do we still
> > > need the harder to use batch interface from this
> > > patch?
> > > 
> > 
> > David's opinion incoming:
> > 
> > No, I think proper free page hinting would be the optimum solution,
> > if
> > done right. This would avoid the batch interface and even turn
> > virtio-balloon in some sense useless.
> 
> I agree with that.  Let me go into some more detail of
> what Nitesh is implementing:
> 
> 1) In arch_free_page, the being-freed page is added
>    to a per-cpu set of freed pages.
> 2) Once that set is full, arch_free_pages goes into a
>    slow path, which:
>    2a) Iterates over the set of freed pages, and
>    2b) Checks whether they are still free, and
>    2c) Adds the still free pages to a list that is
>        to be passed to the hypervisor, to be MADV_FREEd.
>    2d) Makes that hypercall.
> 
> Meanwhile all arch_alloc_pages has to do is make sure it
> does not allocate a page while it is currently being
> MADV_FREEd on the hypervisor side.
> 
> The code Wei is working on looks like it could be 
> suitable for steps (2c) and (2d) above. Nitesh already
> has code for steps 1 through 2b.
> 
> -- 
> All rights reversed


So my question is this: Wei posted these numbers for balloon
inflation times:
inflating 7GB of an 8GB idle guest:

	1) allocating pages (6.5%)
	2) sending PFNs to host (68.3%)
	3) address translation (6.1%)
	4) madvise (19%)

	It takes about 4126ms for the inflating process to complete.

It seems that this is an excessive amount of time to stay
under a lock. What are your estimates for Nitesh's work?

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-20 18:26             ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-20 18:26 UTC (permalink / raw)
  To: Rik van Riel
  Cc: David Hildenbrand, Dave Hansen, Wei Wang, linux-kernel,
	qemu-devel, virtualization, kvm, linux-mm, cornelia.huck, akpm,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource,
	Nitesh Narayan Lal

On Tue, Jun 20, 2017 at 01:29:00PM -0400, Rik van Riel wrote:
> On Tue, 2017-06-20 at 18:49 +0200, David Hildenbrand wrote:
> > On 20.06.2017 18:44, Rik van Riel wrote:
> 
> > > Nitesh Lal (on the CC list) is working on a way
> > > to efficiently batch recently freed pages for
> > > free page hinting to the hypervisor.
> > > 
> > > If that is done efficiently enough (eg. with
> > > MADV_FREE on the hypervisor side for lazy freeing,
> > > and lazy later re-use of the pages), do we still
> > > need the harder to use batch interface from this
> > > patch?
> > > 
> > 
> > David's opinion incoming:
> > 
> > No, I think proper free page hinting would be the optimum solution,
> > if
> > done right. This would avoid the batch interface and even turn
> > virtio-balloon in some sense useless.
> 
> I agree with that.  Let me go into some more detail of
> what Nitesh is implementing:
> 
> 1) In arch_free_page, the being-freed page is added
>    to a per-cpu set of freed pages.
> 2) Once that set is full, arch_free_pages goes into a
>    slow path, which:
>    2a) Iterates over the set of freed pages, and
>    2b) Checks whether they are still free, and
>    2c) Adds the still free pages to a list that is
>        to be passed to the hypervisor, to be MADV_FREEd.
>    2d) Makes that hypercall.
> 
> Meanwhile all arch_alloc_pages has to do is make sure it
> does not allocate a page while it is currently being
> MADV_FREEd on the hypervisor side.
> 
> The code Wei is working on looks like it could be 
> suitable for steps (2c) and (2d) above. Nitesh already
> has code for steps 1 through 2b.
> 
> -- 
> All rights reversed


So my question is this: Wei posted these numbers for balloon
inflation times:
inflating 7GB of an 8GB idle guest:

	1) allocating pages (6.5%)
	2) sending PFNs to host (68.3%)
	3) address translation (6.1%)
	4) madvise (19%)

	It takes about 4126ms for the inflating process to complete.

It seems that this is an excessive amount of time to stay
under a lock. What are your estimates for Nitesh's work?

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-20 18:26             ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-20 18:26 UTC (permalink / raw)
  To: Rik van Riel
  Cc: David Hildenbrand, Dave Hansen, Wei Wang, linux-kernel,
	qemu-devel, virtualization, kvm, linux-mm, cornelia.huck, akpm,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource,
	Nitesh Narayan Lal

On Tue, Jun 20, 2017 at 01:29:00PM -0400, Rik van Riel wrote:
> On Tue, 2017-06-20 at 18:49 +0200, David Hildenbrand wrote:
> > On 20.06.2017 18:44, Rik van Riel wrote:
> 
> > > Nitesh Lal (on the CC list) is working on a way
> > > to efficiently batch recently freed pages for
> > > free page hinting to the hypervisor.
> > > 
> > > If that is done efficiently enough (eg. with
> > > MADV_FREE on the hypervisor side for lazy freeing,
> > > and lazy later re-use of the pages), do we still
> > > need the harder to use batch interface from this
> > > patch?
> > > 
> > 
> > David's opinion incoming:
> > 
> > No, I think proper free page hinting would be the optimum solution,
> > if
> > done right. This would avoid the batch interface and even turn
> > virtio-balloon in some sense useless.
> 
> I agree with that.  Let me go into some more detail of
> what Nitesh is implementing:
> 
> 1) In arch_free_page, the being-freed page is added
>    to a per-cpu set of freed pages.
> 2) Once that set is full, arch_free_pages goes into a
>    slow path, which:
>    2a) Iterates over the set of freed pages, and
>    2b) Checks whether they are still free, and
>    2c) Adds the still free pages to a list that is
>        to be passed to the hypervisor, to be MADV_FREEd.
>    2d) Makes that hypercall.
> 
> Meanwhile all arch_alloc_pages has to do is make sure it
> does not allocate a page while it is currently being
> MADV_FREEd on the hypervisor side.
> 
> The code Wei is working on looks like it could be 
> suitable for steps (2c) and (2d) above. Nitesh already
> has code for steps 1 through 2b.
> 
> -- 
> All rights reversed


So my question is this: Wei posted these numbers for balloon
inflation times:
inflating 7GB of an 8GB idle guest:

	1) allocating pages (6.5%)
	2) sending PFNs to host (68.3%)
	3) address translation (6.1%)
	4) madvise (19%)

	It takes about 4126ms for the inflating process to complete.

It seems that this is an excessive amount of time to stay
under a lock. What are your estimates for Nitesh's work?

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-20 17:29           ` Rik van Riel
  (?)
  (?)
@ 2017-06-20 18:26           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-20 18:26 UTC (permalink / raw)
  To: Rik van Riel
  Cc: aarcange, amit.shah, kvm, linux-kernel, liliang.opensource,
	qemu-devel, virtualization, linux-mm, Dave Hansen, cornelia.huck,
	pbonzini, akpm, Nitesh Narayan Lal, mgorman

On Tue, Jun 20, 2017 at 01:29:00PM -0400, Rik van Riel wrote:
> On Tue, 2017-06-20 at 18:49 +0200, David Hildenbrand wrote:
> > On 20.06.2017 18:44, Rik van Riel wrote:
> 
> > > Nitesh Lal (on the CC list) is working on a way
> > > to efficiently batch recently freed pages for
> > > free page hinting to the hypervisor.
> > > 
> > > If that is done efficiently enough (eg. with
> > > MADV_FREE on the hypervisor side for lazy freeing,
> > > and lazy later re-use of the pages), do we still
> > > need the harder to use batch interface from this
> > > patch?
> > > 
> > 
> > David's opinion incoming:
> > 
> > No, I think proper free page hinting would be the optimum solution,
> > if
> > done right. This would avoid the batch interface and even turn
> > virtio-balloon in some sense useless.
> 
> I agree with that.  Let me go into some more detail of
> what Nitesh is implementing:
> 
> 1) In arch_free_page, the being-freed page is added
>    to a per-cpu set of freed pages.
> 2) Once that set is full, arch_free_pages goes into a
>    slow path, which:
>    2a) Iterates over the set of freed pages, and
>    2b) Checks whether they are still free, and
>    2c) Adds the still free pages to a list that is
>        to be passed to the hypervisor, to be MADV_FREEd.
>    2d) Makes that hypercall.
> 
> Meanwhile all arch_alloc_pages has to do is make sure it
> does not allocate a page while it is currently being
> MADV_FREEd on the hypervisor side.
> 
> The code Wei is working on looks like it could be 
> suitable for steps (2c) and (2d) above. Nitesh already
> has code for steps 1 through 2b.
> 
> -- 
> All rights reversed


So my question is this: Wei posted these numbers for balloon
inflation times:
inflating 7GB of an 8GB idle guest:

	1) allocating pages (6.5%)
	2) sending PFNs to host (68.3%)
	3) address translation (6.1%)
	4) madvise (19%)

	It takes about 4126ms for the inflating process to complete.

It seems that this is an excessive amount of time to stay
under a lock. What are your estimates for Nitesh's work?

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-20 18:17           ` Michael S. Tsirkin
  (?)
@ 2017-06-20 18:54             ` David Hildenbrand
  -1 siblings, 0 replies; 175+ messages in thread
From: David Hildenbrand @ 2017-06-20 18:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rik van Riel, Dave Hansen, Wei Wang, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource,
	Nitesh Narayan Lal

On 20.06.2017 20:17, Michael S. Tsirkin wrote:
> On Tue, Jun 20, 2017 at 06:49:33PM +0200, David Hildenbrand wrote:
>> On 20.06.2017 18:44, Rik van Riel wrote:
>>> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
>>>
>>>> The hypervisor is going to throw away the contents of these pages,
>>>> right?  As soon as the spinlock is released, someone can allocate a
>>>> page, and put good data in it.  What keeps the hypervisor from
>>>> throwing
>>>> away good data?
>>>
>>> That looks like it may be the wrong API, then?
>>>
>>> We already have hooks called arch_free_page and
>>> arch_alloc_page in the VM, which are called when
>>> pages are freed, and allocated, respectively.
>>>
>>> Nitesh Lal (on the CC list) is working on a way
>>> to efficiently batch recently freed pages for
>>> free page hinting to the hypervisor.
>>>
>>> If that is done efficiently enough (eg. with
>>> MADV_FREE on the hypervisor side for lazy freeing,
>>> and lazy later re-use of the pages), do we still
>>> need the harder to use batch interface from this
>>> patch?
>>>
>> David's opinion incoming:
>>
>> No, I think proper free page hinting would be the optimum solution, if
>> done right. This would avoid the batch interface and even turn
>> virtio-balloon in some sense useless.
> 
> I agree generally. But we have to balance that against the fact that
> this was discussed since at least 2011 and no one built this solution
> yet.

I totally agree, and I still think it will be hard to get a decent
performance for free page hinting (let's call it challenging). But I
heard of some interesting ideas. Surprise me.

Still, I would favor such an interface over a mm interface where people
start asking the same question over and over again ("how can this even
work"). Not only because it wasn't explained sufficiently enough, but
also because this interface is so special for one use case and one
scenario (concurrent dirty tracking in the host during migration).

IMHO even simply writing all-zeros to all free pages before starting
migration (or even when freeing a page) would be a cleaner interface
than this (because it atomically works with the entity the host cares
about for migration). But yes, performance is horrible that's why I am
not even suggesting it. Just saying that this mm interface is very very
special and if we could find something better, I'd favor it.

-- 

Thanks,

David

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-20 18:54             ` David Hildenbrand
  0 siblings, 0 replies; 175+ messages in thread
From: David Hildenbrand @ 2017-06-20 18:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rik van Riel, Dave Hansen, Wei Wang, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource,
	Nitesh Narayan Lal

On 20.06.2017 20:17, Michael S. Tsirkin wrote:
> On Tue, Jun 20, 2017 at 06:49:33PM +0200, David Hildenbrand wrote:
>> On 20.06.2017 18:44, Rik van Riel wrote:
>>> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
>>>
>>>> The hypervisor is going to throw away the contents of these pages,
>>>> right?  As soon as the spinlock is released, someone can allocate a
>>>> page, and put good data in it.  What keeps the hypervisor from
>>>> throwing
>>>> away good data?
>>>
>>> That looks like it may be the wrong API, then?
>>>
>>> We already have hooks called arch_free_page and
>>> arch_alloc_page in the VM, which are called when
>>> pages are freed, and allocated, respectively.
>>>
>>> Nitesh Lal (on the CC list) is working on a way
>>> to efficiently batch recently freed pages for
>>> free page hinting to the hypervisor.
>>>
>>> If that is done efficiently enough (eg. with
>>> MADV_FREE on the hypervisor side for lazy freeing,
>>> and lazy later re-use of the pages), do we still
>>> need the harder to use batch interface from this
>>> patch?
>>>
>> David's opinion incoming:
>>
>> No, I think proper free page hinting would be the optimum solution, if
>> done right. This would avoid the batch interface and even turn
>> virtio-balloon in some sense useless.
> 
> I agree generally. But we have to balance that against the fact that
> this was discussed since at least 2011 and no one built this solution
> yet.

I totally agree, and I still think it will be hard to get a decent
performance for free page hinting (let's call it challenging). But I
heard of some interesting ideas. Surprise me.

Still, I would favor such an interface over a mm interface where people
start asking the same question over and over again ("how can this even
work"). Not only because it wasn't explained sufficiently enough, but
also because this interface is so special for one use case and one
scenario (concurrent dirty tracking in the host during migration).

IMHO even simply writing all-zeros to all free pages before starting
migration (or even when freeing a page) would be a cleaner interface
than this (because it atomically works with the entity the host cares
about for migration). But yes, performance is horrible that's why I am
not even suggesting it. Just saying that this mm interface is very very
special and if we could find something better, I'd favor it.

-- 

Thanks,

David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-20 18:54             ` David Hildenbrand
  0 siblings, 0 replies; 175+ messages in thread
From: David Hildenbrand @ 2017-06-20 18:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rik van Riel, Dave Hansen, Wei Wang, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource,
	Nitesh Narayan Lal

On 20.06.2017 20:17, Michael S. Tsirkin wrote:
> On Tue, Jun 20, 2017 at 06:49:33PM +0200, David Hildenbrand wrote:
>> On 20.06.2017 18:44, Rik van Riel wrote:
>>> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
>>>
>>>> The hypervisor is going to throw away the contents of these pages,
>>>> right?  As soon as the spinlock is released, someone can allocate a
>>>> page, and put good data in it.  What keeps the hypervisor from
>>>> throwing
>>>> away good data?
>>>
>>> That looks like it may be the wrong API, then?
>>>
>>> We already have hooks called arch_free_page and
>>> arch_alloc_page in the VM, which are called when
>>> pages are freed, and allocated, respectively.
>>>
>>> Nitesh Lal (on the CC list) is working on a way
>>> to efficiently batch recently freed pages for
>>> free page hinting to the hypervisor.
>>>
>>> If that is done efficiently enough (eg. with
>>> MADV_FREE on the hypervisor side for lazy freeing,
>>> and lazy later re-use of the pages), do we still
>>> need the harder to use batch interface from this
>>> patch?
>>>
>> David's opinion incoming:
>>
>> No, I think proper free page hinting would be the optimum solution, if
>> done right. This would avoid the batch interface and even turn
>> virtio-balloon in some sense useless.
> 
> I agree generally. But we have to balance that against the fact that
> this was discussed since at least 2011 and no one built this solution
> yet.

I totally agree, and I still think it will be hard to get a decent
performance for free page hinting (let's call it challenging). But I
heard of some interesting ideas. Surprise me.

Still, I would favor such an interface over a mm interface where people
start asking the same question over and over again ("how can this even
work"). Not only because it wasn't explained sufficiently enough, but
also because this interface is so special for one use case and one
scenario (concurrent dirty tracking in the host during migration).

IMHO even simply writing all-zeros to all free pages before starting
migration (or even when freeing a page) would be a cleaner interface
than this (because it atomically works with the entity the host cares
about for migration). But yes, performance is horrible that's why I am
not even suggesting it. Just saying that this mm interface is very very
special and if we could find something better, I'd favor it.

-- 

Thanks,

David

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-20 18:17           ` Michael S. Tsirkin
                             ` (2 preceding siblings ...)
  (?)
@ 2017-06-20 18:54           ` David Hildenbrand
  -1 siblings, 0 replies; 175+ messages in thread
From: David Hildenbrand @ 2017-06-20 18:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, Rik van Riel, amit.shah, kvm, linux-mm, linux-kernel,
	liliang.opensource, qemu-devel, virtualization, Dave Hansen,
	cornelia.huck, pbonzini, akpm, Nitesh Narayan Lal, mgorman

On 20.06.2017 20:17, Michael S. Tsirkin wrote:
> On Tue, Jun 20, 2017 at 06:49:33PM +0200, David Hildenbrand wrote:
>> On 20.06.2017 18:44, Rik van Riel wrote:
>>> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
>>>
>>>> The hypervisor is going to throw away the contents of these pages,
>>>> right?  As soon as the spinlock is released, someone can allocate a
>>>> page, and put good data in it.  What keeps the hypervisor from
>>>> throwing
>>>> away good data?
>>>
>>> That looks like it may be the wrong API, then?
>>>
>>> We already have hooks called arch_free_page and
>>> arch_alloc_page in the VM, which are called when
>>> pages are freed, and allocated, respectively.
>>>
>>> Nitesh Lal (on the CC list) is working on a way
>>> to efficiently batch recently freed pages for
>>> free page hinting to the hypervisor.
>>>
>>> If that is done efficiently enough (eg. with
>>> MADV_FREE on the hypervisor side for lazy freeing,
>>> and lazy later re-use of the pages), do we still
>>> need the harder to use batch interface from this
>>> patch?
>>>
>> David's opinion incoming:
>>
>> No, I think proper free page hinting would be the optimum solution, if
>> done right. This would avoid the batch interface and even turn
>> virtio-balloon in some sense useless.
> 
> I agree generally. But we have to balance that against the fact that
> this was discussed since at least 2011 and no one built this solution
> yet.

I totally agree, and I still think it will be hard to get a decent
performance for free page hinting (let's call it challenging). But I
heard of some interesting ideas. Surprise me.

Still, I would favor such an interface over a mm interface where people
start asking the same question over and over again ("how can this even
work"). Not only because it wasn't explained sufficiently enough, but
also because this interface is so special for one use case and one
scenario (concurrent dirty tracking in the host during migration).

IMHO even simply writing all-zeros to all free pages before starting
migration (or even when freeing a page) would be a cleaner interface
than this (because it atomically works with the entity the host cares
about for migration). But yes, performance is horrible that's why I am
not even suggesting it. Just saying that this mm interface is very very
special and if we could find something better, I'd favor it.

-- 

Thanks,

David

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-20 18:54             ` David Hildenbrand
  (?)
@ 2017-06-20 18:56               ` Michael S. Tsirkin
  -1 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-20 18:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Rik van Riel, Dave Hansen, Wei Wang, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource,
	Nitesh Narayan Lal

On Tue, Jun 20, 2017 at 08:54:29PM +0200, David Hildenbrand wrote:
> On 20.06.2017 20:17, Michael S. Tsirkin wrote:
> > On Tue, Jun 20, 2017 at 06:49:33PM +0200, David Hildenbrand wrote:
> >> On 20.06.2017 18:44, Rik van Riel wrote:
> >>> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
> >>>
> >>>> The hypervisor is going to throw away the contents of these pages,
> >>>> right?  As soon as the spinlock is released, someone can allocate a
> >>>> page, and put good data in it.  What keeps the hypervisor from
> >>>> throwing
> >>>> away good data?
> >>>
> >>> That looks like it may be the wrong API, then?
> >>>
> >>> We already have hooks called arch_free_page and
> >>> arch_alloc_page in the VM, which are called when
> >>> pages are freed, and allocated, respectively.
> >>>
> >>> Nitesh Lal (on the CC list) is working on a way
> >>> to efficiently batch recently freed pages for
> >>> free page hinting to the hypervisor.
> >>>
> >>> If that is done efficiently enough (eg. with
> >>> MADV_FREE on the hypervisor side for lazy freeing,
> >>> and lazy later re-use of the pages), do we still
> >>> need the harder to use batch interface from this
> >>> patch?
> >>>
> >> David's opinion incoming:
> >>
> >> No, I think proper free page hinting would be the optimum solution, if
> >> done right. This would avoid the batch interface and even turn
> >> virtio-balloon in some sense useless.
> > 
> > I agree generally. But we have to balance that against the fact that
> > this was discussed since at least 2011 and no one built this solution
> > yet.
> 
> I totally agree, and I still think it will be hard to get a decent
> performance for free page hinting (let's call it challenging). But I
> heard of some interesting ideas. Surprise me.
> 
> Still, I would favor such an interface over a mm interface where people
> start asking the same question over and over again ("how can this even
> work"). Not only because it wasn't explained sufficiently enough, but
> also because this interface is so special for one use case and one
> scenario (concurrent dirty tracking in the host during migration).
> 
> IMHO even simply writing all-zeros to all free pages before starting
> migration (or even when freeing a page) would be a cleaner interface
> than this (because it atomically works with the entity the host cares
> about for migration). But yes, performance is horrible that's why I am
> not even suggesting it. Just saying that this mm interface is very very
> special and if we could find something better, I'd favor it.

As long as there's a single user, changing to a better interface
once it's found won't be hard at all :)

> -- 
> 
> Thanks,
> 
> David

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-20 18:56               ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-20 18:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Rik van Riel, Dave Hansen, Wei Wang, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource,
	Nitesh Narayan Lal

On Tue, Jun 20, 2017 at 08:54:29PM +0200, David Hildenbrand wrote:
> On 20.06.2017 20:17, Michael S. Tsirkin wrote:
> > On Tue, Jun 20, 2017 at 06:49:33PM +0200, David Hildenbrand wrote:
> >> On 20.06.2017 18:44, Rik van Riel wrote:
> >>> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
> >>>
> >>>> The hypervisor is going to throw away the contents of these pages,
> >>>> right?  As soon as the spinlock is released, someone can allocate a
> >>>> page, and put good data in it.  What keeps the hypervisor from
> >>>> throwing
> >>>> away good data?
> >>>
> >>> That looks like it may be the wrong API, then?
> >>>
> >>> We already have hooks called arch_free_page and
> >>> arch_alloc_page in the VM, which are called when
> >>> pages are freed, and allocated, respectively.
> >>>
> >>> Nitesh Lal (on the CC list) is working on a way
> >>> to efficiently batch recently freed pages for
> >>> free page hinting to the hypervisor.
> >>>
> >>> If that is done efficiently enough (eg. with
> >>> MADV_FREE on the hypervisor side for lazy freeing,
> >>> and lazy later re-use of the pages), do we still
> >>> need the harder to use batch interface from this
> >>> patch?
> >>>
> >> David's opinion incoming:
> >>
> >> No, I think proper free page hinting would be the optimum solution, if
> >> done right. This would avoid the batch interface and even turn
> >> virtio-balloon in some sense useless.
> > 
> > I agree generally. But we have to balance that against the fact that
> > this was discussed since at least 2011 and no one built this solution
> > yet.
> 
> I totally agree, and I still think it will be hard to get a decent
> performance for free page hinting (let's call it challenging). But I
> heard of some interesting ideas. Surprise me.
> 
> Still, I would favor such an interface over a mm interface where people
> start asking the same question over and over again ("how can this even
> work"). Not only because it wasn't explained sufficiently enough, but
> also because this interface is so special for one use case and one
> scenario (concurrent dirty tracking in the host during migration).
> 
> IMHO even simply writing all-zeros to all free pages before starting
> migration (or even when freeing a page) would be a cleaner interface
> than this (because it atomically works with the entity the host cares
> about for migration). But yes, performance is horrible that's why I am
> not even suggesting it. Just saying that this mm interface is very very
> special and if we could find something better, I'd favor it.

As long as there's a single user, changing to a better interface
once it's found won't be hard at all :)

> -- 
> 
> Thanks,
> 
> David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-20 18:56               ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-20 18:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Rik van Riel, Dave Hansen, Wei Wang, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource,
	Nitesh Narayan Lal

On Tue, Jun 20, 2017 at 08:54:29PM +0200, David Hildenbrand wrote:
> On 20.06.2017 20:17, Michael S. Tsirkin wrote:
> > On Tue, Jun 20, 2017 at 06:49:33PM +0200, David Hildenbrand wrote:
> >> On 20.06.2017 18:44, Rik van Riel wrote:
> >>> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
> >>>
> >>>> The hypervisor is going to throw away the contents of these pages,
> >>>> right?  As soon as the spinlock is released, someone can allocate a
> >>>> page, and put good data in it.  What keeps the hypervisor from
> >>>> throwing
> >>>> away good data?
> >>>
> >>> That looks like it may be the wrong API, then?
> >>>
> >>> We already have hooks called arch_free_page and
> >>> arch_alloc_page in the VM, which are called when
> >>> pages are freed, and allocated, respectively.
> >>>
> >>> Nitesh Lal (on the CC list) is working on a way
> >>> to efficiently batch recently freed pages for
> >>> free page hinting to the hypervisor.
> >>>
> >>> If that is done efficiently enough (eg. with
> >>> MADV_FREE on the hypervisor side for lazy freeing,
> >>> and lazy later re-use of the pages), do we still
> >>> need the harder to use batch interface from this
> >>> patch?
> >>>
> >> David's opinion incoming:
> >>
> >> No, I think proper free page hinting would be the optimum solution, if
> >> done right. This would avoid the batch interface and even turn
> >> virtio-balloon in some sense useless.
> > 
> > I agree generally. But we have to balance that against the fact that
> > this was discussed since at least 2011 and no one built this solution
> > yet.
> 
> I totally agree, and I still think it will be hard to get a decent
> performance for free page hinting (let's call it challenging). But I
> heard of some interesting ideas. Surprise me.
> 
> Still, I would favor such an interface over a mm interface where people
> start asking the same question over and over again ("how can this even
> work"). Not only because it wasn't explained sufficiently enough, but
> also because this interface is so special for one use case and one
> scenario (concurrent dirty tracking in the host during migration).
> 
> IMHO even simply writing all-zeros to all free pages before starting
> migration (or even when freeing a page) would be a cleaner interface
> than this (because it atomically works with the entity the host cares
> about for migration). But yes, performance is horrible that's why I am
> not even suggesting it. Just saying that this mm interface is very very
> special and if we could find something better, I'd favor it.

As long as there's a single user, changing to a better interface
once it's found won't be hard at all :)

> -- 
> 
> Thanks,
> 
> David

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-20 18:54             ` David Hildenbrand
                               ` (2 preceding siblings ...)
  (?)
@ 2017-06-20 18:56             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-20 18:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: aarcange, Rik van Riel, amit.shah, kvm, linux-mm, linux-kernel,
	liliang.opensource, qemu-devel, virtualization, Dave Hansen,
	cornelia.huck, pbonzini, akpm, Nitesh Narayan Lal, mgorman

On Tue, Jun 20, 2017 at 08:54:29PM +0200, David Hildenbrand wrote:
> On 20.06.2017 20:17, Michael S. Tsirkin wrote:
> > On Tue, Jun 20, 2017 at 06:49:33PM +0200, David Hildenbrand wrote:
> >> On 20.06.2017 18:44, Rik van Riel wrote:
> >>> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
> >>>
> >>>> The hypervisor is going to throw away the contents of these pages,
> >>>> right?  As soon as the spinlock is released, someone can allocate a
> >>>> page, and put good data in it.  What keeps the hypervisor from
> >>>> throwing
> >>>> away good data?
> >>>
> >>> That looks like it may be the wrong API, then?
> >>>
> >>> We already have hooks called arch_free_page and
> >>> arch_alloc_page in the VM, which are called when
> >>> pages are freed, and allocated, respectively.
> >>>
> >>> Nitesh Lal (on the CC list) is working on a way
> >>> to efficiently batch recently freed pages for
> >>> free page hinting to the hypervisor.
> >>>
> >>> If that is done efficiently enough (eg. with
> >>> MADV_FREE on the hypervisor side for lazy freeing,
> >>> and lazy later re-use of the pages), do we still
> >>> need the harder to use batch interface from this
> >>> patch?
> >>>
> >> David's opinion incoming:
> >>
> >> No, I think proper free page hinting would be the optimum solution, if
> >> done right. This would avoid the batch interface and even turn
> >> virtio-balloon in some sense useless.
> > 
> > I agree generally. But we have to balance that against the fact that
> > this was discussed since at least 2011 and no one built this solution
> > yet.
> 
> I totally agree, and I still think it will be hard to get a decent
> performance for free page hinting (let's call it challenging). But I
> heard of some interesting ideas. Surprise me.
> 
> Still, I would favor such an interface over a mm interface where people
> start asking the same question over and over again ("how can this even
> work"). Not only because it wasn't explained sufficiently enough, but
> also because this interface is so special for one use case and one
> scenario (concurrent dirty tracking in the host during migration).
> 
> IMHO even simply writing all-zeros to all free pages before starting
> migration (or even when freeing a page) would be a cleaner interface
> than this (because it atomically works with the entity the host cares
> about for migration). But yes, performance is horrible that's why I am
> not even suggesting it. Just saying that this mm interface is very very
> special and if we could find something better, I'd favor it.

As long as there's a single user, changing to a better interface
once it's found won't be hard at all :)

> -- 
> 
> Thanks,
> 
> David

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-20 18:56               ` Michael S. Tsirkin
  (?)
@ 2017-06-20 19:01                 ` David Hildenbrand
  -1 siblings, 0 replies; 175+ messages in thread
From: David Hildenbrand @ 2017-06-20 19:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rik van Riel, Dave Hansen, Wei Wang, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource,
	Nitesh Narayan Lal


>> IMHO even simply writing all-zeros to all free pages before starting
>> migration (or even when freeing a page) would be a cleaner interface
>> than this (because it atomically works with the entity the host cares
>> about for migration). But yes, performance is horrible that's why I am
>> not even suggesting it. Just saying that this mm interface is very very
>> special and if we could find something better, I'd favor it.
> 
> As long as there's a single user, changing to a better interface
> once it's found won't be hard at all :)
> 

Hehe, more like "we made this beautiful virtio-balloon extension" - oh
there is free page hinting (assuming that it does not reuse the batch
interface here). Guess how long it would take to at least show that free
page hinting can be done. If it takes another 6 years, I am totally on
your side ;)

-- 

Thanks,

David

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-20 19:01                 ` David Hildenbrand
  0 siblings, 0 replies; 175+ messages in thread
From: David Hildenbrand @ 2017-06-20 19:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rik van Riel, Dave Hansen, Wei Wang, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource,
	Nitesh Narayan Lal


>> IMHO even simply writing all-zeros to all free pages before starting
>> migration (or even when freeing a page) would be a cleaner interface
>> than this (because it atomically works with the entity the host cares
>> about for migration). But yes, performance is horrible that's why I am
>> not even suggesting it. Just saying that this mm interface is very very
>> special and if we could find something better, I'd favor it.
> 
> As long as there's a single user, changing to a better interface
> once it's found won't be hard at all :)
> 

Hehe, more like "we made this beautiful virtio-balloon extension" - oh
there is free page hinting (assuming that it does not reuse the batch
interface here). Guess how long it would take to at least show that free
page hinting can be done. If it takes another 6 years, I am totally on
your side ;)

-- 

Thanks,

David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-20 19:01                 ` David Hildenbrand
  0 siblings, 0 replies; 175+ messages in thread
From: David Hildenbrand @ 2017-06-20 19:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rik van Riel, Dave Hansen, Wei Wang, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource,
	Nitesh Narayan Lal


>> IMHO even simply writing all-zeros to all free pages before starting
>> migration (or even when freeing a page) would be a cleaner interface
>> than this (because it atomically works with the entity the host cares
>> about for migration). But yes, performance is horrible that's why I am
>> not even suggesting it. Just saying that this mm interface is very very
>> special and if we could find something better, I'd favor it.
> 
> As long as there's a single user, changing to a better interface
> once it's found won't be hard at all :)
> 

Hehe, more like "we made this beautiful virtio-balloon extension" - oh
there is free page hinting (assuming that it does not reuse the batch
interface here). Guess how long it would take to at least show that free
page hinting can be done. If it takes another 6 years, I am totally on
your side ;)

-- 

Thanks,

David

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-20 18:56               ` Michael S. Tsirkin
                                 ` (2 preceding siblings ...)
  (?)
@ 2017-06-20 19:01               ` David Hildenbrand
  -1 siblings, 0 replies; 175+ messages in thread
From: David Hildenbrand @ 2017-06-20 19:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, Rik van Riel, amit.shah, kvm, linux-mm, linux-kernel,
	liliang.opensource, qemu-devel, virtualization, Dave Hansen,
	cornelia.huck, pbonzini, akpm, Nitesh Narayan Lal, mgorman


>> IMHO even simply writing all-zeros to all free pages before starting
>> migration (or even when freeing a page) would be a cleaner interface
>> than this (because it atomically works with the entity the host cares
>> about for migration). But yes, performance is horrible that's why I am
>> not even suggesting it. Just saying that this mm interface is very very
>> special and if we could find something better, I'd favor it.
> 
> As long as there's a single user, changing to a better interface
> once it's found won't be hard at all :)
> 

Hehe, more like "we made this beautiful virtio-balloon extension" - oh
there is free page hinting (assuming that it does not reuse the batch
interface here). Guess how long it would take to at least show that free
page hinting can be done. If it takes another 6 years, I am totally on
your side ;)

-- 

Thanks,

David

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-20 18:26             ` Michael S. Tsirkin
  (?)
@ 2017-06-20 19:51               ` Rik van Riel
  -1 siblings, 0 replies; 175+ messages in thread
From: Rik van Riel @ 2017-06-20 19:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Hildenbrand, Dave Hansen, Wei Wang, linux-kernel,
	qemu-devel, virtualization, kvm, linux-mm, cornelia.huck, akpm,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource,
	Nitesh Narayan Lal

[-- Attachment #1: Type: text/plain, Size: 1713 bytes --]

On Tue, 2017-06-20 at 21:26 +0300, Michael S. Tsirkin wrote:
> On Tue, Jun 20, 2017 at 01:29:00PM -0400, Rik van Riel wrote:
> > I agree with that.  Let me go into some more detail of
> > what Nitesh is implementing:
> > 
> > 1) In arch_free_page, the being-freed page is added
> >    to a per-cpu set of freed pages.
> > 2) Once that set is full, arch_free_pages goes into a
> >    slow path, which:
> >    2a) Iterates over the set of freed pages, and
> >    2b) Checks whether they are still free, and
> >    2c) Adds the still free pages to a list that is
> >        to be passed to the hypervisor, to be MADV_FREEd.
> >    2d) Makes that hypercall.
> > 
> > Meanwhile all arch_alloc_pages has to do is make sure it
> > does not allocate a page while it is currently being
> > MADV_FREEd on the hypervisor side.
> > 
> > The code Wei is working on looks like it could be 
> > suitable for steps (2c) and (2d) above. Nitesh already
> > has code for steps 1 through 2b.
> 
> So my question is this: Wei posted these numbers for balloon
> inflation times:
> inflating 7GB of an 8GB idle guest:
> 
> 	1) allocating pages (6.5%)
> 	2) sending PFNs to host (68.3%)
> 	3) address translation (6.1%)
> 	4) madvise (19%)
> 
> 	It takes about 4126ms for the inflating process to complete.
> 
> It seems that this is an excessive amount of time to stay
> under a lock. What are your estimates for Nitesh's work?

That depends on the batch size used for step
(2c), and is something that we should be able
to tune for decent performance.

What seems to matter is that things are batched.
There are many ways to achieve that.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-20 19:51               ` Rik van Riel
  0 siblings, 0 replies; 175+ messages in thread
From: Rik van Riel @ 2017-06-20 19:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Hildenbrand, Dave Hansen, Wei Wang, linux-kernel,
	qemu-devel, virtualization, kvm, linux-mm, cornelia.huck, akpm,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource,
	Nitesh Narayan Lal

[-- Attachment #1: Type: text/plain, Size: 1713 bytes --]

On Tue, 2017-06-20 at 21:26 +0300, Michael S. Tsirkin wrote:
> On Tue, Jun 20, 2017 at 01:29:00PM -0400, Rik van Riel wrote:
> > I agree with that.  Let me go into some more detail of
> > what Nitesh is implementing:
> > 
> > 1) In arch_free_page, the being-freed page is added
> >    to a per-cpu set of freed pages.
> > 2) Once that set is full, arch_free_pages goes into a
> >    slow path, which:
> >    2a) Iterates over the set of freed pages, and
> >    2b) Checks whether they are still free, and
> >    2c) Adds the still free pages to a list that is
> >        to be passed to the hypervisor, to be MADV_FREEd.
> >    2d) Makes that hypercall.
> > 
> > Meanwhile all arch_alloc_pages has to do is make sure it
> > does not allocate a page while it is currently being
> > MADV_FREEd on the hypervisor side.
> > 
> > The code Wei is working on looks like it could be 
> > suitable for steps (2c) and (2d) above. Nitesh already
> > has code for steps 1 through 2b.
> 
> So my question is this: Wei posted these numbers for balloon
> inflation times:
> inflating 7GB of an 8GB idle guest:
> 
> 	1) allocating pages (6.5%)
> 	2) sending PFNs to host (68.3%)
> 	3) address translation (6.1%)
> 	4) madvise (19%)
> 
> 	It takes about 4126ms for the inflating process to complete.
> 
> It seems that this is an excessive amount of time to stay
> under a lock. What are your estimates for Nitesh's work?

That depends on the batch size used for step
(2c), and is something that we should be able
to tune for decent performance.

What seems to matter is that things are batched.
There are many ways to achieve that.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-20 19:51               ` Rik van Riel
  0 siblings, 0 replies; 175+ messages in thread
From: Rik van Riel @ 2017-06-20 19:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Hildenbrand, Dave Hansen, Wei Wang, linux-kernel,
	qemu-devel, virtualization, kvm, linux-mm, cornelia.huck, akpm,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource,
	Nitesh Narayan Lal

[-- Attachment #1: Type: text/plain, Size: 1713 bytes --]

On Tue, 2017-06-20 at 21:26 +0300, Michael S. Tsirkin wrote:
> On Tue, Jun 20, 2017 at 01:29:00PM -0400, Rik van Riel wrote:
> > I agree with that.  Let me go into some more detail of
> > what Nitesh is implementing:
> > 
> > 1) In arch_free_page, the being-freed page is added
> >    to a per-cpu set of freed pages.
> > 2) Once that set is full, arch_free_pages goes into a
> >    slow path, which:
> >    2a) Iterates over the set of freed pages, and
> >    2b) Checks whether they are still free, and
> >    2c) Adds the still free pages to a list that is
> >        to be passed to the hypervisor, to be MADV_FREEd.
> >    2d) Makes that hypercall.
> > 
> > Meanwhile all arch_alloc_pages has to do is make sure it
> > does not allocate a page while it is currently being
> > MADV_FREEd on the hypervisor side.
> > 
> > The code Wei is working on looks like it could be 
> > suitable for steps (2c) and (2d) above. Nitesh already
> > has code for steps 1 through 2b.
> 
> So my question is this: Wei posted these numbers for balloon
> inflation times:
> inflating 7GB of an 8GB idle guest:
> 
> 	1) allocating pages (6.5%)
> 	2) sending PFNs to host (68.3%)
> 	3) address translation (6.1%)
> 	4) madvise (19%)
> 
> 	It takes about 4126ms for the inflating process to complete.
> 
> It seems that this is an excessive amount of time to stay
> under a lock. What are your estimates for Nitesh's work?

That depends on the batch size used for step
(2c), and is something that we should be able
to tune for decent performance.

What seems to matter is that things are batched.
There are many ways to achieve that.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-20 18:26             ` Michael S. Tsirkin
                               ` (2 preceding siblings ...)
  (?)
@ 2017-06-20 19:51             ` Rik van Riel
  -1 siblings, 0 replies; 175+ messages in thread
From: Rik van Riel @ 2017-06-20 19:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, amit.shah, kvm, linux-kernel, liliang.opensource,
	qemu-devel, virtualization, linux-mm, Dave Hansen, cornelia.huck,
	pbonzini, akpm, Nitesh Narayan Lal, mgorman


[-- Attachment #1.1: Type: text/plain, Size: 1713 bytes --]

On Tue, 2017-06-20 at 21:26 +0300, Michael S. Tsirkin wrote:
> On Tue, Jun 20, 2017 at 01:29:00PM -0400, Rik van Riel wrote:
> > I agree with that.  Let me go into some more detail of
> > what Nitesh is implementing:
> > 
> > 1) In arch_free_page, the being-freed page is added
> >    to a per-cpu set of freed pages.
> > 2) Once that set is full, arch_free_pages goes into a
> >    slow path, which:
> >    2a) Iterates over the set of freed pages, and
> >    2b) Checks whether they are still free, and
> >    2c) Adds the still free pages to a list that is
> >        to be passed to the hypervisor, to be MADV_FREEd.
> >    2d) Makes that hypercall.
> > 
> > Meanwhile all arch_alloc_pages has to do is make sure it
> > does not allocate a page while it is currently being
> > MADV_FREEd on the hypervisor side.
> > 
> > The code Wei is working on looks like it could be 
> > suitable for steps (2c) and (2d) above. Nitesh already
> > has code for steps 1 through 2b.
> 
> So my question is this: Wei posted these numbers for balloon
> inflation times:
> inflating 7GB of an 8GB idle guest:
> 
> 	1) allocating pages (6.5%)
> 	2) sending PFNs to host (68.3%)
> 	3) address translation (6.1%)
> 	4) madvise (19%)
> 
> 	It takes about 4126ms for the inflating process to complete.
> 
> It seems that this is an excessive amount of time to stay
> under a lock. What are your estimates for Nitesh's work?

That depends on the batch size used for step
(2c), and is something that we should be able
to tune for decent performance.

What seems to matter is that things are batched.
There are many ways to achieve that.

-- 
All rights reversed

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
  2017-06-20 16:18     ` Michael S. Tsirkin
  (?)
  (?)
@ 2017-06-21  3:28       ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-21  3:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
> On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
>> -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
>> +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
>>   		virtqueue_kick(vq);
>> -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>> -		vb->balloon_page_chunk.chunk_num = 0;
>> +		if (busy_wait)
>> +			while (!virtqueue_get_buf(vq, &len) &&
>> +			       !virtqueue_is_broken(vq))
>> +				cpu_relax();
>> +		else
>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>
> This is something I didn't previously notice.
> As you always keep a single buffer in flight, you do not
> really need indirect at all. Just add all descriptors
> in the ring directly, then kick.
>
> E.g.
> 	virtqueue_add_first
> 	virtqueue_add_next
> 	virtqueue_add_last
>
> ?
>
> You also want a flag to avoid allocations but there's no need to do it
> per descriptor, set it on vq.
>

Without using the indirect table, I'm thinking about changing to use
the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
we don't need to modify or add any new functions of virtqueue_add().

In this case, we will kmalloc an array of sgs in probe(), and we can add
the sgs one by one to the vq, which won't trigger the allocation of an
indirect table inside virtqueue_add(), and then kick when all are added.

Best,
Wei

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
@ 2017-06-21  3:28       ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-21  3:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
> On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
>> -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
>> +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
>>   		virtqueue_kick(vq);
>> -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>> -		vb->balloon_page_chunk.chunk_num = 0;
>> +		if (busy_wait)
>> +			while (!virtqueue_get_buf(vq, &len) &&
>> +			       !virtqueue_is_broken(vq))
>> +				cpu_relax();
>> +		else
>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>
> This is something I didn't previously notice.
> As you always keep a single buffer in flight, you do not
> really need indirect at all. Just add all descriptors
> in the ring directly, then kick.
>
> E.g.
> 	virtqueue_add_first
> 	virtqueue_add_next
> 	virtqueue_add_last
>
> ?
>
> You also want a flag to avoid allocations but there's no need to do it
> per descriptor, set it on vq.
>

Without using the indirect table, I'm thinking about changing to use
the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
we don't need to modify or add any new functions of virtqueue_add().

In this case, we will kmalloc an array of sgs in probe(), and we can add
the sgs one by one to the vq, which won't trigger the allocation of an
indirect table inside virtqueue_add(), and then kick when all are added.

Best,
Wei

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
@ 2017-06-21  3:28       ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-21  3:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
> On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
>> -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
>> +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
>>   		virtqueue_kick(vq);
>> -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>> -		vb->balloon_page_chunk.chunk_num = 0;
>> +		if (busy_wait)
>> +			while (!virtqueue_get_buf(vq, &len) &&
>> +			       !virtqueue_is_broken(vq))
>> +				cpu_relax();
>> +		else
>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>
> This is something I didn't previously notice.
> As you always keep a single buffer in flight, you do not
> really need indirect at all. Just add all descriptors
> in the ring directly, then kick.
>
> E.g.
> 	virtqueue_add_first
> 	virtqueue_add_next
> 	virtqueue_add_last
>
> ?
>
> You also want a flag to avoid allocations but there's no need to do it
> per descriptor, set it on vq.
>

Without using the indirect table, I'm thinking about changing to use
the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
we don't need to modify or add any new functions of virtqueue_add().

In this case, we will kmalloc an array of sgs in probe(), and we can add
the sgs one by one to the vq, which won't trigger the allocation of an
indirect table inside virtqueue_add(), and then kick when all are added.

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [virtio-dev] Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
@ 2017-06-21  3:28       ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-21  3:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
> On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
>> -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
>> +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
>>   		virtqueue_kick(vq);
>> -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>> -		vb->balloon_page_chunk.chunk_num = 0;
>> +		if (busy_wait)
>> +			while (!virtqueue_get_buf(vq, &len) &&
>> +			       !virtqueue_is_broken(vq))
>> +				cpu_relax();
>> +		else
>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>
> This is something I didn't previously notice.
> As you always keep a single buffer in flight, you do not
> really need indirect at all. Just add all descriptors
> in the ring directly, then kick.
>
> E.g.
> 	virtqueue_add_first
> 	virtqueue_add_next
> 	virtqueue_add_last
>
> ?
>
> You also want a flag to avoid allocations but there's no need to do it
> per descriptor, set it on vq.
>

Without using the indirect table, I'm thinking about changing to use
the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
we don't need to modify or add any new functions of virtqueue_add().

In this case, we will kmalloc an array of sgs in probe(), and we can add
the sgs one by one to the vq, which won't trigger the allocation of an
indirect table inside virtqueue_add(), and then kick when all are added.

Best,
Wei

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
  2017-06-20 16:18     ` Michael S. Tsirkin
                       ` (2 preceding siblings ...)
  (?)
@ 2017-06-21  3:28     ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-21  3:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, dave.hansen, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
> On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
>> -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
>> +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
>>   		virtqueue_kick(vq);
>> -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>> -		vb->balloon_page_chunk.chunk_num = 0;
>> +		if (busy_wait)
>> +			while (!virtqueue_get_buf(vq, &len) &&
>> +			       !virtqueue_is_broken(vq))
>> +				cpu_relax();
>> +		else
>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>
> This is something I didn't previously notice.
> As you always keep a single buffer in flight, you do not
> really need indirect at all. Just add all descriptors
> in the ring directly, then kick.
>
> E.g.
> 	virtqueue_add_first
> 	virtqueue_add_next
> 	virtqueue_add_last
>
> ?
>
> You also want a flag to avoid allocations but there's no need to do it
> per descriptor, set it on vq.
>

Without using the indirect table, I'm thinking about changing to use
the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
we don't need to modify or add any new functions of virtqueue_add().

In this case, we will kmalloc an array of sgs in probe(), and we can add
the sgs one by one to the vq, which won't trigger the allocation of an
indirect table inside virtqueue_add(), and then kick when all are added.

Best,
Wei

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-20 17:29           ` Rik van Riel
@ 2017-06-21  8:38             ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-21  8:38 UTC (permalink / raw)
  To: Rik van Riel, David Hildenbrand, Dave Hansen, linux-kernel,
	qemu-devel, virtualization, kvm, linux-mm, mst, cornelia.huck,
	akpm, mgorman, aarcange, amit.shah, pbonzini, liliang.opensource
  Cc: Nitesh Narayan Lal

On 06/21/2017 01:29 AM, Rik van Riel wrote:
> On Tue, 2017-06-20 at 18:49 +0200, David Hildenbrand wrote:
>> On 20.06.2017 18:44, Rik van Riel wrote:
>>> Nitesh Lal (on the CC list) is working on a way
>>> to efficiently batch recently freed pages for
>>> free page hinting to the hypervisor.
>>>
>>> If that is done efficiently enough (eg. with
>>> MADV_FREE on the hypervisor side for lazy freeing,
>>> and lazy later re-use of the pages), do we still
>>> need the harder to use batch interface from this
>>> patch?
>>>
>> David's opinion incoming:
>>
>> No, I think proper free page hinting would be the optimum solution,
>> if
>> done right. This would avoid the batch interface and even turn
>> virtio-balloon in some sense useless.
> I agree with that.  Let me go into some more detail of
> what Nitesh is implementing:
>
> 1) In arch_free_page, the being-freed page is added
>     to a per-cpu set of freed pages.

I got some questions here:

1. Are the pages managed one by one on the per-CPU set?
For example, when there are 2 adjacent pages, are they still
put as two nodes on the per-CPU list? or the buddy algorithm
will be re-implemented on the per-CPU list as well?

2. Looks like this will be added to the common free function.
Normally, people may not need the free page hint, do they
need to carry the added burden?


> 2) Once that set is full, arch_free_pages goes into a
>     slow path, which:
>     2a) Iterates over the set of freed pages, and
>     2b) Checks whether they are still free, and

The pages that have been double checked as "free"
pages here and added to the list for the hypervisor can
also be immediately used.


>     2c) Adds the still free pages to a list that is
>         to be passed to the hypervisor, to be MADV_FREEd.
>     2d) Makes that hypercall.
>
> Meanwhile all arch_alloc_pages has to do is make sure it
> does not allocate a page while it is currently being
> MADV_FREEd on the hypervisor side.

Is this proposed to replace the balloon driver?

>
> The code Wei is working on looks like it could be
> suitable for steps (2c) and (2d) above. Nitesh already
> has code for steps 1 through 2b.
>

May I know the advantages of the added steps? Thanks.

Best,
Wei

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-21  8:38             ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-21  8:38 UTC (permalink / raw)
  To: Rik van Riel, David Hildenbrand, Dave Hansen, linux-kernel,
	qemu-devel, virtualization, kvm, linux-mm, mst, cornelia.huck,
	akpm, mgorman, aarcange, amit.shah, pbonzini, liliang.opensource
  Cc: Nitesh Narayan Lal

On 06/21/2017 01:29 AM, Rik van Riel wrote:
> On Tue, 2017-06-20 at 18:49 +0200, David Hildenbrand wrote:
>> On 20.06.2017 18:44, Rik van Riel wrote:
>>> Nitesh Lal (on the CC list) is working on a way
>>> to efficiently batch recently freed pages for
>>> free page hinting to the hypervisor.
>>>
>>> If that is done efficiently enough (eg. with
>>> MADV_FREE on the hypervisor side for lazy freeing,
>>> and lazy later re-use of the pages), do we still
>>> need the harder to use batch interface from this
>>> patch?
>>>
>> David's opinion incoming:
>>
>> No, I think proper free page hinting would be the optimum solution,
>> if
>> done right. This would avoid the batch interface and even turn
>> virtio-balloon in some sense useless.
> I agree with that.  Let me go into some more detail of
> what Nitesh is implementing:
>
> 1) In arch_free_page, the being-freed page is added
>     to a per-cpu set of freed pages.

I got some questions here:

1. Are the pages managed one by one on the per-CPU set?
For example, when there are 2 adjacent pages, are they still
put as two nodes on the per-CPU list? or the buddy algorithm
will be re-implemented on the per-CPU list as well?

2. Looks like this will be added to the common free function.
Normally, people may not need the free page hint, do they
need to carry the added burden?


> 2) Once that set is full, arch_free_pages goes into a
>     slow path, which:
>     2a) Iterates over the set of freed pages, and
>     2b) Checks whether they are still free, and

The pages that have been double checked as "free"
pages here and added to the list for the hypervisor can
also be immediately used.


>     2c) Adds the still free pages to a list that is
>         to be passed to the hypervisor, to be MADV_FREEd.
>     2d) Makes that hypercall.
>
> Meanwhile all arch_alloc_pages has to do is make sure it
> does not allocate a page while it is currently being
> MADV_FREEd on the hypervisor side.

Is this proposed to replace the balloon driver?

>
> The code Wei is working on looks like it could be
> suitable for steps (2c) and (2d) above. Nitesh already
> has code for steps 1 through 2b.
>

May I know the advantages of the added steps? Thanks.

Best,
Wei



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-20 17:29           ` Rik van Riel
                             ` (4 preceding siblings ...)
  (?)
@ 2017-06-21  8:38           ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-21  8:38 UTC (permalink / raw)
  To: Rik van Riel, David Hildenbrand, Dave Hansen, linux-kernel,
	qemu-devel, virtualization, kvm, linux-mm, mst, cornelia.huck,
	akpm, mgorman, aarcange, amit.shah, pbonzini, liliang.opensource
  Cc: Nitesh Narayan Lal

On 06/21/2017 01:29 AM, Rik van Riel wrote:
> On Tue, 2017-06-20 at 18:49 +0200, David Hildenbrand wrote:
>> On 20.06.2017 18:44, Rik van Riel wrote:
>>> Nitesh Lal (on the CC list) is working on a way
>>> to efficiently batch recently freed pages for
>>> free page hinting to the hypervisor.
>>>
>>> If that is done efficiently enough (eg. with
>>> MADV_FREE on the hypervisor side for lazy freeing,
>>> and lazy later re-use of the pages), do we still
>>> need the harder to use batch interface from this
>>> patch?
>>>
>> David's opinion incoming:
>>
>> No, I think proper free page hinting would be the optimum solution,
>> if
>> done right. This would avoid the batch interface and even turn
>> virtio-balloon in some sense useless.
> I agree with that.  Let me go into some more detail of
> what Nitesh is implementing:
>
> 1) In arch_free_page, the being-freed page is added
>     to a per-cpu set of freed pages.

I got some questions here:

1. Are the pages managed one by one on the per-CPU set?
For example, when there are 2 adjacent pages, are they still
put as two nodes on the per-CPU list? or the buddy algorithm
will be re-implemented on the per-CPU list as well?

2. Looks like this will be added to the common free function.
Normally, people may not need the free page hint, do they
need to carry the added burden?


> 2) Once that set is full, arch_free_pages goes into a
>     slow path, which:
>     2a) Iterates over the set of freed pages, and
>     2b) Checks whether they are still free, and

The pages that have been double checked as "free"
pages here and added to the list for the hypervisor can
also be immediately used.


>     2c) Adds the still free pages to a list that is
>         to be passed to the hypervisor, to be MADV_FREEd.
>     2d) Makes that hypercall.
>
> Meanwhile all arch_alloc_pages has to do is make sure it
> does not allocate a page while it is currently being
> MADV_FREEd on the hypervisor side.

Is this proposed to replace the balloon driver?

>
> The code Wei is working on looks like it could be
> suitable for steps (2c) and (2d) above. Nitesh already
> has code for steps 1 through 2b.
>

May I know the advantages of the added steps? Thanks.

Best,
Wei

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
  2017-06-21  3:28       ` Wei Wang
  (?)
  (?)
@ 2017-06-21 12:28         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-21 12:28 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource, riel, nilal

On Wed, Jun 21, 2017 at 11:28:00AM +0800, Wei Wang wrote:
> On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
> > On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
> > > -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
> > > +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
> > >   		virtqueue_kick(vq);
> > > -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > -		vb->balloon_page_chunk.chunk_num = 0;
> > > +		if (busy_wait)
> > > +			while (!virtqueue_get_buf(vq, &len) &&
> > > +			       !virtqueue_is_broken(vq))
> > > +				cpu_relax();
> > > +		else
> > > +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > 
> > This is something I didn't previously notice.
> > As you always keep a single buffer in flight, you do not
> > really need indirect at all. Just add all descriptors
> > in the ring directly, then kick.
> > 
> > E.g.
> > 	virtqueue_add_first
> > 	virtqueue_add_next
> > 	virtqueue_add_last
> > 
> > ?
> > 
> > You also want a flag to avoid allocations but there's no need to do it
> > per descriptor, set it on vq.
> > 
> 
> Without using the indirect table, I'm thinking about changing to use
> the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
> we don't need to modify or add any new functions of virtqueue_add().
> 
> In this case, we will kmalloc an array of sgs in probe(), and we can add
> the sgs one by one to the vq, which won't trigger the allocation of an
> indirect table inside virtqueue_add(), and then kick when all are added.
> 
> Best,
> Wei

And allocate headers too? This can work. API extensions aren't
necessarily a bad idea though. The API I suggest above is preferable
for the simple reason that it can work without INDIRECT flag
support in hypervisor.

I wonder which APIs would Nitesh find useful.

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
@ 2017-06-21 12:28         ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-21 12:28 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, riel, kvm, qemu-devel, amit.shah,
	liliang.opensource, dave.hansen, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, nilal, mgorman

On Wed, Jun 21, 2017 at 11:28:00AM +0800, Wei Wang wrote:
> On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
> > On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
> > > -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
> > > +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
> > >   		virtqueue_kick(vq);
> > > -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > -		vb->balloon_page_chunk.chunk_num = 0;
> > > +		if (busy_wait)
> > > +			while (!virtqueue_get_buf(vq, &len) &&
> > > +			       !virtqueue_is_broken(vq))
> > > +				cpu_relax();
> > > +		else
> > > +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > 
> > This is something I didn't previously notice.
> > As you always keep a single buffer in flight, you do not
> > really need indirect at all. Just add all descriptors
> > in the ring directly, then kick.
> > 
> > E.g.
> > 	virtqueue_add_first
> > 	virtqueue_add_next
> > 	virtqueue_add_last
> > 
> > ?
> > 
> > You also want a flag to avoid allocations but there's no need to do it
> > per descriptor, set it on vq.
> > 
> 
> Without using the indirect table, I'm thinking about changing to use
> the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
> we don't need to modify or add any new functions of virtqueue_add().
> 
> In this case, we will kmalloc an array of sgs in probe(), and we can add
> the sgs one by one to the vq, which won't trigger the allocation of an
> indirect table inside virtqueue_add(), and then kick when all are added.
> 
> Best,
> Wei

And allocate headers too? This can work. API extensions aren't
necessarily a bad idea though. The API I suggest above is preferable
for the simple reason that it can work without INDIRECT flag
support in hypervisor.

I wonder which APIs would Nitesh find useful.

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
@ 2017-06-21 12:28         ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-21 12:28 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource, riel, nilal

On Wed, Jun 21, 2017 at 11:28:00AM +0800, Wei Wang wrote:
> On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
> > On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
> > > -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
> > > +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
> > >   		virtqueue_kick(vq);
> > > -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > -		vb->balloon_page_chunk.chunk_num = 0;
> > > +		if (busy_wait)
> > > +			while (!virtqueue_get_buf(vq, &len) &&
> > > +			       !virtqueue_is_broken(vq))
> > > +				cpu_relax();
> > > +		else
> > > +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > 
> > This is something I didn't previously notice.
> > As you always keep a single buffer in flight, you do not
> > really need indirect at all. Just add all descriptors
> > in the ring directly, then kick.
> > 
> > E.g.
> > 	virtqueue_add_first
> > 	virtqueue_add_next
> > 	virtqueue_add_last
> > 
> > ?
> > 
> > You also want a flag to avoid allocations but there's no need to do it
> > per descriptor, set it on vq.
> > 
> 
> Without using the indirect table, I'm thinking about changing to use
> the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
> we don't need to modify or add any new functions of virtqueue_add().
> 
> In this case, we will kmalloc an array of sgs in probe(), and we can add
> the sgs one by one to the vq, which won't trigger the allocation of an
> indirect table inside virtqueue_add(), and then kick when all are added.
> 
> Best,
> Wei

And allocate headers too? This can work. API extensions aren't
necessarily a bad idea though. The API I suggest above is preferable
for the simple reason that it can work without INDIRECT flag
support in hypervisor.

I wonder which APIs would Nitesh find useful.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [virtio-dev] Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
@ 2017-06-21 12:28         ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-21 12:28 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource, riel, nilal

On Wed, Jun 21, 2017 at 11:28:00AM +0800, Wei Wang wrote:
> On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
> > On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
> > > -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
> > > +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
> > >   		virtqueue_kick(vq);
> > > -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > -		vb->balloon_page_chunk.chunk_num = 0;
> > > +		if (busy_wait)
> > > +			while (!virtqueue_get_buf(vq, &len) &&
> > > +			       !virtqueue_is_broken(vq))
> > > +				cpu_relax();
> > > +		else
> > > +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > 
> > This is something I didn't previously notice.
> > As you always keep a single buffer in flight, you do not
> > really need indirect at all. Just add all descriptors
> > in the ring directly, then kick.
> > 
> > E.g.
> > 	virtqueue_add_first
> > 	virtqueue_add_next
> > 	virtqueue_add_last
> > 
> > ?
> > 
> > You also want a flag to avoid allocations but there's no need to do it
> > per descriptor, set it on vq.
> > 
> 
> Without using the indirect table, I'm thinking about changing to use
> the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
> we don't need to modify or add any new functions of virtqueue_add().
> 
> In this case, we will kmalloc an array of sgs in probe(), and we can add
> the sgs one by one to the vq, which won't trigger the allocation of an
> indirect table inside virtqueue_add(), and then kick when all are added.
> 
> Best,
> Wei

And allocate headers too? This can work. API extensions aren't
necessarily a bad idea though. The API I suggest above is preferable
for the simple reason that it can work without INDIRECT flag
support in hypervisor.

I wonder which APIs would Nitesh find useful.

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-20 19:51               ` Rik van Riel
  (?)
  (?)
@ 2017-06-21 12:41                 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-21 12:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: David Hildenbrand, Dave Hansen, Wei Wang, linux-kernel,
	qemu-devel, virtualization, kvm, linux-mm, cornelia.huck, akpm,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource,
	Nitesh Narayan Lal

On Tue, Jun 20, 2017 at 03:51:00PM -0400, Rik van Riel wrote:
> On Tue, 2017-06-20 at 21:26 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jun 20, 2017 at 01:29:00PM -0400, Rik van Riel wrote:
> > > I agree with that.  Let me go into some more detail of
> > > what Nitesh is implementing:
> > > 
> > > 1) In arch_free_page, the being-freed page is added
> > >    to a per-cpu set of freed pages.
> > > 2) Once that set is full, arch_free_pages goes into a
> > >    slow path, which:
> > >    2a) Iterates over the set of freed pages, and
> > >    2b) Checks whether they are still free, and
> > >    2c) Adds the still free pages to a list that is
> > >        to be passed to the hypervisor, to be MADV_FREEd.
> > >    2d) Makes that hypercall.
> > > 
> > > Meanwhile all arch_alloc_pages has to do is make sure it
> > > does not allocate a page while it is currently being
> > > MADV_FREEd on the hypervisor side.
> > > 
> > > The code Wei is working on looks like it could be 
> > > suitable for steps (2c) and (2d) above. Nitesh already
> > > has code for steps 1 through 2b.
> > 
> > So my question is this: Wei posted these numbers for balloon
> > inflation times:
> > inflating 7GB of an 8GB idle guest:
> > 
> > 	1) allocating pages (6.5%)
> > 	2) sending PFNs to host (68.3%)
> > 	3) address translation (6.1%)
> > 	4) madvise (19%)
> > 
> > 	It takes about 4126ms for the inflating process to complete.
> > 
> > It seems that this is an excessive amount of time to stay
> > under a lock. What are your estimates for Nitesh's work?
> 
> That depends on the batch size used for step
> (2c), and is something that we should be able
> to tune for decent performance.

I am not really sure how you intend to do this. Who will
drop and retake the lock? How do you make progress
instead of restarting from the beginning?
How do you combine multiple pages in a single s/g?

All these were issues that Wei's patches solved,
granted in a very limited manner (migration-specific)
but OTOH without a lot of tuning.


> What seems to matter is that things are batched.
> There are many ways to achieve that.

True, this is what the patches are trying to achieve.  So far this
approach was the 1st more or less workable way do achieve that,
previous ones got us nowhere.


> -- 
> All rights reversed

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-21 12:41                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-21 12:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: David Hildenbrand, Dave Hansen, Wei Wang, linux-kernel,
	qemu-devel, virtualization, kvm, linux-mm, cornelia.huck, akpm,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource,
	Nitesh Narayan Lal

On Tue, Jun 20, 2017 at 03:51:00PM -0400, Rik van Riel wrote:
> On Tue, 2017-06-20 at 21:26 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jun 20, 2017 at 01:29:00PM -0400, Rik van Riel wrote:
> > > I agree with that.  Let me go into some more detail of
> > > what Nitesh is implementing:
> > > 
> > > 1) In arch_free_page, the being-freed page is added
> > >    to a per-cpu set of freed pages.
> > > 2) Once that set is full, arch_free_pages goes into a
> > >    slow path, which:
> > >    2a) Iterates over the set of freed pages, and
> > >    2b) Checks whether they are still free, and
> > >    2c) Adds the still free pages to a list that is
> > >        to be passed to the hypervisor, to be MADV_FREEd.
> > >    2d) Makes that hypercall.
> > > 
> > > Meanwhile all arch_alloc_pages has to do is make sure it
> > > does not allocate a page while it is currently being
> > > MADV_FREEd on the hypervisor side.
> > > 
> > > The code Wei is working on looks like it could be 
> > > suitable for steps (2c) and (2d) above. Nitesh already
> > > has code for steps 1 through 2b.
> > 
> > So my question is this: Wei posted these numbers for balloon
> > inflation times:
> > inflating 7GB of an 8GB idle guest:
> > 
> > 	1) allocating pages (6.5%)
> > 	2) sending PFNs to host (68.3%)
> > 	3) address translation (6.1%)
> > 	4) madvise (19%)
> > 
> > 	It takes about 4126ms for the inflating process to complete.
> > 
> > It seems that this is an excessive amount of time to stay
> > under a lock. What are your estimates for Nitesh's work?
> 
> That depends on the batch size used for step
> (2c), and is something that we should be able
> to tune for decent performance.

I am not really sure how you intend to do this. Who will
drop and retake the lock? How do you make progress
instead of restarting from the beginning?
How do you combine multiple pages in a single s/g?

All these were issues that Wei's patches solved,
granted in a very limited manner (migration-specific)
but OTOH without a lot of tuning.


> What seems to matter is that things are batched.
> There are many ways to achieve that.

True, this is what the patches are trying to achieve.  So far this
approach was the 1st more or less workable way do achieve that,
previous ones got us nowhere.


> -- 
> All rights reversed


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-21 12:41                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-21 12:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: David Hildenbrand, Dave Hansen, Wei Wang, linux-kernel,
	qemu-devel, virtualization, kvm, linux-mm, cornelia.huck, akpm,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource,
	Nitesh Narayan Lal

On Tue, Jun 20, 2017 at 03:51:00PM -0400, Rik van Riel wrote:
> On Tue, 2017-06-20 at 21:26 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jun 20, 2017 at 01:29:00PM -0400, Rik van Riel wrote:
> > > I agree with that.  Let me go into some more detail of
> > > what Nitesh is implementing:
> > > 
> > > 1) In arch_free_page, the being-freed page is added
> > >    to a per-cpu set of freed pages.
> > > 2) Once that set is full, arch_free_pages goes into a
> > >    slow path, which:
> > >    2a) Iterates over the set of freed pages, and
> > >    2b) Checks whether they are still free, and
> > >    2c) Adds the still free pages to a list that is
> > >        to be passed to the hypervisor, to be MADV_FREEd.
> > >    2d) Makes that hypercall.
> > > 
> > > Meanwhile all arch_alloc_pages has to do is make sure it
> > > does not allocate a page while it is currently being
> > > MADV_FREEd on the hypervisor side.
> > > 
> > > The code Wei is working on looks like it could be 
> > > suitable for steps (2c) and (2d) above. Nitesh already
> > > has code for steps 1 through 2b.
> > 
> > So my question is this: Wei posted these numbers for balloon
> > inflation times:
> > inflating 7GB of an 8GB idle guest:
> > 
> > 	1) allocating pages (6.5%)
> > 	2) sending PFNs to host (68.3%)
> > 	3) address translation (6.1%)
> > 	4) madvise (19%)
> > 
> > 	It takes about 4126ms for the inflating process to complete.
> > 
> > It seems that this is an excessive amount of time to stay
> > under a lock. What are your estimates for Nitesh's work?
> 
> That depends on the batch size used for step
> (2c), and is something that we should be able
> to tune for decent performance.

I am not really sure how you intend to do this. Who will
drop and retake the lock? How do you make progress
instead of restarting from the beginning?
How do you combine multiple pages in a single s/g?

All these were issues that Wei's patches solved,
granted in a very limited manner (migration-specific)
but OTOH without a lot of tuning.


> What seems to matter is that things are batched.
> There are many ways to achieve that.

True, this is what the patches are trying to achieve.  So far this
approach was the 1st more or less workable way do achieve that,
previous ones got us nowhere.


> -- 
> All rights reversed


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-21 12:41                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-21 12:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: David Hildenbrand, Dave Hansen, Wei Wang, linux-kernel,
	qemu-devel, virtualization, kvm, linux-mm, cornelia.huck, akpm,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource,
	Nitesh Narayan Lal

On Tue, Jun 20, 2017 at 03:51:00PM -0400, Rik van Riel wrote:
> On Tue, 2017-06-20 at 21:26 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jun 20, 2017 at 01:29:00PM -0400, Rik van Riel wrote:
> > > I agree with that.  Let me go into some more detail of
> > > what Nitesh is implementing:
> > > 
> > > 1) In arch_free_page, the being-freed page is added
> > >    to a per-cpu set of freed pages.
> > > 2) Once that set is full, arch_free_pages goes into a
> > >    slow path, which:
> > >    2a) Iterates over the set of freed pages, and
> > >    2b) Checks whether they are still free, and
> > >    2c) Adds the still free pages to a list that is
> > >        to be passed to the hypervisor, to be MADV_FREEd.
> > >    2d) Makes that hypercall.
> > > 
> > > Meanwhile all arch_alloc_pages has to do is make sure it
> > > does not allocate a page while it is currently being
> > > MADV_FREEd on the hypervisor side.
> > > 
> > > The code Wei is working on looks like it could be 
> > > suitable for steps (2c) and (2d) above. Nitesh already
> > > has code for steps 1 through 2b.
> > 
> > So my question is this: Wei posted these numbers for balloon
> > inflation times:
> > inflating 7GB of an 8GB idle guest:
> > 
> > 	1) allocating pages (6.5%)
> > 	2) sending PFNs to host (68.3%)
> > 	3) address translation (6.1%)
> > 	4) madvise (19%)
> > 
> > 	It takes about 4126ms for the inflating process to complete.
> > 
> > It seems that this is an excessive amount of time to stay
> > under a lock. What are your estimates for Nitesh's work?
> 
> That depends on the batch size used for step
> (2c), and is something that we should be able
> to tune for decent performance.

I am not really sure how you intend to do this. Who will
drop and retake the lock? How do you make progress
instead of restarting from the beginning?
How do you combine multiple pages in a single s/g?

All these were issues that Wei's patches solved,
granted in a very limited manner (migration-specific)
but OTOH without a lot of tuning.


> What seems to matter is that things are batched.
> There are many ways to achieve that.

True, this is what the patches are trying to achieve.  So far this
approach was the 1st more or less workable way do achieve that,
previous ones got us nowhere.


> -- 
> All rights reversed

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-20 19:51               ` Rik van Riel
  (?)
  (?)
@ 2017-06-21 12:41               ` Michael S. Tsirkin
  -1 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-21 12:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: aarcange, amit.shah, kvm, linux-kernel, liliang.opensource,
	qemu-devel, virtualization, linux-mm, Dave Hansen, cornelia.huck,
	pbonzini, akpm, Nitesh Narayan Lal, mgorman

On Tue, Jun 20, 2017 at 03:51:00PM -0400, Rik van Riel wrote:
> On Tue, 2017-06-20 at 21:26 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jun 20, 2017 at 01:29:00PM -0400, Rik van Riel wrote:
> > > I agree with that.  Let me go into some more detail of
> > > what Nitesh is implementing:
> > > 
> > > 1) In arch_free_page, the being-freed page is added
> > >    to a per-cpu set of freed pages.
> > > 2) Once that set is full, arch_free_pages goes into a
> > >    slow path, which:
> > >    2a) Iterates over the set of freed pages, and
> > >    2b) Checks whether they are still free, and
> > >    2c) Adds the still free pages to a list that is
> > >        to be passed to the hypervisor, to be MADV_FREEd.
> > >    2d) Makes that hypercall.
> > > 
> > > Meanwhile all arch_alloc_pages has to do is make sure it
> > > does not allocate a page while it is currently being
> > > MADV_FREEd on the hypervisor side.
> > > 
> > > The code Wei is working on looks like it could be 
> > > suitable for steps (2c) and (2d) above. Nitesh already
> > > has code for steps 1 through 2b.
> > 
> > So my question is this: Wei posted these numbers for balloon
> > inflation times:
> > inflating 7GB of an 8GB idle guest:
> > 
> > 	1) allocating pages (6.5%)
> > 	2) sending PFNs to host (68.3%)
> > 	3) address translation (6.1%)
> > 	4) madvise (19%)
> > 
> > 	It takes about 4126ms for the inflating process to complete.
> > 
> > It seems that this is an excessive amount of time to stay
> > under a lock. What are your estimates for Nitesh's work?
> 
> That depends on the batch size used for step
> (2c), and is something that we should be able
> to tune for decent performance.

I am not really sure how you intend to do this. Who will
drop and retake the lock? How do you make progress
instead of restarting from the beginning?
How do you combine multiple pages in a single s/g?

All these were issues that Wei's patches solved,
granted in a very limited manner (migration-specific)
but OTOH without a lot of tuning.


> What seems to matter is that things are batched.
> There are many ways to achieve that.

True, this is what the patches are trying to achieve.  So far this
approach was the 1st more or less workable way do achieve that,
previous ones got us nowhere.


> -- 
> All rights reversed

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-20 16:49         ` David Hildenbrand
  (?)
@ 2017-06-21 12:56           ` Christian Borntraeger
  -1 siblings, 0 replies; 175+ messages in thread
From: Christian Borntraeger @ 2017-06-21 12:56 UTC (permalink / raw)
  To: David Hildenbrand, Rik van Riel, Dave Hansen, Wei Wang,
	linux-kernel, qemu-devel, virtualization, kvm, linux-mm, mst,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource
  Cc: Nitesh Narayan Lal

On 06/20/2017 06:49 PM, David Hildenbrand wrote:
> On 20.06.2017 18:44, Rik van Riel wrote:
>> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
>>
>>> The hypervisor is going to throw away the contents of these pages,
>>> right?  As soon as the spinlock is released, someone can allocate a
>>> page, and put good data in it.  What keeps the hypervisor from
>>> throwing
>>> away good data?
>>
>> That looks like it may be the wrong API, then?
>>
>> We already have hooks called arch_free_page and
>> arch_alloc_page in the VM, which are called when
>> pages are freed, and allocated, respectively.
>>
>> Nitesh Lal (on the CC list) is working on a way
>> to efficiently batch recently freed pages for
>> free page hinting to the hypervisor.
>>
>> If that is done efficiently enough (eg. with
>> MADV_FREE on the hypervisor side for lazy freeing,
>> and lazy later re-use of the pages), do we still
>> need the harder to use batch interface from this
>> patch?
>>
> David's opinion incoming:
> 
> No, I think proper free page hinting would be the optimum solution, if
> done right. This would avoid the batch interface and even turn
> virtio-balloon in some sense useless.
> 
Two reasons why I disagree:
- virtio-balloon is often used as memory hotplug. (e.g. libvirts current/max memory
uses virtio ballon)
- free page hinting will not allow to shrink the page cache of guests (like a ballooner does)

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-21 12:56           ` Christian Borntraeger
  0 siblings, 0 replies; 175+ messages in thread
From: Christian Borntraeger @ 2017-06-21 12:56 UTC (permalink / raw)
  To: David Hildenbrand, Rik van Riel, Dave Hansen, Wei Wang,
	linux-kernel, qemu-devel, virtualization, kvm, linux-mm, mst,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource
  Cc: Nitesh Narayan Lal

On 06/20/2017 06:49 PM, David Hildenbrand wrote:
> On 20.06.2017 18:44, Rik van Riel wrote:
>> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
>>
>>> The hypervisor is going to throw away the contents of these pages,
>>> right?  As soon as the spinlock is released, someone can allocate a
>>> page, and put good data in it.  What keeps the hypervisor from
>>> throwing
>>> away good data?
>>
>> That looks like it may be the wrong API, then?
>>
>> We already have hooks called arch_free_page and
>> arch_alloc_page in the VM, which are called when
>> pages are freed, and allocated, respectively.
>>
>> Nitesh Lal (on the CC list) is working on a way
>> to efficiently batch recently freed pages for
>> free page hinting to the hypervisor.
>>
>> If that is done efficiently enough (eg. with
>> MADV_FREE on the hypervisor side for lazy freeing,
>> and lazy later re-use of the pages), do we still
>> need the harder to use batch interface from this
>> patch?
>>
> David's opinion incoming:
> 
> No, I think proper free page hinting would be the optimum solution, if
> done right. This would avoid the batch interface and even turn
> virtio-balloon in some sense useless.
> 
Two reasons why I disagree:
- virtio-balloon is often used as memory hotplug. (e.g. libvirts current/max memory
uses virtio ballon)
- free page hinting will not allow to shrink the page cache of guests (like a ballooner does)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-21 12:56           ` Christian Borntraeger
  0 siblings, 0 replies; 175+ messages in thread
From: Christian Borntraeger @ 2017-06-21 12:56 UTC (permalink / raw)
  To: David Hildenbrand, Rik van Riel, Dave Hansen, Wei Wang,
	linux-kernel, qemu-devel, virtualization, kvm, linux-mm, mst,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource
  Cc: Nitesh Narayan Lal

On 06/20/2017 06:49 PM, David Hildenbrand wrote:
> On 20.06.2017 18:44, Rik van Riel wrote:
>> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
>>
>>> The hypervisor is going to throw away the contents of these pages,
>>> right?  As soon as the spinlock is released, someone can allocate a
>>> page, and put good data in it.  What keeps the hypervisor from
>>> throwing
>>> away good data?
>>
>> That looks like it may be the wrong API, then?
>>
>> We already have hooks called arch_free_page and
>> arch_alloc_page in the VM, which are called when
>> pages are freed, and allocated, respectively.
>>
>> Nitesh Lal (on the CC list) is working on a way
>> to efficiently batch recently freed pages for
>> free page hinting to the hypervisor.
>>
>> If that is done efficiently enough (eg. with
>> MADV_FREE on the hypervisor side for lazy freeing,
>> and lazy later re-use of the pages), do we still
>> need the harder to use batch interface from this
>> patch?
>>
> David's opinion incoming:
> 
> No, I think proper free page hinting would be the optimum solution, if
> done right. This would avoid the batch interface and even turn
> virtio-balloon in some sense useless.
> 
Two reasons why I disagree:
- virtio-balloon is often used as memory hotplug. (e.g. libvirts current/max memory
uses virtio ballon)
- free page hinting will not allow to shrink the page cache of guests (like a ballooner does)

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-21 12:56           ` Christian Borntraeger
  (?)
@ 2017-06-21 13:47             ` David Hildenbrand
  -1 siblings, 0 replies; 175+ messages in thread
From: David Hildenbrand @ 2017-06-21 13:47 UTC (permalink / raw)
  To: Christian Borntraeger, Rik van Riel, Dave Hansen, Wei Wang,
	linux-kernel, qemu-devel, virtualization, kvm, linux-mm, mst,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource
  Cc: Nitesh Narayan Lal

On 21.06.2017 14:56, Christian Borntraeger wrote:
> On 06/20/2017 06:49 PM, David Hildenbrand wrote:
>> On 20.06.2017 18:44, Rik van Riel wrote:
>>> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
>>>
>>>> The hypervisor is going to throw away the contents of these pages,
>>>> right?  As soon as the spinlock is released, someone can allocate a
>>>> page, and put good data in it.  What keeps the hypervisor from
>>>> throwing
>>>> away good data?
>>>
>>> That looks like it may be the wrong API, then?
>>>
>>> We already have hooks called arch_free_page and
>>> arch_alloc_page in the VM, which are called when
>>> pages are freed, and allocated, respectively.
>>>
>>> Nitesh Lal (on the CC list) is working on a way
>>> to efficiently batch recently freed pages for
>>> free page hinting to the hypervisor.
>>>
>>> If that is done efficiently enough (eg. with
>>> MADV_FREE on the hypervisor side for lazy freeing,
>>> and lazy later re-use of the pages), do we still
>>> need the harder to use batch interface from this
>>> patch?
>>>
>> David's opinion incoming:
>>
>> No, I think proper free page hinting would be the optimum solution, if
>> done right. This would avoid the batch interface and even turn
>> virtio-balloon in some sense useless.
>>

I said "some sense" for a reason. Mainly because other techniques are
being worked on that are to fill the holes.

> Two reasons why I disagree:
> - virtio-balloon is often used as memory hotplug. (e.g. libvirts current/max memory
> uses virtio ballon)

I know, while one can argue if this real unplug as there are basically
no guarantees (see virtio-mem RFC) it is used by people because there is
simply no alternative. Still, for now some people use it for that.

> - free page hinting will not allow to shrink the page cache of guests (like a ballooner does)

There are currently some projects ongoing that try to avoid the page
cache in the guest completely.


-- 

Thanks,

David

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-21 13:47             ` David Hildenbrand
  0 siblings, 0 replies; 175+ messages in thread
From: David Hildenbrand @ 2017-06-21 13:47 UTC (permalink / raw)
  To: Christian Borntraeger, Rik van Riel, Dave Hansen, Wei Wang,
	linux-kernel, qemu-devel, virtualization, kvm, linux-mm, mst,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource
  Cc: Nitesh Narayan Lal

On 21.06.2017 14:56, Christian Borntraeger wrote:
> On 06/20/2017 06:49 PM, David Hildenbrand wrote:
>> On 20.06.2017 18:44, Rik van Riel wrote:
>>> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
>>>
>>>> The hypervisor is going to throw away the contents of these pages,
>>>> right?  As soon as the spinlock is released, someone can allocate a
>>>> page, and put good data in it.  What keeps the hypervisor from
>>>> throwing
>>>> away good data?
>>>
>>> That looks like it may be the wrong API, then?
>>>
>>> We already have hooks called arch_free_page and
>>> arch_alloc_page in the VM, which are called when
>>> pages are freed, and allocated, respectively.
>>>
>>> Nitesh Lal (on the CC list) is working on a way
>>> to efficiently batch recently freed pages for
>>> free page hinting to the hypervisor.
>>>
>>> If that is done efficiently enough (eg. with
>>> MADV_FREE on the hypervisor side for lazy freeing,
>>> and lazy later re-use of the pages), do we still
>>> need the harder to use batch interface from this
>>> patch?
>>>
>> David's opinion incoming:
>>
>> No, I think proper free page hinting would be the optimum solution, if
>> done right. This would avoid the batch interface and even turn
>> virtio-balloon in some sense useless.
>>

I said "some sense" for a reason. Mainly because other techniques are
being worked on that are to fill the holes.

> Two reasons why I disagree:
> - virtio-balloon is often used as memory hotplug. (e.g. libvirts current/max memory
> uses virtio ballon)

I know, while one can argue if this real unplug as there are basically
no guarantees (see virtio-mem RFC) it is used by people because there is
simply no alternative. Still, for now some people use it for that.

> - free page hinting will not allow to shrink the page cache of guests (like a ballooner does)

There are currently some projects ongoing that try to avoid the page
cache in the guest completely.


-- 

Thanks,

David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [PATCH v11 4/6] mm: function to offer a page block on the free list
@ 2017-06-21 13:47             ` David Hildenbrand
  0 siblings, 0 replies; 175+ messages in thread
From: David Hildenbrand @ 2017-06-21 13:47 UTC (permalink / raw)
  To: Christian Borntraeger, Rik van Riel, Dave Hansen, Wei Wang,
	linux-kernel, qemu-devel, virtualization, kvm, linux-mm, mst,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource
  Cc: Nitesh Narayan Lal

On 21.06.2017 14:56, Christian Borntraeger wrote:
> On 06/20/2017 06:49 PM, David Hildenbrand wrote:
>> On 20.06.2017 18:44, Rik van Riel wrote:
>>> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
>>>
>>>> The hypervisor is going to throw away the contents of these pages,
>>>> right?  As soon as the spinlock is released, someone can allocate a
>>>> page, and put good data in it.  What keeps the hypervisor from
>>>> throwing
>>>> away good data?
>>>
>>> That looks like it may be the wrong API, then?
>>>
>>> We already have hooks called arch_free_page and
>>> arch_alloc_page in the VM, which are called when
>>> pages are freed, and allocated, respectively.
>>>
>>> Nitesh Lal (on the CC list) is working on a way
>>> to efficiently batch recently freed pages for
>>> free page hinting to the hypervisor.
>>>
>>> If that is done efficiently enough (eg. with
>>> MADV_FREE on the hypervisor side for lazy freeing,
>>> and lazy later re-use of the pages), do we still
>>> need the harder to use batch interface from this
>>> patch?
>>>
>> David's opinion incoming:
>>
>> No, I think proper free page hinting would be the optimum solution, if
>> done right. This would avoid the batch interface and even turn
>> virtio-balloon in some sense useless.
>>

I said "some sense" for a reason. Mainly because other techniques are
being worked on that are to fill the holes.

> Two reasons why I disagree:
> - virtio-balloon is often used as memory hotplug. (e.g. libvirts current/max memory
> uses virtio ballon)

I know, while one can argue if this real unplug as there are basically
no guarantees (see virtio-mem RFC) it is used by people because there is
simply no alternative. Still, for now some people use it for that.

> - free page hinting will not allow to shrink the page cache of guests (like a ballooner does)

There are currently some projects ongoing that try to avoid the page
cache in the guest completely.


-- 

Thanks,

David

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [PATCH v11 4/6] mm: function to offer a page block on the free list
  2017-06-21 12:56           ` Christian Borntraeger
  (?)
  (?)
@ 2017-06-21 13:47           ` David Hildenbrand
  -1 siblings, 0 replies; 175+ messages in thread
From: David Hildenbrand @ 2017-06-21 13:47 UTC (permalink / raw)
  To: Christian Borntraeger, Rik van Riel, Dave Hansen, Wei Wang,
	linux-kernel, qemu-devel, virtualization, kvm, linux-mm, mst,
	cornelia.huck, akpm, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource
  Cc: Nitesh Narayan Lal

On 21.06.2017 14:56, Christian Borntraeger wrote:
> On 06/20/2017 06:49 PM, David Hildenbrand wrote:
>> On 20.06.2017 18:44, Rik van Riel wrote:
>>> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
>>>
>>>> The hypervisor is going to throw away the contents of these pages,
>>>> right?  As soon as the spinlock is released, someone can allocate a
>>>> page, and put good data in it.  What keeps the hypervisor from
>>>> throwing
>>>> away good data?
>>>
>>> That looks like it may be the wrong API, then?
>>>
>>> We already have hooks called arch_free_page and
>>> arch_alloc_page in the VM, which are called when
>>> pages are freed, and allocated, respectively.
>>>
>>> Nitesh Lal (on the CC list) is working on a way
>>> to efficiently batch recently freed pages for
>>> free page hinting to the hypervisor.
>>>
>>> If that is done efficiently enough (eg. with
>>> MADV_FREE on the hypervisor side for lazy freeing,
>>> and lazy later re-use of the pages), do we still
>>> need the harder to use batch interface from this
>>> patch?
>>>
>> David's opinion incoming:
>>
>> No, I think proper free page hinting would be the optimum solution, if
>> done right. This would avoid the batch interface and even turn
>> virtio-balloon in some sense useless.
>>

I said "some sense" for a reason. Mainly because other techniques are
being worked on that are to fill the holes.

> Two reasons why I disagree:
> - virtio-balloon is often used as memory hotplug. (e.g. libvirts current/max memory
> uses virtio ballon)

I know, while one can argue if this real unplug as there are basically
no guarantees (see virtio-mem RFC) it is used by people because there is
simply no alternative. Still, for now some people use it for that.

> - free page hinting will not allow to shrink the page cache of guests (like a ballooner does)

There are currently some projects ongoing that try to avoid the page
cache in the guest completely.


-- 

Thanks,

David

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
  2017-06-21 12:28         ` Michael S. Tsirkin
  (?)
  (?)
@ 2017-06-22  8:40           ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-22  8:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource, riel, nilal

On 06/21/2017 08:28 PM, Michael S. Tsirkin wrote:
> On Wed, Jun 21, 2017 at 11:28:00AM +0800, Wei Wang wrote:
>> On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
>>> On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
>>>> -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
>>>> +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
>>>>    		virtqueue_kick(vq);
>>>> -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>> -		vb->balloon_page_chunk.chunk_num = 0;
>>>> +		if (busy_wait)
>>>> +			while (!virtqueue_get_buf(vq, &len) &&
>>>> +			       !virtqueue_is_broken(vq))
>>>> +				cpu_relax();
>>>> +		else
>>>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>> This is something I didn't previously notice.
>>> As you always keep a single buffer in flight, you do not
>>> really need indirect at all. Just add all descriptors
>>> in the ring directly, then kick.
>>>
>>> E.g.
>>> 	virtqueue_add_first
>>> 	virtqueue_add_next
>>> 	virtqueue_add_last
>>>
>>> ?
>>>
>>> You also want a flag to avoid allocations but there's no need to do it
>>> per descriptor, set it on vq.
>>>
>> Without using the indirect table, I'm thinking about changing to use
>> the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
>> we don't need to modify or add any new functions of virtqueue_add().
>>
>> In this case, we will kmalloc an array of sgs in probe(), and we can add
>> the sgs one by one to the vq, which won't trigger the allocation of an
>> indirect table inside virtqueue_add(), and then kick when all are added.
>>
>> Best,
>> Wei
> And allocate headers too? This can work. API extensions aren't
> necessarily a bad idea though. The API I suggest above is preferable
> for the simple reason that it can work without INDIRECT flag
> support in hypervisor.

OK, probably we don't need to add a desc to the vq - we can just use
the vq's desc, like this:

int virtqueue_add_first(struct virtqueue *_vq,
                                      uint64_t addr,
                                      uint32_t len,
                                      bool in,
                                      unsigned int *idx) {

     ...
    uint16_t desc_flags = in ? VRING_DESC_F_NEXT | VRING_DESC_F_WRITE :
                                              VRING_DESC_F_NEXT;

     vq->vring.desc[vq->free_head].addr = addr;
     vq->vring.desc[vq->free_head].len = len;
     vq->vring.desc[vq->free_head].flags = cpu_to_virtio16(_vq->vdev, 
flags);
     /* return to the caller the desc id */
     *idx = vq->free_head;
     ...
}

int virtqueue_add_next(struct virtqueue *_vq,
                                      uint64_t addr,
                                      uint32_t len,
                                      bool in,
                                      bool end,
                                      unsigned int *idx) {
     ...
     vq->vring.desc[*idx].next = vq->free_head;
     vq->vring.desc[vq->free_head].addr = addr;
     ...
     if (end)
         remove the VRING_DESC_F_NEXT flag
}


What do you think? We can also combine the two functions into one.



Best,
Wei

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
@ 2017-06-22  8:40           ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-22  8:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini,

On 06/21/2017 08:28 PM, Michael S. Tsirkin wrote:
> On Wed, Jun 21, 2017 at 11:28:00AM +0800, Wei Wang wrote:
>> On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
>>> On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
>>>> -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
>>>> +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
>>>>    		virtqueue_kick(vq);
>>>> -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>> -		vb->balloon_page_chunk.chunk_num = 0;
>>>> +		if (busy_wait)
>>>> +			while (!virtqueue_get_buf(vq, &len) &&
>>>> +			       !virtqueue_is_broken(vq))
>>>> +				cpu_relax();
>>>> +		else
>>>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>> This is something I didn't previously notice.
>>> As you always keep a single buffer in flight, you do not
>>> really need indirect at all. Just add all descriptors
>>> in the ring directly, then kick.
>>>
>>> E.g.
>>> 	virtqueue_add_first
>>> 	virtqueue_add_next
>>> 	virtqueue_add_last
>>>
>>> ?
>>>
>>> You also want a flag to avoid allocations but there's no need to do it
>>> per descriptor, set it on vq.
>>>
>> Without using the indirect table, I'm thinking about changing to use
>> the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
>> we don't need to modify or add any new functions of virtqueue_add().
>>
>> In this case, we will kmalloc an array of sgs in probe(), and we can add
>> the sgs one by one to the vq, which won't trigger the allocation of an
>> indirect table inside virtqueue_add(), and then kick when all are added.
>>
>> Best,
>> Wei
> And allocate headers too? This can work. API extensions aren't
> necessarily a bad idea though. The API I suggest above is preferable
> for the simple reason that it can work without INDIRECT flag
> support in hypervisor.

OK, probably we don't need to add a desc to the vq - we can just use
the vq's desc, like this:

int virtqueue_add_first(struct virtqueue *_vq,
                                      uint64_t addr,
                                      uint32_t len,
                                      bool in,
                                      unsigned int *idx) {

     ...
    uint16_t desc_flags = in ? VRING_DESC_F_NEXT | VRING_DESC_F_WRITE :
                                              VRING_DESC_F_NEXT;

     vq->vring.desc[vq->free_head].addr = addr;
     vq->vring.desc[vq->free_head].len = len;
     vq->vring.desc[vq->free_head].flags = cpu_to_virtio16(_vq->vdev, 
flags);
     /* return to the caller the desc id */
     *idx = vq->free_head;
     ...
}

int virtqueue_add_next(struct virtqueue *_vq,
                                      uint64_t addr,
                                      uint32_t len,
                                      bool in,
                                      bool end,
                                      unsigned int *idx) {
     ...
     vq->vring.desc[*idx].next = vq->free_head;
     vq->vring.desc[vq->free_head].addr = addr;
     ...
     if (end)
         remove the VRING_DESC_F_NEXT flag
}


What do you think? We can also combine the two functions into one.



Best,
Wei

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
@ 2017-06-22  8:40           ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-22  8:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource, riel, nilal

On 06/21/2017 08:28 PM, Michael S. Tsirkin wrote:
> On Wed, Jun 21, 2017 at 11:28:00AM +0800, Wei Wang wrote:
>> On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
>>> On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
>>>> -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
>>>> +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
>>>>    		virtqueue_kick(vq);
>>>> -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>> -		vb->balloon_page_chunk.chunk_num = 0;
>>>> +		if (busy_wait)
>>>> +			while (!virtqueue_get_buf(vq, &len) &&
>>>> +			       !virtqueue_is_broken(vq))
>>>> +				cpu_relax();
>>>> +		else
>>>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>> This is something I didn't previously notice.
>>> As you always keep a single buffer in flight, you do not
>>> really need indirect at all. Just add all descriptors
>>> in the ring directly, then kick.
>>>
>>> E.g.
>>> 	virtqueue_add_first
>>> 	virtqueue_add_next
>>> 	virtqueue_add_last
>>>
>>> ?
>>>
>>> You also want a flag to avoid allocations but there's no need to do it
>>> per descriptor, set it on vq.
>>>
>> Without using the indirect table, I'm thinking about changing to use
>> the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
>> we don't need to modify or add any new functions of virtqueue_add().
>>
>> In this case, we will kmalloc an array of sgs in probe(), and we can add
>> the sgs one by one to the vq, which won't trigger the allocation of an
>> indirect table inside virtqueue_add(), and then kick when all are added.
>>
>> Best,
>> Wei
> And allocate headers too? This can work. API extensions aren't
> necessarily a bad idea though. The API I suggest above is preferable
> for the simple reason that it can work without INDIRECT flag
> support in hypervisor.

OK, probably we don't need to add a desc to the vq - we can just use
the vq's desc, like this:

int virtqueue_add_first(struct virtqueue *_vq,
                                      uint64_t addr,
                                      uint32_t len,
                                      bool in,
                                      unsigned int *idx) {

     ...
    uint16_t desc_flags = in ? VRING_DESC_F_NEXT | VRING_DESC_F_WRITE :
                                              VRING_DESC_F_NEXT;

     vq->vring.desc[vq->free_head].addr = addr;
     vq->vring.desc[vq->free_head].len = len;
     vq->vring.desc[vq->free_head].flags = cpu_to_virtio16(_vq->vdev, 
flags);
     /* return to the caller the desc id */
     *idx = vq->free_head;
     ...
}

int virtqueue_add_next(struct virtqueue *_vq,
                                      uint64_t addr,
                                      uint32_t len,
                                      bool in,
                                      bool end,
                                      unsigned int *idx) {
     ...
     vq->vring.desc[*idx].next = vq->free_head;
     vq->vring.desc[vq->free_head].addr = addr;
     ...
     if (end)
         remove the VRING_DESC_F_NEXT flag
}


What do you think? We can also combine the two functions into one.



Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [virtio-dev] Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
@ 2017-06-22  8:40           ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-22  8:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource, riel, nilal

On 06/21/2017 08:28 PM, Michael S. Tsirkin wrote:
> On Wed, Jun 21, 2017 at 11:28:00AM +0800, Wei Wang wrote:
>> On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
>>> On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
>>>> -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
>>>> +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
>>>>    		virtqueue_kick(vq);
>>>> -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>> -		vb->balloon_page_chunk.chunk_num = 0;
>>>> +		if (busy_wait)
>>>> +			while (!virtqueue_get_buf(vq, &len) &&
>>>> +			       !virtqueue_is_broken(vq))
>>>> +				cpu_relax();
>>>> +		else
>>>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>> This is something I didn't previously notice.
>>> As you always keep a single buffer in flight, you do not
>>> really need indirect at all. Just add all descriptors
>>> in the ring directly, then kick.
>>>
>>> E.g.
>>> 	virtqueue_add_first
>>> 	virtqueue_add_next
>>> 	virtqueue_add_last
>>>
>>> ?
>>>
>>> You also want a flag to avoid allocations but there's no need to do it
>>> per descriptor, set it on vq.
>>>
>> Without using the indirect table, I'm thinking about changing to use
>> the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
>> we don't need to modify or add any new functions of virtqueue_add().
>>
>> In this case, we will kmalloc an array of sgs in probe(), and we can add
>> the sgs one by one to the vq, which won't trigger the allocation of an
>> indirect table inside virtqueue_add(), and then kick when all are added.
>>
>> Best,
>> Wei
> And allocate headers too? This can work. API extensions aren't
> necessarily a bad idea though. The API I suggest above is preferable
> for the simple reason that it can work without INDIRECT flag
> support in hypervisor.

OK, probably we don't need to add a desc to the vq - we can just use
the vq's desc, like this:

int virtqueue_add_first(struct virtqueue *_vq,
                                      uint64_t addr,
                                      uint32_t len,
                                      bool in,
                                      unsigned int *idx) {

     ...
    uint16_t desc_flags = in ? VRING_DESC_F_NEXT | VRING_DESC_F_WRITE :
                                              VRING_DESC_F_NEXT;

     vq->vring.desc[vq->free_head].addr = addr;
     vq->vring.desc[vq->free_head].len = len;
     vq->vring.desc[vq->free_head].flags = cpu_to_virtio16(_vq->vdev, 
flags);
     /* return to the caller the desc id */
     *idx = vq->free_head;
     ...
}

int virtqueue_add_next(struct virtqueue *_vq,
                                      uint64_t addr,
                                      uint32_t len,
                                      bool in,
                                      bool end,
                                      unsigned int *idx) {
     ...
     vq->vring.desc[*idx].next = vq->free_head;
     vq->vring.desc[vq->free_head].addr = addr;
     ...
     if (end)
         remove the VRING_DESC_F_NEXT flag
}


What do you think? We can also combine the two functions into one.



Best,
Wei

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
  2017-06-21 12:28         ` Michael S. Tsirkin
                           ` (3 preceding siblings ...)
  (?)
@ 2017-06-22  8:40         ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-06-22  8:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, riel, kvm, qemu-devel, amit.shah,
	liliang.opensource, Hansen, Dave, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, nilal, mgorman

On 06/21/2017 08:28 PM, Michael S. Tsirkin wrote:
> On Wed, Jun 21, 2017 at 11:28:00AM +0800, Wei Wang wrote:
>> On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
>>> On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
>>>> -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
>>>> +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
>>>>    		virtqueue_kick(vq);
>>>> -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>> -		vb->balloon_page_chunk.chunk_num = 0;
>>>> +		if (busy_wait)
>>>> +			while (!virtqueue_get_buf(vq, &len) &&
>>>> +			       !virtqueue_is_broken(vq))
>>>> +				cpu_relax();
>>>> +		else
>>>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>> This is something I didn't previously notice.
>>> As you always keep a single buffer in flight, you do not
>>> really need indirect at all. Just add all descriptors
>>> in the ring directly, then kick.
>>>
>>> E.g.
>>> 	virtqueue_add_first
>>> 	virtqueue_add_next
>>> 	virtqueue_add_last
>>>
>>> ?
>>>
>>> You also want a flag to avoid allocations but there's no need to do it
>>> per descriptor, set it on vq.
>>>
>> Without using the indirect table, I'm thinking about changing to use
>> the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
>> we don't need to modify or add any new functions of virtqueue_add().
>>
>> In this case, we will kmalloc an array of sgs in probe(), and we can add
>> the sgs one by one to the vq, which won't trigger the allocation of an
>> indirect table inside virtqueue_add(), and then kick when all are added.
>>
>> Best,
>> Wei
> And allocate headers too? This can work. API extensions aren't
> necessarily a bad idea though. The API I suggest above is preferable
> for the simple reason that it can work without INDIRECT flag
> support in hypervisor.

OK, probably we don't need to add a desc to the vq - we can just use
the vq's desc, like this:

int virtqueue_add_first(struct virtqueue *_vq,
                                      uint64_t addr,
                                      uint32_t len,
                                      bool in,
                                      unsigned int *idx) {

     ...
    uint16_t desc_flags = in ? VRING_DESC_F_NEXT | VRING_DESC_F_WRITE :
                                              VRING_DESC_F_NEXT;

     vq->vring.desc[vq->free_head].addr = addr;
     vq->vring.desc[vq->free_head].len = len;
     vq->vring.desc[vq->free_head].flags = cpu_to_virtio16(_vq->vdev, 
flags);
     /* return to the caller the desc id */
     *idx = vq->free_head;
     ...
}

int virtqueue_add_next(struct virtqueue *_vq,
                                      uint64_t addr,
                                      uint32_t len,
                                      bool in,
                                      bool end,
                                      unsigned int *idx) {
     ...
     vq->vring.desc[*idx].next = vq->free_head;
     vq->vring.desc[vq->free_head].addr = addr;
     ...
     if (end)
         remove the VRING_DESC_F_NEXT flag
}


What do you think? We can also combine the two functions into one.



Best,
Wei

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
  2017-06-22  8:40           ` Wei Wang
  (?)
  (?)
@ 2017-06-28 15:01             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-28 15:01 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource, riel, nilal

On Thu, Jun 22, 2017 at 04:40:39PM +0800, Wei Wang wrote:
> On 06/21/2017 08:28 PM, Michael S. Tsirkin wrote:
> > On Wed, Jun 21, 2017 at 11:28:00AM +0800, Wei Wang wrote:
> > > On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
> > > > On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
> > > > > -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
> > > > > +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
> > > > >    		virtqueue_kick(vq);
> > > > > -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > > > -		vb->balloon_page_chunk.chunk_num = 0;
> > > > > +		if (busy_wait)
> > > > > +			while (!virtqueue_get_buf(vq, &len) &&
> > > > > +			       !virtqueue_is_broken(vq))
> > > > > +				cpu_relax();
> > > > > +		else
> > > > > +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > > This is something I didn't previously notice.
> > > > As you always keep a single buffer in flight, you do not
> > > > really need indirect at all. Just add all descriptors
> > > > in the ring directly, then kick.
> > > > 
> > > > E.g.
> > > > 	virtqueue_add_first
> > > > 	virtqueue_add_next
> > > > 	virtqueue_add_last
> > > > 
> > > > ?
> > > > 
> > > > You also want a flag to avoid allocations but there's no need to do it
> > > > per descriptor, set it on vq.
> > > > 
> > > Without using the indirect table, I'm thinking about changing to use
> > > the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
> > > we don't need to modify or add any new functions of virtqueue_add().
> > > 
> > > In this case, we will kmalloc an array of sgs in probe(), and we can add
> > > the sgs one by one to the vq, which won't trigger the allocation of an
> > > indirect table inside virtqueue_add(), and then kick when all are added.
> > > 
> > > Best,
> > > Wei
> > And allocate headers too? This can work. API extensions aren't
> > necessarily a bad idea though. The API I suggest above is preferable
> > for the simple reason that it can work without INDIRECT flag
> > support in hypervisor.
> 
> OK, probably we don't need to add a desc to the vq - we can just use
> the vq's desc, like this:
> 
> int virtqueue_add_first(struct virtqueue *_vq,
>                                      uint64_t addr,
>                                      uint32_t len,
>                                      bool in,
>                                      unsigned int *idx) {
> 
>     ...
>    uint16_t desc_flags = in ? VRING_DESC_F_NEXT | VRING_DESC_F_WRITE :
>                                              VRING_DESC_F_NEXT;
> 
>     vq->vring.desc[vq->free_head].addr = addr;
>     vq->vring.desc[vq->free_head].len = len;
>     vq->vring.desc[vq->free_head].flags = cpu_to_virtio16(_vq->vdev, flags);
>     /* return to the caller the desc id */
>     *idx = vq->free_head;
>     ...
> }
> 
> int virtqueue_add_next(struct virtqueue *_vq,
>                                      uint64_t addr,
>                                      uint32_t len,
>                                      bool in,
>                                      bool end,
>                                      unsigned int *idx) {
>     ...
>     vq->vring.desc[*idx].next = vq->free_head;
>     vq->vring.desc[vq->free_head].addr = addr;
>     ...
>     if (end)
>         remove the VRING_DESC_F_NEXT flag
> }
> 

Add I would say add-last.

> 
> What do you think? We can also combine the two functions into one.
> 
> 
> 
> Best,
> Wei

With an enum? Yes that's also an option.

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
@ 2017-06-28 15:01             ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-28 15:01 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource, riel,
	nilal@redhat.com

On Thu, Jun 22, 2017 at 04:40:39PM +0800, Wei Wang wrote:
> On 06/21/2017 08:28 PM, Michael S. Tsirkin wrote:
> > On Wed, Jun 21, 2017 at 11:28:00AM +0800, Wei Wang wrote:
> > > On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
> > > > On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
> > > > > -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
> > > > > +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
> > > > >    		virtqueue_kick(vq);
> > > > > -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > > > -		vb->balloon_page_chunk.chunk_num = 0;
> > > > > +		if (busy_wait)
> > > > > +			while (!virtqueue_get_buf(vq, &len) &&
> > > > > +			       !virtqueue_is_broken(vq))
> > > > > +				cpu_relax();
> > > > > +		else
> > > > > +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > > This is something I didn't previously notice.
> > > > As you always keep a single buffer in flight, you do not
> > > > really need indirect at all. Just add all descriptors
> > > > in the ring directly, then kick.
> > > > 
> > > > E.g.
> > > > 	virtqueue_add_first
> > > > 	virtqueue_add_next
> > > > 	virtqueue_add_last
> > > > 
> > > > ?
> > > > 
> > > > You also want a flag to avoid allocations but there's no need to do it
> > > > per descriptor, set it on vq.
> > > > 
> > > Without using the indirect table, I'm thinking about changing to use
> > > the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
> > > we don't need to modify or add any new functions of virtqueue_add().
> > > 
> > > In this case, we will kmalloc an array of sgs in probe(), and we can add
> > > the sgs one by one to the vq, which won't trigger the allocation of an
> > > indirect table inside virtqueue_add(), and then kick when all are added.
> > > 
> > > Best,
> > > Wei
> > And allocate headers too? This can work. API extensions aren't
> > necessarily a bad idea though. The API I suggest above is preferable
> > for the simple reason that it can work without INDIRECT flag
> > support in hypervisor.
> 
> OK, probably we don't need to add a desc to the vq - we can just use
> the vq's desc, like this:
> 
> int virtqueue_add_first(struct virtqueue *_vq,
>                                      uint64_t addr,
>                                      uint32_t len,
>                                      bool in,
>                                      unsigned int *idx) {
> 
>     ...
>    uint16_t desc_flags = in ? VRING_DESC_F_NEXT | VRING_DESC_F_WRITE :
>                                              VRING_DESC_F_NEXT;
> 
>     vq->vring.desc[vq->free_head].addr = addr;
>     vq->vring.desc[vq->free_head].len = len;
>     vq->vring.desc[vq->free_head].flags = cpu_to_virtio16(_vq->vdev, flags);
>     /* return to the caller the desc id */
>     *idx = vq->free_head;
>     ...
> }
> 
> int virtqueue_add_next(struct virtqueue *_vq,
>                                      uint64_t addr,
>                                      uint32_t len,
>                                      bool in,
>                                      bool end,
>                                      unsigned int *idx) {
>     ...
>     vq->vring.desc[*idx].next = vq->free_head;
>     vq->vring.desc[vq->free_head].addr = addr;
>     ...
>     if (end)
>         remove the VRING_DESC_F_NEXT flag
> }
> 

Add I would say add-last.

> 
> What do you think? We can also combine the two functions into one.
> 
> 
> 
> Best,
> Wei

With an enum? Yes that's also an option.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
@ 2017-06-28 15:01             ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-28 15:01 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource, riel, nilal

On Thu, Jun 22, 2017 at 04:40:39PM +0800, Wei Wang wrote:
> On 06/21/2017 08:28 PM, Michael S. Tsirkin wrote:
> > On Wed, Jun 21, 2017 at 11:28:00AM +0800, Wei Wang wrote:
> > > On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
> > > > On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
> > > > > -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
> > > > > +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
> > > > >    		virtqueue_kick(vq);
> > > > > -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > > > -		vb->balloon_page_chunk.chunk_num = 0;
> > > > > +		if (busy_wait)
> > > > > +			while (!virtqueue_get_buf(vq, &len) &&
> > > > > +			       !virtqueue_is_broken(vq))
> > > > > +				cpu_relax();
> > > > > +		else
> > > > > +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > > This is something I didn't previously notice.
> > > > As you always keep a single buffer in flight, you do not
> > > > really need indirect at all. Just add all descriptors
> > > > in the ring directly, then kick.
> > > > 
> > > > E.g.
> > > > 	virtqueue_add_first
> > > > 	virtqueue_add_next
> > > > 	virtqueue_add_last
> > > > 
> > > > ?
> > > > 
> > > > You also want a flag to avoid allocations but there's no need to do it
> > > > per descriptor, set it on vq.
> > > > 
> > > Without using the indirect table, I'm thinking about changing to use
> > > the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
> > > we don't need to modify or add any new functions of virtqueue_add().
> > > 
> > > In this case, we will kmalloc an array of sgs in probe(), and we can add
> > > the sgs one by one to the vq, which won't trigger the allocation of an
> > > indirect table inside virtqueue_add(), and then kick when all are added.
> > > 
> > > Best,
> > > Wei
> > And allocate headers too? This can work. API extensions aren't
> > necessarily a bad idea though. The API I suggest above is preferable
> > for the simple reason that it can work without INDIRECT flag
> > support in hypervisor.
> 
> OK, probably we don't need to add a desc to the vq - we can just use
> the vq's desc, like this:
> 
> int virtqueue_add_first(struct virtqueue *_vq,
>                                      uint64_t addr,
>                                      uint32_t len,
>                                      bool in,
>                                      unsigned int *idx) {
> 
>     ...
>    uint16_t desc_flags = in ? VRING_DESC_F_NEXT | VRING_DESC_F_WRITE :
>                                              VRING_DESC_F_NEXT;
> 
>     vq->vring.desc[vq->free_head].addr = addr;
>     vq->vring.desc[vq->free_head].len = len;
>     vq->vring.desc[vq->free_head].flags = cpu_to_virtio16(_vq->vdev, flags);
>     /* return to the caller the desc id */
>     *idx = vq->free_head;
>     ...
> }
> 
> int virtqueue_add_next(struct virtqueue *_vq,
>                                      uint64_t addr,
>                                      uint32_t len,
>                                      bool in,
>                                      bool end,
>                                      unsigned int *idx) {
>     ...
>     vq->vring.desc[*idx].next = vq->free_head;
>     vq->vring.desc[vq->free_head].addr = addr;
>     ...
>     if (end)
>         remove the VRING_DESC_F_NEXT flag
> }
> 

Add I would say add-last.

> 
> What do you think? We can also combine the two functions into one.
> 
> 
> 
> Best,
> Wei

With an enum? Yes that's also an option.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [virtio-dev] Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
@ 2017-06-28 15:01             ` Michael S. Tsirkin
  0 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-28 15:01 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource, riel, nilal

On Thu, Jun 22, 2017 at 04:40:39PM +0800, Wei Wang wrote:
> On 06/21/2017 08:28 PM, Michael S. Tsirkin wrote:
> > On Wed, Jun 21, 2017 at 11:28:00AM +0800, Wei Wang wrote:
> > > On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
> > > > On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
> > > > > -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
> > > > > +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
> > > > >    		virtqueue_kick(vq);
> > > > > -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > > > -		vb->balloon_page_chunk.chunk_num = 0;
> > > > > +		if (busy_wait)
> > > > > +			while (!virtqueue_get_buf(vq, &len) &&
> > > > > +			       !virtqueue_is_broken(vq))
> > > > > +				cpu_relax();
> > > > > +		else
> > > > > +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > > This is something I didn't previously notice.
> > > > As you always keep a single buffer in flight, you do not
> > > > really need indirect at all. Just add all descriptors
> > > > in the ring directly, then kick.
> > > > 
> > > > E.g.
> > > > 	virtqueue_add_first
> > > > 	virtqueue_add_next
> > > > 	virtqueue_add_last
> > > > 
> > > > ?
> > > > 
> > > > You also want a flag to avoid allocations but there's no need to do it
> > > > per descriptor, set it on vq.
> > > > 
> > > Without using the indirect table, I'm thinking about changing to use
> > > the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
> > > we don't need to modify or add any new functions of virtqueue_add().
> > > 
> > > In this case, we will kmalloc an array of sgs in probe(), and we can add
> > > the sgs one by one to the vq, which won't trigger the allocation of an
> > > indirect table inside virtqueue_add(), and then kick when all are added.
> > > 
> > > Best,
> > > Wei
> > And allocate headers too? This can work. API extensions aren't
> > necessarily a bad idea though. The API I suggest above is preferable
> > for the simple reason that it can work without INDIRECT flag
> > support in hypervisor.
> 
> OK, probably we don't need to add a desc to the vq - we can just use
> the vq's desc, like this:
> 
> int virtqueue_add_first(struct virtqueue *_vq,
>                                      uint64_t addr,
>                                      uint32_t len,
>                                      bool in,
>                                      unsigned int *idx) {
> 
>     ...
>    uint16_t desc_flags = in ? VRING_DESC_F_NEXT | VRING_DESC_F_WRITE :
>                                              VRING_DESC_F_NEXT;
> 
>     vq->vring.desc[vq->free_head].addr = addr;
>     vq->vring.desc[vq->free_head].len = len;
>     vq->vring.desc[vq->free_head].flags = cpu_to_virtio16(_vq->vdev, flags);
>     /* return to the caller the desc id */
>     *idx = vq->free_head;
>     ...
> }
> 
> int virtqueue_add_next(struct virtqueue *_vq,
>                                      uint64_t addr,
>                                      uint32_t len,
>                                      bool in,
>                                      bool end,
>                                      unsigned int *idx) {
>     ...
>     vq->vring.desc[*idx].next = vq->free_head;
>     vq->vring.desc[vq->free_head].addr = addr;
>     ...
>     if (end)
>         remove the VRING_DESC_F_NEXT flag
> }
> 

Add I would say add-last.

> 
> What do you think? We can also combine the two functions into one.
> 
> 
> 
> Best,
> Wei

With an enum? Yes that's also an option.

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
  2017-06-22  8:40           ` Wei Wang
                             ` (2 preceding siblings ...)
  (?)
@ 2017-06-28 15:01           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 175+ messages in thread
From: Michael S. Tsirkin @ 2017-06-28 15:01 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, riel, kvm, qemu-devel, amit.shah,
	liliang.opensource, Hansen, Dave, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, nilal, mgorman

On Thu, Jun 22, 2017 at 04:40:39PM +0800, Wei Wang wrote:
> On 06/21/2017 08:28 PM, Michael S. Tsirkin wrote:
> > On Wed, Jun 21, 2017 at 11:28:00AM +0800, Wei Wang wrote:
> > > On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
> > > > On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
> > > > > -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
> > > > > +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
> > > > >    		virtqueue_kick(vq);
> > > > > -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > > > -		vb->balloon_page_chunk.chunk_num = 0;
> > > > > +		if (busy_wait)
> > > > > +			while (!virtqueue_get_buf(vq, &len) &&
> > > > > +			       !virtqueue_is_broken(vq))
> > > > > +				cpu_relax();
> > > > > +		else
> > > > > +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > > This is something I didn't previously notice.
> > > > As you always keep a single buffer in flight, you do not
> > > > really need indirect at all. Just add all descriptors
> > > > in the ring directly, then kick.
> > > > 
> > > > E.g.
> > > > 	virtqueue_add_first
> > > > 	virtqueue_add_next
> > > > 	virtqueue_add_last
> > > > 
> > > > ?
> > > > 
> > > > You also want a flag to avoid allocations but there's no need to do it
> > > > per descriptor, set it on vq.
> > > > 
> > > Without using the indirect table, I'm thinking about changing to use
> > > the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
> > > we don't need to modify or add any new functions of virtqueue_add().
> > > 
> > > In this case, we will kmalloc an array of sgs in probe(), and we can add
> > > the sgs one by one to the vq, which won't trigger the allocation of an
> > > indirect table inside virtqueue_add(), and then kick when all are added.
> > > 
> > > Best,
> > > Wei
> > And allocate headers too? This can work. API extensions aren't
> > necessarily a bad idea though. The API I suggest above is preferable
> > for the simple reason that it can work without INDIRECT flag
> > support in hypervisor.
> 
> OK, probably we don't need to add a desc to the vq - we can just use
> the vq's desc, like this:
> 
> int virtqueue_add_first(struct virtqueue *_vq,
>                                      uint64_t addr,
>                                      uint32_t len,
>                                      bool in,
>                                      unsigned int *idx) {
> 
>     ...
>    uint16_t desc_flags = in ? VRING_DESC_F_NEXT | VRING_DESC_F_WRITE :
>                                              VRING_DESC_F_NEXT;
> 
>     vq->vring.desc[vq->free_head].addr = addr;
>     vq->vring.desc[vq->free_head].len = len;
>     vq->vring.desc[vq->free_head].flags = cpu_to_virtio16(_vq->vdev, flags);
>     /* return to the caller the desc id */
>     *idx = vq->free_head;
>     ...
> }
> 
> int virtqueue_add_next(struct virtqueue *_vq,
>                                      uint64_t addr,
>                                      uint32_t len,
>                                      bool in,
>                                      bool end,
>                                      unsigned int *idx) {
>     ...
>     vq->vring.desc[*idx].next = vq->free_head;
>     vq->vring.desc[vq->free_head].addr = addr;
>     ...
>     if (end)
>         remove the VRING_DESC_F_NEXT flag
> }
> 

Add I would say add-last.

> 
> What do you think? We can also combine the two functions into one.
> 
> 
> 
> Best,
> Wei

With an enum? Yes that's also an option.

-- 
MST

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
  2017-06-15  8:10       ` Wei Wang
  (?)
  (?)
@ 2017-06-28 15:04         ` Matthew Wilcox
  -1 siblings, 0 replies; 175+ messages in thread
From: Matthew Wilcox @ 2017-06-28 15:04 UTC (permalink / raw)
  To: Wei Wang
  Cc: Michael S. Tsirkin, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, david, dave.hansen, cornelia.huck,
	akpm, mgorman, aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Jun 15, 2017 at 04:10:17PM +0800, Wei Wang wrote:
> > So you still have a home-grown bitmap. I'd like to know why
> > isn't xbitmap suggested for this purpose by Matthew Wilcox
> > appropriate. Please add a comment explaining the requirements
> > from the data structure.
> 
> I didn't find his xbitmap being upstreamed, did you?

It doesn't have any users in the tree yet.  Can't add code with new users.
You should be the first!

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
@ 2017-06-28 15:04         ` Matthew Wilcox
  0 siblings, 0 replies; 175+ messages in thread
From: Matthew Wilcox @ 2017-06-28 15:04 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, amit.shah, kvm, Michael S. Tsirkin,
	linux-kernel, liliang.opensource, dave.hansen, qemu-devel,
	virtualization, linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On Thu, Jun 15, 2017 at 04:10:17PM +0800, Wei Wang wrote:
> > So you still have a home-grown bitmap. I'd like to know why
> > isn't xbitmap suggested for this purpose by Matthew Wilcox
> > appropriate. Please add a comment explaining the requirements
> > from the data structure.
> 
> I didn't find his xbitmap being upstreamed, did you?

It doesn't have any users in the tree yet.  Can't add code with new users.
You should be the first!

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
@ 2017-06-28 15:04         ` Matthew Wilcox
  0 siblings, 0 replies; 175+ messages in thread
From: Matthew Wilcox @ 2017-06-28 15:04 UTC (permalink / raw)
  To: Wei Wang
  Cc: Michael S. Tsirkin, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, david, dave.hansen, cornelia.huck,
	akpm, mgorman, aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Jun 15, 2017 at 04:10:17PM +0800, Wei Wang wrote:
> > So you still have a home-grown bitmap. I'd like to know why
> > isn't xbitmap suggested for this purpose by Matthew Wilcox
> > appropriate. Please add a comment explaining the requirements
> > from the data structure.
> 
> I didn't find his xbitmap being upstreamed, did you?

It doesn't have any users in the tree yet.  Can't add code with new users.
You should be the first!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [virtio-dev] Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
@ 2017-06-28 15:04         ` Matthew Wilcox
  0 siblings, 0 replies; 175+ messages in thread
From: Matthew Wilcox @ 2017-06-28 15:04 UTC (permalink / raw)
  To: Wei Wang
  Cc: Michael S. Tsirkin, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, david, dave.hansen, cornelia.huck,
	akpm, mgorman, aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Jun 15, 2017 at 04:10:17PM +0800, Wei Wang wrote:
> > So you still have a home-grown bitmap. I'd like to know why
> > isn't xbitmap suggested for this purpose by Matthew Wilcox
> > appropriate. Please add a comment explaining the requirements
> > from the data structure.
> 
> I didn't find his xbitmap being upstreamed, did you?

It doesn't have any users in the tree yet.  Can't add code with new users.
You should be the first!

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
  2017-06-28 15:01             ` Michael S. Tsirkin
                                 ` (2 preceding siblings ...)
  (?)
@ 2017-07-12 12:57               ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-07-12 12:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource, riel, nilal

On 06/28/2017 11:01 PM, Michael S. Tsirkin wrote:
> On Thu, Jun 22, 2017 at 04:40:39PM +0800, Wei Wang wrote:
>> On 06/21/2017 08:28 PM, Michael S. Tsirkin wrote:
>>> On Wed, Jun 21, 2017 at 11:28:00AM +0800, Wei Wang wrote:
>>>> On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
>>>>> On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
>>>>>> -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
>>>>>> +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
>>>>>>     		virtqueue_kick(vq);
>>>>>> -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>>>> -		vb->balloon_page_chunk.chunk_num = 0;
>>>>>> +		if (busy_wait)
>>>>>> +			while (!virtqueue_get_buf(vq, &len) &&
>>>>>> +			       !virtqueue_is_broken(vq))
>>>>>> +				cpu_relax();
>>>>>> +		else
>>>>>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>>> This is something I didn't previously notice.
>>>>> As you always keep a single buffer in flight, you do not
>>>>> really need indirect at all. Just add all descriptors
>>>>> in the ring directly, then kick.
>>>>>
>>>>> E.g.
>>>>> 	virtqueue_add_first
>>>>> 	virtqueue_add_next
>>>>> 	virtqueue_add_last
>>>>>
>>>>> ?
>>>>>
>>>>> You also want a flag to avoid allocations but there's no need to do it
>>>>> per descriptor, set it on vq.
>>>>>
>>>> Without using the indirect table, I'm thinking about changing to use
>>>> the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
>>>> we don't need to modify or add any new functions of virtqueue_add().
>>>>
>>>> In this case, we will kmalloc an array of sgs in probe(), and we can add
>>>> the sgs one by one to the vq, which won't trigger the allocation of an
>>>> indirect table inside virtqueue_add(), and then kick when all are added.
>>>>
>>>> Best,
>>>> Wei
>>> And allocate headers too? This can work. API extensions aren't
>>> necessarily a bad idea though. The API I suggest above is preferable
>>> for the simple reason that it can work without INDIRECT flag
>>> support in hypervisor.
>> OK, probably we don't need to add a desc to the vq - we can just use
>> the vq's desc, like this:
>>
>> int virtqueue_add_first(struct virtqueue *_vq,
>>                                       uint64_t addr,
>>                                       uint32_t len,
>>                                       bool in,
>>                                       unsigned int *idx) {
>>
>>      ...
>>     uint16_t desc_flags = in ? VRING_DESC_F_NEXT | VRING_DESC_F_WRITE :
>>                                               VRING_DESC_F_NEXT;
>>
>>      vq->vring.desc[vq->free_head].addr = addr;
>>      vq->vring.desc[vq->free_head].len = len;
>>      vq->vring.desc[vq->free_head].flags = cpu_to_virtio16(_vq->vdev, flags);
>>      /* return to the caller the desc id */
>>      *idx = vq->free_head;
>>      ...
>> }
>>
>> int virtqueue_add_next(struct virtqueue *_vq,
>>                                       uint64_t addr,
>>                                       uint32_t len,
>>                                       bool in,
>>                                       bool end,
>>                                       unsigned int *idx) {
>>      ...
>>      vq->vring.desc[*idx].next = vq->free_head;
>>      vq->vring.desc[vq->free_head].addr = addr;
>>      ...
>>      if (end)
>>          remove the VRING_DESC_F_NEXT flag
>> }
>>
> Add I would say add-last.
>
>> What do you think? We can also combine the two functions into one.
>>
>>
>>
>> Best,
>> Wei
> With an enum? Yes that's also an option.
>

Thanks for the suggestion. I shifted it a little bit, please have a check
the latest v12 patches that I just sent out.

Best,
Wei

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
@ 2017-07-12 12:57               ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-07-12 12:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource, riel,
	nilal@redhat.com

On 06/28/2017 11:01 PM, Michael S. Tsirkin wrote:
> On Thu, Jun 22, 2017 at 04:40:39PM +0800, Wei Wang wrote:
>> On 06/21/2017 08:28 PM, Michael S. Tsirkin wrote:
>>> On Wed, Jun 21, 2017 at 11:28:00AM +0800, Wei Wang wrote:
>>>> On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
>>>>> On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
>>>>>> -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
>>>>>> +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
>>>>>>     		virtqueue_kick(vq);
>>>>>> -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>>>> -		vb->balloon_page_chunk.chunk_num = 0;
>>>>>> +		if (busy_wait)
>>>>>> +			while (!virtqueue_get_buf(vq, &len) &&
>>>>>> +			       !virtqueue_is_broken(vq))
>>>>>> +				cpu_relax();
>>>>>> +		else
>>>>>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>>> This is something I didn't previously notice.
>>>>> As you always keep a single buffer in flight, you do not
>>>>> really need indirect at all. Just add all descriptors
>>>>> in the ring directly, then kick.
>>>>>
>>>>> E.g.
>>>>> 	virtqueue_add_first
>>>>> 	virtqueue_add_next
>>>>> 	virtqueue_add_last
>>>>>
>>>>> ?
>>>>>
>>>>> You also want a flag to avoid allocations but there's no need to do it
>>>>> per descriptor, set it on vq.
>>>>>
>>>> Without using the indirect table, I'm thinking about changing to use
>>>> the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
>>>> we don't need to modify or add any new functions of virtqueue_add().
>>>>
>>>> In this case, we will kmalloc an array of sgs in probe(), and we can add
>>>> the sgs one by one to the vq, which won't trigger the allocation of an
>>>> indirect table inside virtqueue_add(), and then kick when all are added.
>>>>
>>>> Best,
>>>> Wei
>>> And allocate headers too? This can work. API extensions aren't
>>> necessarily a bad idea though. The API I suggest above is preferable
>>> for the simple reason that it can work without INDIRECT flag
>>> support in hypervisor.
>> OK, probably we don't need to add a desc to the vq - we can just use
>> the vq's desc, like this:
>>
>> int virtqueue_add_first(struct virtqueue *_vq,
>>                                       uint64_t addr,
>>                                       uint32_t len,
>>                                       bool in,
>>                                       unsigned int *idx) {
>>
>>      ...
>>     uint16_t desc_flags = in ? VRING_DESC_F_NEXT | VRING_DESC_F_WRITE :
>>                                               VRING_DESC_F_NEXT;
>>
>>      vq->vring.desc[vq->free_head].addr = addr;
>>      vq->vring.desc[vq->free_head].len = len;
>>      vq->vring.desc[vq->free_head].flags = cpu_to_virtio16(_vq->vdev, flags);
>>      /* return to the caller the desc id */
>>      *idx = vq->free_head;
>>      ...
>> }
>>
>> int virtqueue_add_next(struct virtqueue *_vq,
>>                                       uint64_t addr,
>>                                       uint32_t len,
>>                                       bool in,
>>                                       bool end,
>>                                       unsigned int *idx) {
>>      ...
>>      vq->vring.desc[*idx].next = vq->free_head;
>>      vq->vring.desc[vq->free_head].addr = addr;
>>      ...
>>      if (end)
>>          remove the VRING_DESC_F_NEXT flag
>> }
>>
> Add I would say add-last.
>
>> What do you think? We can also combine the two functions into one.
>>
>>
>>
>> Best,
>> Wei
> With an enum? Yes that's also an option.
>

Thanks for the suggestion. I shifted it a little bit, please have a check
the latest v12 patches that I just sent out.

Best,
Wei

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
@ 2017-07-12 12:57               ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-07-12 12:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource, riel, nilal

On 06/28/2017 11:01 PM, Michael S. Tsirkin wrote:
> On Thu, Jun 22, 2017 at 04:40:39PM +0800, Wei Wang wrote:
>> On 06/21/2017 08:28 PM, Michael S. Tsirkin wrote:
>>> On Wed, Jun 21, 2017 at 11:28:00AM +0800, Wei Wang wrote:
>>>> On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
>>>>> On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
>>>>>> -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
>>>>>> +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
>>>>>>     		virtqueue_kick(vq);
>>>>>> -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>>>> -		vb->balloon_page_chunk.chunk_num = 0;
>>>>>> +		if (busy_wait)
>>>>>> +			while (!virtqueue_get_buf(vq, &len) &&
>>>>>> +			       !virtqueue_is_broken(vq))
>>>>>> +				cpu_relax();
>>>>>> +		else
>>>>>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>>> This is something I didn't previously notice.
>>>>> As you always keep a single buffer in flight, you do not
>>>>> really need indirect at all. Just add all descriptors
>>>>> in the ring directly, then kick.
>>>>>
>>>>> E.g.
>>>>> 	virtqueue_add_first
>>>>> 	virtqueue_add_next
>>>>> 	virtqueue_add_last
>>>>>
>>>>> ?
>>>>>
>>>>> You also want a flag to avoid allocations but there's no need to do it
>>>>> per descriptor, set it on vq.
>>>>>
>>>> Without using the indirect table, I'm thinking about changing to use
>>>> the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
>>>> we don't need to modify or add any new functions of virtqueue_add().
>>>>
>>>> In this case, we will kmalloc an array of sgs in probe(), and we can add
>>>> the sgs one by one to the vq, which won't trigger the allocation of an
>>>> indirect table inside virtqueue_add(), and then kick when all are added.
>>>>
>>>> Best,
>>>> Wei
>>> And allocate headers too? This can work. API extensions aren't
>>> necessarily a bad idea though. The API I suggest above is preferable
>>> for the simple reason that it can work without INDIRECT flag
>>> support in hypervisor.
>> OK, probably we don't need to add a desc to the vq - we can just use
>> the vq's desc, like this:
>>
>> int virtqueue_add_first(struct virtqueue *_vq,
>>                                       uint64_t addr,
>>                                       uint32_t len,
>>                                       bool in,
>>                                       unsigned int *idx) {
>>
>>      ...
>>     uint16_t desc_flags = in ? VRING_DESC_F_NEXT | VRING_DESC_F_WRITE :
>>                                               VRING_DESC_F_NEXT;
>>
>>      vq->vring.desc[vq->free_head].addr = addr;
>>      vq->vring.desc[vq->free_head].len = len;
>>      vq->vring.desc[vq->free_head].flags = cpu_to_virtio16(_vq->vdev, flags);
>>      /* return to the caller the desc id */
>>      *idx = vq->free_head;
>>      ...
>> }
>>
>> int virtqueue_add_next(struct virtqueue *_vq,
>>                                       uint64_t addr,
>>                                       uint32_t len,
>>                                       bool in,
>>                                       bool end,
>>                                       unsigned int *idx) {
>>      ...
>>      vq->vring.desc[*idx].next = vq->free_head;
>>      vq->vring.desc[vq->free_head].addr = addr;
>>      ...
>>      if (end)
>>          remove the VRING_DESC_F_NEXT flag
>> }
>>
> Add I would say add-last.
>
>> What do you think? We can also combine the two functions into one.
>>
>>
>>
>> Best,
>> Wei
> With an enum? Yes that's also an option.
>

Thanks for the suggestion. I shifted it a little bit, please have a check
the latest v12 patches that I just sent out.

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [virtio-dev] Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
@ 2017-07-12 12:57               ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-07-12 12:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource, riel, nilal

On 06/28/2017 11:01 PM, Michael S. Tsirkin wrote:
> On Thu, Jun 22, 2017 at 04:40:39PM +0800, Wei Wang wrote:
>> On 06/21/2017 08:28 PM, Michael S. Tsirkin wrote:
>>> On Wed, Jun 21, 2017 at 11:28:00AM +0800, Wei Wang wrote:
>>>> On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
>>>>> On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
>>>>>> -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
>>>>>> +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
>>>>>>     		virtqueue_kick(vq);
>>>>>> -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>>>> -		vb->balloon_page_chunk.chunk_num = 0;
>>>>>> +		if (busy_wait)
>>>>>> +			while (!virtqueue_get_buf(vq, &len) &&
>>>>>> +			       !virtqueue_is_broken(vq))
>>>>>> +				cpu_relax();
>>>>>> +		else
>>>>>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>>> This is something I didn't previously notice.
>>>>> As you always keep a single buffer in flight, you do not
>>>>> really need indirect at all. Just add all descriptors
>>>>> in the ring directly, then kick.
>>>>>
>>>>> E.g.
>>>>> 	virtqueue_add_first
>>>>> 	virtqueue_add_next
>>>>> 	virtqueue_add_last
>>>>>
>>>>> ?
>>>>>
>>>>> You also want a flag to avoid allocations but there's no need to do it
>>>>> per descriptor, set it on vq.
>>>>>
>>>> Without using the indirect table, I'm thinking about changing to use
>>>> the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
>>>> we don't need to modify or add any new functions of virtqueue_add().
>>>>
>>>> In this case, we will kmalloc an array of sgs in probe(), and we can add
>>>> the sgs one by one to the vq, which won't trigger the allocation of an
>>>> indirect table inside virtqueue_add(), and then kick when all are added.
>>>>
>>>> Best,
>>>> Wei
>>> And allocate headers too? This can work. API extensions aren't
>>> necessarily a bad idea though. The API I suggest above is preferable
>>> for the simple reason that it can work without INDIRECT flag
>>> support in hypervisor.
>> OK, probably we don't need to add a desc to the vq - we can just use
>> the vq's desc, like this:
>>
>> int virtqueue_add_first(struct virtqueue *_vq,
>>                                       uint64_t addr,
>>                                       uint32_t len,
>>                                       bool in,
>>                                       unsigned int *idx) {
>>
>>      ...
>>     uint16_t desc_flags = in ? VRING_DESC_F_NEXT | VRING_DESC_F_WRITE :
>>                                               VRING_DESC_F_NEXT;
>>
>>      vq->vring.desc[vq->free_head].addr = addr;
>>      vq->vring.desc[vq->free_head].len = len;
>>      vq->vring.desc[vq->free_head].flags = cpu_to_virtio16(_vq->vdev, flags);
>>      /* return to the caller the desc id */
>>      *idx = vq->free_head;
>>      ...
>> }
>>
>> int virtqueue_add_next(struct virtqueue *_vq,
>>                                       uint64_t addr,
>>                                       uint32_t len,
>>                                       bool in,
>>                                       bool end,
>>                                       unsigned int *idx) {
>>      ...
>>      vq->vring.desc[*idx].next = vq->free_head;
>>      vq->vring.desc[vq->free_head].addr = addr;
>>      ...
>>      if (end)
>>          remove the VRING_DESC_F_NEXT flag
>> }
>>
> Add I would say add-last.
>
>> What do you think? We can also combine the two functions into one.
>>
>>
>>
>> Best,
>> Wei
> With an enum? Yes that's also an option.
>

Thanks for the suggestion. I shifted it a little bit, please have a check
the latest v12 patches that I just sent out.

Best,
Wei

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
  2017-06-28 15:01             ` Michael S. Tsirkin
                               ` (2 preceding siblings ...)
  (?)
@ 2017-07-12 12:57             ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-07-12 12:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, riel, kvm, qemu-devel, amit.shah,
	liliang.opensource, Hansen, Dave, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, nilal, mgorman

On 06/28/2017 11:01 PM, Michael S. Tsirkin wrote:
> On Thu, Jun 22, 2017 at 04:40:39PM +0800, Wei Wang wrote:
>> On 06/21/2017 08:28 PM, Michael S. Tsirkin wrote:
>>> On Wed, Jun 21, 2017 at 11:28:00AM +0800, Wei Wang wrote:
>>>> On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
>>>>> On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
>>>>>> -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
>>>>>> +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
>>>>>>     		virtqueue_kick(vq);
>>>>>> -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>>>> -		vb->balloon_page_chunk.chunk_num = 0;
>>>>>> +		if (busy_wait)
>>>>>> +			while (!virtqueue_get_buf(vq, &len) &&
>>>>>> +			       !virtqueue_is_broken(vq))
>>>>>> +				cpu_relax();
>>>>>> +		else
>>>>>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>>> This is something I didn't previously notice.
>>>>> As you always keep a single buffer in flight, you do not
>>>>> really need indirect at all. Just add all descriptors
>>>>> in the ring directly, then kick.
>>>>>
>>>>> E.g.
>>>>> 	virtqueue_add_first
>>>>> 	virtqueue_add_next
>>>>> 	virtqueue_add_last
>>>>>
>>>>> ?
>>>>>
>>>>> You also want a flag to avoid allocations but there's no need to do it
>>>>> per descriptor, set it on vq.
>>>>>
>>>> Without using the indirect table, I'm thinking about changing to use
>>>> the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
>>>> we don't need to modify or add any new functions of virtqueue_add().
>>>>
>>>> In this case, we will kmalloc an array of sgs in probe(), and we can add
>>>> the sgs one by one to the vq, which won't trigger the allocation of an
>>>> indirect table inside virtqueue_add(), and then kick when all are added.
>>>>
>>>> Best,
>>>> Wei
>>> And allocate headers too? This can work. API extensions aren't
>>> necessarily a bad idea though. The API I suggest above is preferable
>>> for the simple reason that it can work without INDIRECT flag
>>> support in hypervisor.
>> OK, probably we don't need to add a desc to the vq - we can just use
>> the vq's desc, like this:
>>
>> int virtqueue_add_first(struct virtqueue *_vq,
>>                                       uint64_t addr,
>>                                       uint32_t len,
>>                                       bool in,
>>                                       unsigned int *idx) {
>>
>>      ...
>>     uint16_t desc_flags = in ? VRING_DESC_F_NEXT | VRING_DESC_F_WRITE :
>>                                               VRING_DESC_F_NEXT;
>>
>>      vq->vring.desc[vq->free_head].addr = addr;
>>      vq->vring.desc[vq->free_head].len = len;
>>      vq->vring.desc[vq->free_head].flags = cpu_to_virtio16(_vq->vdev, flags);
>>      /* return to the caller the desc id */
>>      *idx = vq->free_head;
>>      ...
>> }
>>
>> int virtqueue_add_next(struct virtqueue *_vq,
>>                                       uint64_t addr,
>>                                       uint32_t len,
>>                                       bool in,
>>                                       bool end,
>>                                       unsigned int *idx) {
>>      ...
>>      vq->vring.desc[*idx].next = vq->free_head;
>>      vq->vring.desc[vq->free_head].addr = addr;
>>      ...
>>      if (end)
>>          remove the VRING_DESC_F_NEXT flag
>> }
>>
> Add I would say add-last.
>
>> What do you think? We can also combine the two functions into one.
>>
>>
>>
>> Best,
>> Wei
> With an enum? Yes that's also an option.
>

Thanks for the suggestion. I shifted it a little bit, please have a check
the latest v12 patches that I just sent out.

Best,
Wei

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ
@ 2017-07-12 12:57               ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-07-12 12:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource, riel, nilal

On 06/28/2017 11:01 PM, Michael S. Tsirkin wrote:
> On Thu, Jun 22, 2017 at 04:40:39PM +0800, Wei Wang wrote:
>> On 06/21/2017 08:28 PM, Michael S. Tsirkin wrote:
>>> On Wed, Jun 21, 2017 at 11:28:00AM +0800, Wei Wang wrote:
>>>> On 06/21/2017 12:18 AM, Michael S. Tsirkin wrote:
>>>>> On Fri, Jun 09, 2017 at 06:41:41PM +0800, Wei Wang wrote:
>>>>>> -	if (!virtqueue_indirect_desc_table_add(vq, desc, num)) {
>>>>>> +	if (!virtqueue_indirect_desc_table_add(vq, desc, *num)) {
>>>>>>     		virtqueue_kick(vq);
>>>>>> -		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>>>> -		vb->balloon_page_chunk.chunk_num = 0;
>>>>>> +		if (busy_wait)
>>>>>> +			while (!virtqueue_get_buf(vq, &len) &&
>>>>>> +			       !virtqueue_is_broken(vq))
>>>>>> +				cpu_relax();
>>>>>> +		else
>>>>>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>>> This is something I didn't previously notice.
>>>>> As you always keep a single buffer in flight, you do not
>>>>> really need indirect at all. Just add all descriptors
>>>>> in the ring directly, then kick.
>>>>>
>>>>> E.g.
>>>>> 	virtqueue_add_first
>>>>> 	virtqueue_add_next
>>>>> 	virtqueue_add_last
>>>>>
>>>>> ?
>>>>>
>>>>> You also want a flag to avoid allocations but there's no need to do it
>>>>> per descriptor, set it on vq.
>>>>>
>>>> Without using the indirect table, I'm thinking about changing to use
>>>> the standard sg (i.e. struct scatterlist), instead of vring_desc, so that
>>>> we don't need to modify or add any new functions of virtqueue_add().
>>>>
>>>> In this case, we will kmalloc an array of sgs in probe(), and we can add
>>>> the sgs one by one to the vq, which won't trigger the allocation of an
>>>> indirect table inside virtqueue_add(), and then kick when all are added.
>>>>
>>>> Best,
>>>> Wei
>>> And allocate headers too? This can work. API extensions aren't
>>> necessarily a bad idea though. The API I suggest above is preferable
>>> for the simple reason that it can work without INDIRECT flag
>>> support in hypervisor.
>> OK, probably we don't need to add a desc to the vq - we can just use
>> the vq's desc, like this:
>>
>> int virtqueue_add_first(struct virtqueue *_vq,
>>                                       uint64_t addr,
>>                                       uint32_t len,
>>                                       bool in,
>>                                       unsigned int *idx) {
>>
>>      ...
>>     uint16_t desc_flags = in ? VRING_DESC_F_NEXT | VRING_DESC_F_WRITE :
>>                                               VRING_DESC_F_NEXT;
>>
>>      vq->vring.desc[vq->free_head].addr = addr;
>>      vq->vring.desc[vq->free_head].len = len;
>>      vq->vring.desc[vq->free_head].flags = cpu_to_virtio16(_vq->vdev, flags);
>>      /* return to the caller the desc id */
>>      *idx = vq->free_head;
>>      ...
>> }
>>
>> int virtqueue_add_next(struct virtqueue *_vq,
>>                                       uint64_t addr,
>>                                       uint32_t len,
>>                                       bool in,
>>                                       bool end,
>>                                       unsigned int *idx) {
>>      ...
>>      vq->vring.desc[*idx].next = vq->free_head;
>>      vq->vring.desc[vq->free_head].addr = addr;
>>      ...
>>      if (end)
>>          remove the VRING_DESC_F_NEXT flag
>> }
>>
> Add I would say add-last.
>
>> What do you think? We can also combine the two functions into one.
>>
>>
>>
>> Best,
>> Wei
> With an enum? Yes that's also an option.
>

Thanks for the suggestion. I shifted it a little bit, please have a check
the latest v12 patches that I just sent out.

Best,
Wei

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
  2017-06-28 15:04         ` Matthew Wilcox
                             ` (2 preceding siblings ...)
  (?)
@ 2017-07-12 13:05           ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-07-12 13:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michael S. Tsirkin, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, david, dave.hansen, cornelia.huck,
	akpm, mgorman, aarcange, amit.shah, pbonzini, liliang.opensource

Hi Matthew,

On 06/28/2017 11:04 PM, Matthew Wilcox wrote:
> On Thu, Jun 15, 2017 at 04:10:17PM +0800, Wei Wang wrote:
>>> So you still have a home-grown bitmap. I'd like to know why
>>> isn't xbitmap suggested for this purpose by Matthew Wilcox
>>> appropriate. Please add a comment explaining the requirements
>>> from the data structure.
>> I didn't find his xbitmap being upstreamed, did you?
> It doesn't have any users in the tree yet.  Can't add code with new users.
> You should be the first!

Glad to be the first person eating your tomato. Taste good :-)
Please have a check how it's cooked in the latest v12 patches. Thanks.

Best,
Wei

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
@ 2017-07-12 13:05           ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-07-12 13:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michael S. Tsirkin, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, david, dave.hansen, cornelia.huck,
	akpm, mgorman, aarcange, amit.shah, pbonzini, liliang.opensource

Hi Matthew,

On 06/28/2017 11:04 PM, Matthew Wilcox wrote:
> On Thu, Jun 15, 2017 at 04:10:17PM +0800, Wei Wang wrote:
>>> So you still have a home-grown bitmap. I'd like to know why
>>> isn't xbitmap suggested for this purpose by Matthew Wilcox
>>> appropriate. Please add a comment explaining the requirements
>>> from the data structure.
>> I didn't find his xbitmap being upstreamed, did you?
> It doesn't have any users in the tree yet.  Can't add code with new users.
> You should be the first!

Glad to be the first person eating your tomato. Taste good :-)
Please have a check how it's cooked in the latest v12 patches. Thanks.

Best,
Wei

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
@ 2017-07-12 13:05           ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-07-12 13:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michael S. Tsirkin, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, david, dave.hansen, cornelia.huck,
	akpm, mgorman, aarcange, amit.shah, pbonzini, liliang.opensource

Hi Matthew,

On 06/28/2017 11:04 PM, Matthew Wilcox wrote:
> On Thu, Jun 15, 2017 at 04:10:17PM +0800, Wei Wang wrote:
>>> So you still have a home-grown bitmap. I'd like to know why
>>> isn't xbitmap suggested for this purpose by Matthew Wilcox
>>> appropriate. Please add a comment explaining the requirements
>>> from the data structure.
>> I didn't find his xbitmap being upstreamed, did you?
> It doesn't have any users in the tree yet.  Can't add code with new users.
> You should be the first!

Glad to be the first person eating your tomato. Taste good :-)
Please have a check how it's cooked in the latest v12 patches. Thanks.

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [Qemu-devel] [virtio-dev] Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
@ 2017-07-12 13:05           ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-07-12 13:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michael S. Tsirkin, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, david, dave.hansen, cornelia.huck,
	akpm, mgorman, aarcange, amit.shah, pbonzini, liliang.opensource

Hi Matthew,

On 06/28/2017 11:04 PM, Matthew Wilcox wrote:
> On Thu, Jun 15, 2017 at 04:10:17PM +0800, Wei Wang wrote:
>>> So you still have a home-grown bitmap. I'd like to know why
>>> isn't xbitmap suggested for this purpose by Matthew Wilcox
>>> appropriate. Please add a comment explaining the requirements
>>> from the data structure.
>> I didn't find his xbitmap being upstreamed, did you?
> It doesn't have any users in the tree yet.  Can't add code with new users.
> You should be the first!

Glad to be the first person eating your tomato. Taste good :-)
Please have a check how it's cooked in the latest v12 patches. Thanks.

Best,
Wei

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
  2017-06-28 15:04         ` Matthew Wilcox
                           ` (3 preceding siblings ...)
  (?)
@ 2017-07-12 13:05         ` Wei Wang
  -1 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-07-12 13:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: aarcange, virtio-dev, amit.shah, kvm, Michael S. Tsirkin,
	linux-kernel, liliang.opensource, dave.hansen, qemu-devel,
	virtualization, linux-mm, cornelia.huck, pbonzini, akpm, mgorman

Hi Matthew,

On 06/28/2017 11:04 PM, Matthew Wilcox wrote:
> On Thu, Jun 15, 2017 at 04:10:17PM +0800, Wei Wang wrote:
>>> So you still have a home-grown bitmap. I'd like to know why
>>> isn't xbitmap suggested for this purpose by Matthew Wilcox
>>> appropriate. Please add a comment explaining the requirements
>>> from the data structure.
>> I didn't find his xbitmap being upstreamed, did you?
> It doesn't have any users in the tree yet.  Can't add code with new users.
> You should be the first!

Glad to be the first person eating your tomato. Taste good :-)
Please have a check how it's cooked in the latest v12 patches. Thanks.

Best,
Wei

^ permalink raw reply	[flat|nested] 175+ messages in thread

* Re: [virtio-dev] Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS
@ 2017-07-12 13:05           ` Wei Wang
  0 siblings, 0 replies; 175+ messages in thread
From: Wei Wang @ 2017-07-12 13:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michael S. Tsirkin, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, david, dave.hansen, cornelia.huck,
	akpm, mgorman, aarcange, amit.shah, pbonzini, liliang.opensource

Hi Matthew,

On 06/28/2017 11:04 PM, Matthew Wilcox wrote:
> On Thu, Jun 15, 2017 at 04:10:17PM +0800, Wei Wang wrote:
>>> So you still have a home-grown bitmap. I'd like to know why
>>> isn't xbitmap suggested for this purpose by Matthew Wilcox
>>> appropriate. Please add a comment explaining the requirements
>>> from the data structure.
>> I didn't find his xbitmap being upstreamed, did you?
> It doesn't have any users in the tree yet.  Can't add code with new users.
> You should be the first!

Glad to be the first person eating your tomato. Taste good :-)
Please have a check how it's cooked in the latest v12 patches. Thanks.

Best,
Wei


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 175+ messages in thread

end of thread, other threads:[~2017-07-12 13:05 UTC | newest]

Thread overview: 175+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-09 10:41 [PATCH v11 0/6] Virtio-balloon Enhancement Wei Wang
2017-06-09 10:41 ` [Qemu-devel] " Wei Wang
2017-06-09 10:41 ` Wei Wang
2017-06-09 10:41 ` [PATCH v11 1/6] virtio-balloon: deflate via a page list Wei Wang
2017-06-09 10:41 ` Wei Wang
2017-06-09 10:41   ` [Qemu-devel] " Wei Wang
2017-06-09 10:41   ` Wei Wang
2017-06-09 10:41 ` [PATCH v11 2/6] virtio-balloon: coding format cleanup Wei Wang
2017-06-09 10:41   ` [Qemu-devel] " Wei Wang
2017-06-09 10:41   ` Wei Wang
2017-06-09 10:41 ` Wei Wang
2017-06-09 10:41 ` [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS Wei Wang
2017-06-09 10:41   ` [Qemu-devel] " Wei Wang
2017-06-09 10:41   ` Wei Wang
2017-06-13 17:56   ` Michael S. Tsirkin
2017-06-13 17:56   ` Michael S. Tsirkin
2017-06-13 17:56     ` [Qemu-devel] " Michael S. Tsirkin
2017-06-13 17:56     ` Michael S. Tsirkin
2017-06-13 17:59     ` Dave Hansen
2017-06-13 17:59     ` Dave Hansen
2017-06-13 17:59       ` [Qemu-devel] " Dave Hansen
2017-06-13 17:59       ` Dave Hansen
2017-06-13 18:55       ` Michael S. Tsirkin
2017-06-13 18:55         ` [Qemu-devel] " Michael S. Tsirkin
2017-06-13 18:55         ` Michael S. Tsirkin
2017-06-13 18:55       ` Michael S. Tsirkin
2017-06-15  8:10     ` [virtio-dev] " Wei Wang
2017-06-15  8:10     ` Wei Wang
2017-06-15  8:10       ` [Qemu-devel] " Wei Wang
2017-06-15  8:10       ` Wei Wang
2017-06-16  3:19       ` Michael S. Tsirkin
2017-06-16  3:19       ` Michael S. Tsirkin
2017-06-16  3:19         ` [Qemu-devel] " Michael S. Tsirkin
2017-06-16  3:19         ` Michael S. Tsirkin
2017-06-16  3:19         ` Michael S. Tsirkin
2017-06-28 15:04       ` [virtio-dev] " Matthew Wilcox
2017-06-28 15:04         ` [Qemu-devel] " Matthew Wilcox
2017-06-28 15:04         ` Matthew Wilcox
2017-06-28 15:04         ` Matthew Wilcox
2017-07-12 13:05         ` Wei Wang
2017-07-12 13:05           ` Wei Wang
2017-07-12 13:05           ` [Qemu-devel] " Wei Wang
2017-07-12 13:05           ` Wei Wang
2017-07-12 13:05           ` Wei Wang
2017-07-12 13:05         ` [virtio-dev] " Wei Wang
2017-06-09 10:41 ` Wei Wang
2017-06-09 10:41 ` [PATCH v11 4/6] mm: function to offer a page block on the free list Wei Wang
2017-06-09 10:41 ` Wei Wang
2017-06-09 10:41   ` [Qemu-devel] " Wei Wang
2017-06-09 10:41   ` Wei Wang
2017-06-12 14:10   ` Dave Hansen
2017-06-12 14:10     ` [Qemu-devel] " Dave Hansen
2017-06-12 14:10     ` Dave Hansen
2017-06-12 16:28     ` Michael S. Tsirkin
2017-06-12 16:28       ` [Qemu-devel] " Michael S. Tsirkin
2017-06-12 16:28       ` Michael S. Tsirkin
2017-06-12 16:42       ` Dave Hansen
2017-06-12 16:42       ` Dave Hansen
2017-06-12 16:42         ` [Qemu-devel] " Dave Hansen
2017-06-12 16:42         ` Dave Hansen
2017-06-12 20:34         ` Michael S. Tsirkin
2017-06-12 20:34           ` [Qemu-devel] " Michael S. Tsirkin
2017-06-12 20:34           ` Michael S. Tsirkin
2017-06-12 20:34           ` Michael S. Tsirkin
2017-06-12 20:54           ` Dave Hansen
2017-06-12 20:54             ` [Qemu-devel] " Dave Hansen
2017-06-12 20:54             ` Dave Hansen
2017-06-13  2:56             ` Wei Wang
2017-06-13  2:56               ` [Qemu-devel] " Wei Wang
2017-06-13  2:56               ` Wei Wang
2017-06-13  2:56               ` Wei Wang
2017-06-12 20:54           ` Dave Hansen
2017-06-12 16:28     ` Michael S. Tsirkin
2017-06-20 16:44     ` Rik van Riel
2017-06-20 16:44       ` [Qemu-devel] " Rik van Riel
2017-06-20 16:44       ` Rik van Riel
2017-06-20 16:49       ` David Hildenbrand
2017-06-20 16:49       ` David Hildenbrand
2017-06-20 16:49         ` [Qemu-devel] " David Hildenbrand
2017-06-20 16:49         ` David Hildenbrand
2017-06-20 17:29         ` Rik van Riel
2017-06-20 17:29         ` Rik van Riel
2017-06-20 17:29           ` [Qemu-devel] " Rik van Riel
2017-06-20 17:29           ` Rik van Riel
2017-06-20 18:26           ` Michael S. Tsirkin
2017-06-20 18:26           ` Michael S. Tsirkin
2017-06-20 18:26             ` [Qemu-devel] " Michael S. Tsirkin
2017-06-20 18:26             ` Michael S. Tsirkin
2017-06-20 19:51             ` Rik van Riel
2017-06-20 19:51               ` [Qemu-devel] " Rik van Riel
2017-06-20 19:51               ` Rik van Riel
2017-06-21 12:41               ` Michael S. Tsirkin
2017-06-21 12:41               ` Michael S. Tsirkin
2017-06-21 12:41                 ` [Qemu-devel] " Michael S. Tsirkin
2017-06-21 12:41                 ` Michael S. Tsirkin
2017-06-21 12:41                 ` Michael S. Tsirkin
2017-06-20 19:51             ` Rik van Riel
2017-06-21  8:38           ` [Qemu-devel] " Wei Wang
2017-06-21  8:38             ` Wei Wang
2017-06-21  8:38           ` Wei Wang
2017-06-20 18:17         ` Michael S. Tsirkin
2017-06-20 18:17           ` [Qemu-devel] " Michael S. Tsirkin
2017-06-20 18:17           ` Michael S. Tsirkin
2017-06-20 18:54           ` David Hildenbrand
2017-06-20 18:54             ` [Qemu-devel] " David Hildenbrand
2017-06-20 18:54             ` David Hildenbrand
2017-06-20 18:56             ` Michael S. Tsirkin
2017-06-20 18:56               ` [Qemu-devel] " Michael S. Tsirkin
2017-06-20 18:56               ` Michael S. Tsirkin
2017-06-20 19:01               ` David Hildenbrand
2017-06-20 19:01                 ` [Qemu-devel] " David Hildenbrand
2017-06-20 19:01                 ` David Hildenbrand
2017-06-20 19:01               ` David Hildenbrand
2017-06-20 18:56             ` Michael S. Tsirkin
2017-06-20 18:54           ` David Hildenbrand
2017-06-20 18:17         ` Michael S. Tsirkin
2017-06-21 12:56         ` Christian Borntraeger
2017-06-21 12:56           ` [Qemu-devel] " Christian Borntraeger
2017-06-21 12:56           ` Christian Borntraeger
2017-06-21 13:47           ` David Hildenbrand
2017-06-21 13:47           ` David Hildenbrand
2017-06-21 13:47             ` [Qemu-devel] " David Hildenbrand
2017-06-21 13:47             ` David Hildenbrand
2017-06-20 16:44     ` Rik van Riel
2017-06-12 14:10   ` Dave Hansen
2017-06-09 10:41 ` [PATCH v11 5/6] mm: export symbol of next_zone and first_online_pgdat Wei Wang
2017-06-09 10:41   ` [Qemu-devel] " Wei Wang
2017-06-09 10:41   ` Wei Wang
2017-06-09 10:41 ` Wei Wang
2017-06-09 10:41 ` [PATCH v11 6/6] virtio-balloon: VIRTIO_BALLOON_F_CMD_VQ Wei Wang
2017-06-09 10:41 ` Wei Wang
2017-06-09 10:41   ` [Qemu-devel] " Wei Wang
2017-06-09 10:41   ` Wei Wang
2017-06-12 14:07   ` Dave Hansen
2017-06-12 14:07     ` [Qemu-devel] " Dave Hansen
2017-06-12 14:07     ` Dave Hansen
2017-06-13 10:17     ` Wei Wang
2017-06-13 10:17     ` Wei Wang
2017-06-13 10:17       ` [Qemu-devel] " Wei Wang
2017-06-13 10:17       ` Wei Wang
2017-06-12 14:07   ` Dave Hansen
2017-06-20 16:18   ` Michael S. Tsirkin
2017-06-20 16:18   ` Michael S. Tsirkin
2017-06-20 16:18     ` [Qemu-devel] " Michael S. Tsirkin
2017-06-20 16:18     ` Michael S. Tsirkin
2017-06-21  3:28     ` [virtio-dev] " Wei Wang
2017-06-21  3:28       ` [Qemu-devel] " Wei Wang
2017-06-21  3:28       ` Wei Wang
2017-06-21  3:28       ` Wei Wang
2017-06-21 12:28       ` [virtio-dev] " Michael S. Tsirkin
2017-06-21 12:28         ` [Qemu-devel] " Michael S. Tsirkin
2017-06-21 12:28         ` Michael S. Tsirkin
2017-06-21 12:28         ` Michael S. Tsirkin
2017-06-22  8:40         ` Wei Wang
2017-06-22  8:40           ` [Qemu-devel] " Wei Wang
2017-06-22  8:40           ` Wei Wang
2017-06-22  8:40           ` Wei Wang
2017-06-28 15:01           ` Michael S. Tsirkin
2017-06-28 15:01           ` Michael S. Tsirkin
2017-06-28 15:01             ` [Qemu-devel] " Michael S. Tsirkin
2017-06-28 15:01             ` Michael S. Tsirkin
2017-06-28 15:01             ` Michael S. Tsirkin
2017-07-12 12:57             ` Wei Wang
2017-07-12 12:57             ` Wei Wang
2017-07-12 12:57               ` Wei Wang
2017-07-12 12:57               ` [Qemu-devel] " Wei Wang
2017-07-12 12:57               ` Wei Wang
2017-07-12 12:57               ` Wei Wang
2017-06-22  8:40         ` [virtio-dev] " Wei Wang
2017-06-21  3:28     ` Wei Wang
2017-06-09 11:18 ` [PATCH v11 0/6] Virtio-balloon Enhancement Wang, Wei W
2017-06-09 11:18 ` Wang, Wei W
2017-06-09 11:18   ` [Qemu-devel] " Wang, Wei W
2017-06-09 11:18   ` Wang, Wei W
2017-06-09 11:18   ` Wang, Wei W

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.